0000963: Collation issues in XCU (changes for TC2)

ID	Project	Category	View Status	Date Submitted	Last Update

0000963	1003.1(2013)/Issue7+TC1	Shell and Utilities	public	2015-06-24 11:07	2019-06-10 08:54

Reporter	geoffclare	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted

Name	Geoff Clare
Organization	The Open Group
User Reference
Section	2.13.3, awk, comm, expr, join, ls, sort, uniq
Page Number	2356, 2459, 2559, 2740, 2839, 2888, 3210, 3309, and more
Line Number	75082, 78745, 82755, 89708, 93278, 95164, 107544, 111067, and more
Interp Status	---
Final Accepted Text


Summary	0000963: Collation issues in XCU (changes for TC2)
Description	A discussion on the mailing list identified some issues related to collation for locales that do not define a collation sequence with a total ordering of all characters. It is proposed that these issues are addressed in Issue 8 by requiring implementation-provided locales that do not have an '@' modifier in their name to define a collation sequence that has a total ordering of all characters (thus reducing the problem to "special" locales and user-defined locales), and by modifying the requirements for regular expressions and affected utilities so that they cope better with such locales. As an intermediate step, it is proposed that the new requirements slated for Issue 8 are recommended (or at least allowed) in TC2. The necessary changes will be split across four Mantis bugs, targeting XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the changes proposed for XCU in TC2.
Desired Action	On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename Expansion In list item 3, change from: ... sorted according to the collating sequence in effect in the current locale. to: ... sorted according to the collating sequence in effect in the current locale. If this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. <small>Note: a future version of this standard may require the byte-by-byte further comparison described above.</small> On Page: 2459 Line: 78745 Section: awk In the EXTENDED DESCRIPTION section, change from: operands shall be converted to strings as required and a string comparison shall be made using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false. to: operands shall be converted to strings as required and a string comparison shall be made as follows: * For the "!=" and "==" operators, the strings should be compared to check if they are identical but may be compared using the locale-specific collation sequence to check if they collate equally. * For the other operators, the strings shall be compared using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false. On Page: 2478 Line: 79587 Section: awk In the APPLICATION USAGE section, add two new paragraphs: On implementations where the "==" operator checks if strings collate equally, applications needing to check whether strings are identical can use: length(a) == length(b) && index(a,b) == 1 On implementations where the "==" operator checks if strings are identical, applications needing to check whether strings collate equally can use: a <= b && a >= b On Page: 2486 Line: 79914 Section: awk In the FUTURE DIRECTIONS section, change from: None. to: A future version of this standard may require the "!=" and "==" operators to perform string comparisons by checking if the strings are identical (and not by checking if they collate equally). On Page: 2559 Line: 82755 Section: comm In the DESCRIPTION section, add a new paragraph: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm should treat them as different lines but may treat them as being the same. If it treats them as different, comm should expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale and if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files. On Page: 2560 Line: 82810 Section: comm In the STDOUT section, change from: If the input files were ordered according to the collating sequence of the current locale, the lines written shall be in the collating sequence of the original lines. to: If the input files were ordered according to the collating sequence of the current locale, the lines written shall be in the collating sequence of the current locale. If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, and commtreated them as different lines, then lines written that collate equally but are not identical should be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. On Page: 2561 Line: 82825 Section: comm In the APPLICATION USAGE section, add the following new paragraphs: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behaviour of comm in the following ways: * If comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2. * If comm treats lines as being the same if they collate equally and a line from file1 collates equally with a line from file2 but is not identical to it, one of the lines is misleadingly identified as being in both files and the other is not written to the output at all. Such problems can be avoided by forcing the use of the POSIX locale, for example the following identifies lines in both file1 and file2: LC_ALL=POSIX sort file1 > file1.posix LC_ALL=POSIX sort file2 > file2.posix LC_ALL=POSIX comm -12 file1.posix file2.posix \| sort The final sort re-sorts the output of comm according to the collating sequence of the original locale. Doing this might be difficult if more than one column is output and leading blanks cannot be ignored. On Page: 2561 Line: 82842 Section: comm In the FUTURE DIRECTIONS section, change from: None. to: A future version of this standard may require that if any lines from the input files collate equally but are not identical, then comm treats them as different lines and expects them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. A future version of this standard may require that if the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical are ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. On Page: 2740 Line: 89708 Section: expr In the APPLICATION USAGE section, add a new paragraph: For testing string equality the test utility is preferred over expr, as it is usually implemented as a shell built-in. However, the functionality is not quite the same because the expr '=' and "!=" operators check whether strings collate equally, whereas test checks whether they are identical. Therefore, they can produce different results in locales where the collation sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]). On Page: 2839 Line: 93278 Section: join In the DESCRIPTION section, change from: that have identical join fields to: that have join fields that collate equally On Page: 2841 Line: 93377 Section: join In the APPLICATION USAGE section, add a new paragraph: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]), join treats fields that collate equally but are not identical as being the same. If this behavior is not desired, it can be avoided by forcing the use of the POSIX locale (although this means re-sorting the input files into the POSIX locale collating sequence.) On Page: 2888 Line: 95164 Section: ls In the DESCRIPTION section, add a new paragraph: Whenever ls sorts filenames or pathnames according to the collating sequence in the current locale, if this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), then any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 2896 Line: 95520 Section: ls In the FUTURE DIRECTIONS section, add a new paragraph: A future version of this standard may require that if the collating sequence for the current locale does not have a total ordering of all characters, any filenames or pathnames that collate equally are further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3210 Line: 107544 Section: sort In the DESCRIPTION section, change from: ... shall be performed using the collating sequence of the current locale. to: ... shall be performed using the collating sequence of the current locale. If this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3214 Line: 107719 Section: sort In the APPLICATION USAGE section, add a new paragraph: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behavior of sort in the following ways: * As <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical. * The output of sort (without -u) can contain identical lines that are not adjacent, if it does not implement the recommended further byte-by-byte comparison of lines that collate equally. This affects the use of sort with comm and uniq; see the APPLICATION USAGE for those utilities. On Page: 3215 Line: 107783 Section: sort In the RATIONALE section, add a new paragraph: Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see [xref to XBD 7.3.2]). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified. On Page: 3215 Line: 107785 Section: sort In the FUTURE DIRECTIONS section, change from: None. to: A future version of this standard may require that if the collating sequence of the current locale does not have a total ordering of all characters, any lines of input that collate equally when comparing them as whole lines are further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3309 Line: 111067 Section: uniq In the ENVIRONMENT VARIABLES section, delete: LC_COLLATE Determine the locale for ordering rules. On Page: 3310 Line: 111099 Section: uniq In the APPLICATION USAGE section, change from: The sort utility can be used to cause repeated lines to be adjacent in the input file. to: If the collating sequence of the current locale has a total ordering of all characters, the sort utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence does not have a total ordering of all characters, the sort utility should still do this but it might not. To ensure that all duplicate lines are eliminated, and have the output sorted according the collating sequence of the current locale, applications should use: LC_ALL=C sort -u \| sort instead of: sort \| uniq To remove duplicate lines based on whether they collate equally instead of whether they are identical, applications should use: sort -u instead of: sort \| uniq
Tags	tc2-2008, UTF-8_Locale

Date Modified	Username	Field	Change
2015-06-24 11:07	geoffclare	New Issue
2015-06-24 11:07	geoffclare	Name	=> Geoff Clare
2015-06-24 11:07	geoffclare	Organization	=> The Open Group
2015-06-24 11:07	geoffclare	Section	=> 2.13.3, awk, comm, expr, join, ls, sort, uniq
2015-06-24 11:07	geoffclare	Page Number	=> 2356, 2459, 2559, 2740, 2839, 2888, 3210, 3309, and more
2015-06-24 11:07	geoffclare	Line Number	=> 75082, 78745, 82755, 89708, 93278, 95164, 107544, 111067, and more
2015-06-24 11:07	geoffclare	Interp Status	=> ---
2015-06-24 15:24	geoffclare	Relationship added	related to 0000938
2015-07-30 15:20	~~Don Cragun~~	Status	New => Resolved
2015-07-30 15:20	~~Don Cragun~~	Resolution	Open => Accepted
2015-07-30 15:20	~~Don Cragun~~	Desired Action Updated
2015-07-30 15:20	~~Don Cragun~~	Tag Attached: tc2-2008
2015-07-30 17:05	rhansen	Tag Attached: UTF-8_Locale
2016-02-04 16:18	nick	Relationship added	related to 0000948
2016-08-25 11:11	geoffclare	Relationship added	related to 0001070
2019-06-10 08:54	agadmin	Status	Resolved => Closed

View Issue Details

Relationships

Activities

Issue History

related to	0000938	Closed	Collation issues in XBD (changes for TC2)
related to	0000948	Closed	Collation issues in XBD (changes for Issue 8)
related to	0001070	Closed	Collation issues in XCU (changes for Issue 8)