Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001070 [1003.1(2013)/Issue7+TC1] Shell and Utilities Objection Error 2016-08-25 11:11 2020-04-21 13:35
Reporter geoffclare View Status public  
Assigned To
Priority normal Resolution Accepted  
Status Applied  
Name Geoff Clare
Organization The Open Group
User Reference
Section 2.13.3, awk, comm, localedef, ls, sort, uniq
Page Number 2356, 2459, 2559, 2874, 2888, 3210, 3309, and more
Line Number 75082, 78745, 82755, 94650, 95164, 107544, 111067, and more
Interp Status ---
Final Accepted Text
Summary 0001070: Collation issues in XCU (changes for Issue 8)
Description A discussion on the mailing list identified some issues related to
collation for locales that do not define a collation sequence with
a total ordering of all characters. It is proposed that these issues
are addressed in Issue 8 by requiring implementation-provided locales
that do not have an '@' modifier in their name to define a collation
sequence that has a total ordering of all characters (thus reducing
the problem to "special" locales and user-defined locales), and by
modifying the requirements for regular expressions and affected
utilities so that they cope better with such locales. As an
intermediate step, it is proposed that the new requirements slated
for Issue 8 are recommended (or at least allowed) in TC2.

The necessary changes will be split across four Mantis bugs, targeting
XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the
changes proposed for XCU in Issue 8.
Desired Action After applying the bug 0000963 changes at each of the following
locations, make further changes to the new text as noted below.
(There is also a change to localedef inserted among the changes
derived from bug 963.)

On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename Expansion

In the updated list item 3, change from:

any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

to:

any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale.

and delete the small-font note:

<small>Note: a future version of this standard may require the byte-by-byte further comparison described above.</small>

On Page: 2459 Line: 78745 Section: awk

In the updated text, change from:

For the "!=" and "==" operators, the strings should be compared to check if they are identical but may be compared using the locale-specific collation sequence to check if they collate equally.

to:

For the "!=" and "==" operators, the strings shall be compared to check if they are identical (not to check if they collate equally).

On Page: 2478 Line: 79587 Section: awk

Change the two new APPLICATION USAGE paragraphs from:

On implementations where the "==" operator checks if strings collate equally, applications needing to check whether strings are identical can use:
length(a) == length(b) && index(a,b) == 1
On implementations where the "==" operator checks if strings are identical, applications needing to check whether strings collate equally can use:
a <= b && a >= b
to:

Since the "==" operator checks whether strings are identical, not whether they collate equally, applications needing to check whether strings collate equally can use:
a <= b && a >= b

On Page: 2486 Line: 79914 Section: awk

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require the "!=" and "==" operators to perform string comparisons by checking if the strings are identical (and not by checking if they collate equally).

to:

None.

On Page: 2559 Line: 82755 Section: comm

Change the new DESCRIPTION paragraph from:

If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm should treat them as different lines but may treat them as being the same. If it treats them as different, comm should expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale and if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files.

to:

If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm shall treat them as different lines and shall expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale; if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files.

On Page: 2560 Line: 82810 Section: comm

In the updated text, change from:

If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, and commtreated them as different lines, then lines written that collate equally but are not identical should be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

to:

If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical shall be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

On Page: 2561 Line: 82825 Section: comm

Change the new APPLICATION USAGE paragraphs from:

If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behaviour of comm in the following ways:
* If comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2.

* If comm treats lines as being the same if they collate equally and a line from file1 collates equally with a line from file2 but is not identical to it, one of the lines is misleadingly identified as being in both files and the other is not written to the output at all.
Such problems can be avoided by forcing the use of the POSIX locale, for example the following identifies lines in both file1 and file2:
LC_ALL=POSIX sort file1 > file1.posix
LC_ALL=POSIX sort file2 > file2.posix
LC_ALL=POSIX comm -12 file1.posix file2.posix | sort
The final sort re-sorts the output of comm according to the collating sequence of the original locale. Doing this might be difficult if more than one column is output and leading blanks cannot be ignored.

to:

If the collating sequence of the current locale does not have a total ordering of all characters, since comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2 if lines that collate equally but are not identical are not ordered in the way that comm expects. If the input does not come from utilities (such as ls and sort) which provide this ordering, the problem can be avoided by pre-sorting the input files using sort.

On Page: 2561 Line: 82842 Section: comm

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require that if any lines from the input files collate equally but are not identical, then comm treats them as different lines and expects them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

A future version of this standard may require that if the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical are ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

to:

None.

On Page: 2874 Line: 94650 Section: localedef

Add a new paragraph to the DESCRIPTION section:

If the LC_COLLATE category defines a collation sequence that does not have a total ordering of all characters, localedef shall write a warning message to standard error and, if the exit status would otherwise have been zero, shall exit with status 1.

On Page: 2888 Line: 95164 Section: ls

In the new DESCRIPTION paragraph change from:

any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

to:

any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 2896 Line: 95520 Section: ls

In the FUTURE DIRECTIONS section, delete the new paragraph:

A future version of this standard may require that if the collating sequence for the current locale does not have a total ordering of all characters, any filenames or pathnames that collate equally are further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3210 Line: 107544 Section: sort

In the updated text, change from:

any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

to:

any lines of input that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3214 Line: 107719 Section: sort

In the updated APPLICATION USAGE text, change from:

If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behavior of sort in the following ways:
* As <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical.

* The output of sort (without -u) can contain identical lines that are not adjacent, if it does not implement the recommended further byte-by-byte comparison of lines that collate equally. This affects the use of sort with comm and uniq; see the APPLICATION USAGE for those utilities.
to:

If the collating sequence of the current locale does not have a total ordering of all characters, since <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical.

On Page: 3215 Line: 107783 Section: sort

In the new RATIONALE paragraph change from:

Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see [xref to XBD 7.3.2]). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.

to:

The required further byte-by-byte comparison of lines that collate equally may have an impact on efficiency, but this can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since implementation-supplied locales without an '@' modifier have a total ordering of all characters - see [xref to XBD 7.3.2] - and localedef users are warned to follow the same convention). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.

On Page: 3215 Line: 107785 Section: sort

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require that if the collating sequence of the current locale does not have a total ordering of all characters, any lines of input that collate equally when comparing them as whole lines are further compared byte-by-byte using the collating sequence for the POSIX locale.

to:

None.

On Page: 3310 Line: 111099 Section: uniq

In the updated APPLICATION USAGE section, change from:

If the collating sequence of the current locale has a total ordering of all characters, the sort utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence does not have a total ordering of all characters, the sort utility should still do this but it might not. To ensure that all duplicate lines are eliminated, and have the output sorted according the collating sequence of the current locale, applications should use:
LC_ALL=C sort -u | sort
instead of:
sort | uniq
To remove duplicate lines based on whether they collate equally instead of whether they are identical, applications should use:
sort -u
instead of:
sort | uniq
to:

The sort utility can be used to cause repeated lines to be adjacent in the input file.

If the collating sequence of the current locale does not have a total ordering of all characters, the behavior of <tt>sort | uniq</tt> differs from <tt>sort -u</tt>, as uniq treats lines as duplicates only if they are identical, whereas <tt>sort -u</tt> treats lines as duplicates if they collate equally.
Tags issue8
Attached Files

- Relationships
related to 0000963Closed Collation issues in XCU (changes for TC2) 
related to 0000948Applied Collation issues in XBD (changes for Issue 8) 

There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
2016-08-25 11:11 geoffclare New Issue
2016-08-25 11:11 geoffclare Name => Geoff Clare
2016-08-25 11:11 geoffclare Organization => The Open Group
2016-08-25 11:11 geoffclare Section => 2.13.3, awk, comm, localedef, ls, sort, uniq
2016-08-25 11:11 geoffclare Page Number => 2356, 2459, 2559, 2874, 2888, 3210, 3309, and more
2016-08-25 11:11 geoffclare Line Number => 75082, 78745, 82755, 94650, 95164, 107544, 111067, and more
2016-08-25 11:11 geoffclare Interp Status => ---
2016-08-25 11:11 geoffclare Relationship added related to 0000963
2016-08-25 11:12 geoffclare Relationship added related to 0000948
2016-08-25 11:18 geoffclare Desired Action Updated
2018-01-04 17:07 Don Cragun Status New => Resolved
2018-01-04 17:07 Don Cragun Resolution Open => Accepted
2018-01-04 17:07 Don Cragun Tag Attached: issue8
2020-04-21 13:35 geoffclare Status Resolved => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker