0000963: Collation issues in XCU (changes for TC2) - Austin Group Defect Tracker

Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details [ Jump to Notes ]

[ Issue History ] [ Print ]

ID

Category

Severity

Type

Date Submitted

Last Update

0000963

[1003.1(2013)/Issue7+TC1] Shell and Utilities

Objection

Error

2015-06-24 11:07

2019-06-10 08:54

Reporter

geoffclare

View Status

public

Assigned To

Priority

normal

Resolution

Accepted

Status

Closed

Name

Geoff Clare

Organization

The Open Group

User Reference

Section

2.13.3, awk, comm, expr, join, ls, sort, uniq

Page Number

2356, 2459, 2559, 2740, 2839, 2888, 3210, 3309, and more

Line Number

75082, 78745, 82755, 89708, 93278, 95164, 107544, 111067, and more

Interp Status

---

Final Accepted Text

Summary

0000963: Collation issues in XCU (changes for TC2)

Description

A discussion on the mailing list identified some issues related to
collation for locales that do not define a collation sequence with
a total ordering of all characters. It is proposed that these issues
are addressed in Issue 8 by requiring implementation-provided locales
that do not have an '@' modifier in their name to define a collation
sequence that has a total ordering of all characters (thus reducing
the problem to "special" locales and user-defined locales), and by
modifying the requirements for regular expressions and affected
utilities so that they cope better with such locales. As an
intermediate step, it is proposed that the new requirements slated
for Issue 8 are recommended (or at least allowed) in TC2.

The necessary changes will be split across four Mantis bugs, targeting
XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the
changes proposed for XCU in TC2.

Desired Action

On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename Expansion

In list item 3, change from:

... sorted according to the collating sequence in effect in the current locale.

to:

... sorted according to the collating sequence in effect in the current locale. If this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

<small>Note: a future version of this standard may require the byte-by-byte further comparison described above.</small>

On Page: 2459 Line: 78745 Section: awk

In the EXTENDED DESCRIPTION section, change from:

operands shall be converted to strings as required and a string comparison shall be made using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false.

to:

operands shall be converted to strings as required and a string comparison shall be made as follows:

* For the "!=" and "==" operators, the strings should be compared to check if they are identical but may be compared using the locale-specific collation sequence to check if they collate equally.

* For the other operators, the strings shall be compared using the locale-specific collation sequence.

The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false.

On Page: 2478 Line: 79587 Section: awk

In the APPLICATION USAGE section, add two new paragraphs:

On implementations where the "==" operator checks if strings collate equally, applications needing to check whether strings are identical can use:

length(a) == length(b) && index(a,b) == 1

On implementations where the "==" operator checks if strings are identical, applications needing to check whether strings collate equally can use:

a <= b && a >= b

On Page: 2486 Line: 79914 Section: awk

In the FUTURE DIRECTIONS section, change from:

None.

to:

A future version of this standard may require the "!=" and "==" operators to perform string comparisons by checking if the strings are identical (and not by checking if they collate equally).

On Page: 2559 Line: 82755 Section: comm

In the DESCRIPTION section, add a new paragraph:

If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm should treat them as different lines but may treat them as being the same. If it treats them as different, comm should expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale and if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files.

On Page: 2560 Line: 82810 Section: comm

In the STDOUT section, change from:

If the input files were ordered according to the collating sequence of the current locale, the lines written shall be in the collating sequence of the original lines.

to:

If the input files were ordered according to the collating sequence of the current locale, the lines written shall be in the collating sequence of the current locale. If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, and commtreated them as different lines, then lines written that collate equally but are not identical should be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

On Page: 2561 Line: 82825 Section: comm

In the APPLICATION USAGE section, add the following new paragraphs:

If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behaviour of comm in the following ways:

* If comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2.

* If comm treats lines as being the same if they collate equally and a line from file1 collates equally with a line from file2 but is not identical to it, one of the lines is misleadingly identified as being in both files and the other is not written to the output at all.

Such problems can be avoided by forcing the use of the POSIX locale, for example the following identifies lines in both file1 and file2:

LC_ALL=POSIX sort file1 > file1.posix
LC_ALL=POSIX sort file2 > file2.posix
LC_ALL=POSIX comm -12 file1.posix file2.posix | sort

The final sort re-sorts the output of comm according to the collating sequence of the original locale. Doing this might be difficult if more than one column is output and leading blanks cannot be ignored.

On Page: 2561 Line: 82842 Section: comm

In the FUTURE DIRECTIONS section, change from:

None.

to:

A future version of this standard may require that if any lines from the input files collate equally but are not identical, then comm treats them as different lines and expects them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

A future version of this standard may require that if the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical are ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale.

On Page: 2740 Line: 89708 Section: expr

In the APPLICATION USAGE section, add a new paragraph:

For testing string equality the test utility is preferred over expr, as it is usually implemented as a shell built-in. However, the functionality is not quite the same because the expr '=' and "!=" operators check whether strings collate equally, whereas test checks whether they are identical. Therefore, they can produce different results in locales where the collation sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]).

On Page: 2839 Line: 93278 Section: join

In the DESCRIPTION section, change from:

that have identical join fields

to:

that have join fields that collate equally

On Page: 2841 Line: 93377 Section: join

In the APPLICATION USAGE section, add a new paragraph:

If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]), join treats fields that collate equally but are not identical as being the same. If this behavior is not desired, it can be avoided by forcing the use of the POSIX locale (although this means re-sorting the input files into the POSIX locale collating sequence.)

On Page: 2888 Line: 95164 Section: ls

In the DESCRIPTION section, add a new paragraph:

Whenever ls sorts filenames or pathnames according to the collating sequence in the current locale, if this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), then any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 2896 Line: 95520 Section: ls

In the FUTURE DIRECTIONS section, add a new paragraph:

A future version of this standard may require that if the collating sequence for the current locale does not have a total ordering of all characters, any filenames or pathnames that collate equally are further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3210 Line: 107544 Section: sort

In the DESCRIPTION section, change from:

... shall be performed using the collating sequence of the current locale.

to:

... shall be performed using the collating sequence of the current locale. If this collating sequence does not have a total ordering of all characters (see [xref to XBD 7.3.2]), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3214 Line: 107719 Section: sort

In the APPLICATION USAGE section, add a new paragraph:

If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behavior of sort in the following ways:

* As <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical.

* The output of sort (without -u) can contain identical lines that are not adjacent, if it does not implement the recommended further byte-by-byte comparison of lines that collate equally. This affects the use of sort with comm and uniq; see the APPLICATION USAGE for those utilities.

On Page: 3215 Line: 107783 Section: sort

In the RATIONALE section, add a new paragraph:

Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see [xref to XBD 7.3.2]). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.

On Page: 3215 Line: 107785 Section: sort

In the FUTURE DIRECTIONS section, change from:

None.

to:

A future version of this standard may require that if the collating sequence of the current locale does not have a total ordering of all characters, any lines of input that collate equally when comparing them as whole lines are further compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3309 Line: 111067 Section: uniq

In the ENVIRONMENT VARIABLES section, delete:

LC_COLLATE

Determine the locale for ordering rules.

On Page: 3310 Line: 111099 Section: uniq

In the APPLICATION USAGE section, change from:

The sort utility can be used to cause repeated lines to be adjacent in the input file.

to:

If the collating sequence of the current locale has a total ordering of all characters, the sort utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence does not have a total ordering of all characters, the sort utility should still do this but it might not. To ensure that all duplicate lines are eliminated, and have the output sorted according the collating sequence of the current locale, applications should use:

LC_ALL=C sort -u | sort

instead of:

sort | uniq

To remove duplicate lines based on whether they collate equally instead of whether they are identical, applications should use:

sort -u

instead of:

sort | uniq

Tags

tc2-2008, UTF-8_Locale

Relationships

related to	0000938	Closed		Collation issues in XBD (changes for TC2)
related to	0000948	Applied		Collation issues in XBD (changes for Issue 8)
related to	0001070	Applied		Collation issues in XCU (changes for Issue 8)

There are no notes attached to this issue.

Issue History
Date Modified	Username	Field	Change
2015-06-24 11:07	geoffclare	New Issue
2015-06-24 11:07	geoffclare	Name	=> Geoff Clare
2015-06-24 11:07	geoffclare	Organization	=> The Open Group
2015-06-24 11:07	geoffclare	Section	=> 2.13.3, awk, comm, expr, join, ls, sort, uniq
2015-06-24 11:07	geoffclare	Page Number	=> 2356, 2459, 2559, 2740, 2839, 2888, 3210, 3309, and more
2015-06-24 11:07	geoffclare	Line Number	=> 75082, 78745, 82755, 89708, 93278, 95164, 107544, 111067, and more
2015-06-24 11:07	geoffclare	Interp Status	=> ---
2015-06-24 15:24	geoffclare	Relationship added	related to 0000938
2015-07-30 15:20	Don Cragun	Status	New => Resolved
2015-07-30 15:20	Don Cragun	Resolution	Open => Accepted
2015-07-30 15:20	Don Cragun	Desired Action Updated
2015-07-30 15:20	Don Cragun	Tag Attached: tc2-2008
2015-07-30 17:05	rhansen	Tag Attached: UTF-8_Locale
2016-02-04 16:18	nick	Relationship added	related to 0000948
2016-08-25 11:11	geoffclare	Relationship added	related to 0001070
2019-06-10 08:54	agadmin	Status	Resolved => Closed

Mantis 1.1.6[^]

Copyright © 2000 - 2008 Mantis Group