View Issue Details

IDProjectCategoryView StatusLast Update
00009481003.1(2013)/Issue7+TC1Base Definitions and Headerspublic2024-06-11 09:02
Reportergeoffclare Assigned To 
PrioritynormalSeverityObjectionTypeError
Status ClosedResolutionAccepted 
NameGeoff Clare
OrganizationThe Open Group
User Reference
Section7.3.2, 9.3.5
Page Number147, 150, 184
Line Number4393, 4503, 5963, and more
Interp Status---
Final Accepted Text
Summary0000948: Collation issues in XBD (changes for Issue 8)
DescriptionA discussion on the mailing list identified some issues related to
collation for locales that do not define a collation sequence with
a total ordering of all characters. It is proposed that these issues
are addressed in Issue 8 by requiring implementation-provided locales
that do not have an '@' modifier in their name to define a collation
sequence that has a total ordering of all characters (thus reducing
the problem to "special" locales and user-defined locales), and by
modifying the requirements for regular expressions and affected
utilities so that they cope better with such locales. As an
intermediate step, it is proposed that the new requirements slated
for Issue 8 are recommended (or at least allowed) in TC2.

The necessary changes will be split across four Mantis bugs, targeting
XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the
changes proposed for XBD in Issue 8.
Desired ActionAfter applying the bug 0000938 changes at each of the following locations, make further changes to the new text as noted below.

On Page: 147 Line: 4393 Section: 7.3.2 LC_COLLATE

In the new paragraph after the numbered list, change from:

All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) should define ...

to:

All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) shall define ...

and delete the first of the new small-font notes:

<small>Note: a future version of this standard may require these locales to define a collation sequence that has a total ordering of all characters (by changing "should" to "shall").</small>

On Page: 150 Line: 4503 Section: 7.3.2.4 Collation Order

In the new paragraph, change from:

Weights should be assigned such that the collation sequence ...

to:

Weights shall be assigned such that the collation sequence ...

and delete the small-font note:

<small>Note: a future version of this standard may require a total ordering of all characters for implementation-provided locales that do not have an '@' modifier in the locale name. See [xref to 7.3.2].</small>

On Page: 150 Line: 4517 Section: 7.3.2.4 Collation Order

In the updated text, change from:

If the collation order has only one weight level, these characters should be assigned unique primary weights, equal to the relative order of their character in the character collation sequence, but may be assigned the same primary weight.

to:

If the collation order has only one weight level, these characters shall be assigned unique primary weights, equal to the relative order of their character in the character collation sequence.

and delete the small-font note:

<small>Note: a future version of this standard may require these characters to be assigned unique primary weights if the collation order has only one weight level.</small>

On Page: 184 Line: 5963 Section: 9.3.5 RE Bracket Expression

In the updated list item 2, change from:

An ordinary character in the list should only match that character, but may match any single character that collates equally with that character; for example, "[abc]" is an RE that should only match one of the characters 'a', 'b', or 'c'.

to:

An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'.

and delete the small-font note:

<small>Note: a future version of this standard may require that an ordinary character in the list only matches that character.</small>

On Page: 184 Line: 5970 Section: 9.3.5 RE Bracket Expression

In the updated list item 3, change from:

For example, if the RE "[abc]" only matches 'a', 'b', or 'c', then "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'.

to:

For example, since the RE "[abc]" only matches 'a', 'b', or 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'.


Cross-volume changes to XRAT ...

On Page: 3490 Line: 117820 Section: A.7.3.2 LC_COLLATE

In the new paragraph, change from:

This standard recommends (by the use of "should" in the normative text) that ...

to:

This standard requires that ...
Tagsissue8, UTF-8_Locale

Relationships

related to 0000938 Closed Collation issues in XBD (changes for TC2) 
related to 0000963 Closed Collation issues in XCU (changes for TC2) 
related to 0001070 Closed Collation issues in XCU (changes for Issue 8) 

Activities

eblake

2015-06-04 18:02

manager   bugnote:0002697

Is this proposed wording still accurate, in light of 0000872 documenting how REG_ICASE affects range expressions?
 An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'.

shware_systems

2015-06-04 18:38

reporter   bugnote:0002698

It nominally is, as 'upper' and 'lower' determination is independant of collation order and bug 872 relates to equality testing only, not ordering, but a clarification 'when REG_ICASE has not been specified' inserted somewhere in there wouldn't hurt either, imo.

shware_systems

2015-06-04 20:58

reporter   bugnote:0002699

Last edited: 2015-06-04 21:00

Where the wording changes introduce a possible ambiguity is with the strxfrm() interface. The C standard just states the interface shall refer to the LC_COLLATE category, but is not explicit about when items are copied from a source string verbatim and when a transformed substitute must be stored, so the change from should to shall may break some existing implementations.

If the collation weightings are set up so that after COLL_WEIGHTS_MAX weights have been examined two elements can still compare as equal, though their binary value differs, it is not specified which element has primacy for storage so that strcmp() is deterministic. Using case insensitive on letters as an example, does the lower case or upper case version of a character get copied or always stored, or is it the first member of the given weight class pulled from the LC_COLLATE category that gets stored, which may be upper where what was input is lower.

I can see the latter being the intent, but some implementations may prefer a particular case as a last determining factor, to match the expectations of various standards the transformed strings are routinely used with. Maybe I'm off, but I don't see the language precluding such a preference being implemented.

geoffclare

2015-06-05 09:38

manager   bugnote:0002700

(Response to 0000948:0002697)
I don't see any problem with the new wording as regards REG_ICASE. The way XBD chapter 9 is structured is that the details in 9.3 and 9.4 describe the normal case-sensitive matching process and the variation needed for case-insensitive matching is covered by this statement in 9.2:

"When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without regard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched against the pattern, not only the character, but also its case counterpart (if any), shall be matched."

Issue History

Date Modified Username Field Change
2015-05-11 15:38 geoffclare New Issue
2015-05-11 15:38 geoffclare Name => Geoff Clare
2015-05-11 15:38 geoffclare Organization => The Open Group
2015-05-11 15:38 geoffclare Section => 7.3.2, 9.3.5
2015-05-11 15:38 geoffclare Page Number => 147, 150, 184
2015-05-11 15:38 geoffclare Line Number => 4393, 4503, 5963, and more
2015-05-11 15:38 geoffclare Interp Status => ---
2015-05-11 15:39 geoffclare Relationship added related to 0000938
2015-06-04 18:02 eblake Note Added: 0002697
2015-06-04 18:38 shware_systems Note Added: 0002698
2015-06-04 20:58 shware_systems Note Added: 0002699
2015-06-04 21:00 shware_systems Note Edited: 0002699
2015-06-05 09:38 geoffclare Note Added: 0002700
2015-07-30 17:08 rhansen Tag Attached: issue8
2015-07-30 17:08 rhansen Tag Attached: UTF-8_Locale
2016-02-04 16:18 nick Relationship added related to 0000963
2016-02-04 16:33 Don Cragun Status New => Resolved
2016-02-04 16:33 Don Cragun Resolution Open => Accepted
2016-02-04 16:33 Don Cragun Desired Action Updated
2016-08-25 11:12 geoffclare Relationship added related to 0001070
2020-04-08 15:31 geoffclare Status Resolved => Applied
2024-06-11 09:02 agadmin Status Applied => Closed