Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000663 [1003.1(2008)/Issue 7] System Interfaces Editorial Clarification Requested 2013-02-21 21:25 2019-06-10 08:55
Reporter dalias View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Rich Felker
Organization musl libc
User Reference
Section strcasecmp/strncasecmp
Page Number 1985
Line Number 62819
Interp Status Approved
Final Accepted Text see Note: 0002738
Summary 0000663: Specification of str[n]casecmp is ambiguous
Description The description includes the text:

"When the LC_CTYPE category of the current locale is from the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified."

As far as I can tell, the phrase "converted to lowercase" is not precisely specified anywhere. This is not a serious problem in the event that the POSIX locale only contains the required characters, but per XBD 7.2, implementations are permitted to have other characters available in the POSIX locale ("The tables in Locale Definition describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified.")

If additional characters outside the portable character set exist in the POSIX locale, does "converted to lowercase" mean as if by tolower on a byte-by-byte basis, or as if by towlower on a character-by-character basis? Or does the "unspecified" clause in XBD 7.2 leave this choice up to the implementation?

A more worrisome issue is, in the case that an implementation has multibyte characters in the POSIX locale, the symmetry of strncasecmp is broken. The n argument is specified as the maximum number of bytes to compare from s1, not from s2, so from my reading of the text as written, strncasecmp may read more than n bytes from s2, and strncasecmp(a,b,n)>0 is not necessarily equivalent to strncasecmp(b,a,n)<0. In particular, consider the example (assuming UTF-8):

strncasecmp("\u00df", "\u1e9e", 2)
Desired Action Please clarify whether the standard places any requirements on an implementation to "convert to lowercase" as if by tolower or towlower.

If conversion as if by towlower is permitted, please clarify whether strncasecmp is permitted to read more than n bytes from s2 if necessary to compare (as if by towlower) n bytes from s1, and whether there are any obligations on an implementation when an invalid multibyte sequence is encountered in s1 or s2.

As possibly the only implementor to whom these issues seem relevant (I'm not aware of other implementations with multibyte characters in the POSIX locale), my preference would be to leave these choices unspecified or implementation-defined, and permit strncasecmp, if necessary, to read more than n bytes from s2. If strncasecmp is forbidden from reading more than n bytes from s2, then I believe the only reasonable choice for the other decisions is to require conversion as if by tolower rather than towlower.
Tags tc2-2008, UTF-8_Locale
Attached Files

- Relationships
related to 0000967Closed 1003.1(2013)/Issue7+TC1 character set confusion 
related to 0001182Closed 1003.1(2016/18)/Issue7+TC2 CX behavior wasn't changed appropriately with TC2 

-  Notes
(0001500)
eblake (manager)
2013-03-21 15:23

The C standard is clear that in the C locale, isupper() shall return non-zero for exactly 26 characters - the ascii alphabet. Since the POSIX locale is a synonym for the C locale, it should not matter whether the locale supports a multibyte encoding, there are still exactly 26 characters, all single bytes, that can be converted, and nothing else, because those are the only characters that may match the isupper() contract.

That said, we probably need to tighten the POSIX spec to make it clear that the POSIX locale has exactly 26 uppercase letters, regardless of whether it is single-byte or multi-byte, and regardless of whether other characters are recognized.
(0001501)
eblake (manager)
2013-03-21 15:45
edited on: 2015-07-02 16:20

Note that this resolution has been superseded by Note: 0002738.

Interpretation response
-----------------------
The standard clearly states that the POSIX locale is a synonym for the C locale with regards to LC_CTYPE, and defers to the C standard for its definitions of character categories. The C standard is clear that only 26 uppercase and 26 lowercase letters exist in the C locale, all of which have single-byte encodings.

Rationale:
----------
While the POSIX locale may include multibyte characters, it must still honor the C locale rules. Since the C locale has only 26 characters that can be altered by tolower(), all of which are single bytes, it does not matter whether strcasecmp() and strncasecmp() use byte or character normalization, and the length limit of strncasecmp() is specified in bytes.

We are also seeking proposals for the standardization of a "POSIX.UTF-8" locale for Issue 8.

Notes to the Editor (not part of this interpretation):
------------------------------------------------------
Make the following changes:

On page 136 line 3849 [7.2 POSIX locale], move the following sentences:

The tables in Section 7.3 describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified.

to page 151 line 4531 [7.3.2.6 LC_COLLATE Category in the POSIX Locale], with a wording change:

The definition below describes the behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified.

On page 139 line 3996 [7.3.1 LC_CTYPE upper], change:

In the POSIX locale, the 26 uppercase letters shall be included:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

to:

In the POSIX locale, only:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
shall be included.

On page 139 line 4003 [7.3.1 LC_CTYPE lower], change:

In the POSIX locale, the 26 lowercase letters shall be included:
a b c d e f g h i j k l m n o p q r s t u v w x y z

to:

In the POSIX locale, only:
a b c d e f g h i j k l m n o p q r s t u v w x y z
shall be included.

On page 139 line 4009 [7.3.1 LC_CTYPE alpha], change:

In the POSIX locale, all characters in the classes upper and lower shall be included.

to:

In the POSIX locale, only characters in the classes upper and lower shall be included.

On page 141 line 4091 [7.3.1 LC_CTYPE toupper], change:

In the POSIX locale, at a minimum, the 26 lowercase characters:

to:

In the POSIX locale, the 26 lowercase characters:

On page 142 line 4105 [7.3.1 LC_CTYPE tolower], change:

In the POSIX locale, at a minimum, the 26 uppercase characters:

to:

In the POSIX locale, the 26 uppercase characters:

On page 143 line 4145 [7.3.1.1 LC_CTYPE Category in the POSIX Locale], change:

# The following is the POSIX locale LC_CTYPE.
# "alpha" is by default "upper" and "lower"
# "alnum" is by definition "alpha" and "digit"
# "print" is by default "alnum", "punct", and the <space>
# "graph" is by default "alnum" and "punct"

to:

# The following is the minimum POSIX locale LC_CTYPE; implementations may
# add additional characters to the "cntrl" and "punct" classifications.
# "alpha" is by definition "upper" and "lower"
# "alnum" is by definition "alpha" and "digit"
# "print" is by definition "alnum", "punct", and the <space>
# "graph" is by definition "alnum" and "punct"

At page 151 line 4534 [7.3.2.6 LC_COLLATE Category in the POSIX Locale], change:

# This is the POSIX locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII codeset.

to:

# This is the minimum input for the POSIX locale definition for the
# LC_COLLATE category. Characters in this list are in the same order
# as in the ASCII codeset.

(0001503)
dalias (reporter)
2013-03-23 01:35

As the reporter of this issue and perhaps the only implementor affected by it, I would prefer a resolution that does not impose a requirement that the POSIX locale's upper and lower cases classes and mappings contain only members of the portable character set. If such a new requirement is to be imposed, it should be based on a strong argument that the existence of additional characters with case mappings could break otherwise-correct applications. I am doubtful that such an argument exists; for the most part, an application using the POSIX locale cannot expect well-defined or predictable behavior if it is processing data which contains byte sequences outside of the portable character set.
(0001504)
eblake (manager)
2013-03-23 12:47
edited on: 2013-03-28 17:35

Phooey - I'm wondering if Note: 0001501 has have been over-interpreting the C standard.

POSIX is already clear that the contract of isupper() defers to the C standard [line 38829], and the C standard is already explicit [C99 7.4.1.11] that in the C locale, isupper() may return true only for the characters which are letters according to C99 5.2.1. However, in re-reading 5.2.1, I see the following:

There is a requirement that there are exactly 26 uppercase letters in the basic execution character set. Likewise, there is a requirement that there are exactly 10 decimal digits in the basic execution character set, with consecutive encodings. There is also a strong statement that:

"A letter is an uppercase letter or a lowercase letter as defined above: in this International Standard the term does not include other characters that are letters in other alphabets."

We were originally interpreting this to mean that the C locale provides exactly the basic execution character set, with exactly 26 uppercase letters, and cannot provide extension letters. However, on re-reading, that statement in the C standard sounds to me like a letter is either one of the 26 uppercase letters in the basic character set, OR a character in the extended character set that is a letter, and merely that if the extended character set is smaller than the set of Unicode characters, then the term 'letter' in the standard would not include those Unicode letters not in the implementation's extended character set. But if we read it that way, where the extended character set can provide additional letters, then that also means that the extended character set can provide additional decimal digits.

Yet, in the POSIX standard, we required that there are exactly 10 characters that can belong to the 'digit' classification of the C locale [line 4015]. The approach in Note: 0001501 was to extend the exactness already present on digits to also apply to letters, but maybe the right approach is to instead go the other direction and not only allow extension character set members as additions to the set of letters, but also allow extension character set members as additions to the set of decimal digits. But this means that you can no longer use c-'0' to find the numeric value corresponding to a character c where isdigit(c) returns non-zero.

Now, you DO have some constraints - my understanding is that Unicode includes some characters which are considered letters, but which are neither uppercase nor lowercase. However, while non-C locales are permitted to have a character where isupper(c) and islower(c) both return false, but isalpha(c) returns true, this is forbidden on the C locale.

Since you seem to want additional multibyte letters allowed in the C locale, you would help your case by proposing actual wording to place into the standard, to be used in place of Note: 0001501.

(0001505)
dalias (reporter)
2013-03-23 21:42

On further reading of ISO C, I agree that while it permits the existence of additional characters in the C locale, it does not seem to permit the standard character classes to have any members beyond those that exist in the portable character set when the C locale is active. I further agree that it's misleading for POSIX to imply that such freedom of implementation exists, when alignment with the C standard actually precludes such freedom, so provided this interpretation of ISO C is correct, I would support efforts to include the C requirements in the POSIX text.

On the matter of allowing additional decimal digits, which was also mentioned, I think it would be harmful to applications, whether in the C locale or another locale. Under the current requirements of C and POSIX, isdigit(c) implies c-'0' is in the range 0 to 9 and represents the numeric value of the digit. Pulling this assumption out from under applications could lead to serious breakage.
(0001529)
msbrown (manager)
2013-04-04 14:57
edited on: 2013-04-04 14:58

The collating sequence for an EBCDIC (code page 1047) is going to be different than ASCII, I think. Here is some relevant data.

Here is the collation table definition from our library part edcclocp.c which is part of the structure of our LC_Collate category. This is used for the static initialization of the POSIX locale, which we instantiate in LE during C initialization. I looked at a couple of alternate initializations that we use, but this is the only one that is relevant for POSIX applications. Character values in the array are decimal.

static collel_t _P_colleltbl[]={
   0,     1,     2,     3,    55,    45,    46,    47,    22,
   5,    21,    11,    12,    13,    14,    15,    16,    17,    18,
  19,    60,    61,    50,    38,    24,    25,    63,    39,    28,
  29,    30,    31,    64,    90,   127,   123,    91,   108,    80,
 125,    77,    93,    92,    78,   107,    96,    75,    97,   240,
 241,   242,   243,   244,   245,   246,   247,   248,   249,   122,
  94,    76,   126,   110,   111,   124,   193,   194,   195,   196,
 197,   198,   199,   200,   201,   209,   210,   211,   212,   213,
 214,   215,   216,   217,   226,   227,   228,   229,   230,   231,
 232,   233,   173,   224,   189,    95,   109,   121,   129,   130,
 131,   132,   133,   134,   135,   136,   137,   145,   146,   147,
 148,   149,   150,   151,   152,   153,   162,   163,   164,   165,
 166,   167,   168,   169,   192,    79,   208,   161,     7,     4,
   6,     8,     9,    10,    20,    23,    26,    27,    32,    33,
  34,    35,    36,    37,    40,    41,    42,    43,    44,    51,
  52,    53,    54,    56,    57,    58,    59,    65,    66,    67,
  68,    69,    70,    71,    72,    73,    74,    81,    82,    83,
  84,    85,    86,    87,    88,    89,    98,    99,   100,   101,
 102,   103,   104,   105,   106,   112,   113,   114,   115,   116,
 117,   118,   119,   120,   128,   138,   139,   140,   141,   142,
 143,   144,   154,   155,   156,   157,   158,   159,   160,   170,
 171,   172,   174,   175,   176,   177,   178,   179,   180,   181,
 182,   183,   184,   185,   186,   187,   188,   190,   191,   202,
 203,   204,   205,   206,   207,   218,   219,   220,   221,   222,
 223,   225,   234,   235,   236,   237,   238,   239,   250,   251,
 252,   253,   254,   255,
};

240 -> 249:  Numerals 0 - 9
193 -> 233:  Upper case A-Z
129 -> 169:  Lower case a-z


Here's a pointer to a quick reference to the values for ebcdic code page 1047, which is what z/OS uses. http://en.wikipedia.org/wiki/EBCDIC_1047 [^]
The wikipedia table entries show the glyph for printable characters (or acronym for control chars), the corresponding ASCII value, and the decimal value of the EBCDIC slot in the table, which helps in deciphering our static array definition.

I believe the ASCII collation sequence is just the numerical ordering of the ASCII code points from 0 - 7F, so it would be fairly easy to show graphically the difference between EBCDIC and ASCII. Also, thinking about it now, there are twice as many possible code points in the EBCDIC collation sequence, so even if the first 127 points matched, as in Unicode, the two orders would be different. Oh well.

(0001530)
eblake (manager)
2013-04-04 15:31

The table is missing collation values for decimal 48, 49, and 62 - what happens when sorting a file with contents $'\x30\n\x31\n\x3e\n\x30\n'?
(0001669)
geoffclare (manager)
2013-07-09 09:10

Since there have been questions on the mailing list about the status
of this bug, here is a relevant excerpt from the Minutes of the
28 March 2013 teleconference from

http://www.opengroup.org/austin/docs/austin_601.txt [^]

The only change in status since those minutes is that we received
some feedback regarding the POSIX locale collation sequence on EBCDIC
systems, but we are awaiting further feedback.

Bug 0000663: Specification of str[n]casecmp is ambiguous Reopened
http://austingroupbugs.net/view.php?id=663 [^]

This item was reopened based on discussions on the reflector.

We agreed that we will proceed with the changes to clarify that the
POSIX locale must have a single-byte 8-bit clean encoding, and that
this was always the intention. We noted that POSIX already has a
big additional requirement for the C/POSIX locale over the C Standard,
namely that it must have 8-bit bytes (whereas the C Standard allows
larger bytes).

We should add an optional POSIX.UTF-8 locale in Issue 8, and this
could be used as the default if a C program calls setlocale() with an
empty string and none of LANG and the LC_* variables are set in the
environment.

We discussed the proposed changes in the thread starting with mail
sequence 18730 and noted the following points:

* The proposed addition to utility APPLICATION USAGE sections needs
a minor change to talk about undefined behaviour instead of errors.
Also, if any of the affected utilities do not require a text file as
input, then something a little different will be needed.

* The EILSEQ change also applies to mblen() and mbrlen().

* For some functions the EILSEQ error is shaded XSI, but should be CX.
(If bug 663 ends up being targeted at Issue 8, these should be fixed
in a separate bug for TC2.)

* The extra change to btowc() is needed (for consistency with mbtowc()).

* We need to check whether the proposed change to the LC_COLLATE
sequence for the POSIX locale matches existing behaviour of
EBCDIC-based systems.
Action: Mark to contact the implementers to ask them.
If it does not match existing behaviour, an alternative that still
fixes the non-adjacent identical lines problem with sort would be to
require that all 256 characters have different primary collation
weights.
(0002738)
geoffclare (manager)
2015-07-02 14:26
edited on: 2015-07-02 16:22

Interpretation response
-----------------------

The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------

The intention was always that the POSIX locale should have an 8-bit-clean single-byte encoding. The omission of an explicit statement to that effect was an oversight.

We are also seeking proposals for the standardization of a "POSIX.UTF-8" locale for Issue 8.

Notes to the Editor (not part of this interpretation):
------------------------------------------------------

On Page: 128 Line: 3596 Section: 6.2 Character Encoding
(2013 edition Page: 128 Line: 3623)

Change from:

The POSIX locale contains the characters in [xref to Table 6-1], which have the properties listed in [xref to 7.3.1]. In other locales, the presence, meaning, and representation of any additional characters are locale-specific.

to:

The POSIX locale shall contain 256 single-byte characters including the characters in [xref to Table 6-1] and [xref to Table 6-2], which have the properties listed in [xref to 7.3.1]. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in [xref to Table 6-1] and may contain any or all of the control characters identified in [xref to Table 6-2] that are not included in [xref to Table 6-1]; the presence, meaning, and representation of any additional characters are locale-specific.

[Note to the TC2 editors: the above is a layered change.]

On Page: 136 Line: 3849 Section: 7.2 POSIX locale
(2013 edition Page: 136 Line: 3885)

Delete:

The tables in Section 7.3 describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified.

On Page: 139 Line: 3996 Section: 7.3.1 LC_CTYPE
(2013 edition Page: 139 Line: 4032)

Change from:

In the POSIX locale, the 26 uppercase letters shall be included:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
to:

In the POSIX locale, only:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
shall be included.

On Page: 139 Line: 4003 Section: 7.3.1 LC_CTYPE
(2013 edition Page: 139 Line: 4039)

Change from:

In the POSIX locale, the 26 lowercase letters shall be included:
a b c d e f g h i j k l m n o p q r s t u v w x y z
to:

In the POSIX locale, only:
a b c d e f g h i j k l m n o p q r s t u v w x y z
shall be included.

On Page: 139 Line: 4009 Section: 7.3.1 LC_CTYPE
(2013 edition Page: 140 Line: 4045)

Change from:

In the POSIX locale, all characters in the classes upper and lower shall be included.

to:

In the POSIX locale, only characters in the classes upper and lower shall be included.

On Page: 141 Line: 4091 Section: 7.3.1 LC_CTYPE
(2013 edition Page: 141 Line: 4127)

Change from:

In the POSIX locale, at a minimum, the 26 lowercase characters:

to:

In the POSIX locale, the 26 lowercase characters:

On Page: 142 Line: 4105 Section: 7.3.1 LC_CTYPE
(2013 edition Page: 142 Line: 4141)

Change from:

In the POSIX locale, at a minimum, the 26 uppercase characters:

to:

In the POSIX locale, the 26 uppercase characters:

On Page: 143 Line: 4142 Section: 7.3.1.1 LC_CTYPE Category in the POSIX Locale
(2013 edition Page: 143 Line: 4178)

Change from:

The character classifications for the POSIX locale follow; the code listing depicts the localedef input, and the table represents the same information, sorted by character.

to:

The minimum character classifications for the POSIX locale follow; the code listing depicts the localedef input, and the table represents the same information, sorted by character. Implementations may add additional characters to the cntrl and punct classifications but shall not make any other additions.

On Page: 143 Line: 4145 Section: 7.3.1.1 LC_CTYPE Category in the POSIX Locale
(2013 edition Page: 143 Line: 4181)

Change from:
# The following is the POSIX locale LC_CTYPE.
# "alpha" is by default "upper" and "lower"
# "alnum" is by definition "alpha" and "digit"
# "print" is by default "alnum", "punct", and the <space>
# "graph" is by default "alnum" and "punct"
to:
# The following is the minimum POSIX locale LC_CTYPE.
# "alpha" is by definition "upper" and "lower"
# "alnum" is by definition "alpha" and "digit"
# "print" is by definition "alnum", "punct", and the <space>
# "graph" is by definition "alnum" and "punct"

On Page: 151 Line: 4531 Section: 7.3.2.6 LC_COLLATE Category in the POSIX Locale
(2013 edition Page: 151 Line: 4567)

Change from:

The collation sequence definition of the POSIX locale follows; the code listing depicts the localedef input.

to:

The minimum collation sequence definition of the POSIX locale follows; the code listing depicts the localedef input. All characters not explicitly listed here shall be inserted in the character collation order after the listed characters and shall be assigned unique primary weights. If the listed characters have ASCII encoding, the other characters shall be in ascending order according to their coded character set values; otherwise, the order of the other characters is unspecified. The collation sequence shall not include any multi-character collating elements.

On Page: 151 Line: 4534 Section: 7.3.2.6 LC_COLLATE Category in the POSIX Locale
(2013 edition Page: 151 Line: 4570)

Change from:
# This is the POSIX locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII codeset.
to:
# This is the minimum input for the POSIX locale definition for the
# LC_COLLATE category. Characters in this list are in the same order
# as in the ASCII codeset.

On Page: 355 Line: 11953 Section: <stdlib.h>
(2013 edition Page: 358 Line: 12042)

After:

{MB_CUR_MAX} Maximum number of bytes in a character specified by the current locale (category LC_CTYPE).

add a new sentence:

[CX]In the POSIX locale the value of {MB_CUR_MAX} shall be 1.[/CX]

On Page: 622 Line: 21263 Section: btowc()
(2013 edition Page: 627 Line: 21451)

In the RETURN VALUE section, add a new sentence:

[CX]In the POSIX locale, btowc() shall not return WEOF if c has a value in the range 0 to 255 inclusive.[/CX]

On Page: 1270 Line: 41775 Section: mblen()
(2013 edition Page: 1282 Line: 42472)

In the ERRORS section, change from:

[XSI][EILSEQ]
An invalid character sequence is detected.[/XSI]
to:

[CX][EILSEQ]
An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 1272 Line: 41825 Section: mbrlen()
(2013 edition Page: 1284 Line: 42526)

In the ERRORS section, change from:

[EILSEQ]
An invalid character sequence is detected.
to:

[EILSEQ]
An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 1275 Line: 41890 Section: mbrtowc()
(2013 edition Page: 1287 Line: 42594)

In the ERRORS section, change from:

[EILSEQ]
An invalid character sequence is detected.
to:

[EILSEQ]
An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 1278 Line: 41998 Section: mbsrtowcs()
(2013 edition Page: 1290 Line: 42706)

In the ERRORS section, change from:

[EILSEQ]
An invalid character sequence is detected.
to:

[EILSEQ]
An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 1279 Line: 42051 Section: mbstowcs()
(2013 edition Page: 1291 Line: 42760)

In the ERRORS section, change from:

[XSI][EILSEQ]
An invalid byte sequence is detected.[/XSI]
to:

[CX][EILSEQ]
An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 1281 Line: 42104 Section: mbtowc()
(2013 edition Page: 1293 Line: 42815)

In the ERRORS section, change from:

[XSI][EILSEQ]
An invalid character sequence is detected.[/XSI]
to:

[CX][EILSEQ]
An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX]

On Page: 2455 Line: 78223 Section: awk
(2013 edition Page: 2478 Line: 79587)

In the APPLICATION USAGE section, add a new paragraph:

When using awk to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 2537 Line: 81424 Section: comm
(2013 edition Page: 2561 Line: 82825)

In the APPLICATION USAGE section, add a new paragraph:

When using comm to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 2786 Line: 90886 Section: grep
(2013 edition Page: 2810 Line: 92292)

In the APPLICATION USAGE section, add a new paragraph:

When using grep to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 2792 Line: 91089 Section: head
(2013 edition Page: 2816 Line: 92496)

In the APPLICATION USAGE section, change from:

None.

to:

When using head to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 2817 Line: 91965 Section: join
(2013 edition Page: 2841 Line: 93377)

In the APPLICATION USAGE section, add a new paragraph:

When using join to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 3159 Line: 105039 Section: sed
(2013 edition Page: 3185 Line: 106550)

In the APPLICATION USAGE section, add a new paragraph:

When using sed to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 3187 Line: 106180 Section: sort
(2013 edition Page: 3214 Line: 107719)

In the APPLICATION USAGE section, add a new paragraph:

When using sort to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 3214 Line: 107142 Section: tail
(2013 edition Page: 3241 Line: 108681)

In the APPLICATION USAGE section, add a new paragraph:

When using tail to process pathnames, and the -c option is not specified, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 3250 Line: 108473 Section: tr
(2013 edition Page: 3277 Line: 110019)

In the RATIONALE section, delete:

This meant that historical practice of being able to specify tr -cd\000-\177 (which would delete all bytes with the top bit set) would have no effect because, in the C locale, bytes with the values octal 200 to octal 377 are not characters.

On Page: 3283 Line: 109551 Section: uniq
(2013 edition Page: 3310 Line: 111099)

In the APPLICATION USAGE section, add a new paragraph:

When using uniq to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided.

On Page: 3454 Line: 115933 Section: A.6.2 Character Encoding
(2013 edition Page: 3483 Line: 117516)

Add a new paragraph:

Earlier versions of this standard did not state the requirement that the POSIX locale contains 256 single-byte characters. This was an oversight; the intention was always that the POSIX locale should have an 8-bit-clean single-byte encoding.

(0002739)
ajosey (manager)
2015-07-02 16:27

Interpretation Proposed: 2 July 2015
(0002822)
ajosey (manager)
2015-09-07 11:34

Interpretation approved: 7 Sep 2015

- Issue History
Date Modified Username Field Change
2013-02-21 21:25 dalias New Issue
2013-02-21 21:25 dalias Status New => Under Review
2013-02-21 21:25 dalias Assigned To => ajosey
2013-02-21 21:25 dalias Name => Rich Felker
2013-02-21 21:25 dalias Organization => musl libc
2013-02-21 21:25 dalias Section => strcasecmp/strncasecmp
2013-02-21 21:25 dalias Page Number => unknown
2013-02-21 21:25 dalias Line Number => unknown
2013-03-21 15:23 eblake Note Added: 0001500
2013-03-21 15:45 eblake Note Added: 0001501
2013-03-21 16:02 eblake Note Edited: 0001501
2013-03-21 16:21 eblake Note Edited: 0001501
2013-03-21 16:24 eblake Note Edited: 0001501
2013-03-21 16:34 eblake Note Edited: 0001501
2013-03-21 16:36 eblake Interp Status => Pending
2013-03-21 16:36 eblake Final Accepted Text => see Note: 0001501
2013-03-21 16:36 eblake Status Under Review => Interpretation Required
2013-03-21 16:36 eblake Resolution Open => Accepted As Marked
2013-03-21 16:36 eblake Tag Attached: tc2-2008
2013-03-21 16:38 eblake Page Number unknown => 1985
2013-03-21 16:38 eblake Line Number unknown => 62819
2013-03-21 16:41 eblake Note Edited: 0001501
2013-03-22 15:33 geoffclare Note Added: 0001502
2013-03-22 15:33 geoffclare Status Interpretation Required => Under Review
2013-03-22 15:33 geoffclare Resolution Accepted As Marked => Reopened
2013-03-23 01:35 dalias Note Added: 0001503
2013-03-23 12:47 eblake Note Added: 0001504
2013-03-23 12:50 eblake Note Edited: 0001504
2013-03-23 21:42 dalias Note Added: 0001505
2013-03-28 17:17 geoffclare Note Deleted: 0001502
2013-03-28 17:35 eblake Note Edited: 0001504
2013-03-29 08:03 ajosey Interp Status Pending => Proposed
2013-03-29 08:03 ajosey Note Added: 0001512
2013-04-04 14:57 msbrown Note Added: 0001529
2013-04-04 14:58 msbrown Note Edited: 0001529
2013-04-04 15:11 ajosey Interp Status Proposed => ---
2013-04-04 15:11 ajosey Note Deleted: 0001512
2013-04-04 15:31 eblake Note Added: 0001530
2013-07-09 09:10 geoffclare Note Added: 0001669
2013-07-22 17:48 wilx Issue Monitored: wilx
2014-01-16 16:45 Don Cragun Tag Attached: UTF-8_Locale
2015-07-02 14:26 geoffclare Note Added: 0002738
2015-07-02 14:31 geoffclare Note Edited: 0002738
2015-07-02 15:44 geoffclare Note Edited: 0002738
2015-07-02 15:51 geoffclare Note Edited: 0002738
2015-07-02 16:03 geoffclare Note Edited: 0002738
2015-07-02 16:12 geoffclare Note Edited: 0002738
2015-07-02 16:20 geoffclare Interp Status --- => Pending
2015-07-02 16:20 geoffclare Final Accepted Text see Note: 0001501 => see Note: 0002738
2015-07-02 16:20 geoffclare Status Under Review => Interpretation Required
2015-07-02 16:20 geoffclare Resolution Reopened => Accepted As Marked
2015-07-02 16:20 geoffclare Note Edited: 0001501
2015-07-02 16:22 geoffclare Note Edited: 0002738
2015-07-02 16:27 ajosey Interp Status Pending => Proposed
2015-07-02 16:27 ajosey Note Added: 0002739
2015-07-02 20:02 rhansen Relationship added related to 0000967
2015-09-07 11:34 ajosey Interp Status Proposed => Approved
2015-09-07 11:34 ajosey Note Added: 0002822
2019-02-21 16:01 nick Relationship added related to 0001182
2019-06-10 08:55 agadmin Status Interpretation Required => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker