0000663: Specification of str[n]casecmp is ambiguous

ID	Project	Category	View Status	Date Submitted	Last Update

0000663	1003.1(2008)/Issue 7	System Interfaces	public	2013-02-21 21:25	2019-06-10 08:55

Reporter	dalias	Assigned To	ajosey
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	Closed	Resolution	Accepted As Marked

Name	Rich Felker
Organization	musl libc
User Reference
Section	strcasecmp/strncasecmp
Page Number	1985
Line Number	62819
Interp Status	Approved
Final Accepted Text	see 0000663:0002738


Summary	0000663: Specification of str[n]casecmp is ambiguous
Description	The description includes the text: "When the LC_CTYPE category of the current locale is from the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. Otherwise, the results are unspecified." As far as I can tell, the phrase "converted to lowercase" is not precisely specified anywhere. This is not a serious problem in the event that the POSIX locale only contains the required characters, but per XBD 7.2, implementations are permitted to have other characters available in the POSIX locale ("The tables in Locale Definition describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified.") If additional characters outside the portable character set exist in the POSIX locale, does "converted to lowercase" mean as if by tolower on a byte-by-byte basis, or as if by towlower on a character-by-character basis? Or does the "unspecified" clause in XBD 7.2 leave this choice up to the implementation? A more worrisome issue is, in the case that an implementation has multibyte characters in the POSIX locale, the symmetry of strncasecmp is broken. The n argument is specified as the maximum number of bytes to compare from s1, not from s2, so from my reading of the text as written, strncasecmp may read more than n bytes from s2, and strncasecmp(a,b,n)>0 is not necessarily equivalent to strncasecmp(b,a,n)<0. In particular, consider the example (assuming UTF-8): strncasecmp("\u00df", "\u1e9e", 2)
Desired Action	Please clarify whether the standard places any requirements on an implementation to "convert to lowercase" as if by tolower or towlower. If conversion as if by towlower is permitted, please clarify whether strncasecmp is permitted to read more than n bytes from s2 if necessary to compare (as if by towlower) n bytes from s1, and whether there are any obligations on an implementation when an invalid multibyte sequence is encountered in s1 or s2. As possibly the only implementor to whom these issues seem relevant (I'm not aware of other implementations with multibyte characters in the POSIX locale), my preference would be to leave these choices unspecified or implementation-defined, and permit strncasecmp, if necessary, to read more than n bytes from s2. If strncasecmp is forbidden from reading more than n bytes from s2, then I believe the only reasonable choice for the other decisions is to require conversion as if by tolower rather than towlower.
Tags	tc2-2008, UTF-8_Locale

eblake 2013-03-21 15:23 manager bugnote:0001500	The C standard is clear that in the C locale, isupper() shall return non-zero for exactly 26 characters - the ascii alphabet. Since the POSIX locale is a synonym for the C locale, it should not matter whether the locale supports a multibyte encoding, there are still exactly 26 characters, all single bytes, that can be converted, and nothing else, because those are the only characters that may match the isupper() contract. That said, we probably need to tighten the POSIX spec to make it clear that the POSIX locale has exactly 26 uppercase letters, regardless of whether it is single-byte or multi-byte, and regardless of whether other characters are recognized.

eblake 2013-03-21 15:45 manager bugnote:0001501 Last edited: 2015-07-02 16:20	Note that this resolution has been superseded by 0000663:0002738. Interpretation response ----------------------- The standard clearly states that the POSIX locale is a synonym for the C locale with regards to LC_CTYPE, and defers to the C standard for its definitions of character categories. The C standard is clear that only 26 uppercase and 26 lowercase letters exist in the C locale, all of which have single-byte encodings. Rationale: ---------- While the POSIX locale may include multibyte characters, it must still honor the C locale rules. Since the C locale has only 26 characters that can be altered by tolower(), all of which are single bytes, it does not matter whether strcasecmp() and strncasecmp() use byte or character normalization, and the length limit of strncasecmp() is specified in bytes. We are also seeking proposals for the standardization of a "POSIX.UTF-8" locale for Issue 8. Notes to the Editor (not part of this interpretation): ------------------------------------------------------ Make the following changes: On page 136 line 3849 [7.2 POSIX locale], move the following sentences: The tables in Section 7.3 describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. to page 151 line 4531 [7.3.2.6 LC_COLLATE Category in the POSIX Locale], with a wording change: The definition below describes the behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. On page 139 line 3996 [7.3.1 LC_CTYPE upper], change: In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z to: In the POSIX locale, only: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z shall be included. On page 139 line 4003 [7.3.1 LC_CTYPE lower], change: In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z to: In the POSIX locale, only: a b c d e f g h i j k l m n o p q r s t u v w x y z shall be included. On page 139 line 4009 [7.3.1 LC_CTYPE alpha], change: In the POSIX locale, all characters in the classes upper and lower shall be included. to: In the POSIX locale, only characters in the classes upper and lower shall be included. On page 141 line 4091 [7.3.1 LC_CTYPE toupper], change: In the POSIX locale, at a minimum, the 26 lowercase characters: to: In the POSIX locale, the 26 lowercase characters: On page 142 line 4105 [7.3.1 LC_CTYPE tolower], change: In the POSIX locale, at a minimum, the 26 uppercase characters: to: In the POSIX locale, the 26 uppercase characters: On page 143 line 4145 [7.3.1.1 LC_CTYPE Category in the POSIX Locale], change: # The following is the POSIX locale LC_CTYPE. # "alpha" is by default "upper" and "lower" # "alnum" is by definition "alpha" and "digit" # "print" is by default "alnum", "punct", and the <space> # "graph" is by default "alnum" and "punct" to: # The following is the minimum POSIX locale LC_CTYPE; implementations may # add additional characters to the "cntrl" and "punct" classifications. # "alpha" is by definition "upper" and "lower" # "alnum" is by definition "alpha" and "digit" # "print" is by definition "alnum", "punct", and the <space> # "graph" is by definition "alnum" and "punct" At page 151 line 4534 [7.3.2.6 LC_COLLATE Category in the POSIX Locale], change: # This is the POSIX locale definition for the LC_COLLATE category. # The order is the same as in the ASCII codeset. to: # This is the minimum input for the POSIX locale definition for the # LC_COLLATE category. Characters in this list are in the same order # as in the ASCII codeset.

dalias 2013-03-23 01:35 reporter bugnote:0001503	As the reporter of this issue and perhaps the only implementor affected by it, I would prefer a resolution that does not impose a requirement that the POSIX locale's upper and lower cases classes and mappings contain only members of the portable character set. If such a new requirement is to be imposed, it should be based on a strong argument that the existence of additional characters with case mappings could break otherwise-correct applications. I am doubtful that such an argument exists; for the most part, an application using the POSIX locale cannot expect well-defined or predictable behavior if it is processing data which contains byte sequences outside of the portable character set.

eblake 2013-03-23 12:47 manager bugnote:0001504 Last edited: 2013-03-28 17:35	Phooey - I'm wondering if 0000663:0001501 has have been over-interpreting the C standard. POSIX is already clear that the contract of isupper() defers to the C standard [line 38829], and the C standard is already explicit [C99 7.4.1.11] that in the C locale, isupper() may return true only for the characters which are letters according to C99 5.2.1. However, in re-reading 5.2.1, I see the following: There is a requirement that there are exactly 26 uppercase letters in the basic execution character set. Likewise, there is a requirement that there are exactly 10 decimal digits in the basic execution character set, with consecutive encodings. There is also a strong statement that: "A letter is an uppercase letter or a lowercase letter as defined above: in this International Standard the term does not include other characters that are letters in other alphabets." We were originally interpreting this to mean that the C locale provides exactly the basic execution character set, with exactly 26 uppercase letters, and cannot provide extension letters. However, on re-reading, that statement in the C standard sounds to me like a letter is either one of the 26 uppercase letters in the basic character set, OR a character in the extended character set that is a letter, and merely that if the extended character set is smaller than the set of Unicode characters, then the term 'letter' in the standard would not include those Unicode letters not in the implementation's extended character set. But if we read it that way, where the extended character set can provide additional letters, then that also means that the extended character set can provide additional decimal digits. Yet, in the POSIX standard, we required that there are exactly 10 characters that can belong to the 'digit' classification of the C locale [line 4015]. The approach in 0000663:0001501 was to extend the exactness already present on digits to also apply to letters, but maybe the right approach is to instead go the other direction and not only allow extension character set members as additions to the set of letters, but also allow extension character set members as additions to the set of decimal digits. But this means that you can no longer use c-'0' to find the numeric value corresponding to a character c where isdigit(c) returns non-zero. Now, you DO have some constraints - my understanding is that Unicode includes some characters which are considered letters, but which are neither uppercase nor lowercase. However, while non-C locales are permitted to have a character where isupper(c) and islower(c) both return false, but isalpha(c) returns true, this is forbidden on the C locale. Since you seem to want additional multibyte letters allowed in the C locale, you would help your case by proposing actual wording to place into the standard, to be used in place of 0000663:0001501.

dalias 2013-03-23 21:42 reporter bugnote:0001505	On further reading of ISO C, I agree that while it permits the existence of additional characters in the C locale, it does not seem to permit the standard character classes to have any members beyond those that exist in the portable character set when the C locale is active. I further agree that it's misleading for POSIX to imply that such freedom of implementation exists, when alignment with the C standard actually precludes such freedom, so provided this interpretation of ISO C is correct, I would support efforts to include the C requirements in the POSIX text. On the matter of allowing additional decimal digits, which was also mentioned, I think it would be harmful to applications, whether in the C locale or another locale. Under the current requirements of C and POSIX, isdigit(c) implies c-'0' is in the range 0 to 9 and represents the numeric value of the digit. Pulling this assumption out from under applications could lead to serious breakage.

msbrown 2013-04-04 14:57 manager bugnote:0001529 Last edited: 2013-04-04 14:58	The collating sequence for an EBCDIC (code page 1047) is going to be different than ASCII, I think. Here is some relevant data. Here is the collation table definition from our library part edcclocp.c which is part of the structure of our LC_Collate category. This is used for the static initialization of the POSIX locale, which we instantiate in LE during C initialization. I looked at a couple of alternate initializations that we use, but this is the only one that is relevant for POSIX applications. Character values in the array are decimal. static collel_t _P_colleltbl[]={ 0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15, 16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31, 64, 90, 127, 123, 91, 108, 80, 125, 77, 93, 92, 78, 107, 96, 75, 97, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 122, 94, 76, 126, 110, 111, 124, 193, 194, 195, 196, 197, 198, 199, 200, 201, 209, 210, 211, 212, 213, 214, 215, 216, 217, 226, 227, 228, 229, 230, 231, 232, 233, 173, 224, 189, 95, 109, 121, 129, 130, 131, 132, 133, 134, 135, 136, 137, 145, 146, 147, 148, 149, 150, 151, 152, 153, 162, 163, 164, 165, 166, 167, 168, 169, 192, 79, 208, 161, 7, 4, 6, 8, 9, 10, 20, 23, 26, 27, 32, 33, 34, 35, 36, 37, 40, 41, 42, 43, 44, 51, 52, 53, 54, 56, 57, 58, 59, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 81, 82, 83, 84, 85, 86, 87, 88, 89, 98, 99, 100, 101, 102, 103, 104, 105, 106, 112, 113, 114, 115, 116, 117, 118, 119, 120, 128, 138, 139, 140, 141, 142, 143, 144, 154, 155, 156, 157, 158, 159, 160, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 190, 191, 202, 203, 204, 205, 206, 207, 218, 219, 220, 221, 222, 223, 225, 234, 235, 236, 237, 238, 239, 250, 251, 252, 253, 254, 255, }; 240 -> 249: Numerals 0 - 9 193 -> 233: Upper case A-Z 129 -> 169: Lower case a-z Here's a pointer to a quick reference to the values for ebcdic code page 1047, which is what z/OS uses. http://en.wikipedia.org/wiki/EBCDIC_1047 The wikipedia table entries show the glyph for printable characters (or acronym for control chars), the corresponding ASCII value, and the decimal value of the EBCDIC slot in the table, which helps in deciphering our static array definition. I believe the ASCII collation sequence is just the numerical ordering of the ASCII code points from 0 - 7F, so it would be fairly easy to show graphically the difference between EBCDIC and ASCII. Also, thinking about it now, there are twice as many possible code points in the EBCDIC collation sequence, so even if the first 127 points matched, as in Unicode, the two orders would be different. Oh well.

eblake 2013-04-04 15:31 manager bugnote:0001530	The table is missing collation values for decimal 48, 49, and 62 - what happens when sorting a file with contents $'\x30\n\x31\n\x3e\n\x30\n'?

geoffclare 2013-07-09 09:10 manager bugnote:0001669	Since there have been questions on the mailing list about the status of this bug, here is a relevant excerpt from the Minutes of the 28 March 2013 teleconference from http://www.opengroup.org/austin/docs/austin_601.txt The only change in status since those minutes is that we received some feedback regarding the POSIX locale collation sequence on EBCDIC systems, but we are awaiting further feedback. Bug 0000663: Specification of str[n]casecmp is ambiguous Reopened http://austingroupbugs.net/view.php?id=663 This item was reopened based on discussions on the reflector. We agreed that we will proceed with the changes to clarify that the POSIX locale must have a single-byte 8-bit clean encoding, and that this was always the intention. We noted that POSIX already has a big additional requirement for the C/POSIX locale over the C Standard, namely that it must have 8-bit bytes (whereas the C Standard allows larger bytes). We should add an optional POSIX.UTF-8 locale in Issue 8, and this could be used as the default if a C program calls setlocale() with an empty string and none of LANG and the LC_* variables are set in the environment. We discussed the proposed changes in the thread starting with mail sequence 18730 and noted the following points: * The proposed addition to utility APPLICATION USAGE sections needs a minor change to talk about undefined behaviour instead of errors. Also, if any of the affected utilities do not require a text file as input, then something a little different will be needed. * The EILSEQ change also applies to mblen() and mbrlen(). * For some functions the EILSEQ error is shaded XSI, but should be CX. (If bug 663 ends up being targeted at Issue 8, these should be fixed in a separate bug for TC2.) * The extra change to btowc() is needed (for consistency with mbtowc()). * We need to check whether the proposed change to the LC_COLLATE sequence for the POSIX locale matches existing behaviour of EBCDIC-based systems. Action: Mark to contact the implementers to ask them. If it does not match existing behaviour, an alternative that still fixes the non-adjacent identical lines problem with sort would be to require that all 256 characters have different primary collation weights.

geoffclare 2015-07-02 14:26 manager bugnote:0002738 Last edited: 2015-07-02 16:22	Interpretation response ----------------------- The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Rationale: ------------- The intention was always that the POSIX locale should have an 8-bit-clean single-byte encoding. The omission of an explicit statement to that effect was an oversight. We are also seeking proposals for the standardization of a "POSIX.UTF-8" locale for Issue 8. Notes to the Editor (not part of this interpretation): ------------------------------------------------------ On Page: 128 Line: 3596 Section: 6.2 Character Encoding (2013 edition Page: 128 Line: 3623) Change from: The POSIX locale contains the characters in [xref to Table 6-1], which have the properties listed in [xref to 7.3.1]. In other locales, the presence, meaning, and representation of any additional characters are locale-specific. to: The POSIX locale shall contain 256 single-byte characters including the characters in [xref to Table 6-1] and [xref to Table 6-2], which have the properties listed in [xref to 7.3.1]. It is unspecified whether characters not listed in those two tables are classified as punct or cntrl, or neither. Other locales shall contain the characters in [xref to Table 6-1] and may contain any or all of the control characters identified in [xref to Table 6-2] that are not included in [xref to Table 6-1]; the presence, meaning, and representation of any additional characters are locale-specific. [Note to the TC2 editors: the above is a layered change.] On Page: 136 Line: 3849 Section: 7.2 POSIX locale (2013 edition Page: 136 Line: 3885) Delete: The tables in Section 7.3 describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. On Page: 139 Line: 3996 Section: 7.3.1 LC_CTYPE (2013 edition Page: 139 Line: 4032) Change from: In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z to: In the POSIX locale, only: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z shall be included. On Page: 139 Line: 4003 Section: 7.3.1 LC_CTYPE (2013 edition Page: 139 Line: 4039) Change from: In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z to: In the POSIX locale, only: a b c d e f g h i j k l m n o p q r s t u v w x y z shall be included. On Page: 139 Line: 4009 Section: 7.3.1 LC_CTYPE (2013 edition Page: 140 Line: 4045) Change from: In the POSIX locale, all characters in the classes upper and lower shall be included. to: In the POSIX locale, only characters in the classes upper and lower shall be included. On Page: 141 Line: 4091 Section: 7.3.1 LC_CTYPE (2013 edition Page: 141 Line: 4127) Change from: In the POSIX locale, at a minimum, the 26 lowercase characters: to: In the POSIX locale, the 26 lowercase characters: On Page: 142 Line: 4105 Section: 7.3.1 LC_CTYPE (2013 edition Page: 142 Line: 4141) Change from: In the POSIX locale, at a minimum, the 26 uppercase characters: to: In the POSIX locale, the 26 uppercase characters: On Page: 143 Line: 4142 Section: 7.3.1.1 LC_CTYPE Category in the POSIX Locale (2013 edition Page: 143 Line: 4178) Change from: The character classifications for the POSIX locale follow; the code listing depicts the localedef input, and the table represents the same information, sorted by character. to: The minimum character classifications for the POSIX locale follow; the code listing depicts the localedef input, and the table represents the same information, sorted by character. Implementations may add additional characters to the cntrl and punct classifications but shall not make any other additions. On Page: 143 Line: 4145 Section: 7.3.1.1 LC_CTYPE Category in the POSIX Locale (2013 edition Page: 143 Line: 4181) Change from: # The following is the POSIX locale LC_CTYPE. # "alpha" is by default "upper" and "lower" # "alnum" is by definition "alpha" and "digit" # "print" is by default "alnum", "punct", and the <space> # "graph" is by default "alnum" and "punct" to: # The following is the minimum POSIX locale LC_CTYPE. # "alpha" is by definition "upper" and "lower" # "alnum" is by definition "alpha" and "digit" # "print" is by definition "alnum", "punct", and the <space> # "graph" is by definition "alnum" and "punct" On Page: 151 Line: 4531 Section: 7.3.2.6 LC_COLLATE Category in the POSIX Locale (2013 edition Page: 151 Line: 4567) Change from: The collation sequence definition of the POSIX locale follows; the code listing depicts the localedef input. to: The minimum collation sequence definition of the POSIX locale follows; the code listing depicts the localedef input. All characters not explicitly listed here shall be inserted in the character collation order after the listed characters and shall be assigned unique primary weights. If the listed characters have ASCII encoding, the other characters shall be in ascending order according to their coded character set values; otherwise, the order of the other characters is unspecified. The collation sequence shall not include any multi-character collating elements. On Page: 151 Line: 4534 Section: 7.3.2.6 LC_COLLATE Category in the POSIX Locale (2013 edition Page: 151 Line: 4570) Change from: # This is the POSIX locale definition for the LC_COLLATE category. # The order is the same as in the ASCII codeset. to: # This is the minimum input for the POSIX locale definition for the # LC_COLLATE category. Characters in this list are in the same order # as in the ASCII codeset. On Page: 355 Line: 11953 Section: <stdlib.h> (2013 edition Page: 358 Line: 12042) After: {MB_CUR_MAX} Maximum number of bytes in a character specified by the current locale (category LC_CTYPE). add a new sentence: [CX]In the POSIX locale the value of {MB_CUR_MAX} shall be 1.[/CX] On Page: 622 Line: 21263 Section: btowc() (2013 edition Page: 627 Line: 21451) In the RETURN VALUE section, add a new sentence: [CX]In the POSIX locale, btowc() shall not return WEOF if c has a value in the range 0 to 255 inclusive.[/CX] On Page: 1270 Line: 41775 Section: mblen() (2013 edition Page: 1282 Line: 42472) In the ERRORS section, change from: [XSI][EILSEQ] An invalid character sequence is detected.[/XSI] to: [CX][EILSEQ] An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 1272 Line: 41825 Section: mbrlen() (2013 edition Page: 1284 Line: 42526) In the ERRORS section, change from: [EILSEQ] An invalid character sequence is detected. to: [EILSEQ] An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 1275 Line: 41890 Section: mbrtowc() (2013 edition Page: 1287 Line: 42594) In the ERRORS section, change from: [EILSEQ] An invalid character sequence is detected. to: [EILSEQ] An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 1278 Line: 41998 Section: mbsrtowcs() (2013 edition Page: 1290 Line: 42706) In the ERRORS section, change from: [EILSEQ] An invalid character sequence is detected. to: [EILSEQ] An invalid character sequence is detected. [CX]In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 1279 Line: 42051 Section: mbstowcs() (2013 edition Page: 1291 Line: 42760) In the ERRORS section, change from: [XSI][EILSEQ] An invalid byte sequence is detected.[/XSI] to: [CX][EILSEQ] An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 1281 Line: 42104 Section: mbtowc() (2013 edition Page: 1293 Line: 42815) In the ERRORS section, change from: [XSI][EILSEQ] An invalid character sequence is detected.[/XSI] to: [CX][EILSEQ] An invalid character sequence is detected. In the POSIX locale an EILSEQ error cannot occur since all byte values are valid characters.[/CX] On Page: 2455 Line: 78223 Section: awk (2013 edition Page: 2478 Line: 79587) In the APPLICATION USAGE section, add a new paragraph: When using awk to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 2537 Line: 81424 Section: comm (2013 edition Page: 2561 Line: 82825) In the APPLICATION USAGE section, add a new paragraph: When using comm to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 2786 Line: 90886 Section: grep (2013 edition Page: 2810 Line: 92292) In the APPLICATION USAGE section, add a new paragraph: When using grep to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 2792 Line: 91089 Section: head (2013 edition Page: 2816 Line: 92496) In the APPLICATION USAGE section, change from: None. to: When using head to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 2817 Line: 91965 Section: join (2013 edition Page: 2841 Line: 93377) In the APPLICATION USAGE section, add a new paragraph: When using join to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 3159 Line: 105039 Section: sed (2013 edition Page: 3185 Line: 106550) In the APPLICATION USAGE section, add a new paragraph: When using sed to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 3187 Line: 106180 Section: sort (2013 edition Page: 3214 Line: 107719) In the APPLICATION USAGE section, add a new paragraph: When using sort to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 3214 Line: 107142 Section: tail (2013 edition Page: 3241 Line: 108681) In the APPLICATION USAGE section, add a new paragraph: When using tail to process pathnames, and the -c option is not specified, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 3250 Line: 108473 Section: tr (2013 edition Page: 3277 Line: 110019) In the RATIONALE section, delete: This meant that historical practice of being able to specify tr -cd\000-\177 (which would delete all bytes with the top bit set) would have no effect because, in the C locale, bytes with the values octal 200 to octal 377 are not characters. On Page: 3283 Line: 109551 Section: uniq (2013 edition Page: 3310 Line: 111099) In the APPLICATION USAGE section, add a new paragraph: When using uniq to process pathnames, it is recommended that LC_ALL, or at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environment, since pathnames can contain byte sequences that do not form valid characters in some locales, in which case the utility's behavior would be undefined. In the POSIX locale each byte is a valid single-byte character, and therefore this problem is avoided. On Page: 3454 Line: 115933 Section: A.6.2 Character Encoding (2013 edition Page: 3483 Line: 117516) Add a new paragraph: Earlier versions of this standard did not state the requirement that the POSIX locale contains 256 single-byte characters. This was an oversight; the intention was always that the POSIX locale should have an 8-bit-clean single-byte encoding.

ajosey 2015-07-02 16:27 manager bugnote:0002739	Interpretation Proposed: 2 July 2015

ajosey 2015-09-07 11:34 manager bugnote:0002822	Interpretation approved: 7 Sep 2015

Date Modified	Username	Field	Change
2013-02-21 21:25	dalias	New Issue
2013-02-21 21:25	dalias	Status	New => Under Review
2013-02-21 21:25	dalias	Assigned To	=> ajosey
2013-02-21 21:25	dalias	Name	=> Rich Felker
2013-02-21 21:25	dalias	Organization	=> musl libc
2013-02-21 21:25	dalias	Section	=> strcasecmp/strncasecmp
2013-02-21 21:25	dalias	Page Number	=> unknown
2013-02-21 21:25	dalias	Line Number	=> unknown
2013-03-21 15:23	eblake	Note Added: 0001500
2013-03-21 15:45	eblake	Note Added: 0001501
2013-03-21 16:02	eblake	Note Edited: 0001501
2013-03-21 16:21	eblake	Note Edited: 0001501
2013-03-21 16:24	eblake	Note Edited: 0001501
2013-03-21 16:34	eblake	Note Edited: 0001501
2013-03-21 16:36	eblake	Interp Status	=> Pending
2013-03-21 16:36	eblake	Final Accepted Text	=> see 0000663:0001501
2013-03-21 16:36	eblake	Status	Under Review => Interpretation Required
2013-03-21 16:36	eblake	Resolution	Open => Accepted As Marked
2013-03-21 16:36	eblake	Tag Attached: tc2-2008
2013-03-21 16:38	eblake	Page Number	unknown => 1985
2013-03-21 16:38	eblake	Line Number	unknown => 62819
2013-03-21 16:41	eblake	Note Edited: 0001501
2013-03-22 15:33	geoffclare	Status	Interpretation Required => Under Review
2013-03-22 15:33	geoffclare	Resolution	Accepted As Marked => Reopened
2013-03-23 01:35	dalias	Note Added: 0001503
2013-03-23 12:47	eblake	Note Added: 0001504
2013-03-23 12:50	eblake	Note Edited: 0001504
2013-03-23 21:42	dalias	Note Added: 0001505
2013-03-28 17:35	eblake	Note Edited: 0001504
2013-03-29 08:03	ajosey	Interp Status	Pending => Proposed
2013-04-04 14:57	msbrown	Note Added: 0001529
2013-04-04 14:58	msbrown	Note Edited: 0001529
2013-04-04 15:11	ajosey	Interp Status	Proposed => ---
2013-04-04 15:31	eblake	Note Added: 0001530
2013-07-09 09:10	geoffclare	Note Added: 0001669
2014-01-16 16:45	Don Cragun	Tag Attached: UTF-8_Locale
2015-07-02 14:26	geoffclare	Note Added: 0002738
2015-07-02 14:31	geoffclare	Note Edited: 0002738
2015-07-02 15:44	geoffclare	Note Edited: 0002738
2015-07-02 15:51	geoffclare	Note Edited: 0002738
2015-07-02 16:03	geoffclare	Note Edited: 0002738
2015-07-02 16:12	geoffclare	Note Edited: 0002738
2015-07-02 16:20	geoffclare	Interp Status	--- => Pending
2015-07-02 16:20	geoffclare	Final Accepted Text	see 0000663:0001501 => see 0000663:0002738
2015-07-02 16:20	geoffclare	Status	Under Review => Interpretation Required
2015-07-02 16:20	geoffclare	Resolution	Reopened => Accepted As Marked
2015-07-02 16:20	geoffclare	Note Edited: 0001501
2015-07-02 16:22	geoffclare	Note Edited: 0002738
2015-07-02 16:27	ajosey	Interp Status	Pending => Proposed
2015-07-02 16:27	ajosey	Note Added: 0002739
2015-07-02 20:02	rhansen	Relationship added	related to 0000967
2015-09-07 11:34	ajosey	Interp Status	Proposed => Approved
2015-09-07 11:34	ajosey	Note Added: 0002822
2019-02-21 16:01	nick	Relationship added	related to 0001182
2019-06-10 08:55	agadmin	Status	Interpretation Required => Closed

View Issue Details

Relationships

Activities

Issue History

related to	0000967	Closed		1003.1(2013)/Issue7+TC1	character set confusion
related to	0001182	Closed		1003.1(2016/18)/Issue7+TC2	CX behavior wasn't changed appropriately with TC2