0001924: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

ID	Project	Category	View Status	Date Submitted	Last Update

0001924	1003.1(2024)/Issue8	Shell and Utilities	public	2025-05-05 19:02	2025-08-05 11:04

Reporter	stephane	Assigned To	geoffclare
Priority	normal	Severity	Objection	Type	Error
Status	Applied	Resolution	Accepted As Marked

Name	Stephane Chazelas
Organization
User Reference
Section	Shell word splitting and "read" utility
Page Number	various
Line Number	various
Interp Status	---
Final Accepted Text	0001924:0007196


Summary	0001924: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings
Description	This is an objection to the resolution of 0001560 and a follow-up on (now withdrawn) 0001920 0001560 has changed the way word-splitting is meant to work from splitting strings of characters on the characters of IFS to splitting strings of bytes on the byte encoding of characters in IFS. While having operators that can safely deal with arbitrary strings of (non-null) bytes is a worthwhile endeavour, here the new required behaviour is inappropriate in locales where the character encoding is non-self-synchronising such as when the encoding of some characters contains the encoding of others, including the single byte ones that encode characters of the portable character set such as \, , ?, [ themselves involved as part of or after word splitting (backslash processing by the read utility, globbing by sh), as that means word splitting can result in characters being split in the middle and new different characters including \?[]'" to be introduced or removed by that splitting. It is not a theoretical problem. There are still locales on real life systems that use character encodings such as BIG5, BIG5-HKSCS or GB18030 which have dozens of characters whose encoding contain the encoding of \, [ or ] and thousands whose encoding contains those of decimal digits. For instance, as already mentioned in 0001920, the new word splitting wording would require 'Stéphane' to be split into $'St\x88' (invalid encoding) and 'phane' with IFS=m in a locale using BIG5-HKSCS as é there is encoded as 0x88 0x6d (and m as 0x6d as in ASCII). And how would word splitting even work with IFS='mé'? In a locale using the GB18030 encoding, with IFS='芠', '∑[0-9]' would be split into $'\xa1' and '0-9]', turning a glob into two non-globs one of which invalid text. The EUC-JP character encoding does not (AFAIK) have characters whose encoding contains that of others but is still not self synchronising. There for example, 与is encoded as 0xcd 0xbf and 人 as 0xbf 0xcd, so with IFS='与', 人人人 would be split into $'\xbf', '' and $'\xcd'. Even in locales using UTF-8, the algorithm that POSIX now requires sh/read to implement (and that AFAIK no shell implements other than the ones that don't support multibyte encodings) is arguably not the best one if we remove the constraint of IFS having to contain characters. As a contrived example (as 0xc0, 0xc1, 0xf5..0xff would be better choices as those bytes can not appear in valid modern UTF-8), since a lone 0x80 byte cannot occur in valid UTF-8, one could want to join valid UTF-8 strings with that byte value and expect word splitting to split the result back with IFS=$'\x80'. But with the new POSIX algorithm, that wouldn't work if the strings c ontained characters whose encoding contains that byte. Several systems, when converting strings expected to be UTF-8 encoded into wide character strings convert valid UTF-8 encoded characters into the corresponding Unicode codepoint value, and each byte that cannot be decoded into a character in a special range of values outside the 0x0..0xd7ff, 0xe000..0x10ffff range covered by Unicode. Many of them use the range reserved for the second half of the UTF-16 surrogate pairs (0xdc00 to 0xdfff) as that means it can also be used to decode into UTF-16. That's the case of python, zsh (in some areas only, like pattern matching), java (I believe) at least. $ python3 -c 'import sys; print("{!r}".format(sys.argv[1]))' $'\x80' '\udc80' $ a=$'\x80' zsh -c $'case $a in ([\ud7ff-\ue000]) echo yes; esac' yes That approach also allows splitting arbitrary strings, and if IFS contained valid UTF-8 would produce identical results as the 0001560 algorithm and if not would arguably be preferable as it wouldn't cut valid characters in the middle. That approach also allows pattern matching on potentially invalid strings or ${#var} to work better. It doesn't work well for incorrect strings in non-self-synchronising character encodings, but then again there's not much we can do with those in that case.
Desired Action	First, at least the wording should make it clear that shells/read implementations are not required to implement that algorithm, just that whatever algorithm they use must produce the same result as long as IFS contains only properly encoded characters ("shall be split as if by looking for the encoding of characters of IFS..."). In any case, that method should be constrained to locales with self-synchronising encodings (in practice today, probably just single byte encodings and UTF-8), and must not be used otherwise as it can produce incorrect splitting of perfectly correct text. In those other locales (the ones with non-self-synchronising encodings), the pre-0001560 wording is probably the best: split on characters and behaviour unspecified if subject or IFS contain sequences of bytes that can't be decoded into characters (STDIN section of the read utility would need to be updated accordingly) Potentially better algorithms such as the one described above that would allow pattern matching for instance to work better on arbitrary sequences of bytes should probably be explored. Another option would be to put all non-self-synchronising character encodings out of scope of POSIX (like was already done for locking-shift encodings), but that's probably a step too far as that would make a lot of the POSIX spec designed to deal with those irrelevant.
Tags	tc1-2024

geoffclare 2025-05-15 15:14 manager bugnote:0007183	After page 79 line 2388 section 3 Definitions, add: 3.328 Self-synchronizing Character Encoding A character encoding in which no contiguous subset of bytes from the encoding of any one character or two adjacent characters can also represent the encoding of any valid character on its own. and renumber the later subsections. On page 2481 line 80454 section 2.5.3 Shell Variables (IFS), after: If the value of IFS includes any bytes that do not form part of a valid character, the results of field splitting, expansion of '', and use of the read* utility are unspecified. add a sentence: If the character encoding used for the characters in IFS is not self-synchronizing and the value of IFS includes any character for which the byte encoding can overlap with the byte encoding of any other sequence of characters, the results of field splitting, expansion of '', and use of the read* utility are unspecified. (Note: the UTF-8 encoding is self-synchronizing, meaning that no character's encoding can be confused with any other sequence of characters, and thus does not trigger this exception.)

stephane 2025-05-16 06:25 reporter bugnote:0007186	Re: 0001924:0007183 Thanks for that. A few comments: > After page 79 line 2388 section 3 Definitions, add: > > 3.328 Self-synchronizing Character Encoding > > A character encoding in which no contiguous subset of bytes > from the encoding of any one character or two adjacent > characters can also represent the encoding of any valid > character on its own. [...] Not sure that wording works. There's necessarily "A subset of bytes from the encoding of two adjacent characters" that can "represent the encoding of any valid character on its own", since it contains the encoding of each of those two characters. Maybe a "subset (other than the encoding of each character)". > On page 2481 line 80454 section 2.5.3 Shell Variables (IFS), after: > > If the value of IFS includes any bytes that do not form part > of a valid character, the results of field splitting, > expansion of '', and use of the read utility are > unspecified. > > add a sentence: > > If the character encoding used for the characters in IFS is > not self-synchronizing and the value of IFS includes any > character for which the byte encoding can overlap with the > byte encoding of any other sequence of characters, the > results of field splitting, expansion of '', and use of the > read utility are unspecified. (Note: the UTF-8 encoding is > self-synchronizing, meaning that no character's encoding can > be confused with any other sequence of characters, and thus > does not trigger this exception.) "encoding used for the characters in IFS" is not clear to me. In the shells I know (with the possible exception of yash), when IFS is assigned a value, it's assigned a sequence of bytes which may or may not form characters in the locale (as determined by ${LC_ALL:-${LC_CTYPE:-$LANG}}) at the time, but what matters wrt word splitting is the locale (specifically ${LC_ALL:-${LC_CTYPE:-$LANG}}) in effect at the time splitting is performed, and the characters that those bytes form (and potentially whether they're classified as iswspace()). So it would be about whether the locale's character encoding is self-synchronizing or not, not "the character encoding used for the characters in IFS" (whatever that means). Now, those concerns aside, AFAICT that resolution addresses this issue (other than be request to add a "as if by") and it's nice that it makes it clear that character encodings such a BIG5/GB18030 and other non-self-synchronising encodings are not usable (at least reliably), but I fear it's not going to be very useful to a portable application writer. How is someone to know which character may or may not be used in IFS? In practice, on systems that have locales that use GB18030 or BIG5-HKSCS charsets (which are many), we're basically telling them that they can't use characters other than U+0001..U+002F (control characters, space and !"#$%&'()+,-./), U+003A..U+003F (:;<=>?) and U+007F (DEL) in IFS. They can't use IFS='\|', IFS=_, IFS='~', IFS=X or non-ASCII characters for instance if they want their script to be usable on user input in any of the system's locales. Word splitting is meant to be about splitting text* on characters of IFS, we should be able to tell application writers that if they have valid text in IFS and the subject being split (as input by the user in their own locale for instance), it will be split correctly. AFAICT, and bugs aside, shells that support multibyte encodings (bash, zsh, AT&T ksh, bosh, yash at least) do that, they do not "split on the encoding of characters of IFS" like bug:1560 requires. It's a welcome addition to mandate that in locales using a self-synchronising character encoding (and IFS containing valid text as per that encoding), implementations must be able to split arbitrary sequences of bytes as if by splitting on the encoding of characters of IFS. But then, IMO, it should say that. As in: - split on characters of IFS (essentially revert bug:1560) - and also: in locales using a self-synchronising character encoding (and IFS containing valid text as per that encoding), implementations must be able to split arbitrary sequences of bytes even if they don't form valid characters as if by splitting on the encoding of characters of IFS. (same for read -d delimiter with the added constraint that the delimiter must be a single-byte character). With non-self-synchronising encoding, behaviour unspecified on non-text subject. Also, why make $* unspecified? $* unquoted is not useful, so I don't really care what POSIX says about it but I can't see why "$" can't be just the concatenation of positional parameters with the first character* of IFS (at byte level) regardless of what that character may be (assuming IFS contains valid text), even if the positional parameters don't contain valid text (which may result in character recombination, but why would we care at that point?). [btw, one still can't use $IFS or ${IFS} in this bug tracker, any way that particular rule could be disabled?]

stephane 2025-05-16 06:28 reporter bugnote:0007187	Re: 0001924:0007186 > [btw, one still can't use $IFS or ${IFS} in this bug > tracker, any way that particular rule could be disabled?] For the record, I did enter those as $IFS, &#36{IFS} and that's rendered as $IFS or ${IFS}, so that would be a work-around for now.

hvd 2025-05-16 09:39 reporter bugnote:0007188	I'm not sure what the standardese would be, but I think it's possible to make it less unspecified so that it still allows handling file names containing arbitrary bytes, but restore the handling of all locales to what Issue 7 required. The rule that, as far as I know, all shells that support multibyte characters try to implement, is simple: When a shell interprets a byte string as a character string, this is done as if by repeated calls to mbrtowc(), except that if it would encounter EILSEQ, an unspecified character (other than a null character) is produced and conversion resumes from the initial conversion state. Are there any shells that do not try to follow to this general principle? If not, if there is a way to phrase that in a manner appropriate for standardization, the changes to require splitting on byte sequences can be reverted, the intended aim of those changes would then be handled transparently.

stephane 2025-05-16 14:13 reporter bugnote:0007189	Re: 0001924:0007188 I don't really like the idea of specifying implementation algorithms unless that's the obvious one and it can't be perfectible (maybe because I'm on the user and not implementor side). The 0001561 algorithm is perfectly fine in locales using single-byte or self-synchronising (UTF-8) encodings, likely the most efficient and the ones implementations / systems that don't intend to support other encodings (and don't otherwise decode all input à la PYTHONIOENCODING=utf-8:surrogateescape) may want to use. But it's plain wrong for other encodings. In locales using non-self-synchronising encodings, with sequences of bytes that can't be decoded into text, I don't think a perfect solution exists. That's the point, if you lose synchronisation, there's no sure way to tell where it was lost and where to resume it, whether the corruption was caused by left or right truncation, byte deletion/insertion, bit flip or the input was actually encoded using a different encoding or a different version of the encoding, or is supplied by an attacker trying to trick you or exploit a bug. Whatever you do, you may very well be hallucinating new characters, missing perfectly encoded characters... In those cases IMO, the best thing for POSIX to do is leave the behaviour unspecified, letting implementations decide what they think is best in their specific context. For instance, some may want to detect that the input is actually encoded in UTF-8 and treat it as such (because that's the most likely cause on those systems for instance), some may want to treat input and IFS as single-byte characters when the input or IFS can't be decoded into characters (like bash does for pattern matching when subject or pattern cannot be decoded as text¹) I don't know if shell implementations use the algorithm you describe with mbrtowc() and handling of EILSEQ, but for the record, I reported 0001920 and this follow-up bug after having been made aware of the bash bug described at: https://mywiki.wooledge.org/BashPitfalls#pf65 https://lists.gnu.org/archive/html/bug-bash/2025-04/msg00065.html where read can read passed the delimiter (even if newline or null which can't be found in the encoding of other characters) if reading sequences of bytes that don't form valid characters. Suggesting it may not be how it does it or that it's not as simple as that. Also bear in mind, that on non-seekable (and non-peekable) input at least, read has to read one byte at a time (at try to decode what has been read at every step) so as not to read past the delimiter, which complicates things further. --- ¹ Which I personally consider a bug, see https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html

hvd 2025-05-16 15:31 reporter bugnote:0007190	> I don't really like the idea of specifying implementation algorithms unless that's the obvious one and it can't be perfectible (maybe because I'm on the user and not implementor side). Although I specified it as an implementation algorithm, it's from the user's perspective that I'm suggesting it. The reason for the spec changes is to be able to hold arbitrary file names that are not valid characters according to the current locale. I have such files myself, it's from that perspective that I care about this. In some cases we need to have multiple file names joined by something other than '\0', and being able to resume the conversion after invalid bytes have been encountered on a "best effort" basis is important for that. > The 0001561 algorithm is perfectly fine in locales using single-byte or self-synchronising (UTF-8) encodings, likely the most efficient and the ones implementations / systems that don't intend to support other encodings (and don't otherwise decode all input à la PYTHONIOENCODING=utf-8:surrogateescape) may want to use. It's not fine even there, in my opinion. I recall invalid bytes being interpreted by some shells in some situations as the same characters as other valid bytes, and I can imagine scenarios where that would make sense (e.g. interpreting a single 0xA0 byte, which represents U+00A0 in ISO-8859-1, as U+00A0 even in an UTF-8 locale), that should IMO be permitted so that shell implementors can figure out what works best for them. The current wording does not permit it, my suggested wording does. > In those cases IMO, the best thing for POSIX to do is leave the behaviour unspecified, I don't mind if the behaviour is unspecified for bytes that do not form valid characters, I do mind if the required behaviour is contrary to previously required behaviour for bytes that do form valid characters. That is the basis for my suggestion, it limits the unspecified behaviour to those cases. > I don't know if shell implementations use the algorithm you describe with mbrtowc() and handling of EILSEQ, but for the record, I reported 0001920 and this follow-up bug after having been made aware of the bash bug described at: Thanks for the pointer, it looks like bash didn't do it this way in this specific case but it was acknowledged as a bug and will be fixed for the next version? > some may want to treat input and IFS as single-byte characters when the input or IFS can't be decoded into characters This, however, I am less sure about. This is neither permitted by the current wording nor by my suggested wording, but is a valid idea of what constitutes "best effort" and it seems reasonable to find some way of allowing it. > Also bear in mind, that on non-seekable (and non-peekable) input at least, read has to read one byte at a time (at try to decode what has been read at every step) so as not to read past the delimiter, which complicates things further. I'm aware of that, that is easy to handle portably since mbrtowc() allows processing one single byte at a time, and better optimised implementations for specific locales would be able to do the same even easier.

chet_ramey 2025-05-16 18:48 reporter bugnote:0007191	>> I don't know if shell implementations use the algorithm you describe with mbrtowc() and handling of EILSEQ, but for the record, I reported 0001920 and this follow-up bug after having been made aware of the bash bug described at: >Thanks for the pointer, it looks like bash didn't do it this way in this specific case but it was acknowledged as a bug and will be fixed for the next version? The problem was that the bash `read' took the bytes that resulted in an invalid multibyte character and added them to the current word, without checking whether any of those bytes were the delimiter. The fix required adding the check.

geoffclare 2025-06-05 16:11 manager bugnote:0007196	In the June 5, 2025 teleconference the issues raised since the original resolution were discussed. The following is a new proposed resolution but the issue is being left open for feedback. After page 79 line 2388 section 3 Definitions, add: 3.328 Self-synchronizing Character Encoding A character encoding in which no contiguous subset (other than the encoding of each character) of bytes from the encoding of any one character or two adjacent characters can also represent the encoding of any valid character on its own. and renumber the later subsections. On page 120 line 3840 section 6.2, change: Likewise, the byte values used to encode <period>, <slash>, <newline>, and <carriage-return> shall not occur as part of any other character in any locale. to: Likewise, the byte values used to encode <newline>, <carriage-return>, <tab>, <space>, <hyphen-minus>, <period>, <slash>, and <colon> shall not occur as part of any other character in any locale. On page 2481 line 80454 section 2.5.3 Shell Variables (IFS), after: If the value of IFS includes any bytes that do not form part of a valid character, the results of field splitting, expansion of '', and use of the read* utility are unspecified. add a sentence: If the current locale's character encoding is not self-synchronizing and the value of IFS includes any character for which the byte encoding can overlap with the byte encoding of any other sequence of characters, the results of field splitting, expansion of '', and use of the read* utility are unspecified. and two small-font notes: <small>Note: The UTF-8 encoding is self-synchronizing, meaning that no character's encoding can be confused with any other sequence of characters, and thus all characters are safe to use in IFS when the current locale uses this encoding.</small> <small>Note: [xref to XBD 6.2 Character Encoding] specifies a set of characters from the portable character set whose byte values are not allowed to occur as part of any other character in any locale. These characters are safe to use in IFS with any locale.</small>

stephane 2025-06-12 14:18 reporter bugnote:0007200	Thanks, but please also change the wording to add "as if" as per: > First, at least the wording should make it clear that shells/read implementations are not required to implement that algorithm, just that whatever algorithm they use must produce the same result as long as IFS contains only properly encoded characters ("shall be split as if by looking for the encoding of characters of IFS..."). In the desired action. I thought we were in agreement that splitting by "looking for the encoding of characters of IFS" was not the algorithm that shells would want to implement here (even in locales using self-synchronising encodings such as UTF-8), but "as if by looking..." is fine under the currently stated constraints (as long as IFS contains only valid characters and the locale uses a self-synchronising encoding) and avoids having to mandate a specific algorithm.

geoffclare 2025-06-12 14:58 manager bugnote:0007202	Re 0001924:0007200 This was discussed in the teleconference. There is already an "as if" on line 80923 and we didn't see the need to add another one.

Date Modified	Username	Field	Change
2025-05-05 19:02	stephane	New Issue
2025-05-15 15:14	geoffclare	Note Added: 0007183
2025-05-15 15:16	geoffclare	Status	New => Resolved
2025-05-15 15:16	geoffclare	Resolution	Open => Accepted As Marked
2025-05-15 15:16	geoffclare	Interp Status	=> ---
2025-05-15 15:16	geoffclare	Final Accepted Text	=> 0001924:0007183
2025-05-15 15:16	geoffclare	Tag Attached: tc1-2024
2025-05-16 06:25	stephane	Note Added: 0007186
2025-05-16 06:28	stephane	Note Added: 0007187
2025-05-16 09:39	hvd	Note Added: 0007188
2025-05-16 14:13	stephane	Note Added: 0007189
2025-05-16 15:31	hvd	Note Added: 0007190
2025-05-16 18:48	chet_ramey	Note Added: 0007191
2025-06-05 16:11	geoffclare	Note Added: 0007196
2025-06-05 16:12	geoffclare	Assigned To	=> geoffclare
2025-06-05 16:12	geoffclare	Status	Resolved => Under Review
2025-06-05 16:12	geoffclare	Resolution	Accepted As Marked => Reopened
2025-06-12 14:18	stephane	Note Added: 0007200
2025-06-12 14:58	geoffclare	Note Added: 0007202
2025-07-10 15:24	geoffclare	Status	Under Review => Resolved
2025-07-10 15:24	geoffclare	Resolution	Reopened => Accepted As Marked
2025-07-10 15:24	geoffclare	Final Accepted Text	0001924:0007183 => 0001924:0007196
2025-08-05 11:04	geoffclare	Status	Resolved => Applied

View Issue Details

Activities

Issue History