View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001920 | 1003.1(2024)/Issue8 | Shell and Utilities | public | 2025-04-21 07:16 | 2025-05-01 15:56 |
Reporter | stephane | Assigned To | |||
Priority | normal | Severity | Objection | Type | Omission |
Status | Closed | Resolution | Withdrawn | ||
Name | Stephane Chazelas | ||||
Organization | |||||
User Reference | |||||
Section | read utility, stdin section | ||||
Page Number | 3321 | ||||
Line Number | 112915 | ||||
Interp Status | --- | ||||
Final Accepted Text | |||||
Summary | 0001920: read -d '' on invalid text without -r and IFS= | ||||
Description | Issue 8 added the -d option to the read utility as a resolution of 0000243 to be able to read records from the output of find -print0 reliably with IFS= read -rd '' pathname. That includes in the STDIN section: > If the -d delim option is specified and delim is the null string, > the standard input shall contain zero or more bytes (which > need not form valid characters). However if -r is not also included, that leaves it unclear how an implementation should identify backslash characters (as used to escape separators and delimiter) and if IFS is unset or non-empty, how it should locate those in non-text input. That also implies that shells are required to store variable values internally as the raw byte encoding, not decoded text like yash does for instance (but then again there are other parts of the specification that imply it as well and IMO yash's approach is not sustainable and should be discouraged). | ||||
Desired Action | Change to: > If the -d delim option is specified and delim is the null string, the -r option is specified > and IFS is set to the empty string, the standard input shall contain zero or more bytes > (which need not form valid characters). Instead of "IFS is set to the empty string", I guess we could make it "IFS is empty or contains only single byte characters among those whose encoding is guaranteed not to be found in that of multibyte characters" (period, slash, newline, carriage return). Shells variables, environment variables and command line arguments being required to be able to contain non-text would likely warrant a separate ticket. | ||||
Tags | No tags attached. |
|
Ah, I was focusing on the IFS= read -rd '' case, but overlooked that the previous paragraph had: > If the −d delim option is not specified, or if it is specified and delim is not the null string, the > standard input shall contain zero or more bytes (which need not form valid characters) and shall > not contain any null bytes. Which also leaves it unclear how backslash and IFS character should be identified, so my proposed resolution is not adequate. The whole stdin section should probably be something like: > if the -r option is specified, IFS is set to an empty > string (or alternative mentioned above) and the -d delim is > either not specified or delim is one of period, slash, newline > or carriage-return, standard input shall contain zero or more > bytes (which need not form valid characters) other than null > (unless the -d delim option is specified and delim is the null > string). Otherwise, the input shall be text except that it is > not required to end in a newline character and lines are not > limited to {LINE_MAX} bytes in length. |
|
I can't see a problem here. Or if there is a problem, it is not as big as Stephane is claiming. All of the processing of the input is specified in terms of bytes, not characters, to allow for the fact that the input need not form valid characters. This was done because the input often contains pathnames. The wording does refer to characters in a few places, but only when referring to single-byte characters (including a delim character specified with -d, because if delim is not a single-byte character, the behaviour is unspecified) or to characters in IFS. Although IFS contains characters, XCU 2.6.5 says "The shell shall use the byte sequences that form the characters in the value of the IFS variable as delimiters", and "Note that the shell processes arbitrary bytes from the input fields; there is no requirement that those bytes form valid characters." As regards Stephane's suggested text at the end of 0001920:0007139, we definitely don't want to go back to requiring the input to be text under any circumstances (because of the pathname thing). |
|
Remember that in some character sets, byte sequences that form valid characters are also subsequences of other valid characters. In BIG-5, α is encoded as two bytes, 0xA3 followed by 0x5C. Backslash is encoded as in US-ASCII, as 0x5C. BIG-5 text that contains α should not have the second byte of that character misinterpreted as a backslash. |
|
Re: 7140 That's reasonable for byte sequences that do not form valid multibyte characters in the current encoding, but a delimiter byte that happens to appear as part of a valid multibyte character in the current encoding should not terminate a field. |
|
This proposed text can't be correct: > Otherwise, the input shall be text except that it is > not required to end in a newline character and lines are not > limited to {LINE_MAX} bytes in length. This proposal can't be correct because you CANNOT presume the input is text in the case of read -rd '' . The expected use case of read -rd '' is to handle arbitrary pathnames which are NOT necessarily text at all. Pathnames are sequences of bytes, and the spec never guarantees that those byte sequences are text. Using read with -d '' but without -r is generally a mistake. Once you use -d '', you can't assume the fields are "text", and the lack of "-r" presumes that the input is text and that we know what its encoding is. The *obvious* solution would be to allow implementations to silently enable '-r' whenever -d is passed an empty string. Frankly, I'd also allow returning a failure when using read with -d '' but without -r, as it's not really a sensible combination. If you assume "all of the world is UTF-8" I guess it's easy to implement, but it's not clear *why* you would do it :-). Instead, after: > If the -d delim option is specified and delim is the null string, > the standard input shall contain zero or more bytes (which > need not form valid characters). I would add: > In this case, implementations MAY act as if -r was also used > or return an error. In some future version of the spec I would be happy to add a read "-0" option that did both -d '' and -r. It's good to make security-relevant options easy. |
|
Re: 0001920:0007143 The "otherwise" is - if IFS is not empty (or is empty which is equivalent to the default IFS value) (or at least it contains characters other than dot slash CR LF). In which case how would you do IFS splitting (which is defined in terms of characters, not bytes) on non-text? - or if -r is not provided. Otherwise how would you identify backslash characters (considering that in several charmaps, the encoding of backslash is encoded in many other characters). - or the delimiter (as passed to -d, defaulting to LF) is something other than dot slash CR LF or the empty string (which means NUL character with guaranteed single 0 byte encoding), as again that character could have an encoding found in that of other characters. "IFS= read -rd '' filename" in practice is the only reliable way to read an arbitrary file path (if we ignore bugs in some versions of bash as recently reported and which led me to submit this ticket) read -d was first introduced in ksh93, and is also found in bash and zsh. I believe all 3 were (at least initially) treating the argument as a byte and were looking for it in the input *before* decoding it into text (to look for backslashes and IFS characters) I wouldn't personally object to (and would probably even approve) POSIX mandating that behaviour even for delimiter values other than NUL, dot, slash, CR, LF (the ones whose encoding is guaranteed to be single byte and not found in the encoding of any other character in the locale), even if in practice that may lead to unwanted behaviour in locales that use GB18030, BIG5 and BIG5-HKSCS (and maybe others). In practice those character encodings are not workable anyway, and the mere fact of /enabling/ locales with those charmaps (let alone use them) is a sure way to introduce security vulnerabilities on a system. |
|
Re: 0001920:0007143 > Once you use -d '', you can't assume the fields are "text", and the lack of > "-r" presumes that the input is text and that we know what its encoding is. > The *obvious* solution would be to allow implementations to silently > enable '-r' whenever -d is passed an empty string. While implying -r, and maybe skipping IFS-splitting may make sense for -d '' (when reading NUL-delimited records), -d was not initially intended to read NUL-delimited records. Actually, read -d '' didn't work in ksh93 where that option came from over 30 years ago. AFAIK, it's bash that first extended it so read -d '' meant read until the next 0 byte (more or less as an accident of implementation, as it's implemented in C with its NUL-delimited strings, and it was taking the first byte of such strings). read -d has been used and is still being used to read text records with different delimiters like IFS=' ' read -d , word to read comma-separated words, trimming leading and trailing spaces. |
|
Better: > In this case, implementations MAY act as if -r was also used. > If the -d delim option is specified and delim is the null string, and -r was not used, > implementations may return an error or return an implementation-defined result. |
|
> In which case how would you do IFS splitting (which is defined in terms of characters, not bytes) on non-text? IFS splitting is not defined in terms of characters. See my quotes from 2.6.5 in 0001920:0007140. |
|
> Using read with -d '' but without -r is generally a mistake. The standard already advises in APPLICATION USAGE: "portable applications need to specify −r whenever they specify −d delim (and delim is not <newline>)." |
|
So does that mean that for instance, with IFS=m, Stéphane should be split into $'St\210' and phane in a locale that uses BIG5-HKSCS (where é is encoded as 0x88 0x6d and m as 0x6d). That's not what I observe with any of the shells that support multi-byte text, only those that don't: $ LANG=zh_HK luit $ locale charmap BIG5-HKSCS $ bash -o posix -o noglob -c 'IFS=m; printf "<%q>\n" $1' bash Stéphane <Stéphane> $ ksh93 -o posix -o noglob -c 'IFS=m; printf "<%q>\n" $1' bash Stéphane <Stéphane> $ zsh --emulate sh -o noglob -c 'IFS=m; printf "<%q>\n" $1' bash Stéphane <Stéphane> $ yash -o posix -o noglob -c 'IFS=m; printf "<%q>\n" $1' bash Stéphane printf: `q' is not a valid conversion specifier $ yash -o posix -o noglob -c 'IFS=m; /bin/printf "<%q>\n" $1' bash Stéphane <Stéphane> $ dash -o noglob -c 'IFS=m; /bin/printf "<%q>\n" $1' bash Stéphane <'St'$'\210'> <phane> |
|
Wow, I see the field splitting section has changed dramatically between the 2018 and 2024 editions (seemingling as the result of 0001560 which was about "clarify wording of command substitution"). And by the look of it, no shell hasn't been updated to implement the new behaviour yet. With `IFS=mé` in my BIG5-HKSCS example above, POSIX now requires the input to be split on either `0x88 0x6d` or `0x6d` byte sequences. Does the order matter or should it look for the longest of those occurences? For "read" specifically, for backslash processing, it may be worth clarifying that when the spec says "backslash", it means the byte encoding of that character, wherever it's found (including inside the encoding of other characters) |
|
> The standard already advises in APPLICATION USAGE: "portable applications need to specify −r whenever they specify −d delim (and delim is not <newline>)." Ah, thank you. I missed that. I guess that technically deals with the -d '' case, because users are supposed to specify -r as well. It's kind of a footgun, as it's easy for users to specify -d without -r, but I get it. |
|
In view of 0001561 (about yash behaviour now being proscribed) and the changes wrt word splitting introduced by resolutions of 0001560 and 0001649 (which AFAICT no shell has implemented yet), I'll agree that most of the points I raised in this issue are void. The only remaining one is the handling of backslash by the read utility in the absence of the -r option. We'd need to have either: 1. stdin shall be text (not required to end in newline, no LINE_MAX limit) if -r is not specified 2. or in the same vein as the recent changes to IFS-splitting, change the wording so it's not the backslash character that is considered as the escaper, but the byte encoding of the backslash character, whether it's found in the encoding of backslash or that of any other character or of no character. 2 would however mean that backslash processing in "read" would be done differently from anywhere else, and raises additional questions if IFS contains characters whose encoding contains that of backslash. More generally, while I welcome the changes to word splitting that make it possible to handle arbitrary strings of non-null bytes in locales that use single-byte encodings or UTF-8 or other multi-byte encoding that don't have characters whose encoding is found inside the encoding of other characters, for locales that use multi-byte encodings such as BIG5-HKSCS or GB18030, those changes are really counterproductive and *require* shells to implement a total mess inconsistent with the rest of the system. So it sounds to me like this current (0001920) issue should be withdrawn and another issue raised about the more general problem of locales where characters can contain the encoding of other characters. And it seems to me that the only sensible resolution to that one would be that those character encodings such as BIG5-HKSCS or GB18030 that have characters whose encoding contains the encoding of other characters (including from the portable charset including backslash in the case of those two) should be left out of scope of POSIX, so multi-byte aware shells such as bash/ksh93/zsh can carry on doing the more sensible thing they're doing just now and don't have to implement those changes from 0001560 other than making sure UTF-8 decoding errors (for those that decode before splitting and doing backslash processing) don't prevent splitting strings (or process backslashes) safely. Once those character encodings are out of the picture, it should also be possible to simplify the standard. |
|
Those "encoding that have characters whose encoding contains the encoding of other characters" are not the only problematic one. Obviously, the stateful ones with locking shift would be but those seem to already be out of scope as per https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/basedefs/V1_chap06.html#tag_06_02 A character encoding where α is encoded as 0xaa 0xbb and β as 0xbb 0xaa for instance would also be as with IFS='β', `αα` (encoded as 0xaa 0xbb 0xaa 0xbb) would be split into two 0xaa strings with a shell that implements the algorithm from 0001560. So maybe best would be to put all non-self-synchronising encodings (https://en.wikipedia.org/wiki/Self-synchronizing_code) out of scope. |
|
That wouldn't be enough to accurately specify what shells do even if limited to UTF-8. Since it's now the explicit intent that variables may contain bytes that do not form valid characters, we have to ask what happens when IFS contains bytes that do not form valid characters. In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a valid character. But IFS can be set to 0xA9. If IFS is set to 0xA9, and X is set to 0xC3 0xA9 0xA9 0x40 (é, invalid byte, @), then in most locale-aware shells that I know of that permit arbitrary bytes in variables (bash, gwsh, bosh, ksh), $X is split into two fields, the first one being 0xC3 0xA9, the second one being @. Most shells do not do any pure byte-based splitting. Exceptions are mksh which does appear to do exactly that (producing 0xC3, empty, 0x40), and zsh which does not split at all on this case. Clearly the current wording is defective. A long time ago I wrote on the mailing list in more detail about what shells actually did with variables containing bytes that do not form valid characters in the context of pattern matching (subject: "[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work") and asked whether there was any interest in getting this standardized. There was no interest then. Given the mess that we have now ended up with, please now actually look at what shells do, and specify that, rather than coming up with more broken specs that only handle the trivial cases. |
|
This is a test note by an AG test account. Please ignore. |
|
Test note from a non-admin manager account. Please ignore. |
|
This is another test from a manager acccount. |
|
As requested by the submitter in 0001920:0007155, this bug is being marked withdrawn. We will leave it to the submitter to add another bug report for the related issue. |
Date Modified | Username | Field | Change |
---|---|---|---|
2025-04-21 07:16 | stephane | New Issue | |
2025-04-21 07:30 | stephane | Note Added: 0007139 | |
2025-04-21 07:38 | stephane | Note Edited: 0007139 | |
2025-04-22 14:46 | geoffclare | Note Added: 0007140 | |
2025-04-22 15:20 | hvd | Note Added: 0007141 | |
2025-04-22 18:45 | chet_ramey | Note Added: 0007142 | |
2025-04-23 14:59 | dwheeler | Note Added: 0007143 | |
2025-04-23 16:47 | stephane | Note Added: 0007144 | |
2025-04-23 16:57 | stephane | Note Added: 0007145 | |
2025-04-23 19:53 | dwheeler | Note Added: 0007146 | |
2025-04-24 10:54 | geoffclare | Note Added: 0007147 | |
2025-04-24 10:55 | geoffclare | Note Edited: 0007147 | |
2025-04-24 11:00 | geoffclare | Note Added: 0007148 | |
2025-04-24 11:23 | stephane | Note Added: 0007149 | |
2025-04-24 12:18 | stephane | Note Added: 0007150 | |
2025-04-24 15:43 | dwheeler | Note Added: 0007152 | |
2025-04-27 08:51 | stephane | Note Added: 0007155 | |
2025-04-28 18:37 | stephane | Note Added: 0007156 | |
2025-04-28 19:30 | hvd | Note Added: 0007158 | |
2025-04-29 13:47 | geoffclare | Project | 1003.1(2013)/Issue7+TC1 => 1003.1(2024)/Issue8 |
2025-04-30 15:26 | msbtester | Note Added: 0007163 | |
2025-05-01 15:04 | eblake | Note Added: 0007165 | |
2025-05-01 15:05 | Don Cragun | Note Added: 0007166 | |
2025-05-01 15:39 | msbrown | Note Edited: 0007163 | |
2025-05-01 15:55 | Don Cragun | Note Added: 0007168 | |
2025-05-01 15:56 | Don Cragun | Status | New => Closed |
2025-05-01 15:56 | Don Cragun | Resolution | Open => Withdrawn |
2025-05-01 15:56 | Don Cragun | Interp Status | => --- |