Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000243 [1003.1(2008)/Issue 7] Shell and Utilities Objection Enhancement Request 2010-04-29 19:23 2024-06-11 08:53
Reporter dwheeler View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name David A. Wheeler
Organization IDA
User Reference
Section find
Page Number 2740
Line Number 89194
Interp Status ---
Final Accepted Text Note: 0006110
Summary 0000243: Add -print0 to "find"
Description The POSIX specification and common implementations permit nearly all bytes to be in pathnames, and yet it is surprisingly difficult to portably and correctly process such pathnames. This is one of the more common reason for security vulnerabilities (see CERT’s "Secure Coding" item MSC09-C, CWE 78, CWE 73, and CWE 116, and the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors). For more details about this problem, see:
 http://www.dwheeler.com/essays/filenames-in-shell.html [^]
 http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [^]

The find command's "-exec...+" was intended to fix this, but it is simply inadequate. This is only practical for trivial commands. It also fails to acknowledge a very common construct, find ... -print0 | xargs -0, which is technically not portable (it's not in the spec) but is actually in wide use.

The 2008 specification notes that "Other implementations have added other ways to get around this problem, notably a -print0 primary that wrote filenames with a null byte terminator. This was considered here, but not adopted. Using a null terminator meant that any utility that was going to process find's -print0 output had to add a new option to parse the null terminators it would now be reading." I believe that this decision must be revisited. While it's true that adding null terminator support means that other extensions are necessary, the POSIX -exec...+ construct is simply inadequate to support robust filename processing. Complex commands are rediculously unreadable when placed there, for example, and xargs supports other capabilities (such as limiting the number of parameters) that find does not duplicate. Nor should find duplicate xargs; the beauty of POSIX is that different tools can be good at one job. POSIX should either completely forbid the characters such as newline in filenames, or it should be extended to adequately support such filenames.

The current situation is that it is too hard to *correctly* process filenames, leading to a number of security vulnerabilities. Expecting users and developers to use complicated constructs to handle filenames is unreasonable and dangerous; they should be given a safer and easy-to-use set of constructs for this common case.
Desired Action After line 89195, add:

-print0
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a null byte.


In lines 89387-89401, delete the now-obsolete text "Other implementations... reading."

In line 89285, append: "(but note that pathnames may include newlines, so you cannnot be sure that each line is actually a different pathname)"

In the "STDOUT" section, after line 89257, state:

The −print0 primary shall cause the current pathnames to be written to standard output, with each pathname terminated by a null byte. The format shall be:
           "%s", <path>
  followed by a null byte for each <path>.

Note that this change is a prerequisite for several other proposals that are necessary to make "find" useful and secure for ALL pathnames permitted by POSIX.
Tags issue8
Attached Files

- Relationships
has duplicate 0000244Closedajosey 1003.1(2008)/Issue 7 Add -0 to xargs 
has duplicate 0000245Closedajosey 1003.1(2008)/Issue 7 Add -0 option to shell's "read" 
has duplicate 0000903Closed 1003.1(2013)/Issue7+TC1 Please, add find -print0, xargs -0, read -d and other such options 
related to 0000251Closedajosey 1003.1(2008)/Issue 7 Forbid newline, or even bytes 1 through 31 (inclusive), in filenames 
related to 0001861Interpretation Required 1003.1(2024)/Issue8 xargs -L broken by 0000243 resolution 

-  Notes
(0000882)
Don Cragun (manager)
2011-07-06 23:54

The current plan is to add a set of byte values (based on single-byte characters in
the C Locale) that will not be allowed in newly created filenames using 0000251
as the bug to make the changes. If consensus is reached on a resolution for bug
251, the plan is to reject and close bugs 243, 244, and 245. These three bugs
will remain open until bug 251 is resolved.
(0001020)
dwheeler (reporter)
2011-11-16 18:22

On further reflection, I recommend that bugs 243, 244, and 245 be accepted, regardless of the resolution of bug 251.

Adding these capabilities will make it easier to implement portable applications. Most POSIX systems today permit filenames with include anything except NUL (including newline). Even if a future version of POSIX forbids it, there's no guarantee that implementations will move quickly to implement this change to POSIX. In addition, most application developers will want to develop software that works correctly on both older and newer systems. Technically older POSIX systems need not implement bug 243, 244, and 245, but they are very widely implemented.

Adding these capabilities will make many programs - and various widely-recommended and used constructs - POSIX-compliant.
(0006091)
geoffclare (manager)
2022-12-08 15:39
edited on: 2022-12-09 11:21

It is looking like the group might decide to add find -print0 and related xargs and read features (for reasons I won't go into here).

To minimise the delay to draft 3 should this be decided, here are some suggested wording changes.

Page and line numbers are for Issue 8 draft 2.1.

On page 2763 line 91806 section find (OPERANDS), change:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output.
to:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a <newline>.
-print0
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2765 line 91869 section find (STDOUT), change:
current pathnames to be written
to:
current pathname to be written

After page 2765 line 91871 section find (STDOUT), add:
The -print0 primary shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2766 line 91911 section find (EXAMPLES), after:
They both write out the entire directory hierarchy from the current directory.
append:
With this output format, if any pathnames include <newline> characters, it is not possible to tell where each pathname begins and ends. This problem can be avoided by omitting such pathnames:
LC_ALL=POSIX find . -name $'*\n*' -prune -o -print
or by using a sentinel in the pathname that find would never otherwise produce, such as:
find .//. -print
or by using -print0 instead of -print and processing the output with a utility that can accept null-terminated pathnames as input, such as xargs with the -0 option or read with -d "", for example:
find . -print0 | while IFS= read -rd "" file
do
    # process "$file"
done
It should be noted that using find with -print0 is less safe than using find with -exec because if find -print0 is terminated after it has written a partial pathname, the partial pathname will be processed as if it was a complete pathname.

On page 2769 line 92033-92037 section find (RATIONALE), delete:
Other implementations [...] it would now be reading.

On page 3106 line 105084 section read (SYNOPSIS), change:
read [-r] var...
to:
read [-r] [-d delim] var...

On page 3106 line 105088 section read (DESCRIPTION), change:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of a <newline>. If a <newline> follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and <newline> shall be removed before splitting the input into fields.
to:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of either <newline> or the logical line delimiter specified with the -d delim option (if it is used and delim is not <newline>); it is unspecified which. If this excepted character follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and the excepted character shall be removed before splitting the input into fields.

On page 3106 line 105097 section read (DESCRIPTION), change:
The terminating <newline> (if any) shall be removed from the input
to:
The terminating logical line delimiter (if any) shall be removed from the input

On page 3106 line 105118 section read (OPTIONS), change:
The following option is supported:
to:
The following options shall be supported:

-d delim
If delim consists of one single-byte character, that byte shall be used as the logical line delimiter. If delim is the null string, the logical line delimiter shall be the null byte. Otherwise, the behavior is unspecified.

On page 3107 line 105125 section read (STDIN), change:
The standard input shall be a text file.
to:
If the -d delim option is not specified, or if it is specified and delim is <newline>, the standard input shall be a text file, except that it can contain lines longer than {LINE_MAX}.

If the -d delim option is specified and delim consists of one single-byte character other than <newline>, the standard input shall contain zero or more characters, shall not contain any null bytes, and (if not empty) shall end with delim.

If the -d delim option is specified and delim is the null string, the standard input shall contain zero or more bytes and (if not empty) shall end with a null byte.

After page 3108 line 105167 section read (APPLICATION USAGE), add two new paragraphs:
The -d delim option enables reading up to an arbitrary single-byte delimiter. When delim is the null string, the delimiter is the null byte and this allows read to be used to process null-terminated lists of pathnames (as produced by the find -print0 primary), with correct handling of pathnames that contain <newline> characters. Note that in order to specify the null string as the delimiter, -d and delim need to be specified as two separate arguments. Implementations differ in their handling of <backslash> for line continuation when -d delim is specified (and delim is not <newline>); some treat <backslash>delim (or <backslash><NUL> if delim is the null string) as a line continuation, whereas others still treat <backslash><newline> as a line continuation. Consequently, portable applications need to specify -r whenever they specify -d delim (and delim is not <newline>).

When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.)

On page 3108 line 105186 section read (RATIONALE), change:
Although the standard input is required to be a text file
to:
Although the standard input is required to be a text file (without the {LINE_MAX} limit) when the logical line delimiter is <newline>

On page 3365 line 114578 section xargs (SYNOPSIS), change:
[-E eofstr]
to:
[-E eofstr|-0]

On page 3365 line 114593 section xargs (DESCRIPTION), change:
The application shall ensure that arguments in the standard input are separated by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters. A string of zero or more non-double-quote ('"') characters and non-<newline> characters can be quoted by enclosing them in double-quotes. A string of zero or more non-<apostrophe> ('\'') characters and non-<newline> characters can be quoted by enclosing them in <apostrophe> characters. Any unquoted character can be escaped by preceding it with a <backslash>. The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.
to:
If the -0 option is not specified, the application shall ensure that arguments in the standard input are separated by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters, and quoting characters shall be interpreted as follows:

  • A string of zero or more non-double-quote ('"') non-<newline> characters can be quoted by enclosing them in double-quotes.

  • A string of zero or more non-<apostrophe> ('\'') non-<newline> characters can be quoted by enclosing them in <apostrophe> characters.

  • Any unquoted character can be escaped by preceding it with a <backslash>.

If the -0 option is specified, the application shall ensure that arguments in the standard input are separated by null bytes.

The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.

On page 3365 line 114612 section xargs (OPTIONS -E), change:
If -E is not specified
to:
If neither -E nor -0 is specified

On page 3365 line 114617 section xargs (OPTIONS -I), change:
Insert mode: utility is executed for each logical line from standard input. Arguments in the standard input shall be separated only by unescaped <newline> characters, not by <blank> characters. Any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.
to:
Insert mode: invoke utility for each argument from standard input. If -0 is not specified, arguments in the standard input shall be separated only by unescaped <newline> characters, not by <blank> characters, and any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.

On page 3366 line 114625 section xargs (OPTIONS -L), change:
The utility shall be executed for each non-empty number lines of arguments from standard input. The last invocation of utility shall be with fewer lines of arguments if fewer than number remain. A line is considered to end with the first <newline> unless the last character of the line is an unescaped <blank>; a trailing unescaped <blank> signals continuation to the next non-empty line, inclusive.
to:
Invoke utility for each set of number arguments from standard input. The last invocation of utility shall be with fewer arguments if fewer than number remain. If the -0 option is not specified, each line in the standard input shall be treated as containing one argument except that empty lines shall be ignored and a line ending with a trailing unescaped <blank> shall signal continuation to the next non-empty line, inclusive; such continuation shall result in removal of all trailing unescaped <blank> characters and all <newline> characters that immediately follow them from the argument.

On page 3366 line 114644 section xargs (OPTIONS -s), change:
The total number of lines exceeds that specified by the -L option.
to:
The total number of arguments exceeds that specified by the -L option.

After page 3366 line 114655 section xargs (OPTIONS), add:
-0
Use a null byte as the input argument delimiter and do not treat any other input bytes as special.
If the mutually exclusive -0 and -E eofstr options are both specified, the behavior is unspecified, except that if eofstr is the null string the behavior shall be the same as if -0 was specified without -E eofstr.

On page 3367 line 114664 section xargs (STDIN), change:
The standard input shall be a text file. The results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.
to:
If the -0 option is not specified, the standard input shall be a text file and the results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.

If the -0 option is specified, the standard input need not be a text file, and xargs shall process the input as bytes, not characters.

On page 3368 line 114722 section xargs (APPLICATION USAGE), change:
Note that since input is parsed as lines, ...
to:
Note that since (if -0 is not specified) input is parsed as lines, ...

On page 3368 line 114726 section xargs (APPLICATION USAGE), change:
This can be solved by ...
to:
This can be solved by using the -print0 primary of find together with the xargs -0 option, or by ...


(0006092)
stephane (reporter)
2022-12-08 16:21
edited on: 2022-12-08 16:23

> find . ! -name \*'$\n'\* -print

Should be:

LC_ALL=C find . -name $'*\n*' -prune -o -print

> LC_ALL=POSIX read -d "" -r file

Should be:

IFS= read -rd '' file

I don't know of any shell where LC_ALL=POSIX will make a difference. The -r and IFS= are needed in all of them though.

Even in yash, the only shell that does care about proper text encoding:

$ printf 'a\200b\n' | { LC_ALL=C IFS= read r a; printf '<%s>\n' "$a"; }
read: cannot read input: Invalid or incomplete multibyte or wide character
<>

(0006093)
stephane (reporter)
2022-12-08 16:32
edited on: 2022-12-08 17:02

One of the issues with find -print0 | xargs -0 cmd and that can make it less safe than find -exec cmd {} + is that if find is killed for some reason, or more generally if xargs' input is truncated, you may end up passing the wrong path to cmd as current xargs implementation that support -0 don't mandate the records be delimited.

For instance, a:

LC_ALL=C find /var/tmp -name '*.tmp' -type d -prune -print0 |
  xargs -r0 rm -rf

Could end up running rm -rf /var if find gets killed (like because it exceeded some resource limit) just after it has output of block that happened to end on the /var or /var/.

I don't know if we can do anything about that as it's likely mandating the 0 delimiter could break some existing applications.

(that -r should also be added IMO).

(0006094)
geoffclare (manager)
2022-12-09 10:50

I have edited Note: 0006091 to address points raised in Note: 0006092 and Note: 0006093. The changes made were to change the example find -name and read -d "" commands along the lines suggested, to add a note there about the safety of find -print0, and to update the addition to read APPLICATION USAGE (at line 105167) to insert "If IFS is not set to the null string" in the last sentence.

The behaviour of yash seen in Note: 0006092 is likely not yash's fault: it is probably calling a non-conforming library function to do a multi-byte to wide character conversion.
(0006095)
stephane (reporter)
2022-12-09 12:09

To clarify my previous comment, I find that LC_ALL=C or LC_ALL=POSIX is not needed in the specific case of IFS= read -rd '' var, but that's not necessarily the case if $IFS it not empty or -r is not supplied or for other values of delimiters (even single byte ones). I find ksh93u+m (one of the ksh93 forks with read -d '' support, I've not tested others) and zsh are quite buggy, I'm busy raising bug reports ATM.

It may be worth specifying that IFS= read -rd '' var should be able to read arbitrary byte values into a variable.

About yash, I think it's rather or also that yash doesn't support changing locale charmap midway through a script (within a shell invocation). For a shell that works character-based always, that's hardly surprising. if, from within a UTF-8 locale, printf '\200' | LC_ALL=C read var worked, where 0x80 is not a defined character in most C locales, what would a subsequent printf %s "$var", output when charmap is back to UTF-8? The UTF-8 encoding of some undefined character? And there's the reverse problem if calling LC_ALL=C.UTF-8 read from within a locale where the charmap has fewer characters.

It's true though that on my system, printf '\200' | LC_ALL=C yash -c 'read var' fails as mbrtowc() fails with EILSEQ which is not allowed by POSIX.
 
In any case, yash can only be used with text data, encoded in the charmap of the locale that was in effect at the time yash was invoked. In the C locale, on GNU systems at least (where wchar_t uses the Unicode codepoint), it can only deal with ASCII.
(0006100)
geoffclare (manager)
2023-01-09 16:20
edited on: 2023-01-12 09:55

Page and line numbers are for Issue 8 draft 2.1.

On page 2763 line 91806 section find (OPERANDS), change:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output.
to:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a <newline>.
-print0
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2765 line 91869 section find (STDOUT), change:
current pathnames to be written
to:
current pathname to be written

After page 2765 line 91871 section find (STDOUT), add:
The -print0 primary shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2766 line 91911 section find (EXAMPLES), after:
They both write out the entire directory hierarchy from the current directory.
append:
With this output format, if any pathnames include <newline> characters, it is not possible to tell where each pathname begins and ends. This problem can be avoided by omitting such pathnames:
LC_ALL=POSIX find . -name $'*\n*' -prune -o -print
or by using a sentinel in the pathname that find would never otherwise produce, such as:
find .//. -print
or by using -print0 instead of -print and processing the output with a utility that can accept null-terminated pathnames as input, such as xargs with the -0 option or read with -d "", for example:
find . -print0 | while IFS= read -rd "" file
do
    # process "$file"
done
It should be noted that using find with -print0 to pipe input to xargs -0 is less safe than using find with -exec because if find -print0 is terminated after it has written a partial pathname, the partial pathname will be processed as if it was a complete pathname.

On page 2769 line 92033-92037 section find (RATIONALE), delete:
Other implementations [...] it would now be reading.

On page 3106 line 105084 section read (SYNOPSIS), change:
read [-r] var...
to:
read [-r] [-d delim] var...

On page 3106 line 105088 section read (DESCRIPTION), change:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of a <newline>. If a <newline> follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and <newline> shall be removed before splitting the input into fields.
to:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of either <newline> or the logical line delimiter specified with the -d delim option (if it is used and delim is not <newline>); it is unspecified which. If this excepted character follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and the excepted character shall be removed before splitting the input into fields.

On page 3106 line 105097 section read (DESCRIPTION), change:
The terminating <newline> (if any) shall be removed from the input
to:
The terminating logical line delimiter (if any) shall be removed from the input

After page 3106 line 105115 section read (DESCRIPTION), add:
If end-of-file is detected before a terminating logical line delimiter is encountered, the variables specified by the var operands shall be set as described above and the exit status shall be 1.

On page 3106 line 105118 section read (OPTIONS), change:
The following option is supported:
to:
The following options shall be supported:

-d delim
If delim consists of one single-byte character, that byte shall be used as the logical line delimiter. If delim is the null string, the logical line delimiter shall be the null byte. Otherwise, the behavior is unspecified.

On page 3107 line 105125 section read (STDIN), change:
The standard input shall be a text file.
to:
If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes.

If the -d delim option is specified and delim is the null string, the standard input shall contain zero or more bytes (which need not form valid characters).

After page 3108 line 105167 section read (APPLICATION USAGE), add two new paragraphs:
The -d delim option enables reading up to an arbitrary single-byte delimiter. When delim is the null string, the delimiter is the null byte and this allows read to be used to process null-terminated lists of pathnames (as produced by the find -print0 primary), with correct handling of pathnames that contain <newline> characters. Note that in order to specify the null string as the delimiter, -d and delim need to be specified as two separate arguments. Implementations differ in their handling of <backslash> for line continuation when -d delim is specified (and delim is not <newline>); some treat <backslash>delim (or <backslash><NUL> if delim is the null string) as a line continuation, whereas others still treat <backslash><newline> as a line continuation. Consequently, portable applications need to specify -r whenever they specify -d delim (and delim is not <newline>).

When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable to use the contents of the first var operand, if non-empty, when the exit status of read is 1, as it is likely the result of the command used to generate the list of pathnames (for example find with -print or -print0 being terminated after it has written a partial pathname, and consequently using it could result in the wrong pathname being processed.

On page 3108 line 105186 section read (RATIONALE), change:
Although the standard input is required to be a text file, and therefore will always end with a <newline> (unless it is an empty file), the processing of continuation lines when the −r option is not used can result in the input not ending with a <newline>. This occurs if the last line of the input file ends with a <backslash> <newline>. It is for this reason that ``if any’’ is used in ``The terminating <newline> (if any) shall be removed from the input’’ in the description. It is not a relaxation of the requirement for standard input to be a text file.
to:
Earlier versions of this standard required the standard input to be a text file, and therefore the results were undefined if the input was not empty and end-of-file was detected before a <newline> character was encountered. However, all of the most popular shell implementations have been found to have consistent behavior in this case, and so the behavior is now specified and the requirement for standard input to be a text file has been relaxed to allow non-empty input that does not end with a <newline>.

On page 3365 line 114578 section xargs (SYNOPSIS), change:
[-E eofstr]
to:
[-E eofstr|-0]

On page 3365 line 114593 section xargs (DESCRIPTION), change:
The application shall ensure that arguments in the standard input are separated by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters. A string of zero or more non-double-quote ('"') characters and non-<newline> characters can be quoted by enclosing them in double-quotes. A string of zero or more non-<apostrophe> ('\'') characters and non-<newline> characters can be quoted by enclosing them in <apostrophe> characters. Any unquoted character can be escaped by preceding it with a <backslash>. The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.
to:
If the -0 option is not specified, the application shall ensure that arguments in the standard input are separated by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters, and quoting characters shall be interpreted as follows:

  • A string of zero or more non-double-quote ('"') non-<newline> characters can be quoted by enclosing them in double-quotes.

  • A string of zero or more non-<apostrophe> ('\'') non-<newline> characters can be quoted by enclosing them in <apostrophe> characters.

  • Any unquoted character can be escaped by preceding it with a <backslash>.

If the -0 option is specified, the application shall ensure that arguments in the standard input are separated by null bytes.

The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.

On page 3365 line 114612 section xargs (OPTIONS -E), change:
If -E is not specified
to:
If neither -E nor -0 is specified

On page 3365 line 114617 section xargs (OPTIONS -I), change:
Insert mode: utility is executed for each logical line from standard input. Arguments in the standard input shall be separated only by unescaped <newline> characters, not by <blank> characters. Any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.
to:
Insert mode: invoke utility for each argument from standard input. If -0 is not specified, arguments in the standard input shall be separated only by unescaped <newline> characters, not by <blank> characters, and any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.

On page 3366 line 114625 section xargs (OPTIONS -L), change:
The utility shall be executed for each non-empty number lines of arguments from standard input. The last invocation of utility shall be with fewer lines of arguments if fewer than number remain. A line is considered to end with the first <newline> unless the last character of the line is an unescaped <blank>; a trailing unescaped <blank> signals continuation to the next non-empty line, inclusive.
to:
Invoke utility for each set of number arguments from standard input. The last invocation of utility shall be with fewer arguments if fewer than number remain. If the -0 option is not specified, each line in the standard input shall be treated as containing one argument except that empty lines shall be ignored and a line ending with a trailing unescaped <blank> shall signal continuation to the next non-empty line, inclusive; such continuation shall result in removal of all trailing unescaped <blank> characters and all <newline> characters that immediately follow them from the argument.

On page 3366 line 114644 section xargs (OPTIONS -s), change:
The total number of lines exceeds that specified by the -L option.
to:
The total number of arguments exceeds that specified by the -L option.

After page 3366 line 114655 section xargs (OPTIONS), add:
-0
Use a null byte as the input argument delimiter and do not treat any other input bytes as special.
If the mutually exclusive -0 and -E eofstr options are both specified, the behavior is unspecified, except that if eofstr is the null string the behavior shall be the same as if -0 was specified without -E eofstr.

On page 3367 line 114664 section xargs (STDIN), change:
The standard input shall be a text file. The results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.
to:
If the -0 option is not specified, the standard input shall be a text file and the results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.

If the -0 option is specified, the standard input need not be a text file, and xargs shall process the input as bytes, not characters.

On page 3368 line 114722 section xargs (APPLICATION USAGE), change:
Note that since input is parsed as lines, ...
to:
Note that since (if -0 is not specified) input is parsed as lines, ...

On page 3368 line 114726 section xargs (APPLICATION USAGE), change:
This can be solved by ...
to:
This can be solved by using the -print0 primary of find together with the xargs -0 option, or by ...


(0006105)
geoffclare (manager)
2023-01-10 10:08

Reopening because, as discussed on the mailing list, the xargs DESCRIPTION text is not quite right.
(0006106)
geoffclare (manager)
2023-01-10 10:32

Revised proposal for the xargs DESCRIPTION change:
If the -0 option is not specified, the application shall ensure that arguments in the standard input are delimited by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters, and quoting characters shall be interpreted as follows:

  • A string of zero or more non-double-quote ('"') non-<newline> characters can be quoted by enclosing them in double-quotes.

  • A string of zero or more non-<apostrophe> ('\'') non-<newline> characters can be quoted by enclosing them in <apostrophe> characters.

  • Any unquoted character can be escaped by preceding it with a <backslash>.

Multiple adjacent delimiter characters shall be treated as a single delimiter. If the standard input is not empty and does not end with a <newline>, the behavior is undefined (because the requirement in STDIN that the input is a text file is not met in that case).

If the -0 option is specified, the application shall ensure that arguments in the standard input are delimited by null bytes. If multiple adjacent null bytes occur in the input, each null byte shall be treated as a delimiter. If the standard input is not empty and does not end with a null byte, it is unspecified whether the trailing non-null bytes are ignored or are used as the last argument passed to utility.

The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.


The use of "separated" in the text for the -I option should also change to "delimited".

In addition, at the end of the find EXAMPLES addition, this text:
the partial pathname will be processed as if it was a complete pathname.

should say "may" instead of "will".
(0006107)
geoffclare (manager)
2023-01-10 14:46
edited on: 2023-01-10 15:55

Another point raised on the mailing list is that xargs -0 is typically used with -r, so it would make sense to add -r as well. Here are some suggested additional changes for that...

In the find EXAMPLES change, this text:
to pipe input to xargs -0
should instead be:
to pipe input to xargs -r0

In the last paragraph of the xargs DESCRIPTION change, this sentence:
The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found.
should instead be these two:
The utility named by utility shall be executed zero or more times until the end-of-file is reached or the logical end-of file string is found. If no arguments are supplied on standard input, the utility named by utility shall be executed zero times if the -r option is specified and shall be executed exactly once if the -r option is not specified.

Extra changes to add...

On page 3365 line 114578 section xargs (SYNOPSIS), change:
[-ptx]
to:
[-prtx]

After page 3366 line 114639 section xargs (OPTIONS), add:
-r
Do not execute the utility named by utility if no arguments are supplied on standard input.

On page 3368 line 114707 section xargs (EXIT STATUS), change:
All invocations of utility returned exit status zero.
to:
Successful completion.


(0006108)
dwheeler (reporter)
2023-01-10 16:00

First: My thanks to everyone for reconsidering and moving toward acceptance of this proposal! These changes will make it a little easier to write secure portable software.

It's a fair point that trailing data without a terminating \0 could suggest partial data & thus perhaps should be ignored. However, while the current text *allows* addressing this, it doesn't *encourage* addressing this, so I don't think it encourages safe implementations. I have a minor suggestion: use IETF-like language to clarify this, to encourage "better" behavior. That is, change this:

> If the standard input is not empty and does not end with a null byte, it is unspecified whether the trailing non-null bytes are ignored or are used as the last argument passed to utility.

Into this:

> If the standard input is not empty and does not end with a null byte, an implementation should ignore the trailing non-null bytes (as this can signal incomplete data) but may use them as the last argument passed to utility.

Thanks!
(0006110)
geoffclare (manager)
2023-01-12 09:56
edited on: 2023-01-12 16:22

The following is a copy of Note: 0006100 with the updates suggested in the subsequent notes applied.

Page and line numbers are for Issue 8 draft 2.1.

On page 2763 line 91806 section find (OPERANDS), change:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output.
to:
-print
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a <newline>.
-print0
The primary shall always evaluate as true; it shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2765 line 91869 section find (STDOUT), change:
current pathnames to be written
to:
current pathname to be written

After page 2765 line 91871 section find (STDOUT), add:
The -print0 primary shall cause the current pathname to be written to standard output, followed by a null byte.

On page 2766 line 91911 section find (EXAMPLES), after:
They both write out the entire directory hierarchy from the current directory.
append:
With this output format, if any pathnames include <newline> characters, it is not possible to tell where each pathname begins and ends. This problem can be avoided by omitting such pathnames:
LC_ALL=POSIX find . -name $'*\n*' -prune -o -print
or by using a sentinel in the pathname that find would never otherwise produce, such as:
find .//. -print
or by using -print0 instead of -print and processing the output with a utility that can accept null-terminated pathnames as input, such as xargs with the -0 option or read with -d "", for example:
find . -print0 | while IFS= read -rd "" file
do
    # process "$file"
done
It should be noted that using find with -print0 to pipe input to xargs -r0 is less safe than using find with -exec because if find -print0 is terminated after it has written a partial pathname, the partial pathname may be processed as if it was a complete pathname.

On page 2769 line 92033-92037 section find (RATIONALE), delete:
Other implementations [...] it would now be reading.

On page 3106 line 105084 section read (SYNOPSIS), change:
read [-r] var...
to:
read [-r] [-d delim] var...

On page 3106 line 105088 section read (DESCRIPTION), change:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of a <newline>. If a <newline> follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and <newline> shall be removed before splitting the input into fields.
to:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of either <newline> or the logical line delimiter specified with the -d delim option (if it is used and delim is not <newline>); it is unspecified which. If this excepted character follows the <backslash>, the read utility shall interpret this as line continuation. The <backslash> and the excepted character shall be removed before splitting the input into fields.

On page 3106 line 105097 section read (DESCRIPTION), change:
The terminating <newline> (if any) shall be removed from the input
to:
The terminating logical line delimiter (if any) shall be removed from the input

After page 3106 line 105115 section read (DESCRIPTION), add:
If end-of-file is detected before a terminating logical line delimiter is encountered, the variables specified by the var operands shall be set as described above and the exit status shall be 1.

On page 3106 line 105118 section read (OPTIONS), change:
The following option is supported:
to:
The following options shall be supported:

-d delim
If delim consists of one single-byte character, that byte shall be used as the logical line delimiter. If delim is the null string, the logical line delimiter shall be the null byte. Otherwise, the behavior is unspecified.

On page 3107 line 105125 section read (STDIN), change:
The standard input shall be a text file.
to:
If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes.

If the -d delim option is specified and delim is the null string, the standard input shall contain zero or more bytes (which need not form valid characters).

After page 3108 line 105167 section read (APPLICATION USAGE), add two new paragraphs:
The -d delim option enables reading up to an arbitrary single-byte delimiter. When delim is the null string, the delimiter is the null byte and this allows read to be used to process null-terminated lists of pathnames (as produced by the find -print0 primary), with correct handling of pathnames that contain <newline> characters. Note that in order to specify the null string as the delimiter, -d and delim need to be specified as two separate arguments. Implementations differ in their handling of <backslash> for line continuation when -d delim is specified (and delim is not <newline>); some treat <backslash>delim (or <backslash><NUL> if delim is the null string) as a line continuation, whereas others still treat <backslash><newline> as a line continuation. Consequently, portable applications need to specify -r whenever they specify -d delim (and delim is not <newline>).

When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable to use the contents of the first var operand, if non-empty, when the exit status of read is 1, as it is likely the result of the command used to generate the list of pathnames (for example find with -print or -print0 being terminated after it has written a partial pathname, and consequently using it could result in the wrong pathname being processed.

On page 3108 line 105186 section read (RATIONALE), change:
Although the standard input is required to be a text file, and therefore will always end with a <newline> (unless it is an empty file), the processing of continuation lines when the −r option is not used can result in the input not ending with a <newline>. This occurs if the last line of the input file ends with a <backslash> <newline>. It is for this reason that ``if any’’ is used in ``The terminating <newline> (if any) shall be removed from the input’’ in the description. It is not a relaxation of the requirement for standard input to be a text file.
to:
Earlier versions of this standard required the standard input to be a text file, and therefore the results were undefined if the input was not empty and end-of-file was detected before a <newline> character was encountered. However, all of the most popular shell implementations have been found to have consistent behavior in this case, and so the behavior is now specified and the requirement for standard input to be a text file has been relaxed to allow non-empty input that does not end with a <newline>.

On page 3365 line 114578 section xargs (SYNOPSIS), change:
[-ptx]
to:
[-prtx]

On page 3365 line 114578 section xargs (SYNOPSIS), change:
[-E eofstr]
to:
[-E eofstr|-0]

On page 3365 line 114593 section xargs (DESCRIPTION), change:
The application shall ensure that arguments in the standard input are separated by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters. A string of zero or more non-double-quote ('"') characters and non-<newline> characters can be quoted by enclosing them in double-quotes. A string of zero or more non-<apostrophe> ('\'') characters and non-<newline> characters can be quoted by enclosing them in <apostrophe> characters. Any unquoted character can be escaped by preceding it with a <backslash>. The utility named by utility shall be executed one or more times until the end-of-file is reached or the logical end-of file string is found. The results are unspecified if the utility named by utility attempts to read from its standard input.
to:
If the -0 option is not specified, the application shall ensure that arguments in the standard input are delimited by unquoted <blank> characters, unescaped <blank> characters, or <newline> characters, and quoting characters shall be interpreted as follows:

  • A string of zero or more non-double-quote ('"') non-<newline> characters can be quoted by enclosing them in double-quotes.

  • A string of zero or more non-<apostrophe> ('\'') non-<newline> characters can be quoted by enclosing them in <apostrophe> characters.

  • Any unquoted character can be escaped by preceding it with a <backslash>.

Multiple adjacent delimiter characters shall be treated as a single delimiter. If the standard input is not empty and does not end with a <newline>, the behavior is undefined (because the requirement in STDIN that the input is a text file is not met in that case).

If the -0 option is specified, the application shall ensure that arguments in the standard input are delimited by null bytes. If multiple adjacent null bytes occur in the input, each null byte shall be treated as a delimiter. If the standard input is not empty and does not end with a null byte, xargs should ignore the trailing non-null bytes (as this can signal incomplete data) but may use them as the last argument passed to utility.

The utility named by utility shall be executed zero or more times until the end-of-file is reached or the logical end-of file string is found. If no arguments are supplied on standard input, the utility named by utility shall be executed zero times if the -r option is specified and shall be executed exactly once if the -r option is not specified. The results are unspecified if the utility named by utility attempts to read from its standard input.

On page 3365 line 114612 section xargs (OPTIONS -E), change:
If -E is not specified
to:
If neither -E nor -0 is specified

On page 3365 line 114617 section xargs (OPTIONS -I), change:
Insert mode: utility is executed for each logical line from standard input. Arguments in the standard input shall be separated only by unescaped <newline> characters, not by <blank> characters. Any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.
to:
Insert mode: invoke utility for each argument from standard input. If -0 is not specified, arguments in the standard input shall be delimited only by unescaped <newline> characters, not by <blank> characters, and any unquoted unescaped <blank> characters at the beginning of each line shall be ignored.

On page 3366 line 114625 section xargs (OPTIONS -L), change:
The utility shall be executed for each non-empty number lines of arguments from standard input. The last invocation of utility shall be with fewer lines of arguments if fewer than number remain. A line is considered to end with the first <newline> unless the last character of the line is an unescaped <blank>; a trailing unescaped <blank> signals continuation to the next non-empty line, inclusive.
to:
Invoke utility for each set of number arguments from standard input. The last invocation of utility shall be with fewer arguments if fewer than number remain. If the -0 option is not specified, each line in the standard input shall be treated as containing one argument except that empty lines shall be ignored and a line ending with a trailing unescaped <blank> shall signal continuation to the next non-empty line, inclusive; such continuation shall result in removal of all trailing unescaped <blank> characters and all <newline> characters that immediately follow them from the argument.

After page 3366 line 114639 section xargs (OPTIONS), add:
-r
Do not execute the utility named by utility if no arguments are supplied on standard input.

On page 3366 line 114644 section xargs (OPTIONS -s), change:
The total number of lines exceeds that specified by the -L option.
to:
The total number of arguments exceeds that specified by the -L option.

After page 3366 line 114655 section xargs (OPTIONS), add:
-0
Use a null byte as the input argument delimiter and do not treat any other input bytes as special.
If the mutually exclusive -0 and -E eofstr options are both specified, the behavior is unspecified, except that if eofstr is the null string the behavior shall be the same as if -0 was specified without -E eofstr.

On page 3367 line 114664 section xargs (STDIN), change:
The standard input shall be a text file. The results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.
to:
If the -0 option is not specified, the standard input shall be a text file and the results are unspecified if an end-of-file condition is detected immediately following an escaped <newline>.

If the -0 option is specified, the standard input need not be a text file, and xargs shall process the input as bytes, not characters.

On page 3368 line 114707 section xargs (EXIT STATUS), change:
All invocations of utility returned exit status zero.
to:
Successful completion.

On page 3368 line 114722 section xargs (APPLICATION USAGE), change:
Note that since input is parsed as lines, ...
to:
Note that since input is parsed as lines (if -0 is not specified), ...

On page 3368 line 114726 section xargs (APPLICATION USAGE), change:
This can be solved by ...
to:
This can be solved by using the -print0 primary of find together with the xargs -0 option, or by ...

On page 3370 line 114830 section xargs (FUTURE DIRECTIONS), change "None" to:
A future version of this standard may require that, when the -0 option is specified, if the standard input is not empty and does not end with a null byte, xargs ignores the trailing non-null bytes.



- Issue History
Date Modified Username Field Change
2010-04-29 19:23 dwheeler New Issue
2010-04-29 19:23 dwheeler Status New => Under Review
2010-04-29 19:23 dwheeler Assigned To => ajosey
2010-04-29 19:23 dwheeler Name => David A. Wheeler
2010-04-29 19:23 dwheeler Organization => IDA
2010-04-29 19:23 dwheeler Section => find
2010-04-29 19:23 dwheeler Page Number => 2740
2010-04-29 19:23 dwheeler Line Number => 89194
2011-07-06 23:42 Don Cragun Relationship added related to 0000244
2011-07-06 23:42 Don Cragun Relationship added related to 0000245
2011-07-06 23:54 Don Cragun Note Added: 0000882
2011-11-16 18:22 dwheeler Note Added: 0001020
2015-03-12 16:15 Don Cragun Relationship added has duplicate 0000903
2022-12-08 15:39 geoffclare Note Added: 0006091
2022-12-08 15:40 geoffclare Note Edited: 0006091
2022-12-08 16:21 stephane Note Added: 0006092
2022-12-08 16:23 stephane Note Edited: 0006092
2022-12-08 16:32 stephane Note Added: 0006093
2022-12-08 17:02 stephane Note Edited: 0006093
2022-12-09 10:22 geoffclare Note Edited: 0006091
2022-12-09 10:30 geoffclare Note Edited: 0006091
2022-12-09 10:44 geoffclare Note Edited: 0006091
2022-12-09 10:50 geoffclare Note Added: 0006094
2022-12-09 11:21 geoffclare Note Edited: 0006091
2022-12-09 12:09 stephane Note Added: 0006095
2023-01-09 16:13 Don Cragun Relationship replaced has duplicate 0000244
2023-01-09 16:17 Don Cragun Relationship replaced has duplicate 0000245
2023-01-09 16:20 geoffclare Note Added: 0006100
2023-01-09 16:23 geoffclare Note Edited: 0006100
2023-01-09 16:24 geoffclare Note Edited: 0006100
2023-01-09 16:26 geoffclare Interp Status => ---
2023-01-09 16:26 geoffclare Final Accepted Text => Note: 0006100
2023-01-09 16:26 geoffclare Status Under Review => Resolved
2023-01-09 16:26 geoffclare Resolution Open => Accepted As Marked
2023-01-09 16:26 geoffclare Tag Attached: issue8
2023-01-09 17:07 geoffclare Note Edited: 0006100
2023-01-10 10:08 geoffclare Note Added: 0006105
2023-01-10 10:08 geoffclare Status Resolved => Under Review
2023-01-10 10:08 geoffclare Resolution Accepted As Marked => Reopened
2023-01-10 10:32 geoffclare Note Added: 0006106
2023-01-10 14:46 geoffclare Note Added: 0006107
2023-01-10 14:50 geoffclare Note Edited: 0006107
2023-01-10 15:55 geoffclare Note Edited: 0006107
2023-01-10 16:00 dwheeler Note Added: 0006108
2023-01-10 16:00 dwheeler Note Added: 0006109
2023-01-10 16:50 dwheeler Note Deleted: 0006109
2023-01-12 09:55 geoffclare Note Edited: 0006100
2023-01-12 09:56 geoffclare Note Added: 0006110
2023-01-12 16:22 geoffclare Note Edited: 0006110
2023-01-12 16:25 geoffclare Final Accepted Text Note: 0006100 => Note: 0006110
2023-01-12 16:25 geoffclare Status Under Review => Resolved
2023-01-12 16:25 geoffclare Resolution Reopened => Accepted As Marked
2023-01-12 17:34 dwheeler Issue Monitored: dwheeler
2023-01-17 12:11 geoffclare Status Resolved => Applied
2023-08-22 06:28 Don Cragun Relationship added related to 0000251
2024-06-11 08:53 agadmin Status Applied => Closed
2024-10-17 09:12 geoffclare Relationship added related to 0001861


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker