|Anonymous | Login||2020-02-16 21:41 UTC|
|Main | My View | View Issues | Change Log | Docs|
|Viewing Issue Simple Details|
|ID||Category||Severity||Type||Date Submitted||Last Update|
|0001233||[1003.1(2016)/Issue7+TC2] Shell and Utilities||Objection||Enhancement Request||2019-03-08 17:32||2019-06-14 09:32|
|Priority||normal||Resolution||Accepted As Marked|
|Final Accepted Text||Note: 0004418|
|Summary||0001233: backslash needs to be escaped portably inside bracket expressions|
(line numbers are for ISBN#1-947754-05-08 (2018 edition, c181.pdf), not 1003.1(2016))
The spec says than backslash shall lose its special meaning within bracket expressions.
When it comes to fnmatch(), it says backslash escapes special characters, it's not clear whether it's still the case inside bracket expressions.
For shell wildcards (in globs and in case statements), backslash is a quoting operator like '...' and "..." and quoting removes the special meaning of special characters (including the ^, ], - inside bracket expressions as recently clarified). The first paragraph of 2.13.1 also suggests backslash has a second meaning specific to pattern matching and separate from quoting (see bug:1190) though that doesn't match current practice in most shells. I'll cover that in a separate bug but that has some relevance here.
In practice backslash is still special inside bracket expressions in a number of utility implementations, either to remove the special meaning of ], ^, -, [:x:] or [.x.] or for ANSI C escapes like \n (whether that should happen sed is not clear in the spec), \t... or extended operators like \d, \s as shorthands for [:digit:] or [:space:].
Most modern regular expressions (perl, python, tcl, php, .net...) treat backslash specially within bracket expressions. Not allowing it for POSIX RE locks them in the past and contributes to making them obsolete.
[\t] will not match on \ and t in:
- where literal in shell globs, where \ is just quoting the t
- in fnmatch() and find -name/-path as backslash is replacement for shell quoting
- in awk in text ~ "[\t]" where that \t is expanded to a TAB inside double quotes
- in awk in text ~ /[\t]/ where either \t is expanded to TAB, or \t is meant to match a TAB
- in some awk implementations in P='[\t]' awk '$0 ~ ENVIRON["P"]' (as now required by POSIX)
- in GNU sed (unless POSIXLY_CORRECT is in the environment) or busybox sed (where it matches a TAB instead)
- in vim (vi/ex) when not in compatibility mode
- as mentioned above, modern regexps.
[\^a] doesn't match on \ in the above and also in
p='[\!a]'; case "\\" in $p) echo match; esac in bash and some ash-based shells (that's different in those ash-based shells when the pattern is used for globbing instead)
Specify that within bracket expressions, backslash must be escaped or quoted for it to be matched literally. Applications authors should write:
- ['\'] or [\\] in sh
- p='[\\]'; case $var in $p; esac in sh
- fnmatch(): find . -name '[\\]*' (already covered, but could be made less ambiguous)
- $0 ~ /[\\]/ in awk
- $0 ~ "[\\\\]" in awk
- BRE: grep '[\\]'
- ERE: grep -E '[\\]'
Unspecified otherwise (except for the other cases that are already specified to mean something special like awk's /[\t]/).
Also please clarify whether s/[\n]// matches a newline or not. If not, make it unspecified instead of requiring it to match on \ and n. In practice few match on newline, but that's unfortunate as in those that means one can't do things like s/[^\n]/x/ other than with some convoluted: y/\n./.\n/;s/[^.]/x/;y/\n./.\n/
The awk examples in the desired action don't match what you're asking for.
Since awk is specified as processing C-style escapes before the ERE is
interpreted as specified in XBD chapter 9, if XBD chapter 9 is changed to
say that backslash needs to be escaped in bracket expressions, then for
awk you would need:
$0 ~ /[\\\\]/
$0 ~ "[\\\\\\\\]"
So the behaviour of awk in practice is actually an argument against
changing XBD chapter 9 the way you are asking for.
Another reason not to change XBD chapter 9 is that it specifies the
behaviour of regcomp() and regexec(). Are you claiming that there
are implementations of those functions which treat backslash as
special in bracket expressions? There shouldn't be, since those
functions were invented by POSIX.
So I believe XBD chapter 9 should remain as is, and instead changes
should be considered for individual use cases. For example, we could
consider specifying that sed may process C-style escapes the way awk
Re: Note: 0004291
> The awk examples in the desired action don't match what you're asking for.
> Since awk is specified as processing C-style escapes before the ERE is
> interpreted as specified in XBD chapter 9, if XBD chapter 9 is changed to
> say that backslash needs to be escaped in bracket expressions, then for
> awk you would need:
> $0 ~ /[\\\\]/
> $0 ~ "[\\\\\\\\]"
My point is that since several commands including awk treat \ specially within [...] (and it's not only about ANSI C escape sequences, /[\]]/ also matches on ] only in most awk implementations), the regexp syntax should mandate \ to be escaped inside bracket expressions to match a backslash literally.
I'm fine if instead of making the change to the regexp syntax you want to make it to all the standard utilities that use regexps.
And advantage of making the change to the regexp syntax is that it allows implementations to add extensions in a consistent fashion (as opposed to now different utilities adding their own independently).
The most I believe is needed to deal with this is a note in some
application usage type section (perhaps one might need to be added)
noting that some applications have extended the use of \ as an escape
character to allow it to work inside  expressions in matching, and
to advise using \\ whenever this is a possibiliy.
For applications that don't treat the \ in  as anything other than
a character, this is harmless - having any character twice in a bracket
expression changes nothing - either it is there, and that character in
the string being matched against will match, or it is not there, and the
 expression will not match if the character in question appears at the
relevant position in the string.
For applications where \n \t (etc) are treated differently than as being
the two characters, '\' and 'n' or 't' inside a bracket expression,
explicitly writing \\n or \\t when that is the intent is harmless, and
works in both cases. On the other hand, when a newline or tab is to
be matched, it should be entered literally, rather than simply assuming
that \n or \t will work (and yes, I know some apps do not allow literal
newlines in patterns, so this is not alwaye easy - but on the other hand,
sed excepted, most of those (grep, ...) also never attempt a match on a
string that can contain a newline, so it doesn't matter.
That the description notes:
- in GNU sed (unless POSIXLY_CORRECT is in the environment)
makes it clear that they know that they are making an extension to
standard regular expressions, and consequently a method is provided
to return to the standard behaviour.
So, I wouldn't go as far as making it unspecified whether [\n] might be
a match against a newline - in a standard regular expression it is not.
But adding a note for application (script) writers than writing [\\n] is
safe, and avoid issues, is not an unreasonable request.
Let's not lose sight of the fact that POSIX is about specifying a portable API. When it comes to shell and utilities, it's about what people can safely put in their script so it can work everywhere. POSIX also constraints implementations.
Here by saying that \x is unspecified but [\x] or [x\] match on \ and x everywhere, first we're telling *applications* a lie because it's not the case in practice.
And then we're constraining *implementations* in a very counter-productive way. We're telling them, you can use \x as an extension, but not [^\x]. Note that it's not only about \n, \b ANSI sequences. See also the \d, \s of perl (shortcut for [[:digit:]] [[:space:]]) and of course \]
Those \s, \d are now found in many standard utility implementations (grep/sed/vi at least), but because of that POSIX requirement of [^\s] required to match characters other than \ and s, those \s are often not recognised inside bracket expressions, which is an annoying difference from perl. I tend to use grep -P where available because that's IMO a saner regexp syntax.
Relaxing that POSIX requirement (and for what? so users can use [\/] instead of [\\/] which they won't do anyway as it often doesn't work like in shells, awk, fnmatch) would allow utilities (and regcomp()) to implement more useful extensions to the regex syntax.
On page 3222 line 108141 section sed, add a new paragraph to APPLICATION USAGE:
Some implementations of sed, when executed in a non-conforming environment, handle <backslash> escapes in regular expressions in a similar way to how awk handles them in the lexical token ERE (processing "\t" as a tab character, etc.). This is a compatible extension except that it conflicts with the requirements of this standard when <backslash> appears inside a bracket expression. A future version of this standard may allow this behavior, and therefore applications should use two <backslash> characters in bracket expressions instead of one in order to ensure future portability. On implementations conforming to the current standard, the second <backslash> is redundant. In the future (and in current non-conforming environments) the first <backslash> may escape the second.
On page 3224 line 108241 section sed, change FUTURE DIRECTIONS from:
A future version of this standard may allow sed to handle <backslash> escapes in regular expressions in a similar way to how awk handles them in the lexical token ERE. ("Similar" rather than "the same" because sed uses BREs whereas awk uses EREs.)
Re: Note: 0004418 Thanks. That is indeed an improvement and a step in the right direction IMO, but there's still at least two of the problems reported in this bug that are not addressed:
1. sed's \n
> * The escape sequence '\n' shall match a <newline> embedded in the
> pattern space. A literal <newline> shall not be used in the BRE of a
> context address or in the substitute function.
To me, that implies /[\n]/ is meant to match on newline (and that [\\n] is meant to match on either \ or newline! If you want to match on n or backslash you have to write it [n\], [n\\] in the future).
But that's not what traditional sed implementations or GNU sed with POSIXLY_CORRECT do:
printf '\\a\nn\n' | /usr/xpg4/bin/sed 'N;s/[\n]/x/g' xa x $ printf '\\a\nn\n' | POSIXLY_CORRECT=1 gsed 'N;s/[\n]/x/g' xa x
Only GNU sed when not in POSIX mode is compliant here:
$ printf '\\a\nn\n' | sed 'N;s/[\n]/x/g' \axn
But not in
$ printf '\\a\nn\n' | sed 'N;s/[\\n]/x/g' xa x
Again, the gsed behaviour is much more useful. Without it, it's impossible to match a newline in a bracket expression (especially useful in things like [^\n]*).
Like for sed, the most widely used implementation of ex/vi (vim) does some special handling of \ inside bracket expressions. While, like I said in the bug description, [\t] does match on \ and t in compatibility mode, [\] doesn't match on backslash (it matches on  instead), you do need [\\] to match on \.
3. more generally
But more generally, my point is that in practice, portably, we need to double the backslash inside bracket expressions for it to match on a backslash. It's already required by POSIX for shell globs, awk, find, pax patterns (not those in -s). We should probably add sed, ex, vi to the list.
That leaves off expr, lex, ed, grep, csplit, pax' -s argument.
I'm not going to double check all variants of those implementations to see which of them work as POSIX specifies for patterns like [\t] or [^\], I know *I* will use [\\t] [^\\] if I want to match on \ and t or not-\ respectively, and $'[\t]' if I want to match on TAB because I don't trust all will follow that POSIX rule (which again is a restrictive rule that hinders progress, as it's much more natural to have \ having the same role inside and outside bracket expressions; I'd imagine many implementations could choose to ignore that rule), and I don't want to have to remember which applications follow it and which don't.
Re: Note: 0004418
> ("Similar" rather than "the same" because sed uses BREs whereas awk uses EREs.)
Note 0000528 where issue8 sed will likely support -E for EREs. So starting with issue8, sed will support both BREs and EREs.
edited on: 2019-06-14 09:46
Re: Note: 0004419
> 2. ex/vi
> Like for sed, the most widely used implementation of ex/vi (vim) does
> some special handling of \ inside bracket expressions. While, like I
> said in the bug description, [\t] does match on \ and t in compatibility
> mode, [\] doesn't match on backslash (it matches on  instead), you do
> need [\\] to match on \.
FWIW, [\] doesn't work in /bin/vi, /bin/ex, /usr/xpg4/bin/vi, /usr/xpg4/bin/ex on Solaris either (it complains about an unmatched [ there). It works (matches a \) in nvi or FreeBSD vi/ex or Solaris /bin/ed or /usr/xpg4/bin/ed
|2019-03-08 17:32||stephane||New Issue|
|2019-03-08 17:32||stephane||Name||=> Stephane|
|2019-03-08 17:32||stephane||Section||=> 9.3.5|
|2019-03-08 17:32||stephane||Page Number||=> 184|
|2019-03-08 17:32||stephane||Line Number||=> 6087|
|2019-03-08 18:18||geoffclare||Note Added: 0004291|
|2019-03-08 20:29||stephane||Note Added: 0004292|
|2019-03-11 02:36||kre||Note Added: 0004295|
|2019-03-11 07:53||stephane||Note Added: 0004302|
|2019-06-13 15:19||geoffclare||Note Added: 0004418|
|2019-06-13 15:21||geoffclare||Interp Status||=> ---|
|2019-06-13 15:21||geoffclare||Final Accepted Text||=> Note: 0004418|
|2019-06-13 15:21||geoffclare||Status||New => Resolved|
|2019-06-13 15:21||geoffclare||Resolution||Open => Accepted As Marked|
|2019-06-13 15:22||geoffclare||Tag Attached: issue8|
|2019-06-14 06:36||stephane||Note Added: 0004419|
|2019-06-14 07:13||stephane||Note Added: 0004420|
|2019-06-14 09:32||stephane||Note Added: 0004422|
|2019-06-14 09:46||stephane||Note Edited: 0004422|
|Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group|