Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001551 [Issue 8 drafts] Shell and Utilities Objection Clarification Requested 2022-01-14 05:39 2022-04-28 15:15
Reporter calestyo View Status public  
Assigned To
Priority normal Resolution Duplicate  
Status Closed   Product Version Draft 2.1
Name Christoph Anton Mitterer
Organization
User Reference
Section Utilities, sed
Page Number 3132, ff. (in the draft)
Line Number see below
Final Accepted Text
Summary 0001551: sed: ambiguities in the how BREs/EREs are parsed/interpreted between delimiters (especially when these are special characters)
Description Hey.

First of all, I've asked/reported all his already at the mailing list:
"sed and delimiters that are also special characters to REs"
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33587&limit=100&offset=0&sid= [^]
(unfortunately there seems to be no thread-view)

So far, no one could really answer the core questions (or if I just didn't understand, than my apologies for writing again here).


I was looking into using BREs/EREs within delimiters, which as far as POSIX is concerned should be only sed, and in:
- context addresses (e.g. /RE/ or \xREx with x being another delimiter, of which the 1st needs to be quoted if not / )
- s-command


(I made another ticket (https://www.austingroupbugs.net/view.php?id=1550 [^] ) with respect to clarifications/ambiguities about context addresses and delimiters, which may be a bit related.)


This ticket covers presumed ambiguities in:
When BREs/EREs are used within delimiters...
AND
... the delimiter is a special character (or a character that would be special if quoted with a \).
(in the above 3 cases, though in my examples I use only the s-command).



As far as I can see, the documentation says with respect to the delimiters and their literal use in REs (or replacements) only:

[1] »If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the RE. For example, in the context address "\xabc\xdefx", the second x stands for itself, so that the RE is "abcxdef".«
(line 106088 et seq., in the draft)

[ii] »Within the RE and the replacement, the RE delimiter itself can be used
as a literal character if it is preceded by a <backslash>.«
(line 106204 et seq., in the draft)

[iii] for the s-command:
»Any character other than <backslash> or <newline> can be used instead of a <slash> to delimit the RE and the replacement. Within the RE and the replacement, the RE delimiter itself can be used as a literal character if it is preceded by a <backslash>.«
(line 106202 et seq., in the draft)



IMO, that leaves open a number of questions and ambiguities:



1) How are strings/commands which delimiters actually parsed (or split up)?

Consider the following example:
s(\\((X(


There are IMO at least two ways to parse that:

a) two stages
- 1st: splitting up into RE an replacement parts first by going through the string and looking for any delimiter which is not immediately preceded by \ which is here the 3rd ( .
- 2nd: taking the two parts (RE and replacement) and unquote any quoted delimiter \(
  RE-part = \\(
  replacement-part = X
  unquoted:
  RE-part = \( (here the \( became a ( with respect to the RE)
  replacement-part = X

now parse the RE \( as usual... assuming a BRE we'd end up with \( as the sequence starting a sub-pattern.

So effectively here we'd get:
s/\(/X/
=> would as such be an error, but there could have of course been a 'abc\)' in the RE, making it valid.


b) one stage
going from left to right applying the varying rules (for REs and delimiters), whichever comes first, resulting into:
s( ah, an s command with ( as delimiter
  \\ parser first sees these, makes them a literal \
    (( ah, the 2nd and 3rd delimiter
      X( flags to the s command
=> would likely be an error, given the unknown flags


I couldn't find any place, where it really says clearly (or unclearly) how the parsing is to be done.
Just because (b) seems the more logical way to do it, doesn't make it mandatory.

Especially, (a) doesn't seem to be ruled out, take [i], [ii], [iii] which all effectively say that if the delimiter is preceded by \ it's taken literally... that wording points IMO actually more towards (a), because (b) would require a wording, that says something like:
"if the delimiter is preceded by \ THAT BY ITSELF IS NOT ESCAPED"

And both ways (and there might be more crazy ways to do it ;-) )... produce quite different results.




2) What if the delimiter is a special character (assuming BREs here).

[i], [ii], [iii] all effectively say, that if the delimiter is preceded by \ it's taken literally.


a) What does »literally« mean here?
- It's not taken as a delimiter, but "directly" used as RE?
So if one has:
s.\..X.
it would be used as:
s/./X/
(btw, this is what GNU sed does:
   $ printf '%s\n' '.' | sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | sed 's.\..X.'
   X
)

or:

- It's not taken as delimiter AND in RE context it would also be literally, even if it would normally be a special character:
So if one has:
s.\..X.
it would be used as:
s/\./X/
(btw, this is what BusyBox sed does:
   printf '%s\n' '.' | busybox sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.\..X.'
   v
)

And again, as above, the standard says "if the delimiter is preceded by \" ... it does not say, that the \ is by itself NOT escaped, which links to (1), that is: how the parsing is actually done?!

=> The standard should clarify this ambiguity, given that two widely used implementations (GNU vs. BusyBox) already use different behaviour shows that there is something fishy.
And if it's undefined, the standard should also mention that (and probably describe it with an example) and warn from using any special characters as delimiters.


b) Depending on the answer of (2a), there isn't any mentioning on whether one can get back the special meaning respectively the literal character.

In both cases, the question would arise:
Other than using a more sane delimiter ;-) ...

- »literally« means, it's no longer a delimiter, but other than that goes directly into the RE and may be special there:
then: s.\..X. would be effectively s/./X/
... can I get the literal . here and if so how?

- »literally« means, literal even with respect to the RE:
then: s.\..X. would be effectively s/\./X/
... can I get the special meaning . here and if so how?

=> even if it's simply not possibly do get the other meaning (whichever is actually the "right" one), the standard should explicitly mention that.


c) (2a) and (2b) also affect characters that get their special meaning (in the RE) only when preceded by \ .

Consider:
s(\((X(

Unlike above in (1) (where s(\\((X( ) was used, there is no parsing ambiguity here, and the command should effectively be the same than:
s/<something>/X/

Again, for <something> the question from (2a) comes up:
What does the RE see?
- ( (the literal ( )
(btw, this is what GNU sed does:
   $ printf '%s\n' '(' | sed 's(\((X('
   X
   $ printf '%s\n' 'v' | sed 's(\((X('
   v
)

or:
- \( (the sequence \( which starts a subpattern)
(I know the resulting RE would lack a closing '\)' and something within the
subexpression ... but that could be easily added.)
(btw, this is what BusyBox sed does:
   $ printf '%s\n' 'anything' | busybox sed 's(\((X('
   sed: bad regex '\(': Unmatched ( or \(
)


So as one can see, the same questions as in (2a) and (2b) pop up for such characters that get their special meaning only when preceded by \ .



d) I found that e.g. GNU's sed (which (2a) uses the quoted delimiter that is a special character AS special character in the RE) allows the following workaround (to get the literal character):
s.[.].X.
which seems then to be used (by GNU sed) as:
s/[.]/X/

But again, at least to me the standard seems to be ambiguous with
respect to how the original form should be parsed (see point (1) above).

While the bracket expression itself is defined to take the . inside
literally, POSIX nowhere seems to say that this is even to be seen as a
'.' for the RE and not as the 2nd delimiter.

Instead, (i), (ii) and (iii) rather seem to imply, that because the (2nd) . is not preceded by \ it *IS* taken as a delimiter.

(GNU sed:
   $ printf '%s\n' '.' | sed 's.[.].X.'
   X
   $ printf '%s\n' 'v' | sed 's.[.].X.'
   v
)

(BusyBox sed also works like that:
   $ printf '%s\n' '.' | busybox sed 's.[.].X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.[.].X.'
   v
but since BusyBox sed anyway seems to tread the quoted delimiter \. as literal . in the RE (unlike GNU sed):
   $ printf '%s\n' '.' | busybox sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.\..X.'
   v
the "trick" is not really a workaround to get the "other meaning" (which would be the special character . meaning here
)




3. Probably just a bug:

Not really an issue with POSIX, but just as an example how confusing things apparently are:


(GNU sed:
   $ printf '%s\n' '9+' | sed 's+9\++X+'
   X
   $ printf '%s\n' '99+' | sed 's+9\++X+'
   9X
   $ printf '%s\n' '999+' | sed 's+9\++X+'
   99X
)
These results are IMO fine, regardless of my other questions above.

In BREs, + alone is never special, and whether one parses all at once
from left to right (as in (1b))... or first looks for unquoted delimiter characters
and splits the command there (as in (1a))...
... the RE should always be effectively the string 9+ ... which is (in
BREs) the literal 9 followed by the literal + .


However...


(BusyBox sed:
   $ printf '%s\n' '9+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '99+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '999+' | busybox sed 's+9\++X+'
   X+
)
somehow does both:
- make out of the \+ a non-delimiter
- transforms the (wrt BRE) non-special character + into a special one.

Which I think is generally (regardless of the interpretation or any
ambiguities in POSIX) a bug (I'll report it there).


All the above should also at least partially apply to context addresses.

I'd guess nothing of it applies to the y-command, though,.. but I haven't really looked at it.
Desired Action see above, clarify things and resolve ambiguities


Thanks,
Chris.
Tags No tags attached.
Attached Files txt file icon summary-of-literal-behaviour-gnu-vs-busybox.txt [^] (373 bytes) 2022-01-14 22:15

- Relationships
parent of 0001578Closed sed y-command: error in description about the number of characters in string1 and string2 
related to 0001550Closed clarifications/ambiguities in the description of context addresses and their delimiters for sed 
related to 0001662Closed Delimiter issues in ed and ex 

-  Notes
(0005602)
Don Cragun (manager)
2022-01-14 06:51

This was originally filed against the Issue 7 + TC2 project, but the page and line numbers are from Issue 8 draft 2.1. It has been moved to the Issue 8 project.
(0005607)
calestyo (reporter)
2022-01-14 15:52

Some additions, which I was pointed to by Oğuz in:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33636&limit=100&offset=0&sid= [^]



with respect to (2a) and (2b):

Not only special characters of the BRE/ERE are affected, but also special characters of the s-command's replacement (i.e. & ).

Consider:
s&x&_\&_&

which, depending on what "literal" means, could be either effectively:
s/x/_&_/
or:
s/x/_\&_/

(both, GNU and BusyBox sed seem to do the later
   $ printf '%s\n' 'x' | sed 's&x&_\&_&'
   _&_
   $ printf '%s\n' 'a' | sed 's&x&_\&_&'
   a
   $ printf '%s\n' 'x' | busybox sed 's&x&_\&_&'
   _&_
   $ printf '%s\n' 'a' | busybox sed 's&x&_\&_&'
   a
)




with respect to (2c):

And just as above but this time for characters which get their special meaning only when preceded with a \ :

Consider:
s1\(x\)1\11

which, depending on what "literal" means, could be either effectively:
s/\(x\)/\1/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'oxo' | busybox sed 's1\(x\)1\11'
   oxo
   $ printf '%s\n' 'owo' | busybox sed 's1\(x\)1\11'
   owo
)

or:

s/\(x\)/1/
(GNU sed seems to do this:
   $ printf '%s\n' 'oxo' | sed 's1\(x\)1\11'
   o1o
   $ printf '%s\n' 'owo' | sed 's1\(x\)1\11'
   owo
)
(0005610)
calestyo (reporter)
2022-01-14 20:36

with respect to (2c):

The ambiguity with respect to what "literal" means AND when a delimiter is used that get's its sepcial meaning only when preceded by \ goes even a bit further.

In the note above, I gave the example of using one of the digits 1-9 in the s-command's replacement.

But or BREs only (unless there are implementations which provide this also for EREs) the same may even happen in the RE part.

Consider:
s1\(xo\)\11X1

which, depending on what "literal" means, could be either effectively:
s/\(xo\)\1/X/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'xoxo' | busybox sed 's1\(xo\)\11X1'
   X
   $ printf '%s\n' 'xo1' | busybox sed 's1\(xo\)\11X1'
   xo1
)

or:

s/\(xo\)1/X/
(GNU sed seems to do this:
   $ printf '%s\n' 'xoxo' | sed 's1\(xo\)\11X1'
   xoxo
   $ printf '%s\n' 'xo1' | sed 's1\(xo\)\11X1'
   X
)
(0005611)
calestyo (reporter)
2022-01-14 21:40

with respect to (3):

For GNU sed: \w has a special meaning that extends POSIX:
»Matches any "word" character. A "word" character is any letter or digit or the underscore character.«

(similar as it gives \+ with BREs a special meaning, see the example s+9\++X+ above)

Considering:
sw9\wwXw

(GNU sed seems to do this:
   $ printf '%s\n' '99' | sed 'sw9\wwXw'
   99
   $ printf '%s\n' '9w' | sed 'sw9\wwXw'
   X
)
so it effectively takes the \w ... makes it a non-delimiter (and removes the escaping), being just left with the character w which is by itself literal.


However...


(BusyBox sed:
   $ printf '%s\n' '9' | busybox sed 'sw9\wwXw'
   9
   $ printf '%s\n' '99' | busybox sed 'sw9\wwXw'
   X
   $ printf '%s\n' '999' | busybox sed 'sw9\wwXw'
   X9
)
somehow does both:
- make out of the \w a non-delimiter
- but still sees \w with respect to the RE, and gives it special meaning.

Which I think is generally (regardless of the interpretation or any
ambiguities in POSIX) a bug (I'll report it there).



So why do I bring the same example from again, just with \w instead of \+ ?

Because that shows how problematic and dangerous this whole set of ambiguities is:

One could might argue that people using weird characters (like . or ( ) as delimiters don't deserve any better when they get unexpected results.

But this shows that even a plain letter 'x', which has never any special meaning (with or without preceding \ ), could cause such issues, if the implementation choose to give special meaning to \x and if it does the un-delimitering in a bad way.
(0005612)
calestyo (reporter)
2022-01-14 21:48
edited on: 2022-01-14 22:09

Just some summary on the different examples I gave so far in the notes above:


                GNU-sed         BusyBox-sed
s.\..X.         UD              TR
s(\((X(         UD[0a]          error+bug?[1a]
s.[.].X.        UD?[2a]         TR?[2b]
s+9\++X+        UD[0b]          bug?[1b]
s&x&_\&_&       TR           TR
s1\(x\)1\11     UD[0c]          bug?[1c]
s1\(xo\)\11X1   UD[0d]          bug?[1d]
sw9\wwXw        UD[0e]          bug?[1e]


Legend:
UD = un-delimitered, i.e. when the delimiter was . then a \. is seen by the RE as . WITH (if any!) special meaning kept
(un-delimitered meaning: the escaped delimiter character was made a non-delimiter and the escaping removed)

TR = truly literal, i.e. when the delimiter was . then a \. is seen by the RE as literal . with (if any!) special meaning REMOVED



[0]
a) I would count this as UD, because with GNU sed, the RE "sees" effectively ( ... that is the un-delimitered \( ... and that just also happens to be truly literal.
b) Same as (a) right above, GNU sed's RE "sees" effectively + ... that is the un-delimitereed \+ ... and that just happens to be already literal
c) Same as (a)... just on the replacement side.... GNU sed's replacement "sees" effectively 1 ... that is the un-delimitered \1 .... and that is again already literal
d) Same as (c), just not on the replacement side, but on the RE side of a BRE (where \1 is backreference)
e) Same as (a), just with a letter character that should never be special by itself and usable as delimiter without any worries.


[1]
a) BusyBox sed's RE apparently sees here \( in the RE, which means that it *was* un-delimitered, but it's still kept as \( ... and BB sed complains about the closing \) ... IMO likely a bug.
b) Same as (a) right above, BB sed, seems to un-delimiter the \+ ... but still see \+ afterwards and uses it with special meaning. IMO a bug.
c) Same as (a) right above, BB sed makes the \1 a non-delimiter ... and the replacement still "sees" a \1 (i.e. the back-reference)
d) Same as (c), just not on the replacement side, but on the RE side of a BRE (where \1 is backreference)
e) Same as (a), just with a letter character that should never be special by itself and usable a delimiter without any worries (but can't be, with busybox).


[2]
a) I'd also count that as UD, i.e. GNU-sed see's the special character . just that it has a TR meaning within the bracket expression
b) For BusyBox sed I'd count it as TR ... i.e. it might see the already literal character . which is again literal within the bracket expression



Conclusing:
GNU sed:
- seems to mostly follow the philosophy that the quoted delimiter character is just "un-delimitered" (with the preceding \ removed) but then treated special if it would normally be special ... and not treated special if it would normally (that is without preceding \ ) not be special.
- BUT... there is at least one deviation from that philosophy: \& in the replacement when & is the delimiter

BusyBox sed:
- seems to follow the philosophy of doing both: un-delimitering AND makeing the result non-special
- EXCEPT for any cases where the character ALONE would not be special, but WITH a preceding \ it would be. There it un-delimiters (effectively NOT removing the preceding \ ) yet still keeps the special meaning.

(0005613)
calestyo (reporter)
2022-01-14 22:17

sorry if you cannot properly read the table in the comment above... stupid Mantis folds multiple spaces and I had to use Unicode EN SPACE to get at least some readability.

I've attached the file summary-of-literal-behaviour-gnu-vs-busybox.txt which should should render properly in any normal text editor.
(0005627)
calestyo (reporter)
2022-01-18 21:18

btw: in https://austingroupbugs.net/view.php?id=1556#c5626 [^] I made some (hypothetical) example of what I personally would consider a proper definition of how things are parsed and to be interpreted.
(0005634)
calestyo (reporter)
2022-01-24 14:52
edited on: 2022-01-24 15:48

Just for the records, in BusyBox at least the errorneous behaviour of these cases:
s(\((X(         UD[0a]          error+bug?[1a]
s+9\++X+        UD[0b]          bug?[1b]
s1\(xo\)\11X1   UD[0d]          bug?[1d]
sw9\wwXw        UD[0e]          bug?[1e]

But not (yet) that of:
s1\(x\)1\11     UD[0c]          bug?[1c]

See:
https://bugs.busybox.net/show_bug.cgi?id=14541 [^]
https://git.busybox.net/busybox/commit/?id=e998c7c032458a05a7afcc13ce0dc980b99ecc6c [^]

(0005648)
calestyo (reporter)
2022-02-01 20:12

Just something more for the records (in case someone tries to reproduce the above with current versions):

Another commit in BusyBox (https://git.busybox.net/busybox/commit/?id=f12fb1e4092900f26f7f8c71cde44b1cd7d26439) [^] seems to have "fully" aligned it's behaviour with that of GNU set (well "fully" in the sense of the test cases given above in https://www.austingroupbugs.net/view.php?id=1551#c5612 [^] ).

In especially, BusyBox sed now also seems to consider the 2nd '.' in:
s.\..X.
as a special character and no longer as literal one, i.e. equivalent to:
s/./X/
(haven't teste other special characters)

So with respect to the most recent version, the table would look the same for BusyBox sed and GNU sed.




The main points of this issue remain however, i.e. that POSIX is ambiguous with respect to them.
(0005757)
geoffclare (manager)
2022-03-18 14:59

Here's my take on the issues raised:

1) The way the "\cREc" feature is specified for context addresses implies parsing is meant to be done in one stage. The end of the first paragraph of "Regular Expressions in sed" says:
Both BREs and EREs shall also support the following additions:

and one of these additions is the stuff about "\cREc". The fact that this is described as an addition to RE syntax means it is supposed to be recognised while parsing the RE. For s an y it is unclear, but for consistency they should be the same as context addresses. There have been plenty of cases in the past where the standard said "<backslash>" when it meant "unescaped <backslash>" and I think this is another such case. Inserting "unescaped" should be considered an editorial change.

2) Solaris and HP-UX behave the same as busybox, macOS behaves the same as GNU. Since there is no clear winner, POSIX should (explicitly) allow both behaviours.

3) Bug in busybox. (Solaris, HP-UX and macOS all behave the same as GNU.)
(0005774)
kre (reporter)
2022-04-02 09:15

Re Note: 0005757

   There have been plenty of cases in the past where the standard said
   "<backslash>" when it meant "unescaped <backslash>" and I think this is
   another such case. Inserting "unescaped" should be considered an editorial change.

About being an editorial change, I agree, but I think it would be better if
it were changed to be "escape character" rather than "unescaped <backslash>".
Let's stop overloading the use of <backslash> and use that only when we mean
the character as itself, and not when it is being treated as the escape
character.
(0005780)
calestyo (reporter)
2022-04-05 00:59
edited on: 2022-04-05 01:18

(My sincere apologies for this having become so long.)


Geoff, I've looked at your proposal at https://austingroupbugs.net/view.php?id=1550#c5761 [^] and with respect to this ticket I'd say the following:


I) with respect to my original point (1) here (i.e. how the string is parsed, left-to-right in one pass vs. two passes):

People might argue, that it would kinda implicitly follow from that fact that the delimiter character c can be in the string as just c (and thus being a delimiter) or '\c' (being not the delimiter). And since that could also be preceded by any further '\', which would then be part of the RE... they could argue that it would be "clear" that the string needs to be parsed in one pass.

However, in the draft (not Geoff's current proposal), page 3134, line 106087 merely says:
"If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the RE."

It doesn't use the terms "escape character/sequence"... it just says "c following <backslash>".
Vice versa that means, that c NOT following <backslash> is a delimiter, right?!

Nothing directly forbids, to first look for such c NOT following <backslash>, break up the string there and parse the parts, or is there anything that does?

Looking at e.g. 's(\\((X(' should show how this is ambiguous because the parsing is not defined:
a) going left to right in one pass:
s(
          RE: \\ (i.e. the literal '\')
 (
 replacement: <empty>
 (
       flags: X(

b) two stage parsing, with looking for '( not preceded by <backslash>'
s(
          RE: \\(
 (
 replacement: X
 (
       flags: <empty>


With a lot of thinking around some edges, it was already vaguely implied by the draft via:
- Page 3134, line 106085 "Both BREs and EREs shall also support the following additions", which is probably intended to mean, that the following bullet items (which included \n and \c) are all considered part of the RE-language ... and thus have to be parsed with them in one step.


Geoff's current proposal makes this a little bit clearer by the sentences:
- "The BRE and ERE syntax shall additionally support escaping occurrences of the delimiter within the RE with an unescaped <backslash> (except inside a bracket expression)."
- "Within the RE and the replacement, the delimiter shall not terminate the RE or replacement if it
is preceded by an unescaped <backslash> (that is not inside a bracket expression in the RE,
where the delimiter does not terminate the RE anyway - see [xref to Regular Expressions in
sed])"


=> Still, as I propose in https://austingroupbugs.net/view.php?id=1556#c5778 [^] point (c) I'd make this more clear by directly saying, that sed's additions '\n' (for newlines) and '\c' (for escaped delimiter) are - with respect to sed, considered part of the RE respectively replacement language... and that the whole command string (context address respectively s-command) is parsed in one go from left to right.




II) with respect to my original point (2a) here (i.e. what is a escaped delimiter \c with c being special to the RE - special or literal?):

Geoff's current proposals solves this question with the the sentences:
- "If the character designated by c is not special in a BRE or ERE according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively, the escape sequence <backslash>c shall be treated as that literal character; otherwise, it is unspecified whether the escape sequence <backslash>c is treated as the literal character or the special character."

- "If the delimiter character is not special in a BRE or ERE according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively, the escape sequence <backslash>delimiter shall be treated as that literal character in the RE; otherwise, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character. Likewise, if the delimiter character is not <ampersand> ('&'), the escape sequence <backslash>delimiter shall be treated as that literal character in the replacement; if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below)."


Well, that "solves" (2a) (and the first half of my note 5607, which dealt with the same problem for the replacement) by making it implementation dependent...
So my usual question in that case:

Apart from GNU's vs. busybox' sed ... is it known whether any current (= not older than 5 years and still maintained) sed implementations differ in that behaviour?

BusyBox sed may simply change it's behaviour (if persuaded ;-) )... I think they usually try to follow GNU... and so the difference might be simply some implementation coincidence.

In note #5757, Geoff mentioned HP-UX. Current versions? Are HP people here? Is this really documented behaviour? Would they'd be willing to change?

What "we" standardise here will stay there for many decades or forever (and cause much headache because of portability).

Therefore I think it would be worth to try whether the behaviour of any still relevant implementations could be unified.
And have only *one* way standardised.

I mean it's better to have this explicitly defined as implementation dependent than nothing... but would be even better if all implementations do just the same.




II) with respect to my original point (2b) here (i.e. should the standard tell whether or not one cant get the other meaning (e.g. special if it's taken literal)?):

Geoff's current proposal already describes this for one direction in the APPLICATION USAGE:
"Applications that use a special RE character as a delimiter (for example, '.' or '*') and need to use the delimiter as a literal character in the RE should put it inside a bracket expression, as implementations differ regarding whether escaping it with a <backslash> removes its special meaning.


=> Perhaps adding something like:
"If an implementation considers such escaped delimiter as the literal character (as opposed to the special character), it is not possible to give it it's special meaning, except by using another delimiter."




III) with respect to my original point (2c) here (i.e. what about characters, that get their special meaning with respect to the RE *or* replacement only when escaped by a preceding backslash, e.g. for BREs '(' or '{'... and in implementation extensions e.g. '+' or 's'):

Geoff's current proposal doesn't mention that case, AFAIU. It uses the wording "special character", but AFAIU, '(' is not considered a "special character", right?

I gave some different kinds of example for this in:
- the original post of this ticket, point (2c) <-- example for the RE part
- note 5607 of this ticket, point (2c) <-- example for the replacement part
- the original post of this ticket, point (3) <-- busybox example for actually the same thing as in (2c)


=> I think, what Geoff's current proposal already "solves" for special characters (by making it implementation defined) *still needs* to be solved for such characters that become only special when escaped by a preceding '\', too.

AND

I think the specification must make it a MUST that it implementations use that as the literal character.
E.g. for BREs, applications MUST consider:
 s(\((X(
equivalent to:
 s/(/X/

Giving them the freedom to choose the special meaning, by making this implementation defined, would have IMO the following problems:
- POSIX defines the escape sequence '\x' (for all characters x except some like the specials or with sed '\n' and '\c' with c being the delimiter) to be undefined.
- thus implementations started using this for their own purposes, e.g. GNU's sed has '\s', '\S', '\W', '\>', '\+' and more.

If POSIX now allows e.g. '\W' in 'sW\WWxW' to be *either* special *or* literal (and not just the latter), then people loose basically the ability to use most possible delimiter characters in a portable way - because an application could have extended the meaning of any '\x' (x as "defined" above).

So even if one would follow just the POSIX rules... which effectively say:
- you may use 'W' as delimiter
- 'W' is not a special character
- 'W' is not a character like '(' that becomes special if escaped by '\'
- '\W' is undefined, except for the use as escaped delimiter

... and thereby rightfully assume, that '\W', when the delimiter is 'W', would become the literal 'W'... they could actually get a special 'W'.

And that for every character that *any* implementation might have given an extended meaning.
This is basically what I tried to describe in note #5611 above.

If the standard would allow the implementation to choose whether \c is the literal c or the special c for characters c, other than a very limited set (namely the respective POSIX defined special characters - excluding(!) any implementation defined characters that get only special when escaped), then I think no delimiters could be safely used anymore in a portable way (even '/'), because when it's used in the RE it's implementation dependant, whether it becomes special or not.
The only way around that would be, to put any delimiter c in the RE into its own bracket expression... but I guess that would get quite ugly and many people would probably not know that this might be needed.


btw: For the replacement part, I think Geoff's current proposal already does this (rather implicitly though):
"Likewise, if the delimiter character is not <ampersand> ('&'), the escape sequence <backslash>delimiter shall be treated as that literal character in the replacement; if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below)."

That gives the freedom to choose whether \c is special or literal, ONLY for the POSIX-defined special character & ... *any* other characters (including 1-9 and any additions an implementation might have made) are required to be literal.


Geoff's current proposal does not explicitly allow an implementation to choose the behaviour (literal vs. special) for delimiters whose character c would only become special if escaped - but it doesn't explicitly forbid it either (and that's what's IMO missing).

I just saw (at the end of my review) that Geoff's proposal already indicates this whole problematic, in the added paragraph:
"Some historical sed implementations..."

Still I think we need to more explicitly rule this out outside of the "RATIONALE" section.



=> So, I'd propose:

* Implementations MUST consider '\c' with a delimiter 'c' ALWAYS as the literal character 'c', unless 'c' is a special character for BREs respectively ERE.
*If* the final accepted resolution leaves it implementation defined for for special characters, then one could possibly amend that simply by saying "For any 'c' that are not special (including such which would become only special when escaped), '\c' MUST be considered as the literal 'c'."

=> e.g. with a sentence like "If a character 'c' is used as delimiter that is not a special character for BREs respectively EREs (as defined by POSIX), the escape sequence '\c' must be considered to be the literal character c, regardless of any special behaviour extending POSIX, that an implementation would give to '\c' if another delimiter was used."


* I'm however not sure, what should be done about those characters, which POSIX itself already mentions as characters that become special when escaped e.g. for BREs: '(' or '{' ... and with https://austingroupbugs.net/view.php?id=1546 [^] also '\?' '\+' and '\|' which *may* be special)

For them one could allow implementation dependent behaviour, because they're defined by POSIX already, and so people know what they must exclude to stay portable. But they cannot know for any '\w', '\s' or '\~'.

I'd probably recommend against it, and allow implementation defined behaviour only for true special characters, which don't need to be escaped to become special.

[I should note again, that at least BusyBox' sed would already break this, see my original post here, point (2c).
Since it's conceptually the same as point (3), which has already been fixed, it may be already fixed in current BusyBox versions, too.]


* I'm also unsure for the cases '\c' where c is any digit from 1-9:
- this may be in replacements (BRE and ERE)
- the RE part, too (only for BRE) <--- see my note #5610 above for details/examples

I think for BOTH, we need to make it a MUST, that applications treat e.g. the escape sequence '\1' as the literal one.
For the replacement this is already the case with Geoff's current proposal (which only allows the implementation to choose with '\&').


=> I'd also propose to add something like the following sentence also to the standard (maybe APPLICATION USAGE?).
"If a digit from 1-9 is used as delimiter, it cannot be used as back-reference in the replacement or a BRE's RE part."

=> I'd further suggest to add (probably also to APPLICATION USAGE) a list for BRE and ERE, respectively, which lists all those characters that are *not* safely usable as delimiter (because applications may choose literal vs. special) *if* the character is also part of the RE (and the same for the replacement, where it's only '&').
If my above proposal is accepted, and depending on what's done for '\?' '\+' and '\|', these would be simply the list of all (truly) special characters for BREs respectively EREs.
Such lists would help people to more easily understand what they can use portably, without fiddling it out from the rules and their complex meaning.




IV) with respect to my original point (2d) here (i.e. what about 's.[.].X.')

As said before, I think it's already a bit clearer with Geoff's current proposal, that '\c' with the being the delimiter would be consider part of the RE language.

But 's.[.].X.' is different... as the 2nd '.' is not escaped.

However, with:
- what I propose in https://austingroupbugs.net/view.php?id=1556#c5778 [^] point (c) it would become IMO "fully" clear
and
- The following sentence added with Geoff's current proposal:
'The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression. For example, the context address "/[/]/" is equivalent to "/\//".'

it becomes IMO fully clear. So that point is solved (I also like the paragraph you add in the APPLICATION USAGE, which describes how to use that).

=> I would however use a simpler example:
"\%[%]%" is equivalent to "/[%]/"
or alternatively:
"/[/]/" is equivalent to "\%[/]%"

Well, whether it's simply might be a matter of personal taste,.. but it doesn't drop the bracket expression, which I think is better for showing what's going on




V) With respect to the proposal at https://austingroupbugs.net/view.php?id=1550#c5761 [^]

a) As already said in the other tickets, I'd put down the sentences starting with "The BRE and ERE syntax shall additionally support escaping" to the "Regular Expressions in sed" section again.

Sorry for giving that bad idea earlier, that it should be in "Addresses in sed"

And perhaps overlapping parts of these sentences can be unified with the ones added to the s-command. (Won't work for the replacement part, though).


b) I wouldn't write:
    "is not special in a BRE or ERE" <--- this exists in two locations
but rather
    "is not special in __the__ BRE __respectively__ ERE"
or something better.

The "or" could be interpreted e.g. the following way: we're in a BRE, someone uses + as delimiter, while that isn't special in BRE it is in ERE... so the "or" kicks in,... at least in my English understanding, "respectively" would make it a tiny bit clearer that this (BRE vs. ERE) depends on the respective case.
And yes, I've seen the ", respectively, " but I'd rather interpret that to relate to BRE <-> [xref to XBD 9.3] and ERE <-> [xref to XBD 9.4].


c) Through out your additions, you use e.g. "with an unescaped <backslash> (except inside a bracket expression)" or similar.
If we make it more clear (as I proposed above):
- that '\c' and '\n' are considered part of the RE language
and with
- the changes mad through some other issue, that clearly define "escape sequence/character" for the RE language

... I think we could go back and just call that "escape 'c'" or "escape sequnce 'c'", though I would personally prefer to retain the parentheses with a hint like "(there can't be escape characters/sequences inside bracket expressions)"


d) Cosmetics:
In some places the wording "escape sequence <backslash>c" is used... but in others e.g. "escape sequence '\n'".


e) Instead of:
"The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression. For example, the context address "/[/]/" is equivalent to "/\//"."

"The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression __but be that literal character for the bracket expression__. For example, the context address "/[/]/" is equivalent to "/\//"."

It's nitpicking, but AFAIU, the delimiter character (unlike the escaped delimiter character) is strictly speaking *not* part of the RE language.
So it's in principle still not 100% clear what happens with such character. Sure, it doesn't terminate the RE... but it could be... ignored?


f) "Within the RE and the replacement, the delimiter shall not terminate the RE or replacement if it is preceded by an unescaped <backslash> (that is not inside a bracket expression in the RE, where the delimiter does not terminate the RE anyway - see [xref to Regular Expressions in sed])."

In case this would be "unified" with the corresponding parts for the context address in "Regular Expressions in sed"... the part for the replacement would obviously need to stay.


g) "if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below).

=> one might just write '\&' here, since in that case "delimiter" is always '&'.


h) "Applications that use a special RE character as a delimiter (for example, '.' or '*') and need to use the delimiter as a literal character in the RE should put it inside a bracket expression, as implementations differ regarding whether escaping it with a <backslash> removes its special meaning."

=> If my proposal (III) above is accepted, then I'd also repeat here specifically e.g. "special RE character (which does not include such which become only special when escaped) as a delimiter".

=> And perhaps something like "should put it inside a bracket expression __with not other characters__" to make clear, that one cannot re-use one e.g. 'sX\X[0-9]XfooX' can NOT be written as 'sX[X0-9]XfooX' but only as 'sX[X][0-9]XfooX'.

Question:
Are the following bracket expressions well-defined and portable:
- [^]
- [\]
?
At least '[^]' would fall under the above sentence ("Applications that use a special RE character...")... '[\]' not really as '\' cannot be a delimiter.

I tried to find this in 9.3.5 RE Bracket Expression,... and I guess '[\]' is clearly well-defined and portable... but I cannot really follow this for '[^]'... it seems not to be mentioned and I guess I'll report it in a separate ticket.

[Edit: I guess '[^]' is ruled out by page 169, line 5873. Still I filed #1575 to make it even clearer that '^' is forbidden also as the *only* and not just the *first* character.]

=> But anyway,... the above sentence would need to exclude [^] then...

Or is there a way to safely escape this? I guess not, cause it's special and thus implementations would be allowed to choose whether to treat '\^' literal or special (when ^ is also the delimiter)... probably even depending on the position of that escape sequence.


i) "Some historical sed implementations did not support escaping '(', ')', '{', and '}' when used as a BRE"

Not sure, but this introduction with historical implementations gives kinda the feeling that this problem would only exist because of historical implementations and because of '(', ')', '{', and '}'.
However, AFAIU, we need to *generally* rule that out, and not just because of historical implementations.

And with "that" I mean, implementations must not be allowed to choose whether they give '\c' literal or special meaning, if 'c' is the delimiter, and if 'c' alone wouldn't be special, but 'c' preceded by an escaping '\' would be.




VI) not really related to this issue, but it would make things even more complex if I add it in a separate ticket:

The description of the y-command contains on page 3138, line 106249:
"If the number of characters in string1 and string2 are not equal, or if any of the characters in string1 appear more than once, the results are undefined."

That is strictly speaking wrong, namely in the case when string1 and/or string2 contains '\'-escaped 'n' (for newline) or a '\'-escaped delimiters, and the number of occurrences in both strings don't even out.

=> Perhaps simply write "If the number of characters (after resolving any escape sequences)..." or so?

(0005825)
geoffclare (manager)
2022-04-28 15:15

This is being closed as a duplicate of bug 0001550 as Note: 0005816 includes changes to address it.

- Issue History
Date Modified Username Field Change
2022-01-14 05:39 calestyo New Issue
2022-01-14 05:39 calestyo Name => Christoph Anton Mitterer
2022-01-14 05:39 calestyo Section => Utilities, sed
2022-01-14 05:39 calestyo Page Number => 3132, ff. (in the draft)
2022-01-14 05:39 calestyo Line Number => see below
2022-01-14 06:34 Don Cragun Relationship added related to 0001550
2022-01-14 06:48 Don Cragun Project 1003.1(2016/18)/Issue7+TC2 => Issue 8 drafts
2022-01-14 06:51 Don Cragun Note Added: 0005602
2022-01-14 06:51 Don Cragun version => Draft 2.1
2022-01-14 15:52 calestyo Note Added: 0005607
2022-01-14 20:36 calestyo Note Added: 0005610
2022-01-14 21:40 calestyo Note Added: 0005611
2022-01-14 21:48 calestyo Note Added: 0005612
2022-01-14 22:07 calestyo Note Edited: 0005612
2022-01-14 22:08 calestyo Note Edited: 0005612
2022-01-14 22:09 calestyo Note Edited: 0005612
2022-01-14 22:15 calestyo File Added: summary-of-literal-behaviour-gnu-vs-busybox.txt
2022-01-14 22:17 calestyo Note Added: 0005613
2022-01-18 21:18 calestyo Note Added: 0005627
2022-01-24 14:52 calestyo Note Added: 0005634
2022-01-24 15:48 calestyo Note Edited: 0005634
2022-02-01 20:12 calestyo Note Added: 0005648
2022-03-18 14:59 geoffclare Note Added: 0005757
2022-04-02 09:15 kre Note Added: 0005774
2022-04-05 00:59 calestyo Note Added: 0005780
2022-04-05 01:18 calestyo Note Edited: 0005780
2022-04-28 15:15 geoffclare Note Added: 0005825
2022-04-28 15:15 geoffclare Status New => Closed
2022-04-28 15:15 geoffclare Resolution Open => Duplicate
2022-05-05 14:59 nick Relationship added parent of 0001578
2023-04-11 14:30 geoffclare Relationship added related to 0001662


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker