Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001036 [1003.1(2013)/Issue7+TC1] Shell and Utilities Objection Error 2016-03-22 03:18 2017-06-10 00:58
Reporter kre View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Robert Elz
Organization
User Reference
Section 2.7.4
Page Number 2335-2336
Line Number 74235-74256
Interp Status ---
Final Accepted Text
Summary 0001036: Errors/Omissions in specification of here document redirection
Description Aside from the question of just which newline is the "next" newline, that
has been canvassed (without resolution I can see) elsewhere, there are
several problems with the specification of here documents.

First, given that the here doc is processed after encountering a newline
(which newline is the other issue) they must be largely processed as a side
effect of lexical processing (as newlines, other than those that happen to be literal) no longer exist in the scanned form of the shell input, they have
served as token delimiters, and are not otherwise relevant. This would suggest
that the here document is processed during lexical analysis - and nothing in the
specification contradicts that. The spec does say that (given an unquoted)
delimiter word, the text is subject to various expansions. It does not say that those expansions should not be performed while reading the here doc text, however I believe that is (or should be) the intent - that is, if the here doc
is never used because it is attached to a command that is never executed (on the "wrong" side of an && or similar) then the expansions in the here doc should not be performed. It could be I am missing something, but I cannot see any text that says that the expansions in the here doc should be evaluated in the context of the command that is about to use the data, immediately before it is used (in the appropriate sequence of all applicable redirect operations).

Second, the text says ...

      If any character in word is quoted, the delimiter shall be...

and in the following paragraph ...

      If no characters in word are quoted, all lines of the ...

but I do not believe that is what is intended, and is not what is actually
implemented in any shell I can find. Consider ...

      cat << ""EOF
      lines of text
      EOF

The delimiter there is the string EOF in which none of the characters were quoted. True it was preceded by a quoted null string, but that contains no quoted characters. Hence no characters of the delimiter word were quoted, and according to the spec, "lines of text" should be subject to the various expansions. No-one implements it that way, it is not whether any characters in the word are quoted, but whether any quote characters were encountered while scanning word.

Third, in cases where expansions are done, nothing makes it explicit that the end delimiter cannot be found as the result of an expansion. That is, in the
following, there is one here document that happens to contain the string
         echo foo << EOF
and not two here documents

         end=EOF
         cat <<EOF
         lines of text
         $end
         echo foo <<EOF
         another line
         EOF

Of course, if the first question above is resolved to make it clear that expansions do not happen when the document is being read, this would be a moot point, as $end expanding to EOF would not be known while the here doc is being read, which I believe is the correct interpretation.

Fourth, I am totally confused by the relationship between double quoting and
backtick command expansions, section 2.2.3 appears to say that if backticks appear inside double quotes, then the double-quote interpretation continues through the command expansion (if that were not true, it would not be possible for a double quoted string to start before a ` command substitution, and end inside it, as a " inside the `...` would be the start of a new string, not the end of the previous one (the same as it is in $( ) command substitutions).
The relevance of this to here documents is illustrated by the following ...

        echo "` cat << EOF
        X = $(( 1 + 2 ))
        EOF
        `"

If things are as I have postulated, then the EOF is quoted (by the double
quotes that surround the command substitution) and hence the here document
should not be expanded, and echo should (eventually) output

       X = $(( 1 + 2 ))

and not

       X = 3

but again, I do not believe this is in accordance with what any shell does.
This again may be an artifact of the 2nd point above, and if the text is changed so that only quote characters encountered while scanning the delimiter word cause the expansion to be supressed, and not whether "characters are quoted" then this issue will go away.

Fifth, and more minor I think, when the delimiter is not quoted,
the text states that backslashes work the way they do in double quoted strings, and references section 2.2.3 for the details. There we are informed that inside double quotes, \ is only special (only a quote character) when the following character is one of \ " ` $ and newline (so for example "\n" is a
two character string). But then (back to 2.7.4) the text goes on to say that
inside the here document " is not special. The problem is that it is not
clear whether \ continues to act as a quote character when followed by this non-special " or not (ie: is \" in a here document, with an unquoted delimiter word, one character, or two?) I believe two is correct.

Sixth, and perhaps most important of all, there is no discussion of what is expected to happen when the input string ends before the here document delimiter is encountered. Most important, because unlike the previous issues where I believe all shells (all I could find) actually agree on what should be done, and the text just needs to be more clear, for this one, there is a difference of opinion. Some shells treat end of file as equivalent to the
delimiter, and go ahead and execute whatever command the here document was attached to with as much input as they managed to gather (one issues a warning when it does this, but does it anyway, most that adopt this behaviour do it silently.) Other shells consider this to be a redirect error, suppress execution of the command, and set $? to indicate failure. Personally I believe that the latter is the best approach, as it avoids situations where the
shell eats the entire rest of the script as the here document because of some error or other (the one that happens to me from time to time is that I cut & paste a script, or script segment, and the tabs that had been present get converted to spaces, and then the <<-EOF doe not stop on space space..EOF
where it would have with tab EOF.)
Desired Action Change the words "If any character in word is quoted" to "If any quoting character is encountered while scanning word", and "If no characters in word are quoted" to "If no quoting characters are encountered while scanning word".

At the end of the paragraph that currently starts "If no characters in word are quoted" add an extra sentence along the lines of ... "The expansions listed are preformed in the context of the command about to be executed to which the here document contents are to be input, and at the appropriate time in the sequence of all redirect operators applying to that command - if that command is never executed the here document shall not be expanded."

Where it talks about <backslash> quoting in here documents with unquoted delimiter words, add some text to make it clear that even though a \ in ""
quotes a ", a \ in a here document does not, and the sequence \" is 2 chars.

Finally, add (somewhere) words to the effect "If the terminating line of the here document is not located before the shell exhausts its input, the behaviour is undefined, implementers are encouraged to treat this as a redirect error,
but applications should not rely upon this."
Tags No tags attached.
Attached Files

- Relationships
related to 0000583Closedajosey 1003.1(2008)/Issue 7 When is any character in the delimiter word quoted? 
related to 0001037New 1003.1(2013)/Issue7+TC1 The grammar for here documents misses the data body and the final EOF condition 
related to 0001043New 1003.1(2013)/Issue7+TC1 Which newline starts collection of here document data? 

-  Notes
(0003097)
kre (reporter)
2016-03-22 06:19

When I look more closely, I think you can forget the fifth issue in the
list (and the third of the desired actions), I missed the phrase "when
considered special" in 2.2.3 (the description of \ inside "...").

But I think you can replace that one with a request that the "<newline>" in
"begins after the next <newline>" (2nd paragraph in 2.7.4) should be
changed to "NEWLINE token" (as is done in the description of token recognition
in section 2.3 (2nd paragraph). The difference is that in

        nl='
        '

there is a <newline> but no NEWLINE token. I think it is clear that in
a sequence like

        cat <<EOF; nl='
        '
        line 1
        line 2
        EOF

The here document starts at "line 1" not at the line containing just the '
character, even though that is the line after the next <newline>

Similarly, there is a <newline> in the \
sequence (\ followed immediately by newline), that is also not a NEWLINE
token, and a here document would not start after that either.

This still leaves the question of NEWLINE tokens embedded in command substitutions and similar, where the << operator was outside.
(0003098)
joerg (reporter)
2016-03-22 19:52
edited on: 2016-03-23 11:15

I did not yet check the other cases but you are mistaken with respect to

   cat << ""EOF
   $$
   EOF

as it expands to the process id of the Bourne Shell and even ksh88 and ksh93 document exactly this behavior.

From the ksh93 documentation:

   If any character of word is quoted,
   then no interpretation is placed upon the
   characters of the document. Otherwise, parame-
   ter expansion, command substitution, and
   arithmetic substitution occur,

so it seems that you discovered a ksh bug.

Check the original Bourne Shell at:

http://schilytools.sourceforge.net/bosh.html [^]

to verify that your example causes parameter substitution.

Note that the documentation from the Bourne Shell, ksh88, ksh93,
bash, mksh and zsh clearly mention that no expansion occurs when
any of the characters from word is quoted. The dash man page is
not written clearly and thus does not help.

Looking at the ksh88 source, it is obvious that the deviating
behavior from ksh is an unintended side-effect of the rewritten
field splitting code that is used to strip off the quoting from
"word".

(0003099)
geoffclare (manager)
2016-03-23 09:29

We are already fixing the "any character in word" problem: TC2 changes it to "any part of word". See 0000583
(0003100)
joerg (reporter)
2016-03-23 10:57

Geoff, it seems that this was a mistake as from what I can say,
the ksh behavior was changed unintentionally and the new text
is in conflict with both the documentation and the behavior
of the Bourne Shell.
(0003101)
geoffclare (manager)
2016-03-23 11:30

No it is not a mistake. That is how ksh88 behaves, and the POSIX shell was based on ksh88 not Bourne.
(0003103)
kre (reporter)
2016-03-24 00:04

Thanks for the pointer to issue 583 - I actually did a search (I looked
for references to 2.7.4) and the search did not produce that one...
But that resolution is fine for that point, though you might want to
amend the language just a little more to handle the case where the
whole redirection (including the delimiter word) is in a quoted environment
(making it clear that quotes need to be explicit in "word" itself to count
as quoting existing for this purpose). The suggestion (in 583) that
"if quote removal changes the word, it was quoted" seems about right to me,

kre
(0003104)
kre (reporter)
2016-03-24 00:18

To make the first point in the issue more clear (or one aspect of it
anyway), consider the following ...

unset X
cat <<EOF
${X=2}
EOF
echo "${X-1}"

No question but that the output from cat is "2" (a line containing 2),
but what value does the echo line print, 1 or 2 ? That all depends upon the
context in which the here doc is evaluated. If it is in the context of the
shell running the script, then the answer is 2. On the other hand if it is
in the context being established to run cat, then 1 would be the answer.

My testing shows shells (I have to test) about equally divided on this issue,
but I would have expected that 1 makes most sense, given that the here document
is processed at the correct point in the sequence of redirections.

kre
(0003105)
kre (reporter)
2016-03-24 00:28

For the sixth point, consider this example

cat > File1 <<EOF 2>File2
lines of text
but no line containing "EOF"
and the script ends right here.

There are several possibilities here, one is that the here doc with no
end delimiter is a lexical or parser level syntax error, and nothing else
is done with the command at all (a non-interactive shell would exit).
That is the one I prefer...

Another is that the faulty here document is discovered during redirect
processing, after File1 is created, before File2 is created, and things
stop in that state, again with a syntax error, but later in the processing.

Third is that it isn't an error at all, cat is run, the data present is
sent to File1, File2 is created, but empty, as there are no errors. Exit
status would be 0. This solution (though seemingly quite common) I do
not like at all, as this almost always indicate some kind of user error,
and giving either half the intended data, or just as likely, more than
intended when the end delimiter is entered in an incorrect way, is just as
bad as being unable to open a file in a normal '<' redirection, so just
going ahead and substituting some other file (/dev/null maybe) instead.
Not sane.

kre
(0003124)
chet_ramey (reporter)
2016-04-04 19:52

I'm interested in what the group would like to do about the EOF-as
here-document-delimiter issue.

As kre says, just about every shell I looked at allows EOF to delimit a
here document (mksh is the notable exception). It's clearly existing
practice, but the standard is silent. Does this render all these shells
non-conformant?

The other interesting case is whether or not a shell allows an instance of
the delimiter immediately followed by the end of a command substitution
(`)' or ``') to delimit a here-document. For example, what should the
following output?

x=$(cat <<EOF
a
b
EOF)
echo "$x"
echo after

There are varying behaviors.

Shells allowing delimiter+right paren to delimit here document in $(...):
ksh93, bash, mksh, zsh, posh
Shells that do not: dash, BSD(s) sh

Shells allowing delimiter+backquote to delimit here document in `...`:
ksh93, SVR4.2 sh, bash, mksh, zsh, posh, dash, BSD(s) sh

Even in this there is varying behavior: dash uses EOF (in the form of the
end of the `...` command substitution) as the delimiter and includes the
delimiter word as part of the here document.

Is it worthwhile to add text saying the behavior is unspecified if the
shell encounters end-of-file before finding the here-document delimiter?
What about the command substitution case?
(0003125)
jilles (reporter)
2016-04-04 21:48

Given that $(case x in x) : ;; esac) is a single valid command substitution, I would expect the following two to be as well:

$(cat <<EOF &&
EOF)
EOF
:)

$(if :; then cat <<EOF
EOF)
EOF
fi)

If that is accepted but:

$(cat <<EOF
x
EOF)

is also to be a single valid command substitution, detecting whether the end marker with closing parenthesis is an end marker is rather complicated.

There is no such issue with:

`cat <<EOF
x
EOF`
(0003126)
joerg (reporter)
2016-04-05 12:20

`cat <<EOF
x
EOF`

is parsed in a different way than

$(cat <<EOF
x
EOF)

The first one is parsed on a lexical base, i.e. the next unescaped "`"
is searched for, before any instance in the shell tries to understand
the here document.

In the second form, there is a need to recursively call the parser in
order to understand where the end of the $() command is. This is caused
by the fact that the number of opening and closing parenthesis in a
command is not always equal. For this reason, with $(), the here document
is read in during the lexial scan already, because the lexical scan calls
a recursive parser.

Note that this is a POSIXLY correct command:

echo $(if cat <<EOF
1
2
3
EOF)
EOF
then
echo a
fi)

but it is not accepted by ksh93 because ksh93 implements a funny
recognition of "EOF)".
(0003127)
kre (reporter)
2016-04-05 13:58

Re note 3124:
    Is it worthwhile to add text saying the behavior is unspecified if the
    shell encounters end-of-file before finding the here-document delimiter?

I would say yes - along with an admonition on applications to always supply
the end delimiter.

The NetBSD shell is another which (now) complains about here docs without an
end delim (treats it as a syntax error), which I think is far and away the
best thing for shells to do - and I would encourage all of you who are
implementers to do that. Since we made that change, we have (as far as I
know) encountered exactly 1 script which was working only because of the
"eof terminates a here doc" behaviour - and that one was almost certainly an
accident (it was one of many scripts, several others of which also had here
docs that ran to the end of the script, and only that one was missing the
end delimiter). So, I would not be too worried that you will be breaking
large numbers of scripts that are relying on EOF delimiting here docs, the
users don't seem to know about this, or find it simple enough to add the
string just before EOF...

On the other hand, accidentally getting the here doc end delimiter incorrect
is easy to do - say a space amongst the leading tabs that are to be stripped,
or a simple typo - having that silently cause the rest of the script to be
treated as here doc content, silently, and simply carrying on working, is
not friendly.

On:
      What about the command substitution case?

That one should simply be regarded as incorrect, the spec is already clear
on what makes an end delimiter, and a ) following the string is not it, the
newline before the ) is required. (And yes, I agree, the `` case is
quite different, in all ways, even though it seems initially to be just a
different, more difficult to nest, equivalent.)
(0003134)
kre (reporter)
2016-04-07 00:46

Since it seems agreed that there is no consistency on what happens if
a here doc is not terminated, and that applications neither need to, or
ever should, rely upon any particular behaviour there (and in practice,
do not seem to), I suggest that the following wording be added to the end
of the normative text in section 2.7.4 (just before the informative example).
[Sorry, I do not have page or line numbers - someone else will need to
add those.]

    The effect of failing to detect the here-document delimiter before the
    shell exhausts its input stream is unspecified. Applications shall
    ensure the delimiter is present.

And perhaps in a rationale section somewhere...

    Traditional shell behaviour has been to treat "end of file" as being
    equivalent to the delimiter of a here document, terminating the here
    document, usually without any indication, and continuing as if the
    delimiter had been recognised. This can cause problems where the
    delimiter had been intended to occur much earlier in the script, but
    was incorrectly entered - a mistake which for many other errors would
    have resulted in a syntax error, and an aborted script, instead simply
    generates incorrect results. Because of this some shell implementations
    have changed to reporting an undelimited here document as a syntax error.
    Other implementations are encouraged to do the same.

or maybe something less wordy with similar effect...

The other issues still need resolution.
(0003135)
kre (reporter)
2016-04-07 01:05

Another of my original issues, in which I believe (hope) there is no
dissent (as in, I believe all shells act this way, it is just not, yet,
written in the standard), also add to the normative text of section 2.7.4,
in the paragraph that begins:

    If no characters in word are quoted, ...

add the following sentence after the initial sentence of the paragraph
(again, sorry, page/line numbers not available to me):

    This expansion happens after the here document delimiter has been
    recognised and the here document extracted from the input stream,
    and thus the end delimiter for the here document cannot be generated
    as a by-product of the expansion.

And to make things even clearer (I believe this is how shells behave, but
am less certain that this is universal), at the end of that same paragraph
add

    When an unquoted backslash is followed by a newline, line joining occurs,
    and the backslash newline combination is removed. This occurs while the
    here document is being scanned, the end delimiter will not be recognised
    immediately after a newline that has been deleted in this way.
(0003572)
stephane (reporter)
2017-02-25 14:12

About your last sixth point, I'm not sure I see a problem. It's currently unspecified so applications have to make sure there delimiter is provided and implementations can do what they want when it's not, allowing warning or error message, or ignore the problem, all of which are valid approaches to me.

There's also the case of

eval "cat << EOF"
xxx
EOF

(and the same with the "." command) that may need to be covered.
(0003592)
kre (reporter)
2017-03-03 15:57

Re note 3572 ... I agree that the behaviour is unspecified in the literal
sense (in that the spec says nothing about it at all), I do not agree that
is adequate however - if unspecified behaviour is what is expected in this
case (and I'd certainly accept that as an outcome for that point), it ought
to be explicitly unspecified, not just literally.

kre
(0003690)
kre (reporter)
2017-05-11 13:46

When this issue reaches the head of the queue, it might be worthwhile
spending a minute or two on the subject of \newline continuation lines
in here docs, and their effect on tab suppression, and end-string recognition.

For the first of those, given

cat <<-EOF
        \
        X
EOF

(where the white space is supposed to represent a tab character, but is
spaces here, as I cannot seem to input a tab in the form...) what exactly
is expected to be written to stdout? That is, is the tab before X a
leading tab, or not?

For the second, the following script (with the same caveat about spaces and
tabs...)

EOF() { printf 'EOF executed as a command\n'; }
cat <<-EOF
        \
EOF
EOF

executes differently in different shells, in some it simply says "EOF"
(where the second EOF is the end-string) and in others it says
"EOF executed as a command" where the first EOF is the end string.
(Here similar things happen without the tab stripping if the \ line
contains only a \ character).
(0003691)
joerg (reporter)
2017-05-11 14:01
edited on: 2017-05-11 14:05

Re: Note: 0003690

With your first example, all shells except ksh93 seem to print just a
X

With your second example, the historic Bourne Shell, ksh88, bash,
bosh, mksh print

"EOF executed as a command"

While ksh93, dash, yash, zsh are non-compliant.

You discovered an interesting aspect.

(0003756)
kre (reporter)
2017-06-10 00:58

There is another case worth considering ...

cat <<'EOF
REALLY'
Hello
EOF
REALLY

Is this supposed to work, or not? I see nothing in the text that prohibits
a \n as one of the characters of the end delimiter (obviously, like spaces,
and other operator type characters, it can only occur in a quoted delimiter)

My tests show that yash simply forbids it. Nothing else I tested does, though
when the here doc delimiter contains a \n, none of bash, zsh, mksh, or bosh
seem to recognise anything as the delimiter. The ash derived shells (dash,
freebsd netbsd) and ksh93 all just say "Hello" when given the command above
(the two line end delimiter is handled just fine .. I didn't test more than
two, but I am fairly confident that at least the FreeBSD and NetBSD shells
would handle as many embedded \n's as you want to give - probably the others
as well.)

- Issue History
Date Modified Username Field Change
2016-03-22 03:18 kre New Issue
2016-03-22 03:18 kre Name => Robert Elz
2016-03-22 03:18 kre Section => 2.7.4
2016-03-22 03:18 kre Page Number => unknown
2016-03-22 03:18 kre Line Number => unknown
2016-03-22 06:19 kre Note Added: 0003097
2016-03-22 19:52 joerg Note Added: 0003098
2016-03-23 09:29 geoffclare Note Added: 0003099
2016-03-23 09:30 geoffclare Relationship added related to 0000583
2016-03-23 10:55 joerg Note Edited: 0003098
2016-03-23 10:57 joerg Note Added: 0003100
2016-03-23 11:13 joerg Note Edited: 0003098
2016-03-23 11:15 joerg Note Edited: 0003098
2016-03-23 11:30 geoffclare Note Added: 0003101
2016-03-24 00:04 kre Note Added: 0003103
2016-03-24 00:18 kre Note Added: 0003104
2016-03-24 00:28 kre Note Added: 0003105
2016-04-04 19:52 chet_ramey Note Added: 0003124
2016-04-04 21:48 jilles Note Added: 0003125
2016-04-05 12:20 joerg Note Added: 0003126
2016-04-05 13:58 kre Note Added: 0003127
2016-04-07 00:46 kre Note Added: 0003134
2016-04-07 01:05 kre Note Added: 0003135
2016-04-07 05:42 Don Cragun Page Number unknown => 2335-2336
2016-04-07 05:42 Don Cragun Line Number unknown => 74235-74256
2016-04-07 05:42 Don Cragun Interp Status => ---
2017-02-25 14:12 stephane Note Added: 0003572
2017-03-02 16:18 nick Relationship added related to 0001037
2017-03-03 15:57 kre Note Added: 0003592
2017-03-23 16:11 nick Relationship added related to 0001043
2017-05-11 13:46 kre Note Added: 0003690
2017-05-11 14:01 joerg Note Added: 0003691
2017-05-11 14:05 joerg Note Edited: 0003691
2017-06-10 00:58 kre Note Added: 0003756


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker