Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001043 [1003.1(2013)/Issue7+TC1] Shell and Utilities Objection Omission 2016-04-07 13:16 2017-03-23 16:11
Reporter kre View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Robert Elz
User Reference
Section 2.7.4
Page Number 2335-2336
Line Number 74235-74256
Interp Status ---
Final Accepted Text
Summary 0001043: Which newline starts collection of here document data?
Description The spec for a here doc says that the here doc will begin
after the next newline.

First, let's assume that really means after the next NEWLINE
token, as is written elsewhere, that is in

        sed << FILE_END '
                s/$/: EOL/
                /foo/s/bar/& bletch/

none of the newlines in the quoted string is intended to be the
"next newline" in question. Since none of those is a NEWLINE token
making that simple change avoids problems there. To the best of
my belief, there is no shell that doesn't do it this way already.

That still leaves two unresolved issues, probably highly related to
each other, but seemingly different in a sense, and both relate to
subshell environments used in relationship with here docs.

One easy way to seen this is to use command substitution to make
the subshell environment, so let's concentrate on that first.

The first issue is: Does a newline token in a command substitution
that starts on the same line (hence no earlier NEWLINE token)
as a here doc redirection operator, count as the NEWLINE token
(meaning the here doc would appear in the middle of the command
substitution, even though it is not used with it in any way) or
is the search for a NEWLINE token interrupted while processing
the text of the command substitution, meaning that the here
document starts not at the "next" NEWLINE token, but at the
NEWLINE token that next appears at the "same parser level"
(which I am sure is not the correct way to say what I mean.)

And second, if a here document redirect operator appears within
a command substitution, does the here document also have to appear
within the same command substitution, even in cases where otherwise
the command substitution would contain no NEWLINE token at all.

Examples to illustrate:

First, given

        cat $( find . -name text-file* -mtime +3
                        -ctime -1 )

is intended to cat a bunch of files found by fine (for the purpose
of this example, let's ignore the filename issues raised by doing find
this way, that isn't relevant to the point.)

If we then assume that we also want to prefix the output with a
standard message, we might want to include that in what cat reads
and prints, one way would be
        printf "%s\n" message > /tmp/file

and then

        cat /tmp/file $( find ...
                ... )

but this is a perfect use of here documents, so ...
        cat - $( find ...
                ... ) <<EOF

will clearly work. But that separates the << from the "-" that
uses it, so we may prefer to write

        cat - <<EOF $( find ...
and at this point we need to answer the question, is what follows

                ... )

or is what follows

                ... )

This one seems to have been implemented both ways by different shells.

A literal reading of the current specification would suggest that the
first way is correct - the "next NEWLINE token" (or even just the next
newline - in this example they are the same) is the one in the middle
of the command substitution, so the here document should start there.

But most people are likely to find that form difficult to comprehend,
and probably even more difficult to write correctly.
For the second issue consider

        printf "%s\n" $( cat << EOF )
        line 1
        line 2
Is that valid, or not. Again, according to a literal reading of the
specification, it is - the next newline token is the one that appears
after the closing ')' of the command substitution. However, many
shells expect that command (inside the $( ) ) to be complete by itself,
and treat the here document referenced there as being empty (delimited
by the "end of file" which is the end of the command substitution string,
and either simply pass an empty file to cat as its stdin, or generate
a syntax error - which of those is appropriate is one of the issues
of 0001036) Shells that don't abort because of a syntax error and
which act this way then go on to attempt to execute "line" and "EOF"
as commands. Other shells simply keep looking for a "next newline"
outside the command substitution, and pass "line 1" and "line 2" to
cat, which eventually gives those lines to printf.

Which of those is correct ?

I should also say that for these, it should make no difference
whether $( ) (new style) command substitution, or `...` (old style)
is used, the same issues arise. I will also say that when old
style is used, no shell I know of (until I changed the NetBSD shell
within the past week) parse

        cat ` sed 's/-/_/' <<FILENAME ) `
        file-name FILENAME

as the author of that script clearly intended it to be parsed
(the actual script where this was detected was a little more
complicated, and had a better reason to be written in this kind
of way - though it could easily have moved the closing ` to
after the line containing FILENAME.

The same issues arise with ( ) sub-shells

        cat << FO0 | ( while read line
                                whatever || break
                      if [ "$line" = something ]
                        something -else
                      f )

In that, where is the correct spot to put the here document data?

This one doesn't even have the easy answer "just move the << to later"
that exists in the earlier case. However, it could be written

        cat << FOO |
        (while read line
which I suspect all shells would parse as intended. Much the
isame issue arises if the ( ) are not used, as in

        cat << FOO | while read line
        Here doc data here, or not ??
        Or is the here doc data here?
And for the other issue
        (cat << FILE1; cat << FILE2) | wc -l
        data for file1
        data for file2
I know, an unlikely command, but still... Is this correct, or should
it be written as

        (cat << FILE1; cat << FILE2
        data for file1
        data foe file2
        ) | wc -l
I have no doubt that the second form is correct, but is the first
correct as well?

Desired Action For the second issue, I believe a suitable solution is clear.

Add words like

        It is unspecified whether the here document data for
        a here document relocation operator is required to
        occur in the same subshell environment as the operator.
        Applications shall ensure that when a here document
        redirection operator occurs in a subshell environment
        the data is also placed in that same environment.

Though what that does to pipelines, and similar, I am not sure.
That is
        cat << EOF | wc -l
Would that remain valid? If not, how should the wording be fixed
to allow that one to work, while requiring applications to keep
here docs inside () and $( ) and `..` ?

For the first issue, the solution appears to be less clear.
I suspect that the best that can be done is ...

        It is unspecified whether a NEWLINE token that appears
        within shell input that is to be executed within a
        sub-shell environment, where the redirection operator
        occurs outside that sub-shell, is the "next NEWLINE token"
        which starts collection of data for the here document.
        Applications shall be code so as avoid this.

But that is ugly...
Tags No tags attached.
Attached Files

- Relationships
related to 0001036New Errors/Omissions in specification of here document redirection 
related to 0001037New The grammar for here documents misses the data body and the final EOF condition 

-  Notes
joerg (reporter)
2016-04-08 13:33

Some of your questions are easy to answer once you understand that a
command substitution with $(..) or `..` always is a word or part of a

Given that the shell needs to first collect all characters that form the
word, it is obvious that "the next NEWLINE" must be seen locally first,
in case of a here document that appears to be inside a command substitution.
kre (reporter)
2016-04-09 11:42

Sorry, I have no idea what "must be seen locally first" means.

The point here is that shells interpret these things in different
ways. Perhaps there is something in the spec which makes it clear
which is correct, but personally, I cannot see it.

Perhaps it is obvious which should be correct - and maybe this is
a case where what should be correct (rather than what is actually
implemented) might be specified (since it is rather an outlier in
the syntax) but if that is the case, I cannot come to a conclusion
about what should be correct, and what should not. I know what the
NetBSD shell does in these cases, and I have done some testing of
other shells, but none of that has blessed me with magic enlightenment
of correctness.

ps: I do understand that command substitution is part of a word, but
I cannot fathom how that helps - the actual here document, and the
here document operator that creates it, are separated lexically in the
input. What matters is just how that is to be resolved in some kind
of consistent matter that is more or less in accordance with what works
jilles (reporter)
2016-04-09 15:33

That a command substitution is always part of a WORD or similar token implies that any newlines part of the command substitution are not NEWLINE tokens on that level and do not start here-document contents. For example:

cat - <<EOF $(find .

is a valid command.

A different situation is where the << redirection is within the command substitution and the here-document contents are outside of it. Historically, ash variants have used their implementation technique that fully parses command substitutions when encountered to allow things like:

v=$(cat <<EOF)

in addition to the standard

v=$(cat <<EOF

The ash-specific form violates the statement in XCU 2.6.3 Command Substitution that "all characters following the open parenthesis to the matching closing parenthesis constitute the command", since the here-document contents are outside the parentheses.

More practically, the ash-specific form is hard to parse for implementations that only parse command substitutions to the minimal level necessary to find their end while parsing the outer command and only fully parse them just before execution. I think both implementation techniques (ash-style immediate full parse and bash/ksh93-style minimal immediate parse) should be valid. Changing from the latter to the former technique is likely break existing scripts that contain invalid command substitutions that are not executed.

The same special form with `...` command substitution:

v=`cat <<EOF`

seems to have no historical basis.
kre (reporter)
2016-04-10 02:27

Perhaps I erred by concentrating so much on command substitution in
the original filing of this issue, it is just that that is where it
first really came to my attention, so ...

But this, from note 3145

    That a command substitution is always part of a WORD or similar token
    implies that any newlines part of the command substitution are not NEWLINE
    tokens on that level

gets right to the crux of the issue, and which led to the title of this
bug report "which newline ..."

That is, from where, in the standard, do you get the qualification
"on that level". I do not see that anywhere.

If we take that same example, and re-cast it slightly to:

cat - <<EOF ; if find .

then (ignoring the command args, and whether this is a sane way to write
the command) is that a legal command sequence, or not (this time using "if"
and "fi" as the bracketing operators rather than $( ) ).

If this is correct, upon what basis is the newline after "." being ignored

What if we made it a simple subshell instead ..

cat - <<EOF ; ( find .

Is that one correct? And if so, the same question. There is no "one word"
or even "same level" argument to use here.

And if those forms are not valid, how exactly do you explain to script
writers how those (particularly the sub-shell version) are different from
the command substitution example in a way they can comprehend.

And while doing this also explain how

     (cat << EOF) | cmd

works in a consistent way (which I am assuming we agree is how it should work)
Or is it required to be written

    (cat << EOF
    ) | cmd

? And if that is required, where is that written? The spec just says that
here doc data comes after the next newline (token) - and we are back to the
topic of the bug report - "which newline (token)" ?

And wrt:

    Historically, ash variants have used their implementation technique that
    fully parses command substitutions when encountered to allow things like:

    v=$(cat <<EOF)

    in addition to the standard

    v=$(cat <<EOF

I have no problem with considering the second of those "standard", but
I am by no means convinced that the first is not just as standard. I
see nothing written currently that makes it so - maybe the ash technique is
how those things should be parsed? Or maybe the doc is just deficient
and needs fixing?

Note: I have no particular axe to grind here, I am not advocating one result
over another (which the wording I proposed adding, as poor and sloppy as it
was, should, I hope, make clear.) What I would like to see happen is for
some resolution to be reached so that this same discussion doesn't have to
happen again sometime in the future, when perhaps there is actually something
important riding on the outcome.

Lastly, I agree that the form:

    v=`cat <<EOF`

seems to have never been implemented (previously) anywhere. However I saw it
used in an actual script (one I did not write - rather, one I got bug reports
about when I made NetBSD's sh start to object to, rather than simply ignore,
missing here document data - previously the script had been parsed without
error, after my earlier change, it no longer was, and that was brought to my
attention as a problem caused by my first change.)

Now the script in question had other errors, it could never have actually
worked as intended, so it is not really a good example to use, but when I
thought about it, I could find nothing in the standard to forbid this (that
the command actually embedded in the `...` did not do what its author intended
was not material), if anything the "next newline (token)" wording seems to
explicitly allow it.

It turned out to be easy to "fix" (and looked to be something of an oversight
caused by the way that `...` type command substitutions are parsed, that it
had not worked all along) so I did. That handled the "bug report" ... the
script in question still doesn't work, but that doesn't matter, it has no
syntax error any more, so it parses "correctly" (even if differently than
before) and the actual command sequence is, in practice, never used anyway.
So, everyone was happy...

- Issue History
Date Modified Username Field Change
2016-04-07 13:16 kre New Issue
2016-04-07 13:16 kre Name => Robert Elz
2016-04-07 13:16 kre Section => 2.7.4
2016-04-07 13:16 kre Page Number => 2335-2336
2016-04-07 13:16 kre Line Number => 74235-74256
2016-04-08 13:33 joerg Note Added: 0003141
2016-04-09 11:42 kre Note Added: 0003144
2016-04-09 15:33 jilles Note Added: 0003145
2016-04-10 02:27 kre Note Added: 0003147
2017-03-23 16:11 nick Relationship added related to 0001036
2017-03-23 16:11 nick Relationship added related to 0001037

Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker