0001043: Which newline starts collection of here document data?

ID	Project	Category	View Status	Date Submitted	Last Update

0001043	1003.1(2013)/Issue7+TC1	Shell and Utilities	public	2016-04-07 13:16	2022-01-06 17:24

Reporter	kre	Assigned To
Priority	normal	Severity	Objection	Type	Omission
Status	Closed	Resolution	Duplicate

Name	Robert Elz
Organization
User Reference
Section	2.7.4
Page Number	2335-2336
Line Number	74235-74256
Interp Status	---
Final Accepted Text


Summary	0001043: Which newline starts collection of here document data?
Description	The spec for a here doc says that the here doc will begin after the next newline. First, let's assume that really means after the next NEWLINE token, as is written elsewhere, that is in sed << FILE_END ' s/$/: EOL/ 22i\ -------------- /foo/s/bar/& bletch/ ' none of the newlines in the quoted string is intended to be the "next newline" in question. Since none of those is a NEWLINE token making that simple change avoids problems there. To the best of my belief, there is no shell that doesn't do it this way already. That still leaves two unresolved issues, probably highly related to each other, but seemingly different in a sense, and both relate to subshell environments used in relationship with here docs. One easy way to seen this is to use command substitution to make the subshell environment, so let's concentrate on that first. The first issue is: Does a newline token in a command substitution that starts on the same line (hence no earlier NEWLINE token) as a here doc redirection operator, count as the NEWLINE token (meaning the here doc would appear in the middle of the command substitution, even though it is not used with it in any way) or is the search for a NEWLINE token interrupted while processing the text of the command substitution, meaning that the here document starts not at the "next" NEWLINE token, but at the NEWLINE token that next appears at the "same parser level" (which I am sure is not the correct way to say what I mean.) And second, if a here document redirect operator appears within a command substitution, does the here document also have to appear within the same command substitution, even in cases where otherwise the command substitution would contain no NEWLINE token at all. Examples to illustrate: First, given cat $( find . -name text-file* -mtime +3 -ctime -1 ) is intended to cat a bunch of files found by fine (for the purpose of this example, let's ignore the filename issues raised by doing find this way, that isn't relevant to the point.) If we then assume that we also want to prefix the output with a standard message, we might want to include that in what cat reads and prints, one way would be printf "%s\n" message > /tmp/file and then cat /tmp/file $( find ... ... ) but this is a perfect use of here documents, so ... cat - $( find ... ... ) <<EOF message EOF will clearly work. But that separates the << from the "-" that uses it, so we may prefer to write cat - <<EOF $( find ... and at this point we need to answer the question, is what follows that message EOF ... ) or is what follows ... ) message EOF This one seems to have been implemented both ways by different shells. A literal reading of the current specification would suggest that the first way is correct - the "next NEWLINE token" (or even just the next newline - in this example they are the same) is the one in the middle of the command substitution, so the here document should start there. But most people are likely to find that form difficult to comprehend, and probably even more difficult to write correctly. For the second issue consider printf "%s\n" $( cat << EOF ) line 1 line 2 EOF Is that valid, or not. Again, according to a literal reading of the specification, it is - the next newline token is the one that appears after the closing ')' of the command substitution. However, many shells expect that command (inside the $( ) ) to be complete by itself, and treat the here document referenced there as being empty (delimited by the "end of file" which is the end of the command substitution string, and either simply pass an empty file to cat as its stdin, or generate a syntax error - which of those is appropriate is one of the issues of 0001036) Shells that don't abort because of a syntax error and which act this way then go on to attempt to execute "line" and "EOF" as commands. Other shells simply keep looking for a "next newline" outside the command substitution, and pass "line 1" and "line 2" to cat, which eventually gives those lines to printf. Which of those is correct ? I should also say that for these, it should make no difference whether $( ) (new style) command substitution, or `...` (old style) is used, the same issues arise. I will also say that when old style is used, no shell I know of (until I changed the NetBSD shell within the past week) parse cat ` sed 's/-/_/' <<FILENAME ) ` file-name FILENAME as the author of that script clearly intended it to be parsed (the actual script where this was detected was a little more complicated, and had a better reason to be written in this kind of way - though it could easily have moved the closing ` to after the line containing FILENAME. The same issues arise with ( ) sub-shells cat << FO0 \| ( while read line do whatever \|\| break done if [ "$line" = something ] then something -else f ) In that, where is the correct spot to put the here document data? This one doesn't even have the easy answer "just move the << to later" that exists in the earlier case. However, it could be written cat << FOO \| data data data FOO (while read line do ... ) which I suspect all shells would parse as intended. Much the isame issue arises if the ( ) are not used, as in cat << FOO \| while read line Here doc data here, or not ?? do ... done Or is the here doc data here? And for the other issue (cat << FILE1; cat << FILE2) \| wc -l data for file1 FILE1 data for file2 FILE2 I know, an unlikely command, but still... Is this correct, or should it be written as (cat << FILE1; cat << FILE2 data for file1 FILE1 data foe file2 FILE2 ) \| wc -l I have no doubt that the second form is correct, but is the first correct as well?
Desired Action	For the second issue, I believe a suitable solution is clear. Add words like It is unspecified whether the here document data for a here document relocation operator is required to occur in the same subshell environment as the operator. Applications shall ensure that when a here document redirection operator occurs in a subshell environment the data is also placed in that same environment. Though what that does to pipelines, and similar, I am not sure. That is cat << EOF \| wc -l data EOF Would that remain valid? If not, how should the wording be fixed to allow that one to work, while requiring applications to keep here docs inside () and $( ) and `..` ? For the first issue, the solution appears to be less clear. I suspect that the best that can be done is ... It is unspecified whether a NEWLINE token that appears within shell input that is to be executed within a sub-shell environment, where the redirection operator occurs outside that sub-shell, is the "next NEWLINE token" which starts collection of data for the here document. Applications shall be code so as avoid this. But that is ugly...
Tags	No tags attached.

joerg 2016-04-08 13:33 reporter bugnote:0003141	Some of your questions are easy to answer once you understand that a command substitution with $(..) or `..` always is a word or part of a word. Given that the shell needs to first collect all characters that form the word, it is obvious that "the next NEWLINE" must be seen locally first, in case of a here document that appears to be inside a command substitution.

kre 2016-04-09 11:42 reporter bugnote:0003144	Sorry, I have no idea what "must be seen locally first" means. The point here is that shells interpret these things in different ways. Perhaps there is something in the spec which makes it clear which is correct, but personally, I cannot see it. Perhaps it is obvious which should be correct - and maybe this is a case where what should be correct (rather than what is actually implemented) might be specified (since it is rather an outlier in the syntax) but if that is the case, I cannot come to a conclusion about what should be correct, and what should not. I know what the NetBSD shell does in these cases, and I have done some testing of other shells, but none of that has blessed me with magic enlightenment of correctness. ps: I do understand that command substitution is part of a word, but I cannot fathom how that helps - the actual here document, and the here document operator that creates it, are separated lexically in the input. What matters is just how that is to be resolved in some kind of consistent matter that is more or less in accordance with what works today.

jilles 2016-04-09 15:33 reporter bugnote:0003145	That a command substitution is always part of a WORD or similar token implies that any newlines part of the command substitution are not NEWLINE tokens on that level and do not start here-document contents. For example: cat - <<EOF $(find . ) message EOF is a valid command. A different situation is where the << redirection is within the command substitution and the here-document contents are outside of it. Historically, ash variants have used their implementation technique that fully parses command substitutions when encountered to allow things like: v=$(cat <<EOF) & EOF in addition to the standard v=$(cat <<EOF & EOF ) The ash-specific form violates the statement in XCU 2.6.3 Command Substitution that "all characters following the open parenthesis to the matching closing parenthesis constitute the command", since the here-document contents are outside the parentheses. More practically, the ash-specific form is hard to parse for implementations that only parse command substitutions to the minimal level necessary to find their end while parsing the outer command and only fully parse them just before execution. I think both implementation techniques (ash-style immediate full parse and bash/ksh93-style minimal immediate parse) should be valid. Changing from the latter to the former technique is likely break existing scripts that contain invalid command substitutions that are not executed. The same special form with `...` command substitution: v=`cat <<EOF` & EOF seems to have no historical basis.

kre 2016-04-10 02:27 reporter bugnote:0003147	Perhaps I erred by concentrating so much on command substitution in the original filing of this issue, it is just that that is where it first really came to my attention, so ... But this, from note 3145 That a command substitution is always part of a WORD or similar token implies that any newlines part of the command substitution are not NEWLINE tokens on that level gets right to the crux of the issue, and which led to the title of this bug report "which newline ..." That is, from where, in the standard, do you get the qualification "on that level". I do not see that anywhere. If we take that same example, and re-cast it slightly to: cat - <<EOF ; if find . fi message EOF then (ignoring the command args, and whether this is a sane way to write the command) is that a legal command sequence, or not (this time using "if" and "fi" as the bracketing operators rather than $( ) ). If this is correct, upon what basis is the newline after "." being ignored here? What if we made it a simple subshell instead .. cat - <<EOF ; ( find . ) message EOF Is that one correct? And if so, the same question. There is no "one word" or even "same level" argument to use here. And if those forms are not valid, how exactly do you explain to script writers how those (particularly the sub-shell version) are different from the command substitution example in a way they can comprehend. And while doing this also explain how (cat << EOF) \| cmd data EOF works in a consistent way (which I am assuming we agree is how it should work) Or is it required to be written (cat << EOF data EOF ) \| cmd ? And if that is required, where is that written? The spec just says that here doc data comes after the next newline (token) - and we are back to the topic of the bug report - "which newline (token)" ? And wrt: Historically, ash variants have used their implementation technique that fully parses command substitutions when encountered to allow things like: v=$(cat <<EOF) & EOF in addition to the standard v=$(cat <<EOF & EOF ) I have no problem with considering the second of those "standard", but I am by no means convinced that the first is not just as standard. I see nothing written currently that makes it so - maybe the ash technique is how those things should be parsed? Or maybe the doc is just deficient and needs fixing? Note: I have no particular axe to grind here, I am not advocating one result over another (which the wording I proposed adding, as poor and sloppy as it was, should, I hope, make clear.) What I would like to see happen is for some resolution to be reached so that this same discussion doesn't have to happen again sometime in the future, when perhaps there is actually something important riding on the outcome. Lastly, I agree that the form: v=`cat <<EOF` & EOF seems to have never been implemented (previously) anywhere. However I saw it used in an actual script (one I did not write - rather, one I got bug reports about when I made NetBSD's sh start to object to, rather than simply ignore, missing here document data - previously the script had been parsed without error, after my earlier change, it no longer was, and that was brought to my attention as a problem caused by my first change.) Now the script in question had other errors, it could never have actually worked as intended, so it is not really a good example to use, but when I thought about it, I could find nothing in the standard to forbid this (that the command actually embedded in the `...` did not do what its author intended was not material), if anything the "next newline (token)" wording seems to explicitly allow it. It turned out to be easy to "fix" (and looked to be something of an oversight caused by the way that `...` type command substitutions are parsed, that it had not worked all along) so I did. That handled the "bug report" ... the script in question still doesn't work, but that doesn't matter, it has no syntax error any more, so it parses "correctly" (even if differently than before) and the actual command sequence is, in practice, never used anyway. So, everyone was happy...

geoffclare 2021-12-17 15:05 manager bugnote:0005563	See bug 1036 0001036:0005561 for a proposed resolution.

Date Modified	Username	Field	Change
2016-04-07 13:16	kre	New Issue
2016-04-07 13:16	kre	Name	=> Robert Elz
2016-04-07 13:16	kre	Section	=> 2.7.4
2016-04-07 13:16	kre	Page Number	=> 2335-2336
2016-04-07 13:16	kre	Line Number	=> 74235-74256
2016-04-08 13:33	joerg	Note Added: 0003141
2016-04-09 11:42	kre	Note Added: 0003144
2016-04-09 15:33	jilles	Note Added: 0003145
2016-04-10 02:27	kre	Note Added: 0003147
2017-03-23 16:11	nick	Relationship added	related to 0001036
2017-03-23 16:11	nick	Relationship added	related to 0001037
2021-09-10 08:47	geoffclare	Relationship added	related to 0001521
2021-12-17 15:05	geoffclare	Note Added: 0005563
2022-01-06 17:22	~~Don Cragun~~	Interp Status	=> ---
2022-01-06 17:22	~~Don Cragun~~	Status	New => Closed
2022-01-06 17:22	~~Don Cragun~~	Resolution	Open => Duplicate
2022-01-06 17:24	~~Don Cragun~~	Relationship replaced	duplicate of 0001036

View Issue Details

Relationships

Activities

Issue History

duplicate of	0001036	Closed	1003.1(2013)/Issue7+TC1	Errors/Omissions in specification of here document redirection
related to	0001037	Closed	1003.1(2013)/Issue7+TC1	The grammar for here documents misses the data body and the final EOF condition
related to	0001521	Closed	1003.1(2016/18)/Issue7+TC2	here document processing is underspecified