Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001561 [Issue 8 drafts] Shell and Utilities Editorial Enhancement Request 2022-02-01 00:10 2022-11-30 16:35
Reporter calestyo View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Applied   Product Version Draft 2.1
Name Christoph Anton Mitterer
Organization
User Reference
Section various
Page Number N/A
Line Number N/A
Final Accepted Text Note: 0005795
Summary 0001561: clarify what kind of data shell variables need to be able to hold
Description In:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33722&limit=100&offset=0&sid= [^]

I've raised the question, on which data shell variables are required to be able to hold.

In various replies following it became clear that there is some ambiguity with respect to that question:


In:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33723&limit=100&offset=0&sid= [^]
Geoff Clare brought up that:
»but POSIX clearly requires that a variable can be
assigned any value obtained from a command substitution that does not
include a NUL byte, and specifies utilities that can be used to
generate arbitrary byte values, therefore a variable can contain any
sequence of bytes that does not include a NUL byte.«

Which AFAIU means that shell variables are expected to hold any bytes except NUL, and only the use of these shell variables in certain other constructs (e.g. ${#var}) interprets them as characters according to the current locale.


It was brought up, that e.g. yash discards any bytes from shell variables that don't make up a valid encoding:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33724&limit=100&offset=0&sid= [^]


In:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33725&limit=100&offset=0&sid= [^]
Chet Ramey brought up, that shell variables are initialised from environment variables, which themselves may contain anything except NUL as value, as long as anything before the "=" is a valid Name (in the sense of POSIX).
And in the later:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33731&limit=100&offset=0&sid= [^]
that:
»applications can obviously put whatever they want into the value of an environment variable in envp and call execve.«


In:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33730&limit=100&offset=0&sid= [^]
Harald van Dijk countered, that:
»That is not what POSIX says. It says "The value of an environment variable is a string of characters" (8.1 Environment Variable Definition), and "character" is defined as "a sequence of one or more bytes representing a single graphic symbol or control code" (3 Definitions), with a note that says it corresponds to what C calls a multi-byte character. Environment variables are not specified to allow arbitrary bytes.«


There was some further discussion on whether the definition of command substitutions implies whether or not any bytes other than NUL need to be able to be stored in shell variables.
One argument brought up was, that there the wording "<newline> character" is used - another, that this would clearly refer *only* to the <newline> itself which is per definition the same (byte) in every locale.
(for that particular part see also the proposed clarifications in https://www.austingroupbugs.net/view.php?id=1560 [^] ).



In:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33736&limit=100&offset=0&sid= [^]
I brought up that in addition to what Harald pointed out earlier, in 8.1 Environment Variables it says:
»These strings have the form name=value; names shall not contain the
character '='. For values to be portable across systems conforming to
POSIX.1-2017, the value shall be composed of characters from the
portable character set (except NUL and as indicated below).«

but a bit further down it says the contradicting:
»The values that the environment variables may be assigned are not
restricted except that they are considered to end with a null byte and
the total space used to store the environment and the arguments to the
process is limited to {ARG_MAX} bytes.«


And in:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33737&limit=100&offset=0&sid= [^]
I brought up:
»3.368 Standard Output
"An output stream usually intended to be used for primary data output."

And:
3.370 Stream
"Appearing in lowercase, a stream is a file access object that allows access to an ordered sequence of characters, as described by the ISO C standard. Such objects can be created by the fdopen(), fmemopen(), fopen(), open_memstream(), or popen() functions, and are associated with a file descriptor. A stream provides the additional services of user-selectable buffering and formatted input and output; see also STREAM."


This however links to Standard I/O Streams ( file:///usr/share/doc/susv4/susv4-2018/functions/V2_chap02.html#tag_15_05 [^] )
which very well names byte output modes (fputc and so on).«
Desired Action 1) All the above should be clarified, i.e. which values shell variables hold (bytes vs. characters?) and which of them are *not only allowed*... but *must* be supported by any compliant shell (any byte except NUL)?

Ideally there would be one central place where this is clearly defined (and not just indirectly), e.g. in 2.5 Parameters and Variables

Probably at least the following places are also affected and need some work (see above):
- 2.6.3 Command Substitution
- perhaps (but rather not):
  3.267 Parameter
  3.440 Variable
- 8. Environment Variables (there are at least two places here, which are contradictory)


2) In combination with (1) above, it should also be clarified in 8. Environment Variables, whether implementations MUST initialise shell variables from the environment (where the portion before the '=' is a Name) with values "as is" (i.e. with exactly the bytes that were found in char **environ ... or whether an implementation would be allowed to transform that (this idea was brought up on help-bash within some discussion) or e.g. skip variables that contain an invalid character encoding.


3) Since command substitution refers to standard output (but presumably in the sense of it being binary - with NUL causing undefined behaviour) and standard output is in defined in 3.368 Standard Output to be a stream...
... and that in 3.370 Stream to be defined as working on characters (while e.g. the definitions of fdopen() or fputc() allow for binary)...

... there probably needs to be resolved something in at least 3.370 Stream.


4) In 2.5.3 Shell Variables and/or 8.1 Environment Variable Definition it should be clarified what happens to assignments in char **environ whose portion before the first '=' is not a valid 3.235 Name, i.e.:
- is it unspecified
- do they have to be ignored
- may an implementation transform the name somehow (e.g. replace all invalid chars with '_')
- anything else

Thanks,
Chris
Tags tc3-2008
Attached Files

- Relationships
related to 0001560Applied clarify wording of command substitution 
related to 0001562Applied printf utility: clarify what is (byte) string an what is character string 
related to 0001564Applied clariy on what (character/byte) strings pattern matching notation should work 

-  Notes
(0005645)
mirabilos (reporter)
2022-02-01 19:33

For 4) I suggest unspecified; many shells allow arrays to be imported from the environment, and GNU bash even has imported functions… I’d be very much against a requirement to transform them somehow (especially as that would open the door to more attacks), so (shell-locally) extending the permitted values and ignoring the rest is a sane way to go.
(0005647)
calestyo (reporter)
2022-02-01 19:44

I'd agree with that.
(0005649)
chet_ramey (reporter)
2022-02-01 20:52

Re: 4) This has come up before and the current standard contains text to the effect that "applications shall tolerate the presence of such names." The consensus seems to be to pass them through to child processes in their environment but not attempt to create shell variables from them.

It seems to me that arrays don't affect this: they're normal shell variables and have normal shell variable name restrictions.

Shell functions do, since they can have nearly arbitrary names, the bash function name encoding aside. Bash tolerates such names by -- you guessed it -- creating shell functions from them.
(0005650)
kre (reporter)
2022-02-01 23:07

Re 4) again, I agree, unspecified. But re Note: 0005649 I thought that it
was unspecified whether such things are passed through to children. Our shell
creates vars from the environment (exported ones of course) when the names are
valid, and ignores anything else it sees there. When running a child process
all exported sh variables (and nothing else) is passed to the child as its
environment.

I suspect that the question about arrays was assuming that the environment
might contain something like:
    ARRAY[4]=7
if some shel decided to permit exporting individual elements of an array,

Lastly, and I'm not sure if this shold be considered as part of 2) or added
as a new 5) - but is it anywhere specified what the shell is supposed to do
if it receives an environment which contains
    X=1
    X=2
    X=3
?? Shells will never create such a thing, and I don't believe the env
command can do it either, but purpose written C code can easily set that up.

rather than the whole array (in the case that a whole array is exported, how
does bash, when importing it, distinguish that from a scalar var whose
contents just happen to be identical to whatever serialized format the
array contents are exported like ?)

And POSIX only requires POSIX names to work as shell function names, though
that is a truly stupid limitation (it permits implementations to extend it,
and most do I think). We (NetBSD sh) don't allow exported functions, but
we do allow for every shell to read a startup file (not just interactive shells)
so functions that one mighth want to export can be written there, and then
are available to all who follow (a script can make its own file and arrange for
its children to read that one, if it wants to).

Lastly, when reading the standard, I believe that we must sometimes use care
in assuming that "characters" means locale specific chars, rather than bytes.
Much of the text is quite old, and has its origins in even older material, and
that far ago, for most concerned people, characters and bytes were the same
thing (which is why what we might now call an int8_t in C started out (and
still is though sometimes as a uint8_t) also called "char").
(0005652)
chet_ramey (reporter)
2022-02-02 15:15

Re: https://www.austingroupbugs.net/view.php?id=1561#c5650 [^]

Bash supports both old and newer behavior wrt environment strings. If an environment string doesn't contain `=' or starts with `=' (no name), bash skips it (way, way back in the day, such strings would cause some applications -- notably `sh' -- to seg fault). An environment string that contains an `=', doesn't follow the naming rules for an exported shell function, but has a name that is not a valid shell variable name, gets passed to child processes in their environment but otherwise ignored.

Bash doesn't export or import array variables. There is code in there to do it, just commented out. It encodes the value as a compound variable assignment, similar to the output of `declare -p'. I've never released a version of bash with that code enabled. One of the concerns has been distinguishing it from a
scalar variable with value `( some list of words )'.

There aren't any shells that export arrays as arrays. ksh93 and mksh export element 0 as a scalar variable. yash exports the values as a colon-separated list, but doesn't import the variable as an array. bash and zsh don't allow it.

Bash does export and import shell functions, encoding the names in a way that allows the shell to recognize them as functions.
(0005653)
calestyo (reporter)
2022-02-02 16:39

Re: https://www.austingroupbugs.net/view.php?id=1561#c5650 [^]

8.1 Environment Variable Definition says:
"If more than one string in an environment of a process has the same name, the consequences are undefined."
(0005654)
kre (reporter)
2022-02-02 18:44

Re Note: 0005653

Thanks.

Undefined seems a little harsh however, I'd have thought unspecified
should be sufficient for this.
(0005662)
mirabilos (reporter)
2022-02-06 11:18

It absolutely needs to stay unspecified whether “bad names” are passed to child processes. I thought we had fought about this already?

In mksh, the child environment is created from scratch, using exported shell parameters and nothing else. Requiring passthrough here would be insane, a weird hell of trying to deal with setenv/unsetenv/whatever… and explode _more_ in the face of duplicates than the current method.
(0005665)
chet_ramey (reporter)
2022-02-06 18:18

Re: "fighting about it". We may have, I don't remember when.

I don't understand the second paragraph, unless mksh uses setenv and its siblings to create the child environment. Surely you already reject the bad names when importing variables from the environment, so you know when they occur. You could save them in some table and add them to the child environment when you create it.
(0005666)
kre (reporter)
2022-02-06 23:17

Re Note: 0005662 Note: 0005665

ash derived shells work the same way. Valid names in the incomoming environment
(except a few like IFS) are imported into the shell's variable table, and exported.

The outgoing environment is built from the exported varibles in the shell, and
nothing else.

We could keep the discarded trash from the imported environment. but don't,
and I see no reason to ever do that.
(0005668)
calestyo (reporter)
2022-02-08 15:14

Not sure whether that's relevant, but 2.12. Shell Execution Environment says:

"Variables with the export attribute, along with those explicitly exported for the duration of the command, shall be passed to the utility environment variables"

Doesn't that kinda rule out, that the shell may pass on any variables in its own environment that haven't had a valid Name (in the sense of POSIX) to any executed programs?


- "Variables with the export attribute" => those are all shell variables
- "along with those explicitly exported for the duration of the command" => those are the variable assignments on the command.

- it doesn't rule out explicitly that no others "shall be passed" on... but one could implicitly deduce that (as those environment variables with non-Names are already part of POSIX and not just some vendor extension... so why should POSIX not list them as to be exported or at least call it "unspecified" ... if it were so?)
(0005669)
kre (reporter)
2022-02-09 01:58

Re Note: 0005668:

     but one could implicitly deduce

No, that' snot how it works, If the standard requires something to be
done, it will say so. If it requires somethingnot to be done, it will
say so. If it says nothing on an issue, then nothing is required.

If this causes an interoperability problem, then the standard is broken,
and should be fixed.

Here, nothing needs to be done (oe perhaps beyond what has already been done)
as it is already unspecified what happens in this case (which means applications
cannot depend upon either behaviour):

From 2.9.1.6 in draft 2.1:

       It is unspecified whether environment variables that were passed
       to the shell when it was invoked, but were not used to initialize
       shell variables (see Section 2.5.3) because they had invalid names,
       are included in the environment passed to execl( ) and (if execl( )
       fails as described above) to the new shell.

Don't get all uptight that it says "environment variables" - as far as the
stanrdard is concerned, everything in the environment is an environment
variable, as that's all the standard defines to go there. Anything with
a valid name (which must be terminated by '=' to be valid) gets turned into
a shell variable, and exported. Everything else has an invalid name, and
can be ignored by the shell (or used for any other purpose the shell desires).
(0005795)
geoffclare (manager)
2022-04-11 13:52

Since field splitting is performed on the results of (unquoted) parameter expansions, it is also affected by this issue, but the fix is included in my suggested changes for bug 0001560 so is not repeated here.

Suggested changes...

On page 155 line 5331 section 8.1 Environment Variable Definition, change:
The value of an environment variable is a string of characters.
to:
The value of an environment variable is an arbitrary sequence of bytes, except for the null byte.

On page 155 line 5335 section 8.1 Environment Variable Definition, change:
names shall not contain the character '='.
to:
names shall not contain any bytes that have the encoded value of the character '='.

On page 155 line 5336 section 8.1 Environment Variable Definition, change:
shall be composed of characters from the portable character set
to:
shall be composed of bytes that have the encoded value of characters from the portable character set

On page 155 line 5342 section 8.1 Environment Variable Definition, change:
Other characters may be permitted by an implementation
to:
Other characters, and byte sequences that do not form valid characters, may be permitted by an implementation

On page 2314 line 74531 section 2.5 Parameters and Variables, add a new paragraph:
Parameters can contain arbitrary byte sequences, except for the null byte. The shell shall process their values as characters only when performing operations that are described in this standard in terms of characters.

On page 2316 line 74612 section 2.5.3 Shell Variables, insert:
Shell variables shall be initialized only from environment variables that have valid names.
before:
If a variable is initialized from the environment, ...

On page 2321 line 74857 section 2.6.2 Parameter Expansion, change:
... for substring processing.
to:
... for character substring processing.

After page 3626 line 125374 section C.2.5.3 Shell Variables, add a new paragraph:
Since shell variables are parameters denoted by a name, the shell cannot initialize shell variables from environment variables that do not have a valid name. However, the shell may initialize parameters that do not have valid names from such environment variables.
(0005807)
calestyo (reporter)
2022-04-15 23:38

Re: https://www.austingroupbugs.net/view.php?id=1561#c5795 [^]

Some of the experts which all have undoubtedly much more knowledge than me should probably rather review that... nevertheless...


> Since field splitting is performed on the results of
> (unquoted) parameter expansions, it is also affected
> by this issue, but the fix is included in my suggested
> changes for bug 0001560 so is not repeated here.

I assume with fix you mean the introduction of "bytes"/"byte sequence" there?


> Parameters can contain arbitrary byte sequences

I assume that this wording implies in the language of the standard, that conforming shells *MUST* support that, right?


> Shell variables shall be initialized only from
> environment variables that have valid names.

What exactly is the intention of this?

Cause if it's that a shell must not "take over" e.g. '+=whatAWeirdVariable', then this was (at least indirectly) already clear before, by variables being parameters and these, as per 2.5, having names (in the sense of 3.235 Name).

If your intention is however to forbid that environment variables with invalid names must not be "taken over" by transcribing their invalid name into something valid (e.g. replacing any invalid chars with '__<unicode code point>__' - which would of course be prone to collisions) than I think this wording is too unclear/indirect and it should rather be something like:
"Environment variable that don't have valid names must not be made shell variables, not even by transcribing the invalid names or similar means."

However some shells may e.g. wish to provide such invalidly named env vars in a special shell var (like an associative array)... so that should perhaps be allowed?

At least, I personally think, that the current sentence is a bit vague what it exactly means.


> The shell shall process their values as characters
> only when performing operations that are described
> in this standard in terms of characters.

That one is quite nice!


> ... for character substring processing.

May I suggest to rather directly write something like:
"The following four varieties of parameter expansion provide for substring processing, with each of them requiring the value to be a character string."

Especially the "substring" makes it IMO a tiny bit vague... like in:
foo="${binaryString}."
cmdSubstWithTrailingNL=${foo%.}

One could claim: "well,... the '.' is a substring of $foo... and only that is processed as character and matched... so fine"

Further, this basically assume the current proposal of https://www.austingroupbugs.net/view.php?id=1562 [^] ... i.e. that pattern matching notation would always be on character strings.
However, on the mailing list, Harald van Dijk offered to look more deeply into that matter, ... So may I kindly request that you defer voting on this particular part of the above changes, until Harald had a chance to do so?



> However, the shell may initialize parameters that do not have valid names from such environment variables.

What's that intended for? To allow positional/special parameters to be initialised from the environment variable?
(0005808)
calestyo (reporter)
2022-04-16 00:22

I tried to go through all the possible points to deal with, that have come up so far in this ticket.

I'd say that no adaptions (like re-iterations) are needed in these:
- 3.239 Parameter, page 62
- 3.393 Variable, page 85
- 2.6.3 Command Substitution, page 2323 (unlike assumed in the description of this issue, I not longer think anything needs to be done there, which wouldn't have already been dealt with in other tickets).




As far as I can see all have been dealt with, except for the following:


- My original point (3) from the "Desired Actions" (the definition of "Stream" using "characters" although it can be bytes)... shall I open another ticket for this to be dealt with?


- My original point (2) from the "Desired Actions" (would a shell be allowed to transform the value or to skip env vars which are not valid characters), is kinda still open.
I mean it's specified now, that any byte values (except NUL) need to be supported, but not ruled out whether shells might still do any fancy transformations (e.g. mapping any such bytes that do not form characters into special Unicode regions).

Should something be done about that? Like excluding it or declaring it explicitly unspecified - or should it simply be left out?


- https://www.austingroupbugs.net/view.php?id=1561#c5669 [^]
That solves point (4) (in the sense that it's explicitly unspecified)... perhaps with the following to be considered by some expert:

I've noticed, that the line KRE was quoting (page 2335, lines 75446-75449 and their counterparts on page 2336, lines 75469-75472) were all "only" about "Non-built-in Utility Execution".

Does that open any holes with respect to regular built-in utilities?
Utilities are not only the 3.369 Standard Utilities (none of which would use any such strangely named environment variables, I guess)... so a shell could, AFAIU it, make *any* program a built-in utility, right?
Such program may however either expect that such ill-named variables are present - or the opposite - not present.

Does anyone think that the same (i.e. that it's unspecified) should be included for regular built-ins, too?


- With respect to https://www.austingroupbugs.net/view.php?id=1561#c5668 [^]

Should page 2351, line 76081-76082 include a note with respect to what's been said in page 2335, lines 75446-75449 and their counterparts on page 2336, lines 75469-75472.... namely that in addition to the variables with export attribute, also such with invalid names *might* be passed on (respectively that it's unspecified whether or not)?

- Issue History
Date Modified Username Field Change
2022-02-01 00:10 calestyo New Issue
2022-02-01 00:10 calestyo Name => Christoph Anton Mitterer
2022-02-01 00:10 calestyo Section => various
2022-02-01 00:10 calestyo Page Number => N/A
2022-02-01 00:10 calestyo Line Number => N/A
2022-02-01 19:33 mirabilos Note Added: 0005645
2022-02-01 19:44 calestyo Note Added: 0005647
2022-02-01 20:52 chet_ramey Note Added: 0005649
2022-02-01 23:07 kre Note Added: 0005650
2022-02-02 15:15 chet_ramey Note Added: 0005652
2022-02-02 16:39 calestyo Note Added: 0005653
2022-02-02 18:44 kre Note Added: 0005654
2022-02-06 11:18 mirabilos Note Added: 0005662
2022-02-06 18:18 chet_ramey Note Added: 0005665
2022-02-06 23:17 kre Note Added: 0005666
2022-02-08 15:14 calestyo Note Added: 0005668
2022-02-09 01:58 kre Note Added: 0005669
2022-04-07 16:29 geoffclare Relationship added related to 0001560
2022-04-07 16:30 geoffclare Relationship added related to 0001562
2022-04-07 16:30 geoffclare Relationship added related to 0001564
2022-04-11 13:52 geoffclare Note Added: 0005795
2022-04-15 23:38 calestyo Note Added: 0005807
2022-04-16 00:22 calestyo Note Added: 0005808
2022-10-31 16:25 geoffclare Final Accepted Text => Note: 0005795
2022-10-31 16:25 geoffclare Status New => Resolved
2022-10-31 16:25 geoffclare Resolution Open => Accepted As Marked
2022-10-31 16:25 geoffclare Tag Attached: tc3-2008
2022-11-30 16:35 geoffclare Status Resolved => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker