Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001269 [1003.1(2016/18)/Issue7+TC2] Shell and Utilities Objection Omission 2019-07-10 21:23 2020-04-29 15:18
Reporter Clausecker View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Applied  
Name Robert Clausecker
Organization Zuse Institute Berlin
User Reference
Section yacc
Page Number 3415, 3425–3426
Line Number 15000–15006, 115422–115425, 115447–115452
Interp Status ---
Final Accepted Text Note: 0004520
Summary 0001269: yacc: yychar is mentioned but not further specified
Description In both historical and current yacc implementations, the variable yychar holds the lookahead token when a rule is reduced (yylval holds its value). This is primarily useful for handling parser errors and without the existence of yychar, the yyclearin action is rather pointless.

The existence of this variable is suggested by the description of the -p option which reads (in IEEE 1003.1-2017):

> Use sym_prefix instead of yy as the prefix for all external names produced
> by yacc. The names affected shall include the functions yyparse(), yylex(),
> and yyerror(), and the variables yylval, yychar, and yydebug. (In the
> remainder of this section, the six symbols cited are referenced using their
> default names only as a notational convenience.) Local names may also be
> affected by the -p option; however, the -p option shall not affect #define
> symbols generated by yacc.

However, no further specification of yychar is provided. I believe that it was the intent of the committee to provide a specification of yychar, but such a specification was forgotten or accidentally removed during editing. I request a description of yychar and the associated variables YYEMPTY and YYEOF to be added.
Desired Action Either remove the mention of yychar from he list of variables in lines 15000--15006 or add a description of yychar. I suggest the following description, to be added on its own paragraph before the paragraph starting on line 115422 (or perhaps one paragraph earlier):

> When a parser action is executed, the external int yychar holds either
> the token number of the lookahead token, or YYEMPTY indicating the
> absence of a lookahead token, or YYEOF indicating the end of the token
> stream. yylval holds the value of the lookahead symbol if any.

If a description of YYEOF is added, also change the first sentence of the paragraph starting at line 115447 from:

> The end of the input is marked by a special token called the endmarker,
> which has a token number that is zero or negative. (These values are
> invalid for any other token.)

to

> The end of the input is marked by a special token YYEOF called the
> endmarker, which has a token number that is zero or negative. (These
> values are invalid for any other token.)
Tags issue8
Attached Files

- Relationships

-  Notes
(0004479)
joerg (reporter)
2019-07-11 08:21

Neither YYEOF nor YYEMPTY exist in yacc.

With YYDEFBUG, the following code part is used in the file "yaccpar":

      if ( yychar == 0 )
           printf( "end-of-file\n" ); 
      else if ( yychar < 0 ) 
           printf( "-none-\n" ); 


Do you refer to that?
(0004480)
Clausecker (reporter)
2019-07-11 09:33

RE Note: 0004479,

While historical yacc does not have the YYEOF and YYEMPTY macros, other implementations (like Berkeley yacc and GNU bison) have. Some mechanism is required to recognise “no lookahead” and “end of input” or the yychar mechanism is rather hard to use. Since different yacc implementations assign different token numbers to these two cases (and POSIX merely gives us that end-of-input is negative or zero [contrary to e.g. Plan 9 yacc]), it seems hard to distinguish these cases otherwise.
(0004505)
shware_systems (reporter)
2019-08-01 21:23
edited on: 2019-08-05 15:00

I can see adding YYEOF as a symbolic reference to the required endmarker.

I do not see the necessity of keeping yychar in that list, or defining YYEMPTY. It is on yyparse() to manage the mechanism required, and noting no lookahead may be done by using a local boolean variable that yyclearin as a macro just sets true or false, instead of a dedicated token value in addition to error and endmarker. A similar local variable can be set when the return value of yylex(), when yyparse() requests lookahead, is the endmarker; for use by the YYERROR macro and related processing to determine what the return value of yyparse() should be. This lookahead value doesn't need a dedicated variable holding it to begin with; it can easily be the top value of the parse stack if NSTATES+1 slots for holding the return value and associated yylval data are actually allocated.

As to Note 4479, it requires each call to an external yylex() to be checked for being <= zero, so zero as the single YYEOF value gets stored in yychar. Even then, allowing endmarker to be 0 precludes the <NUL> character as a literal from being returned as a token. While not relevant for parsing C source files, this does complicate parsers for languages that consider <NUL> a member of charclass <blank>.

(0004518)
Clausecker (reporter)
2019-08-14 10:03
edited on: 2019-08-14 21:10

By request of Joerg, I'd like to clarify the intent of this report:

An LALR(1) parser as generated by yacc performs a lookahead of up to 1 character to determine the correct parse action. Due to the way the parser is constructed, such a lookahead token does not exist at all times. The parser needs to keep track if a lookahead token exists because that makes the differences if a new token must be read before the current token can be shifted.

Traditionally and in all yacc implementations I know, the lookahead token's number is stored in yychar while its value is stored in yylval. Two special values with implementation-dependent values exist: one value to indicate that the end of the input was reached and another one to indicate that no lookahead token exists. The symbolic constants YYEOF and YYEMPTY are used for these two in yacc implementations by Robert Corbett (i.e. bison and byacc); other implementations seem to have no symbolic constants for these.

yacc provides one mean to modify the lookahead token: the macro yyclearin discards the lookahead token. Before discarding a lookahead token, it is useful to know what the lookahead token is and whether it exists, otherwise yyclearin may or may not discard a random token. To do this, we can check the value of yychar and compare it with the number for “no lookahead token.”

This is possible in all yacc implementations known to me as they all work the same: yychar and yylval hold the lookahead token with yychar holding a special value (often -1 or -2 but I think I recall 0 being used, too) to indicate no lookahead. Since the state of the lookahead is only accounted for through these two variables, actions can modify them arbitrarily.

It is however not possible to do this portably as there is no way to find out what the number for “no lookahead” is. This is why I am asking to specify a symbolic constant for such a number, following the precedent in two popular yacc implementations.

(0004519)
geoffclare (manager)
2019-08-14 14:06

I used searchcode.com to look for open source projects that use yychar in .y files:

https://searchcode.com/?q=yychar&lan=56 [^]

(NB pay attention to the line numbers in the search results; the juxtaposition of lines from different parts of a file can be misleading).

The most common code that looks for special values of yychar seems to be to fetch a token when there isn't one. This is either done without using macros:
if (yychar < 0)
    yychar = yylex();

or with them:
if (yychar == YYEMPTY)
    yychar = YYLEX;

Not using the macros is currently more portable, but code that uses YYEMPTY is more easily understandable, so I believe we should add at least that one. If we add YYEMPTY we might as well also add YYEOF. (I assume YYLEX is locally defined by the applications that use it, as I don't see it in the online bison manual.)

I noticed quite a lot of code that assumed yychar is zero at end of input, so I think we should also consider requiring this (currently we allow the end-of-input token number to be zero or negative).
(0004520)
geoffclare (manager)
2019-08-15 09:28
edited on: 2019-08-15 10:10

Suggested changes ...

On page 3456 line 116730 section yacc (Code File), change:
It shall contain code for the various semantic actions with macro substitution performed on them as described in the EXTENDED DESCRIPTION section. It also shall contain a copy of the #define statements in the header file. If a %union declaration is used, the declaration for YYSTYPE shall also be included in this file.
to:
It shall contain code for the various semantic actions with macro substitution performed on them as described in the EXTENDED DESCRIPTION section. Preceding this code it shall contain an <tt>extern int yychar</tt> declaration or <tt>int yychar</tt> definition, and #define statements for the following macros:

YYEMPTY
Token number indicating there is no lookahead token. This macro shall expand to an integer constant with a value less than zero, protected by parentheses.
YYEOF
Token number indicating the end of input. This macro shall expand to the value 0.
It also shall contain a copy of the #define statements in the header file. If a %union declaration is used, the declaration for YYSTYPE and an <tt>extern YYSTYPE yylval</tt> declaration or <tt>YYSTYPE yylval</tt> definition shall also be included in this file.

On page 3456 line 116738 section yacc (Header File), change:
the declaration for YYSTYPE and an extern YYSTYPE yylval declaration
to:
the declaration for YYSTYPE and an <tt>extern YYSTYPE yylval</tt> declaration

On page 3464 line 117105 section yacc (Interface to the Lexical Analyzer), change:
The yylex() function is an integer-valued function that returns a token number representing the kind of token read. If there is a value associated with the token returned by yylex() (see the discussion of tag above), it shall be assigned to the external variable yylval.
to:
The application shall ensure that the yylex() function is an integer-valued function that returns a token number greater than zero representing the kind of token read, or a value less than or equal to zero when the end of input is reached. When the parser generated by yacc calls yylex(), it shall assign the returned value, if greater than zero, to the external variable yychar. If there is a value associated with the returned token (see the discussion of tag above), it shall be assigned to the external variable yylval. If the return value from yylex is less than or equal to zero, the parser shall assign the value YYEOF to yychar.

On page 3464 line 117122 section yacc (Interface to the Lexical Analyzer), add a new paragraph:
When a parser action is executed, yychar shall hold either the token number of the lookahead token, or YYEMPTY indicating that there is no lookahead token, or YYEOF indicating the end of input. If yychar holds the token number of the lookahead token, yylval shall hold the value associated with that token, if any.

On page 3464 line 117123 section yacc (Interface to the Lexical Analyzer), change:
The end of the input is marked by a special token called the endmarker, which has a token number that is zero or negative. (These values are invalid for any other token.) All lexical analyzers shall return zero or negative as a token number upon reaching the end of their input. If the tokens up to, but excluding, the endmarker form ...
to:
The application shall ensure that when the end of input is reached, the yylex() function returns a value that is zero or negative. The parser shall treat this as the token number YYEOF for a special token called the endmarker. If the tokens up to, but excluding, the endmarker form ...


(0004521)
Clausecker (reporter)
2019-08-15 11:59

Looks good to me.

Thank you for the editing.

- Issue History
Date Modified Username Field Change
2019-07-10 21:23 Clausecker New Issue
2019-07-10 21:23 Clausecker Name => Robert Clausecker
2019-07-10 21:23 Clausecker Organization => Zuse Institute Berlin
2019-07-10 21:23 Clausecker Section => yacc
2019-07-10 21:23 Clausecker Page Number => 3415, 3425–3426
2019-07-10 21:23 Clausecker Line Number => 15000–15006, 115422–115425, 115447–115452
2019-07-11 08:21 joerg Note Added: 0004479
2019-07-11 09:33 Clausecker Note Added: 0004480
2019-08-01 21:23 shware_systems Note Added: 0004505
2019-08-05 15:00 shware_systems Note Edited: 0004505
2019-08-14 10:03 Clausecker Note Added: 0004518
2019-08-14 14:06 geoffclare Note Added: 0004519
2019-08-14 21:10 Clausecker Note Edited: 0004518
2019-08-15 09:28 geoffclare Note Added: 0004520
2019-08-15 10:00 geoffclare Note Edited: 0004520
2019-08-15 10:10 geoffclare Note Edited: 0004520
2019-08-15 11:59 Clausecker Note Added: 0004521
2019-08-15 15:16 geoffclare Interp Status => ---
2019-08-15 15:16 geoffclare Final Accepted Text => Note: 0004520
2019-08-15 15:16 geoffclare Status New => Resolved
2019-08-15 15:16 geoffclare Resolution Open => Accepted As Marked
2019-08-15 15:16 geoffclare Tag Attached: issue8
2020-04-29 15:18 geoffclare Status Resolved => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker