Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001329 [1003.1(2013)/Issue7+TC1] Base Definitions and Headers Objection Clarification Requested 2020-03-29 20:54 2020-03-30 10:27
Reporter rhialto View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Olaf 'Rhialto' Seibert
Organization
User Reference
Section Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2
Page Number 190
Line Number 6195
Interp Status ---
Final Accepted Text
Summary 0001329: Problem in resolution of 0000793: "Regular Expressions: add REG_MINIMAL and a minimum repitition modifier"
Description Issue 793 (which has status Applied) contains this text to apply:

     - Vol. 1: Base Definitions, Chapter 9, «Regular Expressions».

  9.4.6 EREs Matching Multiple Characters, p. 190, line 6195:
  insert after

        6. Each of the duplication symbols ('+', '*', '?', and intervals) may
           be suffixed by the minimal repitition modifier '?' <question-mark>,
           in which case matching behaviour is changed from the «leftmost
           longest possible match» to the «leftmost shortest possible match»,
           including the null match
           (see [reference to A.9, p. 3500 ff.]). For example, the ERE ".*c"
           matches the last character ('c') in the string "abc abc", whereas
           the ERE ".*?c" matches the first character 'c', the third character
           in the string.

There seems to be a mistake in this text:
                                                                                
     For example, the ERE ".*c" matches the last character ('c') in the
     string "abc abc",
                                                                                
which should likely be
                                                          
     For example, the ERE ".*c" matches all characters up to the last
     character ('c') in the string "abc abc",
                               
As an existing implementation, NetBSD's egrep matches the whole string:

    $ echo abc-abc | egrep --color ".*c"
    abc-abc
                                              
with the whole of abc-abc coloured red.

           
                                                                     
The text that follows seems incorrect for a similar but potentially more serious reason:
                                                                                
    whereas the ERE ".*?c" matches the first character 'c', the third
    character in the string. [[ "abc abc" ]]
                                                                                
Possibly it should match "abc", because that is leftmost; just matching
"c" would be shortest but I don't see it as being leftmost. Which of these two is meant?



Furthermore, "repitition" (used several times) seems to contain a typo.
Desired Action Correct the examples, and/or indicate whether "leftmost" or "shortest" is more important in a match with a minimal repetition modifier.
Tags No tags attached.
Attached Files

- Relationships
child of 0000793Appliedajosey Regular Expressions: add REG_MINIMAL and a minimum repitition modifier 

-  Notes
(0004804)
kre (reporter)
2020-03-30 02:57

The major issue here is whether in "leftmost shortest possible match" which
takes priority, "leftmost" or "shortest possible" when the results would be
different (as in the example given) - the text of the example suggests that
the shortest match should be preferred over leftmost, but I suspect that's
probably just a similar example error than the obvious earlier one.

In case of "leftmost longest possible match" this issue cannot arise, as
moving the match further to the left necessarily makes it longer,

It is likely that the text from the definition of "matched" in 9.1:

    The search for a matching sequence starts at the beginning of a string
    and stops when the first sequence matching the expression is found,
    where ``first'' is defined to mean ``begins earliest in the string''.

will cover this case, but that's only if this is what the implementations
that currently implement minimal matches implement it that way, and if the
example is corrected to make it clear that is what happens. Additional
words to make it clear that when there is an apparent conflict, the leftmost
possible match is chosen rather than a shorter one beginning further to the
right might be useful.
(0004805)
kre (reporter)
2020-03-30 03:14
edited on: 2020-03-30 05:59

0000073 should probably also have amended the text in 9.1 that immediately
follows the quote given in Note: 0004804 ...

     If the pattern permits a variable number of matching characters and
     thus there is more than one such sequence starting at that point, the
     longest such sequence is matched.

Since that will no longer be universally true.

Further, when dealing with sub-matches, we need to know how the "minimal
match" operator applies to submatches, both when those sub-matches do, and
do not, also contain the minimal match operator.

That is, in each of

     (([abc]*)([abc]*)([abc]*))+?x
     (([abc]*?)([abc]*)([abc]*))+?x
     (([abc]*)([abc]*?)([abc]*))+?x
     (([abc]*?)([abc]*?)([abc]*))+?x

when applied to the string

     abcaabbccaaabbbcccx

what is \1 \2 \3 and \4 in each case ?

That will partly turn upon whether the added text

        If the REG_MINIMAL flag, as defined in the <regex.h>[REF] header,
           is used when compiling an ERE via regcomp(3)[REF], the «leftmost
           shortest possible match» is the default, and the minimal repitition
           modifier ’?’ can be used to select the «leftmost longest possible
           match».

(added in the new s.6 on p 190 of 7-TC1) is intended to apply to the
? operator as well as the REG_MINIMAL flag. That is, when evaluating
an ERE for which the "minimal match" operator applies, do sub-expressions
inherit the mimimal match property, or not? And if they do, does the
minimal match operator do what it does when REG_MINIMAL is used, and
reverse its effect, to become a longest match operator?

I don't know the answers to any of this, but I do know it all needs to be
made clear. There might be more, RE's in general tend to be things I simply
use, I generally prefer not to try and bend my mind around their definitions.

(0004806)
geoffclare (manager)
2020-03-30 08:36

When I applied bug 0000793 I spotted that "up to" was missing and inserted it. I should have added a note to the bug that I had done this - sorry.

The source currently has:
For example, the ERE
.sG ".*c"
matches up to the last character (\c
.cH c )
in the string
.sG "abc abc" ,
whereas the ERE
.sG ".*?c"
matches up to the first character
.cH c ,
the third character in the string.

I also fixed the spelling of "repetition" (and would not have added a note for that - minor editorial corrections like that are often needed when applying bugs).

Re: Note: 0004805 I agree that a change to the definition of "matched" in 9.1 is needed, and we should address that problem here.
(0004808)
kre (reporter)
2020-03-30 10:27

I'd also like to see a better definition of what all of this really
means, particularly REG_MINIMAL and the (new, as opposed to old) '?'
operator when that is in use, and how all of this applies in some of
the more difficult cases. I know I couldn't implement this in a
way I would expect to be compatible with other implementations based
only on what has been added in 0000793 (even assuming that we fix
the definition of "matched").

- Issue History
Date Modified Username Field Change
2020-03-29 20:54 rhialto New Issue
2020-03-29 20:54 rhialto Name => Olaf 'Rhialto' Seibert
2020-03-29 20:54 rhialto Section => Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2
2020-03-29 20:54 rhialto Page Number => 190
2020-03-29 20:54 rhialto Line Number => 6195
2020-03-29 23:03 Don Cragun Relationship added child of 0000793
2020-03-30 02:57 kre Note Added: 0004804
2020-03-30 03:14 kre Note Added: 0004805
2020-03-30 03:23 kre Note Edited: 0004805
2020-03-30 05:59 kre Note Edited: 0004805
2020-03-30 08:36 geoffclare Note Added: 0004806
2020-03-30 10:27 kre Note Added: 0004808


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker