View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001919 | 1003.1(2024)/Issue8 | Base Definitions and Headers | public | 2025-04-19 21:01 | 2025-05-05 20:47 |
Reporter | dwheeler | Assigned To | ajosey | ||
Priority | normal | Severity | Editorial | Type | Clarification Requested |
Status | Resolved | Resolution | Accepted As Marked | ||
Name | David A. Wheeler | ||||
Organization | |||||
User Reference | |||||
Section | 9. Regular Expressions | ||||
Page Number | 1 | ||||
Line Number | 1 | ||||
Interp Status | --- | ||||
Final Accepted Text | see 0001919:0007154 | ||||
Summary | 0001919: Add \A and \z to regular expressions (at least EREs) | ||||
Description | I propose adding \A and \z to regular expressions (regexes) to reduce the likelihood of incorrectly copied regular expressions leading to security vulnerabilities. At least do this for EREs, and preferably both EREs and BREs. Regexes are widely used. A common use for regexes is to implement security checks. Regexes are widely used to check that all inputs match constrained patterns *before* the inputs are accepted. The usual way to use regexes for security is to begin the regex with “anchor at string beginning” (^ is the closest in POSIX) and end it with “anchor at string end” ($ is the closest in POSIX). Unfortunately ecosystems do NOT agree on a standardized way to spell these anchors, as “^” and “$” do NOT have the same meanings across all ecosystems. For example, “$” means “allow optional newlines as well” in languages like Perl, Python, PHP, and C#. The “^” means “begin any line” in Ruby (and similar for “$”). For guidance on how to handle this variation between ecosystems, see “Correctly Using Regular Expressions for Secure Input Validation” https://best.openssf.org/Correctly-Using-Regular-Expressions This is a problem. Davis et al’s “Why Aren’t Regular Expressions a Lingua Franca?…” (2019) https://arxiv.org/abs/2105.04397 found a serious problem. Of surveyed developers, 94% reuse regexes, 50% use reuse regexes at least half the time, and 47% incorrectly believed that regex notation is the same everywhere. LLMs will only make this error worse, as they will notice patterns of “almost” correct results that seem to repeat, and copy this subtle pattern incorrectly. If humans often make this mistake, systems trained on bad data and generalizing it will make it worse. In addition, if REG_NEWLINE (multi-line) mode is enabled, POSIX currently has no mechanism to “always anchor at the beginning of the string”. POSIX “^” and “$” change meaning and there is currently no standard mechanism to always match only at the beginning or end of a string. It would be great to “heal the rift” between regex notations for this common case, so that people could write simple regexes that really WOULD be interpreted the same way across ecosystems. This healing would also provide missing functionality. In short, make it possible to ALWAYS use \A for beginning of string in all cases (even for multi-line matches) and \z for end of string in all cases (even for multi-line matches). This capability doesn’t exist at all in POSIX regex (though ^ and $ are similar). This solution was recommended in “Correctly Using Regular Expressions for Secure Input Validation - Rationale” https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale GNU extends the POSIX regex syntax with this functionality, but it’s spelled differently: ‘\`’ matches the beginning of the whole input ‘\'’ matches the end of the whole input This suggests that there is a desire by some to have this functionality. However, few other systems have copied this syntax. Most other ecosystems use \A and \z instead. For more see: https://www.gnu.org/software/findutils/manual/html_node/find_html/posix_002dextended-regular-expression-syntax.html Note that \A and \z are widely implemented across many ecosystems. It would take some work to implement in POSIX systems, but this is not expected to be significant work. | ||||
Desired Action | Modify base definitions “9. Regular Expressions” in https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html as follows: Modify the ERE text as follows: In 9.4.8 ERE Precedence, modify “Anchoring” so “^ $” changes to “^ $ \A \z” In 9.4.9 ERE Expression Anchoring: * Change “The <circumflex> and <dollar-sign> special characters” to “The <circumflex> and <dollar-sign> special characters, and the expressions \A and \z,” * At the end of points 1 and 2, append “This meaning changes if regcomp is given the flag REG_NEWLINE as described there. * Add: “3. When not inside a bracket expression, a \A shall anchor the expression or subexpression it begins to the beginning of a string; such an expression or subexpression can match only a sequence starting at the first character of a string. This meaning is unchanged by REG_NEWLINE. 4. When not inside a bracket expression, a \z shall anchor the expression or subexpression it ends to the end of a string; such an expression or subexpression can match only a sequence ending at the last character of a string. This meaning is unchanged by REG_NEWLINE.” In 9.5 Regular Expression Grammar: In QUOTED_CHAR add \A and \z In 9.5.3 ERE Grammar In ERE_expression after “| ‘$’” add: | \A | \z … and similarly for BRE. I thought I’d start by making proposals for ERE, and if that seems reasonable, go back for BRE. My understanding is that there’s more hesitation to change BREs, so while I’d prefer to do this for both EREs and BREs, I’d rather get 50% than 0%. Modify regcomp in https://pubs.opengroup.org/onlinepubs/9799919799/ as follows: Modify “REG_NOTBOL The first character of the string pointed to by string is not the beginning of the line. Therefore, the <circumflex> character ('^'), when taken as a special character, shall not match the beginning of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \A, when taken as an anchor, shall not match the beginning of string, and its behavior is not modified by REG_NEWLINE.” Modify “REG_NOTEOL The last character of the string pointed to by string is not the end of the line. Therefore, the <dollar-sign> ('$'), when taken as a special character, shall not match the end of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \z, when taken as an anchor, shall not match the end of string, and its behavior is not modified by REG_NEWLINE.” Append to RATIONALE: “The \A and \z anchors were added to make it easier to reuse regular expressions between different ecosystems and to provide a mechanism that ALWAYS means “exactly anchor at the beginning and end of string” even in the presence of REG_NEWLINE.” Note: I don't see the PDF, so the page# and line# aren't quite right, but I hope my references sections will make this clear. I welcom corrections on this or other issues. Thank you. | ||||
Tags | tc1-2024 |
|
See also this proposed glibc patch to add \A and \z as synonyms to its existing extension of \` and \': https://sourceware.org/pipermail/libc-alpha/2025-April/166098.html |
|
Add to BRE Expression Anchoring (P 187 after line 6674) as small NOTE: NOTE: Given the widespread adoption of some sort of special character sequence for matching the beginning and end of string in other languages, a future version of this standard is likely to require such characters to be supported in a regular expression. Implementors are encouraged to provide this as an extension using "\A" for the beginning and "\z" for the end of strings as they are already in widespread use for this purpose in other languages. Add the same note to ERE Expression Anchoring page 190 after line 6811. |
|
Thank you! That's fantastic! I plan to add a note about this to the OpenSSF document about regular expressions. Obviously this agreement doesn't mean it's *in* POSIX, but it's a key step. |
|
\A is less controversial. In response to my query to glibc, Paul Eggert argued that Python's \Z is better than Perl's \z, and that Perl's \Z can be emulated in other languages by \n?\Z (at least, where \n represents newline). But he did agree that glibc's existing \` and \' have the same impact as what other languages have in \A and \Z. |
|
No, using only \Z for "end of string" would be a TERRIBLE result. Using \Z would solve NOTHING, because there would *still* not be a standard pair of regex markers that work on most platforms. That's the problem I'm trying to solve: there should be ONE way to indicate this in a regex that works everywhere. The vast majority of existing platforms use \Z as a synonym for \n?\z. That includes Java, .NET/C#, Perl, PCRE, PHP (using PCRE), and Ruby. Some use \z as end-of-string and don't assign a meaning to \Z. This includes RE2 <https://github.com/google/re2/wiki/Syntax> widely used by Go and Rust crate regex widely used by Rust. JavaScript doesn't currently support \A and \z, but they have a stage 2 proposal for them (note that it does NOT use \Z): https://github.com/tc39/proposal-regexp-buffer-boundaries . I know of no proposal for supporting \Z in JavaScript. For more info, see: https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale Python is the *ONLY* platform I know of that uses \Z for end-of-string, which is incompatible with almost every other platform. I'm the process of drafting a fix to Python to add \z. If the group wants to also add \Z as a synonym for \n?\z that would be fine, that would add yet another anchor that is widely supported, though one that's not nearly as important. The point is to have a *single* expression for beginning & ending of a string that works everywhere. Almost all platforms use \A ... \z, and the other platforms have no *conflict* with those markers, making this pair the only practical option. Thanks! |
|
Just to be clear: the set of languages and platforms that use \z (and NOT \Z) as "end of string" are: Java, .NET/C#, Perl, PCRE, PHP (using PCRE), Ruby, RE2 which is widely used by Go and Rust crate regex which is widely used by Rust. This is why "end of string" needs to be \z; that's what almost everyone else does. If POSIX adds \z to mean "end of string", we are *much* closer to having a single term that means "end of string" everywhere. The term \Z cannot *ever* be a common term for end-of-string across platforms, because most platforms already use \Z for something else ("end of string optionally preceded by a newline"). The term "$" can't mean end-of-string across platforms, because many platforms *also* don't treat it that way. The term \z *CAN* be a common term for end-of-string across platforms; it's already used that way in most platforms, and it doesn't conflict with another meaning on the rest. |
|
I have reached out to the Python community: https://discuss.python.org/t/proposal-add-z-as-a-synonym-for-z-in-python-res-for-standardization/90378 The responses so far have been favorable, and I have been asked to open an Issue for this: https://github.com/python/cpython/issues/133306 |
|
FYI: The Python community has decided to add \z for end-of-string (they already had \A for beginning-of-string), aligning Python with this proposal. We are getting ever closer to having a single notation for beginning-of-string and end-of-string across platforms! The Python developers rejected the \` and \' alternatives as they felt they were hard to read, difficult to use on GitHub, and conflicted with the general avoidance of backquotes in Python syntax. My thanks to msbrown for reaching out to the Python community! |
|
Something I just realized, how should it behave with ed/sed/grep utilities? For example should \A in their case end up meaning start-of-file (and start-of-selection for ed and sed when addresses are used), or should it end up being equivalent to start-of-line, or be undefined-behavior (like start of the read buffer would be)? Thinking as a user I think \A being start-of-file would be interesting for grep(1), but as an implementer, I could see it being rather annoying to support. |
|
A lot of tools process data a line at a time. In those cases I think \A and \z should mean "beginning and ending of the current line being processed", since that's the data being processed. grep normally reads a line at time, so I'd expect \A to mean the same as ^ and \z to mean the same as $ (presuming there's no multi-line mode enabled), since within its processing the "beginning of string' and "end of string" would be within the read line. More generally, I would want this to be *easy* to implement, or there's a risk it won't happen everywhere. In the case of ed and sed, again, commands like "s" work a line at a time, so \A should mean ^ and \z should mean $. They can already notate "first line" and "last line" anyway, so there's no strong need to do things "the hard way". |
|
Right, re-reading the specs of them more closely they are defined as line-matching (including for ed/sed addresses). So implementations might need to be a bit more careful and properly pass a line rather than a whole buffer (specially for \z). |
|
sed's regexps are matched against the pattern space which is initialised from the contents of the current line without the line delimiter but can be modified at will by the sed script including by adding newline characters including by pulling additional lines from the input with the N command. awk's regexps can be applied on anything, grep works on line (the "subject" is the line currently being considered, again not including the line delimiter), ex/vi's AFAICT are meant to be on lines (when not in vi-compatibility mode, vim regexs can have matches spanning several lines and \A there is for "non-alphabetic character", and \z a prefix for several additional regex operators; so in any case those \A, \z cannot be added to vim regexps). In any case, none of them use REG_NEWLINE, so in all POSIX utilities, \A is equivalent to ^ and \z to $ and match at the beginning and end of the subject, not at the beginning/end of each line within the subject. $ printf 'a\nb\n' | sed 'N; s/^b/X/; s/^a.b$/<&>/' <a b> ^b did not match because the pattern space never started with b; after N, it starts with a and ended in b with one character (newline) in the middle. \A and \z would only become relevant/useful there if POSIX added support for perl's (?m-s) operators (more or less the equivalent of REG_NEWLINE). |
Date Modified | Username | Field | Change |
---|---|---|---|
2025-04-19 21:01 | dwheeler | New Issue | |
2025-04-19 21:01 | dwheeler | Status | New => Under Review |
2025-04-19 21:01 | dwheeler | Assigned To | => ajosey |
2025-04-24 16:20 | eblake | Note Added: 0007153 | |
2025-04-24 16:26 | nick | Note Added: 0007154 | |
2025-04-24 16:28 | nick | Status | Under Review => Resolved |
2025-04-24 16:28 | nick | Resolution | Open => Accepted As Marked |
2025-04-24 16:28 | nick | Category | Front Matter => Base Definitions and Headers |
2025-04-24 16:28 | nick | Interp Status | => --- |
2025-04-24 16:28 | nick | Final Accepted Text | => see 0001919:0007154 |
2025-04-24 16:28 | nick | Tag Attached: tc1-2024 | |
2025-04-24 16:29 | geoffclare | Project | 1003.1(2008)/Issue 7 => 1003.1(2024)/Issue8 |
2025-04-28 18:55 | dwheeler | Note Added: 0007157 | |
2025-04-28 21:16 | eblake | Note Added: 0007159 | |
2025-04-29 12:55 | dwheeler | Note Added: 0007161 | |
2025-04-29 14:05 | dwheeler | Note Added: 0007162 | |
2025-05-02 15:42 | msbrown | Note Added: 0007170 | |
2025-05-05 14:32 | dwheeler | Note Added: 0007174 | |
2025-05-05 14:48 | lanodan | Note Added: 0007175 | |
2025-05-05 15:05 | lanodan | Note Edited: 0007175 | |
2025-05-05 17:45 | dwheeler | Note Added: 0007179 | |
2025-05-05 18:16 | lanodan | Note Added: 0007180 | |
2025-05-05 20:47 | stephane | Note Added: 0007181 |