0001919: Add \A and \z to regular expressions (at least EREs)

ID	Project	Category	View Status	Date Submitted	Last Update

0001919	1003.1(2024)/Issue8	Base Definitions and Headers	public	2025-04-19 21:01	2025-07-08 14:30

Reporter	dwheeler	Assigned To	geoffclare
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	Applied	Resolution	Accepted As Marked

Name	David A. Wheeler
Organization
User Reference
Section	9. Regular Expressions
Page Number	1
Line Number	1
Interp Status	---
Final Accepted Text	see 0001919:0007154


Summary	0001919: Add \A and \z to regular expressions (at least EREs)
Description	I propose adding \A and \z to regular expressions (regexes) to reduce the likelihood of incorrectly copied regular expressions leading to security vulnerabilities. At least do this for EREs, and preferably both EREs and BREs. Regexes are widely used. A common use for regexes is to implement security checks. Regexes are widely used to check that all inputs match constrained patterns before the inputs are accepted. The usual way to use regexes for security is to begin the regex with “anchor at string beginning” (^ is the closest in POSIX) and end it with “anchor at string end” ($ is the closest in POSIX). Unfortunately ecosystems do NOT agree on a standardized way to spell these anchors, as “^” and “$” do NOT have the same meanings across all ecosystems. For example, “$” means “allow optional newlines as well” in languages like Perl, Python, PHP, and C#. The “^” means “begin any line” in Ruby (and similar for “$”). For guidance on how to handle this variation between ecosystems, see “Correctly Using Regular Expressions for Secure Input Validation” https://best.openssf.org/Correctly-Using-Regular-Expressions This is a problem. Davis et al’s “Why Aren’t Regular Expressions a Lingua Franca?…” (2019) https://arxiv.org/abs/2105.04397 found a serious problem. Of surveyed developers, 94% reuse regexes, 50% use reuse regexes at least half the time, and 47% incorrectly believed that regex notation is the same everywhere. LLMs will only make this error worse, as they will notice patterns of “almost” correct results that seem to repeat, and copy this subtle pattern incorrectly. If humans often make this mistake, systems trained on bad data and generalizing it will make it worse. In addition, if REG_NEWLINE (multi-line) mode is enabled, POSIX currently has no mechanism to “always anchor at the beginning of the string”. POSIX “^” and “$” change meaning and there is currently no standard mechanism to always match only at the beginning or end of a string. It would be great to “heal the rift” between regex notations for this common case, so that people could write simple regexes that really WOULD be interpreted the same way across ecosystems. This healing would also provide missing functionality. In short, make it possible to ALWAYS use \A for beginning of string in all cases (even for multi-line matches) and \z for end of string in all cases (even for multi-line matches). This capability doesn’t exist at all in POSIX regex (though ^ and $ are similar). This solution was recommended in “Correctly Using Regular Expressions for Secure Input Validation - Rationale” https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale GNU extends the POSIX regex syntax with this functionality, but it’s spelled differently: ‘\`’ matches the beginning of the whole input ‘\'’ matches the end of the whole input This suggests that there is a desire by some to have this functionality. However, few other systems have copied this syntax. Most other ecosystems use \A and \z instead. For more see: https://www.gnu.org/software/findutils/manual/html_node/find_html/posix_002dextended-regular-expression-syntax.html Note that \A and \z are widely implemented across many ecosystems. It would take some work to implement in POSIX systems, but this is not expected to be significant work.
Desired Action	Modify base definitions “9. Regular Expressions” in https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html as follows: Modify the ERE text as follows: In 9.4.8 ERE Precedence, modify “Anchoring” so “^ $” changes to “^ $ \A \z” In 9.4.9 ERE Expression Anchoring: * Change “The <circumflex> and <dollar-sign> special characters” to “The <circumflex> and <dollar-sign> special characters, and the expressions \A and \z,” * At the end of points 1 and 2, append “This meaning changes if regcomp is given the flag REG_NEWLINE as described there. * Add: “3. When not inside a bracket expression, a \A shall anchor the expression or subexpression it begins to the beginning of a string; such an expression or subexpression can match only a sequence starting at the first character of a string. This meaning is unchanged by REG_NEWLINE. 4. When not inside a bracket expression, a \z shall anchor the expression or subexpression it ends to the end of a string; such an expression or subexpression can match only a sequence ending at the last character of a string. This meaning is unchanged by REG_NEWLINE.” In 9.5 Regular Expression Grammar: In QUOTED_CHAR add \A and \z In 9.5.3 ERE Grammar In ERE_expression after “\| ‘$’” add: \| \A \| \z … and similarly for BRE. I thought I’d start by making proposals for ERE, and if that seems reasonable, go back for BRE. My understanding is that there’s more hesitation to change BREs, so while I’d prefer to do this for both EREs and BREs, I’d rather get 50% than 0%. Modify regcomp in https://pubs.opengroup.org/onlinepubs/9799919799/ as follows: Modify “REG_NOTBOL The first character of the string pointed to by string is not the beginning of the line. Therefore, the <circumflex> character ('^'), when taken as a special character, shall not match the beginning of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \A, when taken as an anchor, shall not match the beginning of string, and its behavior is not modified by REG_NEWLINE.” Modify “REG_NOTEOL The last character of the string pointed to by string is not the end of the line. Therefore, the <dollar-sign> ('$'), when taken as a special character, shall not match the end of string.” and append “This behavior is modified by REG_NEWLINE as described below. Similarly \z, when taken as an anchor, shall not match the end of string, and its behavior is not modified by REG_NEWLINE.” Append to RATIONALE: “The \A and \z anchors were added to make it easier to reuse regular expressions between different ecosystems and to provide a mechanism that ALWAYS means “exactly anchor at the beginning and end of string” even in the presence of REG_NEWLINE.” Note: I don't see the PDF, so the page# and line# aren't quite right, but I hope my references sections will make this clear. I welcom corrections on this or other issues. Thank you.
Tags	tc1-2024

eblake 2025-04-24 16:20 manager bugnote:0007153	See also this proposed glibc patch to add \A and \z as synonyms to its existing extension of \` and \': https://sourceware.org/pipermail/libc-alpha/2025-April/166098.html

nick 2025-04-24 16:26 manager bugnote:0007154	Add to BRE Expression Anchoring (P 187 after line 6674) as small NOTE: NOTE: Given the widespread adoption of some sort of special character sequence for matching the beginning and end of string in other languages, a future version of this standard is likely to require such characters to be supported in a regular expression. Implementors are encouraged to provide this as an extension using "\A" for the beginning and "\z" for the end of strings as they are already in widespread use for this purpose in other languages. Add the same note to ERE Expression Anchoring page 190 after line 6811.

dwheeler 2025-04-28 18:55 reporter bugnote:0007157	Thank you! That's fantastic! I plan to add a note about this to the OpenSSF document about regular expressions. Obviously this agreement doesn't mean it's in POSIX, but it's a key step.

eblake 2025-04-28 21:16 manager bugnote:0007159	\A is less controversial. In response to my query to glibc, Paul Eggert argued that Python's \Z is better than Perl's \z, and that Perl's \Z can be emulated in other languages by \n?\Z (at least, where \n represents newline). But he did agree that glibc's existing \` and \' have the same impact as what other languages have in \A and \Z.

dwheeler 2025-04-29 12:55 reporter bugnote:0007161	No, using only \Z for "end of string" would be a TERRIBLE result. Using \Z would solve NOTHING, because there would still not be a standard pair of regex markers that work on most platforms. That's the problem I'm trying to solve: there should be ONE way to indicate this in a regex that works everywhere. The vast majority of existing platforms use \Z as a synonym for \n?\z. That includes Java, .NET/C#, Perl, PCRE, PHP (using PCRE), and Ruby. Some use \z as end-of-string and don't assign a meaning to \Z. This includes RE2 <https://github.com/google/re2/wiki/Syntax> widely used by Go and Rust crate regex widely used by Rust. JavaScript doesn't currently support \A and \z, but they have a stage 2 proposal for them (note that it does NOT use \Z): https://github.com/tc39/proposal-regexp-buffer-boundaries . I know of no proposal for supporting \Z in JavaScript. For more info, see: https://best.openssf.org/Correctly-Using-Regular-Expressions-Rationale Python is the ONLY platform I know of that uses \Z for end-of-string, which is incompatible with almost every other platform. I'm the process of drafting a fix to Python to add \z. If the group wants to also add \Z as a synonym for \n?\z that would be fine, that would add yet another anchor that is widely supported, though one that's not nearly as important. The point is to have a single expression for beginning & ending of a string that works everywhere. Almost all platforms use \A ... \z, and the other platforms have no conflict with those markers, making this pair the only practical option. Thanks!

dwheeler 2025-04-29 14:05 reporter bugnote:0007162	Just to be clear: the set of languages and platforms that use \z (and NOT \Z) as "end of string" are: Java, .NET/C#, Perl, PCRE, PHP (using PCRE), Ruby, RE2 which is widely used by Go and Rust crate regex which is widely used by Rust. This is why "end of string" needs to be \z; that's what almost everyone else does. If POSIX adds \z to mean "end of string", we are much closer to having a single term that means "end of string" everywhere. The term \Z cannot ever be a common term for end-of-string across platforms, because most platforms already use \Z for something else ("end of string optionally preceded by a newline"). The term "$" can't mean end-of-string across platforms, because many platforms also don't treat it that way. The term \z CAN be a common term for end-of-string across platforms; it's already used that way in most platforms, and it doesn't conflict with another meaning on the rest.

msbrown 2025-05-02 15:42 manager bugnote:0007170	I have reached out to the Python community: https://discuss.python.org/t/proposal-add-z-as-a-synonym-for-z-in-python-res-for-standardization/90378 The responses so far have been favorable, and I have been asked to open an Issue for this: https://github.com/python/cpython/issues/133306

dwheeler 2025-05-05 14:32 reporter bugnote:0007174	FYI: The Python community has decided to add \z for end-of-string (they already had \A for beginning-of-string), aligning Python with this proposal. We are getting ever closer to having a single notation for beginning-of-string and end-of-string across platforms! The Python developers rejected the \` and \' alternatives as they felt they were hard to read, difficult to use on GitHub, and conflicted with the general avoidance of backquotes in Python syntax. My thanks to msbrown for reaching out to the Python community!

lanodan 2025-05-05 14:48 reporter bugnote:0007175 Last edited: 2025-05-05 15:05	Something I just realized, how should it behave with ed/sed/grep utilities? For example should \A in their case end up meaning start-of-file (and start-of-selection for ed and sed when addresses are used), or should it end up being equivalent to start-of-line, or be undefined-behavior (like start of the read buffer would be)? Thinking as a user I think \A being start-of-file would be interesting for grep(1), but as an implementer, I could see it being rather annoying to support.

dwheeler 2025-05-05 17:45 reporter bugnote:0007179	A lot of tools process data a line at a time. In those cases I think \A and \z should mean "beginning and ending of the current line being processed", since that's the data being processed. grep normally reads a line at time, so I'd expect \A to mean the same as ^ and \z to mean the same as $ (presuming there's no multi-line mode enabled), since within its processing the "beginning of string' and "end of string" would be within the read line. More generally, I would want this to be easy to implement, or there's a risk it won't happen everywhere. In the case of ed and sed, again, commands like "s" work a line at a time, so \A should mean ^ and \z should mean $. They can already notate "first line" and "last line" anyway, so there's no strong need to do things "the hard way".

lanodan 2025-05-05 18:16 reporter bugnote:0007180	Right, re-reading the specs of them more closely they are defined as line-matching (including for ed/sed addresses). So implementations might need to be a bit more careful and properly pass a line rather than a whole buffer (specially for \z).

stephane 2025-05-05 20:47 reporter bugnote:0007181	sed's regexps are matched against the pattern space which is initialised from the contents of the current line without the line delimiter but can be modified at will by the sed script including by adding newline characters including by pulling additional lines from the input with the N command. awk's regexps can be applied on anything, grep works on line (the "subject" is the line currently being considered, again not including the line delimiter), ex/vi's AFAICT are meant to be on lines (when not in vi-compatibility mode, vim regexs can have matches spanning several lines and \A there is for "non-alphabetic character", and \z a prefix for several additional regex operators; so in any case those \A, \z cannot be added to vim regexps). In any case, none of them use REG_NEWLINE, so in all POSIX utilities, \A is equivalent to ^ and \z to $ and match at the beginning and end of the subject, not at the beginning/end of each line within the subject. $ printf 'a\nb\n' \| sed 'N; s/^b/X/; s/^a.b$/<&>/' <a b> ^b did not match because the pattern space never started with b; after N, it starts with a and ended in b with one character (newline) in the middle. \A and \z would only become relevant/useful there if POSIX added support for perl's (?m-s) operators (more or less the equivalent of REG_NEWLINE).

Date Modified	Username	Field	Change
2025-04-19 21:01	dwheeler	New Issue
2025-04-19 21:01	dwheeler	Status	New => Under Review
2025-04-19 21:01	dwheeler	Assigned To	=> ajosey
2025-04-24 16:20	eblake	Note Added: 0007153
2025-04-24 16:26	nick	Note Added: 0007154
2025-04-24 16:28	nick	Status	Under Review => Resolved
2025-04-24 16:28	nick	Resolution	Open => Accepted As Marked
2025-04-24 16:28	nick	Category	Front Matter => Base Definitions and Headers
2025-04-24 16:28	nick	Interp Status	=> ---
2025-04-24 16:28	nick	Final Accepted Text	=> see 0001919:0007154
2025-04-24 16:28	nick	Tag Attached: tc1-2024
2025-04-24 16:29	geoffclare	Project	1003.1(2008)/Issue 7 => 1003.1(2024)/Issue8
2025-04-28 18:55	dwheeler	Note Added: 0007157
2025-04-28 21:16	eblake	Note Added: 0007159
2025-04-29 12:55	dwheeler	Note Added: 0007161
2025-04-29 14:05	dwheeler	Note Added: 0007162
2025-05-02 15:42	msbrown	Note Added: 0007170
2025-05-05 14:32	dwheeler	Note Added: 0007174
2025-05-05 14:48	lanodan	Note Added: 0007175
2025-05-05 15:05	lanodan	Note Edited: 0007175
2025-05-05 17:45	dwheeler	Note Added: 0007179
2025-05-05 18:16	lanodan	Note Added: 0007180
2025-05-05 20:47	stephane	Note Added: 0007181
2025-07-03 16:13	geoffclare	Assigned To	ajosey => geoffclare
2025-07-08 14:30	geoffclare	Status	Resolved => Applied

View Issue Details

Activities

Issue History