View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001941 | 1003.1(2024)/Issue8 | Shell and Utilities | public | 2025-08-30 21:51 | 2025-09-20 02:32 |
Reporter | dwheeler | Assigned To | ajosey | ||
Priority | normal | Severity | Objection | Type | Enhancement Request |
Status | Under Review | Resolution | Open | ||
Name | David A. Wheeler | ||||
Organization | |||||
User Reference | |||||
Section | grep | ||||
Page Number | 1 | ||||
Line Number | 1 | ||||
Interp Status | |||||
Final Accepted Text | |||||
Summary | 0001941: Add widely-implemented options to grep | ||||
Description | DESCRIPTION: The "grep" utility is widely used on POSIX systems. However, POSIX doesn't include a lot of widely-available options that many people depend on. It'd be better if these grep options were standardized in POSIX so that people could count on their presence on POSIX systems. This will cause no change in many implementations, and at most a few tweaks in others, since many implementations already support the options proposed here. If they don't currently implement them, they're easy to add. I compared the current POSIX specification of grep: https://pubs.opengroup.org/onlinepubs/9799919799/ to a few sample implementations of grep: GNU - https://man7.org/linux/man-pages/man1/grep.1.html FreeBSD - https://man.freebsd.org/cgi/man.cgi?grep(1) MacOS - https://www.unix.com/man_page/osx/1/grep/ Busybox - https://busybox.net/downloads/BusyBox.html Obviously there are other POSIX implementations, but if something is widely implemented by multiple common implementations, I think that's a good argument for standardization. I'm only listing options in at least 3, and nearly all are in all, suggesting they are widely implemented. BusyBox targets small systems; when even BusyBox implements something, that's evidence people want and expect it. Anyway, here are my proposed additional options in brief, along with the rationale for adding each one. My proposed more formal wording is in the "desired action" section: Control filename printing: -L List names of files that do not match. This is a useful negation of -l. This is NOT the same as -v when there are multiple files, and -v can't replace it. Implemented in GNU, FreeBSD, MacOS, and BusyBox. -H Print the file name for each match. This is the default when there is more than one file to search, and that's a pain when you don't know for sure how many files will be processed. You can work around this by reading an empty file, but a standard option is cleaner. Implemented in GNU, FreeBSD, MacOS, and BusyBox. -h Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search. You can use cat first, but this seems clearer and removes the need for an extra process to remove the possibility of filenames. Implemented in GNU, FreeBSD, MacOS, and BusyBox. Control context: -A NUM Print NUM lines of context AFTER each match. Often you want to have a sense of what is "around" the match, and this controls how much afterwards. I note that Claude Code often uses -A and -B to get a sense of what it's matching (it won't want to read a whole file, that burns up tokens), so being able to isolate context is a *today* issue. I've tried to nail down the semantics in more detail on GNU and MacOS, and they seem consistent, so here's what I've learned from those two platforms. Both use '--' to separate context results. When there's an overlap of ranges due to -A or -B, both GNU and MacOS just continue the lines (replying with only one line for each output), creating a longer sequences of lines and omitting the '--' separator when they overlap. When filenames and line numbers are added (-H, -n), the separators of that information are ":" for the matches and "-" for the context lines that are not matches. Implemented in GNU, FreeBSD, MacOS, and BusyBox. -B NUM Print NUM lines of context BEFORE each match. Again, this helps you determine what you'll see "around" the match. Implemented in GNU, FreeBSD, MacOS, and BusyBox. Other: -m NUM Immediately stop reading a file after NUM findings (matches, or in the case of -v, non-matches). You can pipe to `head -n NUM` if there's only file, but this can be far less efficient. That depends on writing and eventually getting a SIGPIPE, which may be a while if there are buffers or a lot of processing before the next write. Counting and immediately stopping is far more efficient on large datasets, and it's a nicer interface too. If there's more than one file, there's no easy way to have the same effect, you have to grep and pipe in a loop. Implemented in GNU, FreeBSD, MacOS, and BusyBox. -o Print only the matched (non-empty) part(s) of a matching line, with each matching part displayed on a separate output line. If it matches multiple times, it's output multiple times. GNU and MacOS differ on semantics when you combine -o with -v, but that's a pretty odd combination, and it's useful even if you allow for variation when -v is used. Implemented in GNU, FreeBSD, BusyBox (it's not clear how it handles multiple matches in a line), MacOS. -G Interpret pattern(s) as a basic regular expression (default). Many tools don't use basic regex any more, so it can be helpful to remind people that this is what it's doing. It's also helpful if someone aliases grep to be "grep -E" normally but they still want to use basic regex sometimes. Not in BusyBox, but trivial to add. Implemented in GNU, FreeBSD, MacOS. There are other widely-used options that I'm not including in this proposal: All implement recursively following directories using at least -r, and sometimes -R and --recursive and -d recurse. However, there are some differences in how they handle symbolic links. That idea could be taken up separately. All implement a "whole word" match with -w. However, that raises complications on defining word boundaries, especially since POSIX doesn't define the underlying construct. This may be quite doable, but since that discussion is complicated, maybe that's for another day. The grep document specifically notes that pathnames can be processed by grep, but there's no way to handle all pathnames because linefeed is allowed to be in filenames and there's no standard way to use nulls for filenames. All actual implementations I've investigated implement ways to use null bytes to handle filenames, but there are differences in the option flags they use, and I imagine there will be controversy in any proposal. So I won't try that here, maybe another day. All have a "-C" option that's equivalent to `-A NUM -B NUM`. However, in some the NUM is optional and must be right after the C (no space), in others it's not optional and must be separated by a space. How annoying. For portability a new letter would have to be devised. For now, if POSIX adds -A and -B, then people can portably get the same effect without too much hassle. In the process, I also found what I think is a minor error in the current spec. The combination -ql shouldn't print any lines, but the current text describing the output format seems to imply otherwise. For example, the following command produces the same result on both GNU and MacOS: seq 1 25 > seq1-25 grep -A 2 -B 1 -Hn 1 seq1-25 The result for both: seq1-25:1:1 seq1-25-2-2 seq1-25-3-3 -- seq1-25-9-9 seq1-25:10:10 seq1-25:11:11 seq1-25:12:12 seq1-25:13:13 seq1-25:14:14 seq1-25:15:15 seq1-25:16:16 seq1-25:17:17 seq1-25:18:18 seq1-25:19:19 seq1-25-20-20 seq1-25:21:21 seq1-25-22-22 seq1-25-23-23 My apologies for the placeholder "page number" and "line number", I normally use the HTML version, and I don't have a PDF version of the latest specification. | ||||
Desired Action | Changes: SYNOPSIS: Change this: ~~~~ grep [-E|-F] [-c|-l|-q] [-insvx] -e pattern_list [-e pattern_list]... [-f pattern_file]... [file...] grep [-E|-F] [-c|-l|-q] [-insvx] [-e pattern_list]... -f pattern_file [-f pattern_file]... [file...] grep [-E|-F] [-c|-l|-q] [-insvx] pattern_list [file...] ~~~~ to: ~~~~ grep [-E|-F|-G] [-c|-l|-L|-q] [-Hhinosvx] [-A NUM] [-B NUM] [-m NUM] -e pattern_list [-e pattern_list]... [-f pattern_file]... [file...] grep [-E|-F|-G] [-c|-l|-L|-q] [-Hhinosvx] [-A NUM] [-B NUM] [-m NUM] [-e pattern_list]... -f pattern_file [-f pattern_file]... [file...] grep [-E|-F|-G] [-c|-l|-L|-q] [-Hhinosvx] [-A NUM] [-B NUM] [-m NUM] pattern_list [file...] ~~~~ Under "OPTIONS" add these (retaining the case-sensitive sort): -A NUM Print NUM lines of context *after* each line selected. If none of those later lines after nor the line after them is selected, print the group separator ('--') after those context lines. If any of those lines after or the following line *is* selected, display all those lines in sequence without duplication. This has no effect with the -o option. See also -B. -B NUM Print NUM lines of context *before* each line selected. If none of those lines before nor the line before them is selected, print the group separator ('--') before those context lines. If any of those lines before or the previous line before them *is* selected, continue to display all those lines in sequence without duplication. This has no effect with the -o option. See also -A. -G Interpret all pattern(s) as basic regular expression(s). This is the default. -H Print the file name for each selection. This is the default when there is more than one file given to search. If the -o option is provided, the file name is provided on each selection. See the -h option. -L List names of files that were processed but no lines were selected. -h Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search. See the -H option. -m NUM Immediately stop reading a file after NUM selections. -o Display only the selected (non-empty) part(s) of a matching line. If there are multiple matches, each selection is displayed on a separate output line. These outputs may be prefixed by filename and/or line number depending on the number of files examined, the -H option, and the -n option. The use of -o with -v is not portable. Change STDOUT to: If -q is specified, there is no output to standard out regardless of other options. Otherwise, if the -l or -L option is in effect, the following shall be written for each file containing at least one selected input line: "%s\n", <file> Otherwise, if the -H option is selected or more than one file argument appears (and -q, -l, and -L are not specified), the grep utility shall prefix each output line by this, where <context-marker> is ':' if this line is selected and '-' if this line was not selected but is only being displayed as part of context: "%s%s", <file>, <context-marker> The remainder of each output line shall depend on the other options specified: If the -c option is in effect, the remainder of each output line shall contain: "%d\n", <count> Otherwise, if -c is not in effect and the -n option is in effect, the following shall be written to standard output, where <context-marker> is the same as described above: "%d%s", <line number>, <context-marker> Finally, if the '-o' option was selected, the following shall be written to standard output, with multiple lines for each selected-part that matches: "%s\n", <selected-part> Otherwise, if the -o option was not selected, the following shall be written to standard output, which may either be a selected line, or one of the context lines before or after a selected line: "%s\n", <line contents> | ||||
Tags | No tags attached. |
|
Add to -m: "For a portable result the number must be 1 or greater or -1 to indicate 'no maximum'". GNU and MacOS, at least, differ on the interpretation of 0 (GNU=stop immediately, MacOS= no limit). |
|
Whups, I just noticed a mistake in mine, I didn't handle -h. Change: Otherwise, if the -H option is selected or more than one file argument appears (and -q, -l, and -L are not specified), the grep utility shall prefix each output line by this, where <context-marker> is ':' if this line is selected and '-' if this line was not selected but is only being displayed as part of context: to: Otherwise, if the -H option is selected, or more than one file argument appears and the -h option was not specified (and -q, -l, and -L are not specified), the grep utility shall prefix each output line by this, where <context-marker> is ':' if this line is selected and '-' if this line was not selected but is only being displayed as part of context: |
|
I have conflicting options. > -H Print the file name for each selection. This is the default when there is more > than one file given to search. If the -o option is provided, the file name is > provided on each selection. See the -h option. versus: -H If -R is specified, follow symbolic links only if they were ex- plicitly listed on the command line. The default is not to follow symbolic links. And: > -o Display only the selected (non-empty) part(s) of a matching line. If there are versus: -o Always print filename headers with output lines. Also: > All implement recursively following directories using at least -r, and sometimes > -R and --recursive and -d recurse. However, there are some differences in how > they handle symbolic links. That idea could be taken up separately. Nope, BSD has -R but not -r: -R Recursively search subdirectories listed. BSD has had -o and -H assigned like this for over two decades, so I’m not sure whether occupying it by a GNU extension with different semantics (which I am aware of as Debian Developer) is wise. Addendum: > -L List names of files that were processed but no lines were selected. This is more specific (again from the BSD manpage): -L Only the names of files not containing selected lines are written to standard output. Pathnames are listed once per file searched. If the standard input is searched, the string "(standard input)" is written. (“once per file searched” lists it twice if the same input file is given twice, of course) |
|
Oh, wait, I missed one note: The STANDARDS section of BSD grep says: Historic versions of the grep utility also supported the flags [-ruy]. This implementation supports those options; however, their use is strong- ly discouraged. So apparently it supports a -r but doesn’t even document what it does, this is how strong it’s been deprecated for decades. |
|
I hate it when different implementations of the same utility have conflicting meanings for the same option flag. If only there were a standard :-) :-). Which BSD are you talking about? I looked at FreeBSD: https://man.freebsd.org/cgi/man.cgi?grep(1) I also looked at MacOS, which BSD-ish-derived. If we must, we can drop some flags of course. |
|
Regarding -L, I proposed: -L List names of files that were processed but no lines were selected. Counter-proposal is: -L Only the names of files not containing selected lines are written to standard output. Pathnames are listed once per file searched. If the standard input is searched, the string "(standard input)" is written. I can confirm that both GNU and MacOS also produce "(standard input)" if - is the filename. Unfortunately, GNU and MacOS *do* list pathnames twice, if provided twice. I think that's the sort of thing that the standard can simply allow variance for. I can't imagine it's common to provide the same path multiple times, usually it's a glob or find result or some such. -L List names of files that were processed but no lines were selected. If standard input was processed its name will be considered "(standard input)". If the same filename was processed multiple times, and it is to be output, it's unspecified if its name will be output only once or once for each occurance. |
|
Note that the reason those are widespread is that for the longest time BSDs actually shipped with GNU grep (some version thereof), many have since eventually re-implemented theirs trying to keep the GNU API, but sometimes diverging between themselves and from GNU grep after (and even before as they stayed on an older version of GNU grep) that point. Recursive grep is very unportable in part because in GNU grep it has changed greatly over the years with regards to handling of symlinks and non-regular files. Divergences are hard to reconciliate now. -L/-H like in other recursive tools can't be used because they're used for other things in GNU grep, and anyway we'd need a third one for whether or not to process symlinks to regular files when doing a traversal that doesn't itself follow symlinks. ast-open grep (which can also be made the grep builtin of ksh93, though I don't know whether there's any system where that's the case) is another grep implementation that has incorporated many of GNU grep's features (https://github.com/ksh93/ast-open-archive/blob/2014-12-24/src/lib/libcmd/grep.c#L23-L113). Its -o prints empty matches (a bug as it means it runs into infinite loops in that case). |
|
-o would be more useful with an option to print what was matched by capture groups like in pcre2grep (formerly pcregrep)'s$ echo foobar | pcre2grep -o1 -o2 -o3 --om-separator=, '(.)(.)(.)' f,o,o b,a,r That's one reason I often use pcre2grep instead of grep. Doing it with perl (the p in pcre2grep) is not much more difficult here: $ echo foobar | perl -C -lne 'print "$1,$2,$3" while /(.)(.)(.)/g' f,o,o b,a,r But that's a different matter when you need to combine with other features of grep such as -r. |
|
> Note that the reason those are widespread is that for the longest time BSDs actually shipped with GNU grep (some version thereof), many have since eventually re-implemented theirs trying to keep the GNU API, but sometimes diverging between themselves and from GNU grep after (and even before as they stayed on an older version of GNU grep) that point. That actually gives me hope - that makes it more likely we can gain more agreement on *some* things. Shared ancestry & compatibility gave us POSIX in the first place. > Recursive grep is very unportable in part because in GNU grep it has changed greatly over the years with regards to handling of symlinks and non-regular files. Divergences are hard to reconciliate now. -L/-H like in other recursive tools can't be used because they're used for other things in GNU grep, and anyway we'd need a third one for whether or not to process symlinks to regular files when doing a traversal that doesn't itself follow symlinks. As I noted, I didn't propose a recursive option in this issue. I do think it'd be possible to add one. In a lot of cases it's known there are no symlinks, so the difference is irrelevant, and options could be added to control "where it matters". But if you want to discuss recursive grep, let's take that up in a different issue where it's being proposed. > ast-open grep (which can also be made the grep builtin of ksh93, though I don't know whether there's any system where that's the case) is another grep implementation that has incorporated many of GNU grep's features (https://github.com/ksh93/ast-open-archive/blob/2014-12-24/src/lib/libcmd/grep.c#L23-L113). Its -o prints empty matches (a bug as it means it runs into infinite loops in that case). That's obviously a bug. We don't need to require bugs. Nobody else does that, and I can't imagine why anyone would want that. I can't imagine anyone *depending* on that behavior on a portable script, since that would fail in so many systems. Also, I'm sure nobody wants unintentional infinite loops :-). > -o would be more useful with an option to print what was matched by capture groups like in pcre2grep (formerly pcregrep)'s... > echo foobar | pcre2grep -o1 -o2 -o3 --om-separator=, '(.)(.)(.)' Printing capture groups does sound useful. I don't know an existing standard grep that supports printing capture groups, but if the spec added that *functionality* that'd be great. The syntax would have to be different. Nobody's current `-o` permits an optional parameter. Enabling an optional parameter would make existing scripts fail, e.g., `grep -oE...`. I can imagine something like -g NUMBER that can be repeated, each one causing the display of a captured group (in order requested). Another option could change the separator. However, I suggest that be a *separate* issue. Feel free to propose it! However, this issue was more about capturing options that are already in wide use in scripts that *already* work across a variety of systems. Creating an option not currently implemented by existing greps is obviously possible, but let's capture those discussions in separate groups :-). |
|
I asked Claude Code to analyze the support of OpenBSD and NetBSD for these proposed new grep options. I had previously looked at FreeBSD. That way at least FreeBSD, OpenBSD, and NetBSD are considered. I tweaked what it reported; here's the tweaked version: | Option | OpenBSD | NetBSD | Notes | |--------|----------|----------|----------------------| | -A NUM | supports | supports | Identical| | -B NUM | supports | supports | Identical | | -G | supports | supports | Default behavior | | -H | supports | supports | Identical | | -L | supports | supports | Identical | | -h | supports | supports | Identical | | -m NUM | variance | variance | Slight variation on context | | -o | supports | supports | Identical | There's extraordinary agreement in general on these grep options. I think everyone uses group separator '--' for -A and -B by default (where they are supported), which is happy news. The main variance appears to be with the -m option: If -m stops output, OpenBSD's implementation doesn't output trailing context, while NetBSD's does. I think OpenBSD's is the better semantic, because maybe we *don't* want to keep reading after that for some particular reason. That also seems to be what everyone else does (MacOS, GNU). However, I'm fine with making that a permitted variance if that's necessary for standardization. So if that variance must be permitted, modify -m to say: > -m NUM Immediately stop reading a file after NUM selections. It is permitted (but not recommended) to read and print context lines afterwards if context lines afterwards were requested. |
|
I think we should do our best to avoid conflicting options. However, once a massive consensus emerges among many widely-used systems, we should consider that. Not having grep -o in the standard (to extract just what was matched) is *crippling*. People just use it anyway. The lack of -H can be worked around with /dev/null, but it's an ugly hack that isn't necessary in the BSD-ish systems of OpenBSD, NetBSD, FreeBSD, and MacOS, nor in GNU nor in BusyBox. I propose standardizing on these, as these are the "general consensus", even among many widely used BSD systems. Reasonable minds can differ obviously, but here I state my position, and discussion can continue :-). |
|
One system where -o is different is MirOS/MirBSD, I guess mirabilos could chime in to add some information: http://www.mirbsd.org/htman/i386/man1/grep.htm:
MirBSD being derived from OpenBSD which matches the behavior of the proposal. https://man.openbsd.org/grep.1:
|
|
I was asked to do some research; below are my results. In summary: * Oracle Solaris implements some of these grep options. It doesn't implement some, but there are no flag conflicts and it should be easy to implement them. * MirBSD has some incompatibilities. However, this is a niche hobby project that self-describes as "not suitable for installation" and only supports obsolete chips (32-bit i386 and sparc). It has incompatible -H and -o options. I think that should NOT hold back POSIX. Let's standardize on what MacOS, GNU/Linux, FreeBSD, BusyBox, and all agree on. * Regarding -C ("context"): A -Cnum option *might* work across existing implementations. I'll investigate further; if anyone knows anything about this, please speak up. * I will separately collect the tweaks (based on feedback) to create an updated proposal, formatted using HTML instead of Markdown. However, I want to track down information on -C first. First, regarding Oracle (nee Sun) Solaris: The documentation on its grep implementation appears to be here (this is for version 11.4, which appears to be the current version): https://docs.oracle.com/cd/E88353_01/html/E37839/grep-1.html Solaris grep already supports, with the same semantics: -G: Interpret patterns as basic regular expressions (this is already the default behavior in Solaris) -h: Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search. See the -H option. Solaris grep does NOT support, and has no conflicting flags, for: -A NUM: Print NUM lines of context after each match -B NUM: Print NUM lines of context before each match -H: Print the file name for each selection (force filename display) -L: List names of files that were processed but no lines were selected -m NUM: Stop reading after NUM selections -o: Display only the selected (non-empty) parts of matching lines -C NUM: -A NUM and -B NUM [not in original proposal] There's no conflict, and these should all be really easy to implement in an existing implementation. Regarding MirBSD ("mirabilos’ Open Source playground): As they explain in http://www.mirbsd.org/about.htm - "MirOS BSD is a niche operating system from the BSD family for 32-bit i386 and sparc systems. It is based on 4.4BSD-Lite (mostly OpenBSD, some NetBSD®)... MirBSD pretty much is a hobby project with interesting side projects and subprojects in wide use (such as mksh), but which by itself… is not suitable for installation, except by people actually wanting to work on it." I don't think POSIX should be beholden to MirBSD. It's self-described as a hobby project "not suitable for installation" and focused on obsolete chips. 32-bit i386 chips are no longer even being made. SPARC chips are made but new work has been discontinued & there's a general transition away from them. For completeness, I reviewed MirBSD grep docs here: http://www.mirbsd.org/man.cgi?q=grep#direct MirBSD already supports, with the same semantics: -A NUM: Print NUM lines of context after each match -B NUM: Print NUM lines of context before each match -L: List names of files that were processed but no lines were selected. Pathnames are listed once per file searched. If the standard input is searched, the string "(standard input)" is written. It also supports: -C[num] : -A and -B. No whitespace may be given between option and argument. Not in original proposal. It does not support (but could add, since it has no incompatible flag): -m NUM: Stop reading after NUM selections There are two MirBSD incompatibilities: -H If -R is specified, follow symbolic links only if they were explicitly listed on the command line. The default is not to follow symbolic links. vs. proposal: -H: Print the file name for each selection (force filename display) -o: Display only the selected (non-empty) parts of matching lines vs. proposal -o: Extract match Regarding -C: * On MacOS, "-C[num]" is supported. The num is *optional* (default 2), and *must* be added *without* a space. * On GNU, "-C NUM" is supported. The man page suggests NUM is required (not optional). HOWEVER, it appears that in PRACTICE, GNU quietly implements -CNUM as well. (where there's no space after the number). So it might be possible to agree on an isolated "-Cnum" where there's no space after the "C". Handling this requires somewhat irregular option parsing, and I haven't verified this works everywhere. Note that just "-C" without a number doesn't work. I'm trying to investigate further, but it's hampered because this doesn't appear to be well-documented. |
|
> All implement a "whole word" match with -w. However, that > raises complications on defining word boundaries, especially > since POSIX doesn't define the underlying construct. This may > be quite doable, but since that discussion is complicated, > maybe that's for another day. Actually POSIX does already specify the \< and \> regexp operators for the ex utility: https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/ex.html#tag_20_40_13_58 > \< > Match the beginning of a word. (See the definition of word > at the beginning of Command Descriptions in ex.) > \> > Match the end of a word. That's the wrong reference, btw, looks like it should be a reference to "Input Editing in ex" (I'll raise a bug about that): > word > > In the POSIX locale, a word consists of a maximal sequence of > letters, digits, and underscores, delimited at both ends by > characters other than letters, digits, or underscores, or by > the beginning or end of a line or the edit buffer. And the initial implementation of grep -w (AFAIK from BSD in the late 70s, ex being also a BSD utility) was implemented by adding \<...\> around the regex to match. https://github.com/dspinellis/unix-history-repo/blob/BSD-2/src/grep.c#L105-L106 That's however not necessarily the best approach and not what all implementations do these days. For example, with GNU grep (and its clones): $ echo 'a -b- c' | grep '\<-b-\>' $ echo 'a -b- c' | grep -we -b- a -b- c That is grep -w word being more like grep -P '(?<!\w)word(?!\w)' regardless of whether "word" itself starts and/ord ends with \w or not. Sounds like a better approach. $ echo 'a--b--c' | grep -we -b- a--b--c May be more debattable. The fact that there's no agreement in practice between grep implementations, may mean it's best to leave it out for now. Another issue with \<, \> if they were to be specified is that we'd likely want to also specify the REG_STARTEND BSD flag for regcomp() and sed/grep -o to use it, or we'd get into issues such as: $ echo aaa | sed 's/\<a/<a/g <a<a<a $ echo aaa | grep -o '\<a' a a a As each "a" ends up being at the start of the subject upon successive match. For the record, and for what it's worth, I otherwise support your proposal. |
|
Here’s my analysis on <tt>grep -C</tt>. Does anyone have guidance on what should go in to POSIX? I believe at least <tt>grep -CNUM</tt> (with NUM required) should be standardized. The <tt>grep -CNUM</tt> format is widely used and essentially universal. 8 of 9 systems I examined support the grep <tt>-C</tt> option, and all of those support the <tt>-CNUM</tt> format (no space). The one exception was Oracle Solaris, which lacks <tt>-C</tt> but could add it. Solaris hasn’t had an update in 7 years and is mostly-unmaintained, so its failure to support it shouldn’t block standardization. There’s a strong case to be made for <tt>grep -C NUM</tt> as well. This is widely supported (7 of 9), and Solaris could add it. Many systems’ man pages are misleading, for example, Apple MacOS 13.7.8 says “no whitespace may be given between the argument and its option” but in fact this is false; whitespace is supported between <tt>-C</tt> and the <tt>NUM</tt> argument. I suspect GNU did this first, and most of the BSDs have added support for it over time. I’ve verified through testing that this works on GNU, MacOS, BusyBox, ToyBox, FreeBSD, and NetBSD. The Illumos source code makes me believe it works there too. However, the <tt>grep -C NUM</tt> syntax does NOT work on OpenBSD (I verified through testing), and it would be incompatible with their existing use (where it is truly optional). It’d be cleaner to support <tt>grep -C NUM</tt> everywhere, but I understand avoiding unnecessary backwards incompatibility. I could live with just <tt>grep -CNUM</tt> being in the standard by itself. I’d love feedback/guidance on this. Systems I analyzed:
I don’t have a Solaris or Illumos system; I installed and ran a test script on everything else. Here’s my test script: https://dwheeler.com/misc/test_grep_c.sh Notes:
|
|
See https://www.ibm.com/docs/en/aix/7.3.0?topic=g-grep-command for the AIX man page. -w, -h, -H, -L, -m, -o are useful in scripts where portability matters, -A/-B/-C more interactively, so I wouldn't be upset myself if those were not added to POSIX. Note that new options with optional arguments have been rejected in the past. |
|
Thanks for the link. I see that -A/-B/-C aren't included in AIX, but the good news is that there's no *conflict* either. Adding support for -A/-B/-C is fairly easy. I believe -A/-B/-C are important. They're WIDELY implemented (at least 8 systems support it). What's more, I think AI Code Assistants are becoming hugely important, and being able to "show context" is really important to them. Reading tokens consumes resources, and they have maximum context windows. -A/-B/-C are easy ways to provide them a larger view without overwhelming them. Since they work across different systems, it's valuable to have a *standard* way for these computer systems to request context. |
|
I use -A and -B quite frequently, and my experience is that it is widely supported enough that I do not have to question "Am I using GNU grep?". Therefore, I am +1 adding it. |
|
MirBSD’s is what OpenBSD had for the longest time, so it’s been in use for ages. But since people want a -o with arguments anyway, why not use -O for that and leave -o explicitly unspecified? |
|
Re 0001941:0007263: > MirBSD’s is what OpenBSD had for the longest time, so it’s been in use for ages. OpenBSD dropped it in 4.8 (released in 2010) and added the -o option requested here in 5.0 (released in 2011). The removal was accompanied by the message "For the most part, our grep tries to be compatible with the defacto gnu standard for grep, but there's no need to blaze our own trail." If you've got an implementation that tries to be compatible with GNU grep, but isn't, that should just be regarded as a bug, not standardised as a feature, and OpenBSD had no issue changing that. > But since people want a -o with arguments anyway, why not use -O for that and leave -o explicitly unspecified? That requires changes on every implementation, but let's imagine. Imagine POSIX did that, GNU grep changed today, and we're twenty years down the line. How many legacy systems do you imagine would be out there where -O does not work? How many MirBSD systems do you imagine would be out there where -o does not work? I imagine the former would still far exceed the latter and because of that, I imagine it would be unlikely that people would switch to that new -O option. |
|
I agree with hvd. I think POSIX should not be *gratuitously* incompatible with historical and hobbyist systems. However, at this point, there's a general consensus that there needs to be a way to extract *just* the match from grep, and that "-o" is the option that does that. OpenBSD intentionally switched and implemented that change in 2011, which is *14* years ago. Changing the meaning of a flag is pretty strong evidence of a general consensus among the systems that are in wide use today. If there was agreement on a functionality but not the option flag, then yes, I think it's reasonable to propose an unused option flag. However, that is not the situation today. There's already a general consensus. POSIX should make it easy to find that consensus so that application developers can confidently build on it. |
|
OK, I had not seen that it has been that long since OpenBSD switched their -o. I guess for this specific case… I did a quick recursive grep on the MirBSD tree (which won’t find things split by linebreaks, but that would be very rare for options) and didn’t find anything obviously using -o or -H, so I guess I’ll agree to change. |
|
The "easy" button would be to require "-Cnum" support, where num is cuddled after -C, as that has practically universal support, with a strong encouragement to also support "-C num" support. OpenBSD is the only system that support -Cnum but not "-C num". Would that pass muster in POSIX? Or is a different solution preferred? |
|
As far as the standard itself is concerned, requiring "-Cnum" with num mandatory but not separated from -C would be far from easy. There is currently no way to specify this in a SYNOPSIS section. See XBD 12.1 Utility Argument Syntax. If we stick to what's in 12.1, the choices are: grep [-C[num]]and: grep [-C num] The latter conflicts with Open BSD and the former conflicts with GNU (and probably others). |
|
Thank you, you've answered my question about "what's more acceptable to POSIX". The `grep -C[num]` format with an OPTIONAL value isn't widely supported. OpenBSD supports it, and I think others *used* to as evidenced by MacOS' man page, but nearly all implementations (other than OpenBSD) have switched to making -C have a REQUIRED num instead of an optional num. I don't think it's *illegal* to have an unusual "-Cnum" format for a special case, but I completely understand that the group would prefer to avoid special cases, *especially* when 7 of 9 systems tested support the "easy and obvious" syntax of a required number. I've emailed Theo de Raadt (OpenBSD) to hear his thoughts. |
|
As requested, I've tried to rewrite this to a single proposed change. I tried to make it sound "POSIX-like". Improvements welcome. Desired Action Update the grep utility specification to add the following widely-implemented options that improve grep's functionality across systems, specifically -A -B -C -G -H -L -h -m -o with majority semantics, extending the existing POSIX grep specification here: https://pubs.opengroup.org/onlinepubs/9799919799/utilities/grep.html
Synopsis Changes: Modify the synopsis to include the new options: grep [-E|-F|-G] [-c|-l|-L|-q] [-H|-h] [-inosvx] [-A num] [-B num] [-C num] [-m num] -e pattern_list [-e pattern_list]... [-f pattern_file]... [file...] grep [-E|-F|-G] [-c|-l|-L|-q] [-H|-h] [-inosvx] [-A num] [-B num] [-C num] [-m num] [-e pattern_list]... -f pattern_file [-f pattern_file]... [file...] grep [-E|-F|-G] [-c|-l|-L|-q] [-H|-h] [-inosvx] [-A num] [-B num] [-C num] [-m num] pattern [file...] Option Descriptions: Add the following option descriptions to the OPTIONS section: -A num Write num lines of context after each selected line. -B num Write num lines of context before each selected line. -C num Write num lines of context before and after each selected line. Equivalent to specifying both -A num and -B num. -G Interpret patterns as basic regular expressions. This is the default behavior. -H Prefix each output line with the filename followed by a colon (or dash if it is a context line and is not selected). Default when multiple files are specified. -L Write only names of files containing no selected lines, one per line. Empty files are included if processed. -h Suppress filename prefixes. Default when reading standard input or a single file. -m num Stop reading each file for matching after num selected lines. Complete context for the last match should be written after stopping, but any lines written afterwards are not considered matches. Implementations may instead stop writing lines for that file after writing the corresponding information about the last match, without the trailing context. This is applied independently to each file. num shall be a non-negative decimal integer; if zero, behavior is implementation-defined. -o Write only matching non-empty portions of selected lines, possibly prefixed with filename and line number information (see STDOUT). Multiple non-overlapping matches for one pattern are written as separate lines, with the same filename and/or line number if they are written. Combining -o with context options (-A, -B, -C) or -v is implementation-defined. For context options (-A, -B, -C), if there are overlapping or adjacent regions of context, the lines are written in sequence without duplication, and the group separator ('--') is written to separate non-adjacent context regions. When writing a filename and/or line number, non-matching context lines use dash separators while selected lines use colon separators (see STDOUT section). If a context line also matches the pattern, it shall be displayed as a selected line and counted toward the -m limit. A context does not extend before the beginning or after the end of a file. For all numeric arguments, num shall be a non-negative decimal integer. STDOUT Changes: If -q is specified, there is no output to standard out regardless of other options. Otherwise, if the -l or -L option is in effect, the following shall be written for each file containing at least one selected input line: "%s\n", <file> Otherwise, if the -H option is selected or more than one file argument appears (and -q, -l, and -L are not specified), the grep utility shall prefix each output line by this, where <context-marker> is ':' if this line is selected and '-' if this line was not selected but is instead only being displayed as a context line: Context group separators shall consist of a single line containing only "--" when context from different matches would otherwise be adjacent and there is at least one non-context line between the context regions in the original file. Option Interactions:
Error Conditions: Invalid numeric arguments shall cause grep to write a diagnostic message to standard error and exit with status greater than 1. |
Date Modified | Username | Field | Change |
---|---|---|---|
2025-08-30 21:51 | dwheeler | New Issue | |
2025-08-30 21:51 | dwheeler | Status | New => Under Review |
2025-08-30 21:51 | dwheeler | Assigned To | => ajosey |
2025-08-30 21:56 | dwheeler | Note Added: 0007240 | |
2025-08-30 21:59 | dwheeler | Note Added: 0007241 | |
2025-08-31 00:07 | mirabilos | Note Added: 0007242 | |
2025-08-31 00:10 | mirabilos | Note Added: 0007243 | |
2025-08-31 21:52 | dwheeler | Note Added: 0007244 | |
2025-08-31 22:01 | dwheeler | Note Added: 0007245 | |
2025-09-01 05:57 | stephane | Note Added: 0007246 | |
2025-09-01 06:05 | stephane | Note Added: 0007247 | |
2025-09-01 15:36 | dwheeler | Note Added: 0007249 | |
2025-09-01 17:10 | dwheeler | Note Added: 0007250 | |
2025-09-01 17:18 | dwheeler | Note Added: 0007251 | |
2025-09-11 15:31 | lanodan | Note Added: 0007253 | |
2025-09-11 15:36 | lanodan | Note Edited: 0007253 | |
2025-09-11 15:37 | lanodan | Note Edited: 0007253 | |
2025-09-11 15:37 | lanodan | Note Edited: 0007253 | |
2025-09-11 15:50 | geoffclare | Project | 1003.1(2008)/Issue 7 => 1003.1(2024)/Issue8 |
2025-09-11 18:01 | dwheeler | Note Added: 0007256 | |
2025-09-12 16:28 | stephane | Note Added: 0007258 | |
2025-09-15 02:03 | dwheeler | Note Added: 0007259 | |
2025-09-15 02:04 | dwheeler | Note Edited: 0007259 | |
2025-09-15 02:08 | dwheeler | Note Edited: 0007259 | |
2025-09-15 09:14 | stephane | Note Added: 0007260 | |
2025-09-16 13:47 | dwheeler | Note Added: 0007261 | |
2025-09-16 16:55 | collinfunk | Note Added: 0007262 | |
2025-09-17 00:06 | mirabilos | Note Added: 0007263 | |
2025-09-17 07:29 | hvd | Note Added: 0007264 | |
2025-09-17 14:17 | dwheeler | Note Added: 0007265 | |
2025-09-17 23:47 | mirabilos | Note Added: 0007266 | |
2025-09-18 15:26 | dwheeler | Note Added: 0007267 | |
2025-09-18 16:14 | geoffclare | Note Added: 0007269 | |
2025-09-18 16:14 | geoffclare | Note Edited: 0007269 | |
2025-09-18 18:02 | dwheeler | Note Added: 0007271 | |
2025-09-20 02:32 | dwheeler | Note Added: 0007272 |