Anonymous | Login | 2024-12-03 17:16 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0000251 | [1003.1(2008)/Issue 7] Base Definitions and Headers | Objection | Enhancement Request | 2010-05-03 18:49 | 2024-06-11 08:53 | ||
Reporter | dwheeler | View Status | public | ||||
Assigned To | ajosey | ||||||
Priority | normal | Resolution | Accepted As Marked | ||||
Status | Closed | ||||||
Name | David A. Wheeler | ||||||
Organization | |||||||
User Reference | |||||||
Section | XBD 3.170 Filename | ||||||
Page Number | 60 | ||||||
Line Number | 1781 | ||||||
Interp Status | --- | ||||||
Final Accepted Text | See Note: 0006561. | ||||||
Summary | 0000251: Forbid newline, or even bytes 1 through 31 (inclusive), in filenames | ||||||
Description |
Forbid bytes 1 through 31 (inclusive) in filenames. POSIX.1-2008 page 60 lines 1781-1786 states that filenames (aka "pathname component") may contain all characters except <slash> and the null byte, and this has historically been true. However, this excessive permissiveness has resulted in numerous security vulnerabilities and erroneous programs. It also increases the effort to write correct programs, because correctly processing filenames that include characters like newline is very difficult (even the expert POSIX developers have trouble; see 0000248). The "Unix-Haters Handbook" specifically notes the problems caused by control characters (such as newlines) in filenames (see page 156-157), so this is not a new problem! A key offender, of course, is the <newline> character. This is widely used as a filename separator, even though it is strictly speaking not a valid filename separator (since a filename may include it). But other control characters also cause problems. Another common problematic character is the <tab> character; files with records terminating in newline, and fields separated by tab, are extremely common, and are encouraged by some tools (e.g., by the default delimiter of "cut" and "paste"). Some terminals and terminal emulators accept control characters (range 1-31), e.g., via the escape character. Simply *displaying* filenames with bytecodes in this range can cause problems on such systems. (Granted, there is no requirement that terminals must only accept control characters in the range 1-31, but if 1-31 could not be in filenames, that would be a reasonable configuration to move to.) In practice, the primary use of filenames with control characters appears to be to enable security vulnerabilities and to cause errors. Other than "we've always done it that way", there seems to be little justification for them. By forbidding control characters, many programs that are currently erroneous (e.g., because they process filenames one line at a time) become correct. The number of applications this change would *fix* is far larger than the vanishingly small number of programs that non-portably *depend* on control characters being permitted in filenames. Any program that depends on control characters in filenames is *already* not portable, since control characters are not in the "portable character set" of filenames (Page 77, line 2194, XBD 3.276 Portable Filename Character Set). What's more, NTFS (and probably other filesystems) already forbid bytes 1-31 in filenames, so this is already an extant limitation. But merely making it "not portable" is not enough; it needs to be *forbidden* before correct programs can count on it. This proposal suggests a new error, [ENAME], though other options are possible. I'm well aware that this may be a controversial proposal. But I think most readers will understand *why* I am proposing this, and that while this is a dramatic approach, we are now in an era where POSIX systems are routinely used to manage important information worldwide. The ability to include control characters in filenames has rarely been a help, and instead has been a hindrance to the use of these systems. We've had ample time to see that this is a problem. It's time to jettison this misfeature. |
||||||
Desired Action |
Vol. 1, Page 60, line 1782-1784: Change: "The characters composing the name may be selected from the set of all character values excluding the <slash> character and the null byte." to: "The characters composing the name may be selected from the set of all character values excluding the <slash> character, the null byte, and character values 1 through 31 inclusive. Attempts to create filenames containing bytes 1 though 31 must be rejected, and conforming implementations must not return such filenames to applications or users." Vol. 1, Page 77, append after 2199: The set of character values 1 through 31, inclusive, are expressly forbidden. Attempts to create such filenames must be rejected, and conforming implementations must not return such filenames to applications or users. Vol. 2, page 480, [ENAME] The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive). Vol. 2, page 1382, in "open()", after line 45297, state: "If open() or openat() is passed a path containing byte 1 through 31 (inclusive), it must reply with error ENAME instead of opening the file." Vol.2, page 1382, in open(), before line 45319, state: [ENAME] The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive). Vol. 2, page 1744, lines 55680-55682: Append: "The readdir( ) function shall not return directory entries of names containing bytes 1 through 31, inclusive." There would need to be other changes in the interfaces, but there's no point in identifying these other changes if this proposal will be flatly rejected. Since this involves filenames, which occur all over in the spec, it may be possible to have this change described in a few places instead of many. I've identified open() and readdir() here, because many other interfaces are built from these two. |
||||||
Tags | applied_after_i8d3, issue8 | ||||||
Attached Files | |||||||
|
Relationships | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Notes | |
(0000412) dwheeler (reporter) 2010-05-03 19:37 |
Note that if values 1..31 were forbidden in filenames, widely-used constructs like these would be become correct: # Only correct if tab and newline cannot be in filenames: IFS=`printf '\n\t'` # Remove 'space', so filenames with spaces work well. for file in `find . -type f` ; do COMMAND "$file" done More rationale for this proposal can be found here: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [^] http://www.dwheeler.com/essays/filenames-in-shell.html [^] |
(0000689) msbrown (manager) 2011-03-10 16:04 |
Another data point: The encodings 0x00-0x1f are "control characters" in the "standard" z/OS UNIX System Services code page IBM-1047. Note that "line feed" is 0x25 in EBCDIC/IBM-1047, but the C language '\n' is 0x15 (EBCDIC "new line"). |
(0000739) eblake (manager) 2011-04-11 20:28 |
I have retitled the bug; ongoing discussion is still debating the best solution, but any solution is more likely to gain consensus if it only forbids newlines (which is the main source of ambiguous output on any utility that produces line-oriented listings of file names) than to forbid all control characters. |
(0000740) dwheeler (reporter) 2011-04-11 21:17 |
Forbidding just newline would be better than the current state. So if that more limited goal is truly necessary to make progress, then sure! That said, perhaps there's an alternative in the middle. As noted above, there are two other control characters that cause heartburn: tab (widely used in tab-delimited fields and as IFS whitespace) and ESC (widely used for terminal escapes). If *all* control characters can't be forbidden, can it be at least those three? That would simplify processing (tab-delimited fields work, and IFS of \n\t works nicely), and display of filenames is easier too (no ESC) What are the critically-important use cases where control characters in filenames are REQUIRED for applications to work properly? Especially since several filesystems (such as NTFS) essentially forbid them already? I haven't seen a lot of critically-important use cases documented, and control characters have *never* been in the portable character set for filenames. |
(0000887) user27 2011-07-07 15:31 |
This proposal is fine for applications that depend on the POSIX portable character set, but leaves others in the lurch. It would be preferable to restrict file names to printable characters, space, and NBSP. |
(0001140) oiaohm (reporter) 2012-02-22 06:48 |
Lets look at this differently why do we have to forbin anything. Programs feeding data to shell cannot use \0 Null as a line break. This is a double sided sword. We know the stdin and stdout are still open after getting a \0 char. IFS=`printf '\0'` for file in `find . -type f` ; do COMMAND "$file" done The command should have no issue taking a null terminated string. Why because int main (int argc, char *argv[]) That is exactly what a C argument is. IFS becomes a problem. Because applications don't support accepting \0. This is a case they are depending on string length by Null ended strings instead of when the file-handle feeding them information ends. Reason we don't have two of them. One IFS for applications to feed information to the shell and a IFS for screen. IFS=\0 is good for applications feeding into other applications. Not that great for feeding to screen. \0=\n on screen would be ideal in lots of cases. Also to screen \n(newline) could be displayed as \n and so on for the control chars. What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged. Basically to support UTF8 in case of malfunction this is not possible. Because we cannot say that a file name displaying 31 might not turn up due to file system damage. Its bad enough with \0. The issue is the shell handling of input and output. The - issue that exists as well. Is also a shell issue but its harder to solve. Applications don't have a clear marking in int main (int argc, char *argv[]) what is a directive todo something and what is data to be used. FLAGs on argv contents would be handy in this regard. Simplest method I can think up is a envormental var containing it that is cleared automatically by next exec command. All chars bar <slash> character and the null byte should be able to be handled in file names. Because we have two chars that cannot be in files. ${VAR} exits $(type){VAR} could also very handy ie $(safe){VAR} control chars and so on become there \ equals. Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells. Another is work out how to type argv inputs. So applications can know the difference between rm -i * that just happens to find a file called -rf and deleted everything without asking. We have to remember this wildcard char problem does not just apply to filenames. Bash can read data from keyboard input so a your problem here does not end with file names. read in COMMAND $in might just happen to send a control char into a command. I might type anything or it might be raw contents of a file. Fixing filenames does not fix Posix Shell bad handling of chars this makes Posix shells be here be dragons. |
(0001141) dwheeler (reporter) 2012-02-22 19:25 |
Comment #1140 says: "Fixing filenames does not fix Posix Shell bad handling of chars" This is NOT just a problem with POSIX shell. The problem is systemic. Almost all programs that deal with filenames have trouble, regardless of language, because POSIX permits filenames to include control chars. For example, it's common to store filenames in text files. The "obvious" way is by using newline for each row (tab-separated if there is more than one field). Of course, this fails. Even tools like "pax" are fundamentally broken - if they list filenames, you can't tell if the newline is part of a filename or not. Yes, you can fix this using escape sequences, or by using \0 separators, but most people don't do this because people generally have no *use* for control chars (e.g., newline, tab, ESC) in filenames. As I said earlier, the primary use of filenames with control characters is "to enable security vulnerabilities and to cause errors". Other than "we've always done it that way", there seems to be little justification for them. Filenames already can't include "/" and \0, so their charset is ALREADY limited. The comment asks, "What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged?" I think that's easily answered: * If an application is trying to *create* a filename with 1-31, then a compliant POSIX system should reject the request. POSIX already *permits* this; such chars have never been in the portable character set. * If the filename is damaged internally, then there are many options. For example, separate repair tools could be used to fix it (we already have to do this if a filename contains '/'). A system could provide them, escaped. Less good would be a system that permitted people to read and rename them, just not create them. POSIX could allow a range of options, or require one in particular. The comment said: "Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells." Actually, I agree that we should improve some POSIX utilities. I think it's important to also add \0 support to a few key utilities (find, xargs, read). But adding \0 capability to *everything*, so that full filenames can be handled, is absurd; there's a huge list of tools that process text files. It's better to ease transition (by improving a few selected tools) and create a better place to transition to. |
(0001142) wlerch (reporter) 2012-02-22 19:38 |
For the record, the only UTF-8 characters that contain the bytes 1-31 are the control characters 1-31 themselves. Any Unicode character outside of the ASCII range is encoded with bytes greater than 127. |
(0001143) oiaohm (reporter) 2012-02-23 00:55 |
wlerch in case of bit errors it still turns up. So something has in UTF-8 string that should be above 127 is under because one bit as flipped. Needing to see file no matter how damaged is critical. dwheeler the big questions here is. Why do the shell have to ever show the non escape sequence form for a file name? Why to applications ever need to see the non escaped sequence of filenames from directory listings or other look ups? dwheeler "Even tools like "pax" are fundamentally broken" You say its fundamentally broken. This is the point is fundamentally broken and should be fixed. Not hidden. You should see the hell you cause to a windows system when you write \n containing files into NTFS. Physical limit of NTFS is the same as ext2/3/4 everything bar slash and nul can be written. Its the win32 subsystem that prevents you from doing control chars. Yes you can ruin windows the same way. Attacker does not have to play by the rules if the filesystem can store they can do it. Posix is not OS kernel is really cannot set what file names we fully get. Would it not be simpler and more complete covering to demand that all file-names must be particular escape sequence by default. This cures issue. Filesystem can store what ever char it likes no error. When we handle filenames they are escaped always. Correct solution really is escape. This can fix a lot of things like rm -i * So that a file -rf is escaped so rm sees \-rf so does not malfunction on the user instead of thinking rm -i -rf <plus everything else * found> when user does rm -i *. There are a lot of malfunctions like this. Little example of the space error. ls `echo *` with a file "this that". Due to no escaping ls comes back that it cannot find this and that. Reason shell command line is finding breaks in command line by where spaces exist so think this and that are different arguments so pass them to ls that way so going against what the code would appear todo. This is a far more likely of a error for a person to hit than control chars in file-names. Will be hidden until a script runs over a directory with a filename with space. Escaped this\ that returned by * as long as it don't get processed out by echo no problems. Reason why the % escaping of URI(Uniform Resource Identifiers) might have to be considered. Its something normal posix tools are not going to mess with. You request a directory list. libc serves it up pre escaped. This just kills any chance to a control char newline or tab causing trouble or whatever char is going to cause you program trouble from making it. Even space in filenames is no longer a problem by default. Why you escape space so that filenames are always one constant string with nothing evil in it. libc needs to get a standard escaping function or use a pre-existing. Person can pick any split char they like does not have to be \n \t or a control char. A person could pick - or space or anything else stupid splitter like a , The important part is that cannot be in the file-name there application sees but that filename must be able to exist on disc. This is where excaping fits in hide from program what it cannot handle seeing. Simple 100 cure. Set escaping function. Function that tells libc I want my file-names escaped on all calls by this particular method. With a default that the file-names are escaped by something set in standard. That to see raw and take raw binary filenames you have to call the Set escaping function to turn it to a noop because by default it is on. This is transparent escaping. Could it possibly break a few programs. Really no if the libc requirements are right so open will take a non escaped string and wake up opps we have a problem those chars should not be in there and opens the non escaped version. Also for old programs you could include an option to force program to run with escaping off. \0 support is required to work without escaping. So making it mandatory option on quite a few programs would be a good idea. dwheeler to be truthful the error you are talking about are coming from one simple fact programmers are forgetting todo secure programming. So they don't escape inputs or filenames so thing go badly wrong. Also due to the fact file-names and program directives currently can look identical to programs we get malfunctions. dwheeler you are looking too small the problem of filenames is huge you are only addressing a small corner case. Escaping can address almost all cases with 1 set of alterations to core parts. Currently non-escaped is default and you have to remember to escape stuff or things go badly wrong. Secuirty flaws appear..... Default escaped. Programmer cannot make mistake of forgetting to escape lots issues disappear including non control char ones like - starting filenames and filenames containing spaces. Reason escaped gets on top of the problem. Shells will have to be made handle taking like user typed file-names and performing the escape on them. Really what need is there for a portable charset for filenames on my own system if the tools can handle anything on my system. Portable character set should really only be for cross OS. If you are needing to use it to limit user you have a problem with the tools secuirty. |
(0001147) oiaohm (reporter) 2012-02-25 02:25 |
dwheeler I have opened this new one with my solution. http://austingroupbugs.net/view.php?id=545 [^] I think my solution is the best. I am removing a limit from POSIX as well as solving a problem once and for all. Scary enough my change will allow filenames on disc containing NULL byte and Slash without upsetting anything long term. Applications using URI will have to get use to the idea they don't have to encode and decode any more since Posix does that bit. Application doing raw will have to get use to the idea they have to encode but still be usable to users just with a little horrid typing %stuff. The big issue is that you say that if file name contains a particular char this is forbin to be opened created or otherwise. Only party that can see what is allowed it exist on disc is the party that creates the file system. From my point of view this here should be closed and 545 pushed forwards since 545 is a more platform neutral solution. Yes there will be some overhead. Even better 545 most likely will be doable at kernel level. Alter the file-system drivers. |
(0001437) user229 2012-12-23 05:12 |
olaohm, if a filename is "damaged" by bit errors, it might contain a slash or null byte - is dealing with a damaged filesystem really within the scope of POSIX? Such filenames can be fixed [moved to a different name with offending bytes removed] by a system-specific "fsck" tool. And anyway, it's already legal for an implementation to forbid control characters in filenames, the proposed change is to make it mandatory. |
(0001438) oiaohm (reporter) 2013-01-07 23:29 |
random832 if you look at MS Windows and its forbin chars with the | in particular damaged due to bit errors is not the only issue. Software viruses have also exploited the fact | would cause filename to be hidden from user view and anti-virus software. Newer versions of windows is reducing number of forbin chars. This is why I particularly do not like the idea. XFS and EXT3 along with other filesystems it is valid to have filenames containing those chars. This proposal contains no migration plan to cope with what could possibly be out there. http://austingroupbugs.net/view.php?id=545 [^] I have come to the point of view that the shell in standard needs to be fixed. fopen .... I have not really found an Posix ABI call with a issue with chars with these chars. random832 the idea that filenames don't exist in the. Also to be O F. Is the fact that under Fat file systems 0,7-10,13 dec or 0,7-A,D hex are the only ones that are not printable on screen as single chars. http://www.jimprice.com/jim-asc.shtml [^] This is what dos ANSI allows. Yes Microsoft two primary file systems don't agree on what chars you can use in file name. One of the other result-ions to this problem might be follow how Dos deals with it. Assign 1-31 printable symbols and add a mode to printf for processing out filenames. random832 can you promise me that no future OS will do the Dos method again and assign printable chars to control chars so allowing file-names to contain them. |
(0001439) oiaohm (reporter) 2013-01-07 23:47 |
random832 the big question here why should the dosfsck fix a file-system that is not technically broken. The idea that the fsck tools can fix it is basically saying lets fix something that is not broken. Instead of fixing fault in posix. If you want simple printable chars for the control codes unicode defines one. Simple Squarish box with the number of the char inside. Lot of fonts miss providing this. Same with terminals. So printing control chars to screen should be fairly issue-less as long as a print in the API exits that disregards control chars and uses the replacement. Basically we have a historic terminal design fault. clocal and other settings can be pushed to terminal to disable particular control chars from working like modem control chars. So yes another solution is a simple flag. |
(0001711) jrincayc (reporter) 2013-08-14 02:43 |
oiaohm, I agree with you that it would be possible to make it possible to have newlines work fine in filenames. However, from what I can see of your proposal it seems that your suggestions would require substantial changes. Can you summarize what you want changed, and what changes would be needed in kernel and userspace? Thank you. |
(0001712) dwheeler (reporter) 2013-08-14 03:23 |
An encoding system would make sense if there was a user need to include newlines (or tabs) in filenames. But there is no such need; the only significant use of control characters in filenames is to attack systems, and support for such characters in filenames has NEVER been required as part of the POSIX specification. It would be possible to encode such names... but to what end? It would be far simpler to prevent them. |
(0001713) dalias (reporter) 2013-08-14 05:41 |
The issue is that you have a huge base of existing systems where such malicious filenames may already exist in the filesystem, and you have implementations that deal with external media, networked filesystem, etc. which might present malicious names. Requiring an implementation to reject names containing newlines or control characters would make it impossible to access (or even delete) such files. I believe this is the motivation for encoding. |
(0001714) oiaohm (reporter) 2013-08-14 07:34 |
dwheeler bad news --An encoding system would make sense if there was a user need to include newlines (or tabs) in filenames. But there is no such need;-- I do have a few links and files with newlines in them. It was a work around on some of the X11 windows managers desktops that did not word wrap file-names yet would support \n in the filenames to get around problem. So historically there has been need for newline in filename at different times. jrincayc "Can you summarize what you want changed, and what changes would be needed in kernel and userspace?" The fact of the matter the Linux/bsd and OS X kernel already works with these chars. A lot of programs don't have issue with these chars. So its all user-space changes like making ls not be stupid that hello world one exposes a issue. ln -s ~ Hello$'\n'world Yes this will create one. This basically works on all existing posix systems other than NT. Most graphical filemangers on Linux handle this event without issue. They already do substitution. Libreoffice and Openoffice see a file like Hello$'\n'world and it becomes Helloworld the \n is just not displayed and you can access the document as if its not there. KDE displays a filler char. The big thing is there is no standard substitution defined. Really the simplest would be made it mandatory when printing special chars in file-names to print them as per http://en.wikipedia.org/wiki/Code_page_437 [^] 1-31 in Code page 437 are in fact printable chars. --Interpretation of code points 1–31 and 127 Code points 1–31 and 127 (00–1Fhex and 7Fhex) may be interpreted as either control or graphic characters, depending on the context.-- It the existence of Code page 437 is what brings the real nightmare. What is Code page 437. MS Dos or fat filesystems. dwheeler basically you are forbidding valid printable chars. The missing item is really a way to print with control chars disabled. Shell scripting does need to be altered as well. So arrays are used to call applications. So removing control char hell. If you pick up some old dos discs some people did use the sub 31 chars to make their file-names look fancy. Basically it too late to be prevent them dwheeler the filenames with sub 31 chars exist. Some from dos some from Unix and Linux. Dos to look fancy, Unix and Linux to work around some application bugs. Forbidding also gives somewhere for malicious programs to hide thinking the kernels support these chars. Windows forbids creating new files containing 1-31. But handles processing 1-31 and displaying a filler if they happens to be on the file-system. dwheeler basically your complete base of your idea is off. What ever we change we have to still be able to handle as much as Windows. You are directly putting forwards not to be compatible enough. |
(0001715) jrincayc (reporter) 2013-08-15 02:33 |
dalias and oiaohm, Yes, existing filesystems with control characters are an issue. For local non-readonly filesystems, a reasonable possibility is that the kernel can just not return them in readdir and not allow them to be created with creat and rename. The next fsck that is done can rename all the files with invalid characters in them. What to do with invalid filenames in a networked filesystem is a more complicated issue. oiaohm, Besides Code page 437, another possibility is the control pictures starting at U+2400 (Symbol for Line Feed is U+240A). I tried out Gnome's gedit, and it could read and save files with newlines, but so far as I could tell it did not have a way to create a new file with newlines in the filename. I presume that most GUI applications are the same way. I noticed that ls on my linux system actually lists foo\nbar as foo?bar. dwheeler, I agree that there do not seem to be significant uses of control codes in filenames. I think the security considerations would more than sufficient as motivation to forbid them. |
(0001716) dalias (reporter) 2013-08-15 02:48 |
I agree that security considerations are the most important factor here, but I think it's a real possibility that more security problems will arise out of attempts to encode/escape invalid filenames than presently exist. Basically the only software vulnerable to filenames containing newlines is sloppily-written shell scripts, and it's fairly well-known that you should not run shell scripts on untrusted directory trees unless you're confident they're robust. The other main point I'm concerned about is fragmenting the standard. While I don't particularly agree with their attitude on matters like this, I think you'll find the Linux kernel maintainers 100% unwilling to "impose policy" at the kernel level. Thus you'll either have Linux-based systems intentionally non-conforming, or you'll have userspace hacks attempting to enforce the POSIX rules, which creates an even worse security situation since malicious users can just bypass the userspace code and make syscalls directly to produce malicious filenames. |
(0001717) jrincayc (reporter) 2013-08-15 03:31 |
I think that encoding bad names is a problem. I think only three programs should ever have to deal with the bad names: The kernel, fsck, and a another program called rename_invalid which goes through a directory tree and renames every file that has an invalid filename. Every other program on the system should probably not see the invalid filenames, encoded or otherwise. For the kernel and fsck, they already read the filesystem at the disk block level, so they already can see what is actually in the directory even if readdir doesn't return the names. I am not sure how to make it so that programs like a hypothetical rename_invalid would get access to the complete directory, and ordinary programs would not. |
(0001721) jrincayc (reporter) 2013-08-16 02:43 |
I can think of two different approaches for letting programs that are designed for weird filenames handle them. The first would be at the kernel level and would add a flag to open or fcntl that switches the readdir to provide all the filenames instead of just the filenames with no forbidden characters. The second would be to modify the default meaning of various shell commands and features (such as * would not expand to include filenames with forbidden characters by default, similar to how it currently does not expand to dot files currently). So shell programs that are designed to handle newlines and other weirdness would need to opt in, instead of opt out. Lastly, from what I can tell from RFC 1813 for NFS 3, a / is a perfectly valid filename character, it is the server's choice. See 4.6: """Many servers will also forbid the use of names that contain certain characters, such as the path component separator used by the server operating system. For example, the UFS file system will reject a name which contains "/", while "." and ".." are distinguished in UFS, and may not be specified as the name when creating a file system object.""" I am curious how current UNIX systems would handle a NFS server returning filenames with "/"'s in them. |
(0001722) oiaohm (reporter) 2013-08-16 05:03 |
jrincayc I do have a horible source of control chars. One of my old mp3 player is full Code page 437 with control chars being the alternative chars and you can rename to them. There are a few cameras and other things out there that are also. jrincayc "I tried out Gnome's gedit, and it could read and save files with newlines, but so far as I could tell it did not have a way to create a new file with newlines in the filename. I presume that most GUI applications are the same way." Yes its fine to create a rule that we cannot create them simply. Open up charmap the only 2 you cannot insert into a file-name that are under 31 is Null and Newline by graphical. Yes copy paste from a charmap tool or document somewhere into rename and horible can happen. So currently yes you are free to create new files contain everything bar / Null and newline using graphical programs. Yes some people can have already created these files by mistake. I have also had control chars appear in filename without control chars due to bad sector effect on filesystems. Most posix systems most programs don't call the kernel readdir directly. They call the libc version. All like shells I know don't go direct syscall. The kernel syscalls could be left untouched if you are mad enough to be calling the kernel directly you should be ready to cope with whatever it throws up. jrincayc the NFS solution for odd strange bad evil what chars is a "character translation file" So you define something to substitute for the likes of / . .. being real file names. So you would remap something odd Unicode instead a char you don't use. Yes it is horible but works around problem. jrincayc I would have no issue with a opt in limitation on shell script. Other than the dead in the face fact how long before they say hey spaces in file-names are giving us trouble lets make that a forbin char as well. Something to become very quickly aware of is most bash scripts issues with control chars also have issues with file-names with spaces. jrincayc "Yes, existing filesystems with control characters are an issue. For local non-readonly filesystems, a reasonable possibility is that the kernel can just not return them in readdir and not allow them to be created with creat and rename. The next fsck that is done can rename all the files with invalid characters in them." This could be really bad. If someone ever does something like a firmware piece on a fat partition on a device requiring a control char and you go and nuke it. In one instant move you just bricked a device. Do not presume device makers are always sane. So the answer has to be cope. We can place a rule on create we can place a warning to fsck that they have been created to make sure it was user intention. We cannot place a alter rule because we don't know what we will run into in future and if any device will be mad enough to use control chars in file-names as key requirement. Basically programs must cope somehow with the existence of strange chars in file-names. The to major options are Substitute them or ignore their existence. dwheeler, jrincayc and dalias you are forgetting you take away the system calls what stops a malicious program that has breached deep directly interfacing with the block devices and doing exactly what it so pleases. The answer is nothing. This is why I see no reason to mess with kernels. If the fix is not workable in userspace its not workable full stop. Remember the source of the strange char in filenames might be on a USB key that has come from a completely different OS outside your control like a photocopier that can scan to usb because its ram has gone a bit sus. So should the user not see their files?? This is why I am very sure the library functions should be provide todo opt in but the programs should opt in to be filtered. Like gedit and most graphical file-mangers on Linux don't need to be filtered they already cope with strange chars. The programs that cope need to be left alone. Programs that cannot cope need to have solutions made. A shell that cannot handle strange chars also need to abort if it happens to be passed a list of arguments containing them. jrincayc opt in with shell scripts most likely this has to be done at interpretor start. Basically shell scripts are horible at handling input. And equally horible at doing predictable and safe string transformations. Yes C defined valid program arguments as any char bar Null. So its not just filesystems that can send shell programs off the deep end. Posix shell was not design to cope. There is a lack of protection around shell scripts from malicious issues. It all comes back to shell scripts being horible at handling strings. In a lot of ways it might be simpler to plan end of life of the posix shell. |
(0001723) a brouwer (reporter) 2013-08-16 10:46 |
jrincayc wonders: > I am curious how current UNIX systems would handle a NFS server returning filenames with "/"'s in them. Kernel code might test, but it is not absolutely necessary. The / plays a role in path resolution. If you have a CD-ROM with a filename like "a/b" with embedded slash then the corresponding file cannot be opened for reading because path resolution will interpret the string "a/b" not as a filename but as a pathname. jrincayc considers: > add a flag to open or fcntl This mess was started because it was thought that security could be improved. Now the simpler a system, the better for security. Adding complications never makes a system more secure. Programmers that have to write secure setuid software have to consider all possibilities. If you introduce new failure modes for programs (-EBADNAME) then that is yet another detail. No doubt the first effect would be that security was broken because some software was written without considering the possibility that a program might fail for this reason. |
(0001724) dwheeler (reporter) 2013-08-17 15:03 |
I agree that being *simple* is *critically* *important*. The fundamental problem with approaches like filename encoding, open flags, and fcntl, is that they create new complications. Most developers will ignore them, and this problem will continue to be a source of vulnerabilities. There is nothing simple about the current situation. It's difficult to handle filenames with control chars correctly, and the "standard" approaches for filename handling are actually security vulnerabilities. E.G., everyone "pipes a list of filenames to {program}" - yet this is actually wrong, since filenames can include newlines. This is NOT just a shell problem. Newlines have *never* been in the *portable* character set, so *portable* programs (including security-related ones) must already assume that (for example) creating a file might fail if it has a newline in the name. So adding a limitation changes nothing if you've written your program portably. And while some badly-written program might break, there are probably thousands of other incorrect programs that would suddenly work correctly. A fair trade. If we need to back off this proposal further to gain consensus, we could simply forbid *creating* filenames with newlines (and TAB and ESC ideally). That would be adequate for many servers/clouds/etc. where untrusted users cannot directly manipulate a filesystem that would later be mounted. If we have to back off even further, an alternative could be an "implementations SHOULD forbid creating filenames with newline/tab/esc" instead of "MUST". But if it's not mandatory, that creates its own problems... it means that portable applications must be able to handle implementations that fail to prevent the problem. In that case, I think the POSIX spec *REALLY* needs enhancements to handle such filenames (e.g., byte-0-termination options for find, xargs, and shell read), so that developers can have standard tools for dealing with these filenames. |
(0001725) dalias (reporter) 2013-08-17 16:13 |
There is nothing simple about imposing a policy that will be unpopular with implementors and result in real-world systems behaving in ways that deviate from the standard. That's even worse than the situation we have now. As it stands, implementations are free to reject any characters they like outside the portable character set. This allows a "hardened" implementation to reject newline and even to advertise this rejection as an additional hardening feature. I would be happy with adding text in appropriate places remarking that historical implementations generally treat filenames as abstract byte sequences (minus \0 and /) but that for the sake of preventing security issues and other unwanted behavior in applications which are processing information in unsafe ways, implementators may want to place restrictions on the characters that can be used in filenames, such as forbidding newlines. My view is that, in matters of security, the only time that imposing a new requirement on implementations actually makes security easier for portable applications is when all historical implementations already have the desired property, and you're just writing it down for the first time. If there a historical implementations that lack the property, and wrongly assuming they have it would result in a vulnerable application, then all security-conscious developers will be faced with the complexity of supporting both. One further issue I just realized with any sort of escaping is symbolic links: what do you do with link content? If the escaping form is filesystem-type-specific (for instance, only used on network-mounted devices) and not used on the local system that forbids all unwanted characters, you would run into situations where the proper translation of the symbolic link contents depend on where it points to. This is of course unacceptable since symbolic link contents are pure strings and aren't necessarily even being used to access files they point to... As for the topic of enhancing find, etc., the fundamental problem is that find and xargs don't speak the same language. xargs reads a sequence of shell-escaped words, whereas find outputs raw filenames. If find had an option to output properly escaped results, using it with xargs would be trivial, and using it with read would at least be possible (albeit a bit more work). This can actually be achieved already with the -exec option, and appropriate shell script passed to exec to do the escaping, but it's painful. |
(0001726) dwheeler (reporter) 2013-08-18 02:01 |
Clearly, some people believe that newlines should be forbidden in the spec, while others do not. It's not clear to me what the committee will decide. However... if newlines (and other nasty characters) are not forbidden by the standard, it should at least be easier to *portably* handle such filenames. Currently it's extremely difficult to write programs that handle them correctly... so nearly all programs incorrectly handle filenames. In addition, the fact that "find" and "xargs" aren't easily made compatible in the standard is an old problem; it's past time to fix that. Although not currently in POSIX, there's a widely-used solution: lists of filenames terminated by \0. Adding support for them for a few utilities (e.g., find, xargs, shell "read", and probably pax) would make it much easier to handle dangerous filenames safely. There's no need for *every* text-handling tool to handle \0-termination; adding this capability to just a few tools is enough (e.g., this would allow the easy creation of encoding tools to handle other cases). Proposals to do so are here for find, xargs, and read are here: http://austingroupbugs.net/view.php?id=243 [^] http://austingroupbugs.net/view.php?id=244 [^] http://austingroupbugs.net/view.php?id=245 [^] It would probably be wise to also clearly define an error code when opening is rejected specifically because of the name. That way, hardened systems could be consistent on the error they return. |
(0001727) oiaohm (reporter) 2013-08-18 05:22 |
dwheeler there is a problem. Bad programming methods for file-names. Are you going to ban , and = as well. There is a long list of chars in file names that can cause issue. "everyone "pipes a list of filenames to {program}" - yet this is actually wrong, since filenames can include newlines. This is NOT just a shell problem." Can this be handled by masking dwheeler the answer is yes. Remember dos linux and apple have different end of line. Reality is that pipe of filenames to a program is mostly a shell problem. Why graphical programs have a habit of not taking piped. Take example VLC. Each file name is 1 argument after the program and if you need to add more it uses IPC that can send null terminated strings. DOS—uses two characters at the end of the line: <cr><lf>, in that order. Macintosh—uses one character at the end of the line: <cr> Unix—uses one character at the end of the line: <lf> So a old Macintosh with lf in its file-name will pipe perfectly. Can the NFS solution of a mask out be used. So a file on a NFS share contain lf or cr may not be a problem is NFS is masking that to another. That is the thing dwheeler 243 244 and 245 should be implemented or equal. The reality is dangerous file-names exist and we have to handle them. The other sad fact to remember with pipe using \n is remember the http://xkcd.com/327/ [^] example. Just because you forbid it from the file-system does not mean a human or machine error still will not find away to enter it. dwheeler yes I agree better handling is required. dwheeler this is my problem there is a clear divide between the terminal and the graphic when it comes to this problem. Majority of graphical applications don't have issues. They are not using pipe. They are using better IPC. This issue has a lot in common with SQL injection. In a lot of ways pipe should be deprecated. pipe never included the idea of data blocks. dwheeler is not too hard to write a black list of features not to use to be safe. Pipe is one of those features in administration tools that should be blacklisted. Processing stdin, stderr and standard out to find out what is going on also programs should not be doing full stop. Parsing is insanely hard. In fact is simpler to use something like dbus for security than pipe. The passed data is formatted. Less and less graphical programs are using the command line or pipe or any of the trouble causing bits. Instead they are using libraries and dbus to talk to other programs using better IPC. dwheeler this is why I have such a big problem. Like \0 filename chars are possible to be handled by dbus and other methods without issues. dwheeler this becomes my serous question thinking its too hare todo safe parsing of stdin, stdout, stderr and pipe. Should we not consider discontinuing their default existence. Remember syslog that some linux distributions are doing away with has the same issue. Tainted input ie application send a error to syslog with a \n in it this would appear as 2 syslog entries possible with a fake program name. Something like journal and other syslog replacements would see this as a string send by the program as one block. Miss handling of \n is not only happening with filenames. Same with all those other control chars. Replacement be something far more formal. Something that has define blocks. dwheeler I know my idea is most likely more shocking. The story is on the wall. PHP scripts are using less exec and pipe using more library options. There need to be a clear split between data application has received that might be tainted line breaks and the applications own intended ends of transmissions. dwheeler is the problem not that you cannot tell what is what. A messeging protocol sending sized blocks of data as filenames really would not have to care what chars were in the filename. This could appear as a formal null terminated array at the other end. posix was never designed to be safe. Text documents is really not a good way of listing file-names either. There are many ways of packing null terminated strings for file storage. Yes a historic issue most text editors believe nul is end of file so included no way to insert nul simply. dwheeler can you see it now its not really forbid chars that is required. Improve the tools so nul can be used in user-space as the de-monitor it is in kernel space. Basically why does shell process by end of line char in the first place when it not user input. Historic huge mistake keeps on coming back and biting us. |
(0001728) oiaohm (reporter) 2013-08-18 06:10 |
Think about this. If you could create a shell script that was based around nul as the separator a lot problems would disappear. single nul replaces space between items double nul replaces end of line. This cures space in filename and end of line in filename. Most of those other char issues are cured as well. Basically not tab or coma separated but nul separated. pipe can be cleaned up by null separator being default. If you don't want to encode the other option is stop using the chars in the parsing of shell scripts that can be in file-names. This also has the knock on. Changing the IFS was included in posix shell requirements. So changing spaces to nul should not be a problem if programs are well behaved. Less of a change to make secure shell scripts not use chars that can exist on file systems. Nul is a universal not to exist in any file-system file name or directory name in existence. The problem remains a lot of existing programs were built wrong to be portable. They presume the end of line is \n when in fact this should have been looked up from a environmental var or equal. Like making a file for msdos should be <cr><lf> end of line and IFS changed to nul nul and nul repectively can fix so many problems. It also will expose lots of programs that are not portability coded. Like programs using execv to call command line will not be effected by change. But programs using "ls -l" string execution would bust. In fact busting them would be a good thing. Those are some of the worst exploits. "ls -l <userstring>" dwheeler I bet you have also forgot that in bash ; is also end of line. Usestring "joke;rm -rf ~" is not going to be particularly nice. The other option nul filling. That is also quite simple. nul<space> nul<newline> nul; Those chars have to have a lead in nul before them for the shell parsing them. Same char without a nul is not to be parsed. Nul could be quite useful for harden script against user input attacks as well as filename issues. This makes taint protecting input simple. Delete all nuls and taint protection is basically done. I am not deleting the usage of chars. Alter shell to harden it. Lets see if we can use what is already forbin to fix the issues. Nul is universal. Its one of the few fairly much for-bin items. Yes it would mean changing editors to support programmers controlling nul locations. |
(0001730) jrincayc (reporter) 2013-08-18 13:12 |
oiaohm, for a "character translation file", which as I understand it is basically a translation from one character to another (ie. '\n' -> '_','?' -> '_' and so forth), what happens if the translated filename already is in the directory? For example, what happens if we have a directory with $'hi\nthere' and hi_there and the translation is '\n' -> '_'? Then we translate into a directory with two files named hi_there and hi_there, which is a new problem. |
(0001731) dwheeler (reporter) 2013-08-18 19:09 |
Comment: "Are you going to ban , and = as well. There is a long list of chars in file names that can cause issue." Reply: No, I did not propose that. Control chars (esp. newlines and tabs) cause FAR more problems than "," or "=". Let's focus on the problems at hand! Comment: "Reality is that pipe of filenames to a program is mostly a shell problem." Reply: No, many GUI programs also need to store and exchange of filenames. And even if they use a "better" IPC, there often have to talk with other programs which use different IPC conventions. I do *NOT* think that trying to ban pipes, and simple storage, is practical... or even desirable. You propose "systems like dbus"... but many POSIX systems do not *have* dbus. Instead, we have many incompatible nonstandard systems. Having a simple *standard* way to safely store and exchange filename information is valuable. I'm fully aware that different systems have different newline conventions, but we are discussing POSIX - which specifically says newline starts a new line. And I'm well aware that ";" separates shell commands. But handling filenames with ";" and "=" is easy; filenames with newline are harder to *correctly* handle. It would be possible to interpret "NUL" as field separator and "NUL NUL" as row separator, but then it'd be impossible to distinguish between an empty field and a new row. Also, that is not a common convention. In contrast, simple NUL-termination *is* widely supported and used. Comment: "That is the thing dwheeler 243 244 and 245 should be implemented or equal. The reality is dangerous file-names exist and we have to handle them." Reply: Thanks!! Here we completely agree. Even if filenames-with-newlines were banned, and implementors all agreed, it would take a long time to get implementations to follow suit, and there would not be enough standards-based support for the transition. Thus, I propose that the POSIX community at *least* add basic support for \0 termination (e.g., 243, 244, and 245), and get started whether or not newlines (or more) are banned. |
(0001732) geoffclare (manager) 2013-08-19 09:25 |
The notes recently added to this bug seem to be mostly rehashing discussions that occurred on the mailing list over two years ago. In particular, Note: 0001726 says "Clearly, some people believe that newlines should be forbidden in the spec, while others do not. It's not clear to me what the committee will decide." The core team made a decision on this over two years ago. There was a thread on the mailing list in April 2011 in which we started to flesh out the details. However, we came to the realisation that there was a lot of overlap with other changes being made in TC1 and decided to postpone finalisation of the changes. The basic idea of the direction the core team decided to take is: * Attempts to create files with names containing newlines fail. * In "normal use", readdir(), readdir_r(), and utilities that list filenames in a "one per line" format all report an error if they encounter a filename containing a newline. (This brings these files to the attention of the user.) * New features should be added that provide administrators and power users with a way to deal with filenames containing newlines (i.e. to find them and then either delete or rename the files). The mechanism chosen for this is a new file descriptor flag which readdir() can query and certain utilities (such as find) can turn on when told, via new options, to do so. |
(0001733) a brouwer (reporter) 2013-08-19 10:48 |
geoff: rehash - yes; decision - ach It is ironic to see that what was intended to be a security enhancement has now been decided to diminish security on all POSIX systems. I hope Linux will not follow. (Did anyone ask?) > readdir() reports an error ... So it is proposed to audit all software in the world and fix the readdir use. Until now: while (1) { dp = readdir(d); if (dp == NULL && (errno == EAGAIN || errno == EINTR)) continue; if (dp == NULL) break; .. do something with the entry dp .. } Now some code has te be added. Of course this doesn't happen, and a hacker gets an entirely new tool: filenames with newline. Earlier only useful to attack sloppy shell scripts. Now useful to attack all systems software. One can make files appear and disappear from the sight of readdir by putting a file with newline in its name earlier in the directory. Very interesting. |
(0001734) jrincayc (reporter) 2013-08-19 12:34 |
a brouwer, For what it is worth, OpenBSD, NetBSD and Linux all currently allow filenames with newlines to be created by unprivileged users. (Probably other POSIX systems as well, but those are the ones I tried.) As I understand it, only users with the privilege to directly edit the directory blocks could create filenames with newlines once "Attempts to create files with names containing newlines fail" so only an attacker that already has elevated privileges could use this to attack software. (If you can directly edit the blocks in a directory, you probably can directly edit the blocks in a suid program.) NFS mounted filesystems might be an exception however. |
(0001735) oiaohm (reporter) 2013-08-19 14:55 |
"Comment: "Reality is that pipe of filenames to a program is mostly a shell problem." Reply: No, many GUI programs also need to store and exchange of filenames." Reality its a shell problem GUI programs don't have the problem. Freedesktop.org specs define all transfers of filenames will be url encoded unless in a nul terminated string. Graphical applications are not using \n terminated with raw. You want \n terminated its url encoded with graphical. So graphical programs on Posix systems don't store or exchange anything else if they are sticking to the common spec. So don't have the problem. url encoded in fact does not have a problem with filenames containing strange chars. This is where it gets problematic. Shell and Graphical are in dispute. Graphical has solved this problem years ago with a simple choice to encode all filenames in a uniform way for transfers between programs. Calling graphical tools with find and other posix tools should not be problem but the reality is url encoding is not an option. You find the upper graphical toolkits like GTK and QT doing URL Encoding wrapping. Its good quality URL Encoding wrapping. geoffclare the big problem I have is why on posix systems is the graphical and shell handling so different. Make shell follow graphical and the problem disappears. Yes following graphical model find and other items would only return URL encoded unless in special mode where \n is not used and nul is used to terminate. geoffclare as jrincayc stated no kernel has taken the 2011 alteration on board. Since no adoption its not that successful. Reality the graphical programs don't have a problem they can record file names with \n in them they can access files with \n in them. Exactly why can we not have readdir open and so on just support url encoding out box. Control chars are simple a issue to those not using graphical libraries. geoffclare did you guys bother looking at what the graphical side could cope with. Graphical applications don't have the issue because of many reasons. Nul terminated strings used when reading from the libc. Url encoded used internally. Graphical life could be made simpler if url encoding could be support by normal c functions. I have mentioned url encoding before and I was basically told don't be stupid. Command line supporting url encode as option would be coming into line with graphical also cures problem. jrincayc remember if you system is highly infected the malware will be using any method it can find. One of the current ones is monitoring accessing of directories and disconnecting an reconnecting files from the directory. This is not exactly the most dependable. Really the \n terminating to split filenames has always been wrong. Fixing \n does not fix the other problem chars. Hardening of shell scripts also need to be considered. To be conforming with graphical applications its url encoded or nul terminated no other form of file-name can exist. Why both of these options are fully cross platform. Since graphical does not have a problem is highly unlikely that banning moves are going to be accepted by the kernels. All alterations to address this problem have to be restricted to user-space libc and up. So yes unprivileged users. If any idea has the requirement for privilage you can fairly much write it off being accepted. List of applications that can support these strange filenames is long. geoffclare --* New features should be added that provide administrators and power users with a way to deal with filenames containing newlines (i.e. to find them and then either delete or rename the files). The mechanism chosen for this is a new file descriptor flag which readdir() can query and certain utilities (such as find) can turn on when told, via new options, to do so.-- What about on read only media. Yes no matter what you think that case does exist. We are past the point of being able to say hey we forbid. We cannot have one section of the world have no issues and another section having issues then saying we will limit the other one. Reality this is is fully Shell related. Graphical has it fixed. Now the choice is simple. Do we follow the graphical or do we do something different. Sanity for simpler coding would say follow the Graphical. |
(0001736) oiaohm (reporter) 2013-08-19 22:40 |
geoffclare read only media requirement is for forensics after security breach. So welcome to devil in the details. This is why rename and run at higher privilege to access what some people call unsafe file-names is not really an option. Normal userspace has to be able to handle them. Special mode in normal user-space could be tolerable. Forensics work you want to do with min privilege possible. There are a set of needs with very limited ways of addressing. url encode can in fact handle filenames with nul and / or \. Yes this is the problem graphical design can handle every char in existence. Yet the shell design is attempt to say hey we cannot. |
(0001737) user229 2013-08-19 22:57 |
oiaohm, the thing is - if url encoding is forced at the open/readdir/etc level, then the files don't _really_, from the point of view of any program written to the existing API, contain control characters, they contain percent sign and hex. And when your gui programs try to apply url-encoding, they'll end up presenting the filenames with %25xx because they don't know the kernel has already applied url-encoding. |
(0001738) dalias (reporter) 2013-08-19 23:21 |
oiaohm, read-only media also includes CD-ROM, DVD-ROM, NFS, CIFS (SMB), etc. Definitely not just forensics. geoffclare, I don't believe the "fragmenting the standard" issue was sufficiently discussed before. Personally, I would love to have both newlines and EILSEQ banned from filenames, if such a requirement would be universally adopted. The problem is that some 90% or more of the deployed machines to which this standard is relevant (even though they lack certification) are based on Linux, and there is zero chance of getting Linus to accept anything that conflicts with the holy gospel of "filenames are abstract byte sequences" into the official Linux kernel. So we would be left with one of the following situations: 1. Completely ignoring the POSIX requirements, and reducing the relevance of POSIX, as it fails to reflect actual real-world systems. 2. Implementing the POSIX requirements as a Linux Security Module. This would allow systems to be configured in a conforming way (note that this option is already available as a conforming enhancement even without POSIX making any new requirements or allowances), but in practice they rarely would be, leading to application developers having to continue to support the possibility that filenames contain newlines or risk having vulnerabilities. 3. Implementing the POSIX requirements in userspace. This would just be a mess, as malicious users could make system calls directly to create filenames containing newlines, which would either be exposed to applications (if readdir failed to filter them) or which would be completely hidden from all users, including administrators, and could thereby prevent "rm -rf" from working. Anyway, my opposition to the proposed change is not based on any belief that filenames should be allowed to contain newlines, but on a (well-justified, I believe) concern that victory over newlines would be a pyrrhic victory, leaving us with a much worse security situation, and much more fragmentation, than what we have now... |
(0001739) shware_systems (reporter) 2013-08-19 23:48 |
@oiaohm "What about on read only media. Yes no matter what you think that case does exist." Not really, IMO... From the perspective of the standard and majority of implementations that object is no longer "read only media", it is "non-compliant hardware", as in "coaster" or "paperweight". @geoff Two things the current discussion has brought up for me... 1) Is the standard explicit enough that portable applications can expect the file system, and file names, to adhere to the limitations of the"C" locale, or are file systems supposed to be compatible with the global implementation-defined "" locale? 2) The point raised about hardware failures and single bit errors corrupting file names stored in DIR sectors on a disk points out to me that most implementations rely on the ECC code used by the hardware at the sector level to indicate a portion of a directory may have been corrupted. To assist in error recovery attempts of this type, I'm wondering if it's plausible that each dirent should have a chksum field to localize which entry may be the one corrupted, when multiple entries fit into a single sector? I realize it wouldn't be backwards compatible with current file systems, but as an explicit implementation option it would be able to coexist with those partitions, I believe. For those file systems that didn't maintain them the relevant field added to dirent as returned by readdir() would be zeroed, and those that did a bit in the statvfs struct for the partition could indicate the support. I don't think additional fields a file system might use to make that reliable would need to be reflected in dirent. Food for thought, anyways. |
(0001740) user229 2013-08-20 02:40 |
Since people are using Linux (and Linus Torvalds personally) as a bogeyman against this ever being implemented, does anyone have any links to posts by him (or other Linux kernel people or important developers of other systems) actually commenting on the issue? |
(0001741) dalias (reporter) 2013-08-20 03:23 |
Not exactly the same issue, but close: here is the classic email thread where Linus Torvalds put forth the "filenames are abstract byte sequences" viewpoint against having any UTF-8 policy in the kernel: http://yarchive.net/comp/linux/utf8.html [^] At one point he mentions '\n' as an analogous case that applications must deal with escaping, just like invalid UTF-8 sequences. Admittedly that is rather old, and his position may have evolved since then; I'd welcome further research on the matter. If I felt confident that we could get real-world implementations to adopt a policy banning \n in filenames, I would not be against it as long as the technical difficulties can be overcome. |
(0001742) shware_systems (reporter) 2013-08-20 06:38 |
As far as evolving goes, while his view that a file name is more of a locale less blob than not still holds with many interfaces, the standard C interfaces for collecting and comparing user input to those blobs do rely now on LC_CTYPE and LC_COLLATE to give semantic context to the bits in his byte size char containers, which by usage extension now includes file names. I think the only way to avoid this for the kernel is if it's compiled with a freestanding C implementation and custom libraries, which I don't believe is the case. I haven't gotten a kernel.org snapshot in a while, so I'm not sure what's current. His arguments about file systems should use UTF-8 over some locale specific single byte encoding has floundered, IMO, on the simple economic fact that no company I know of markets a UTF or Unicode specific keyboard to enter those file names with, and from what I've seen few implementations are going to spend the extra time doing the conversions from the 7 or 8 bit code set a keyboard does support to UTF-8 in the device drivers. |
(0001743) dalias (reporter) 2013-08-20 07:02 |
shware_systems, please stay on-topic. I cited that thread not to discuss UTF-8 (which, by the way, all mainstream Linux distributions have used by default since c. 2007, and which does not require a "Unicode specific keyboard" to use) but as a look into Linus's position on filenames as raw byte sequences, which random832 requested. |
(0001744) dwheeler (reporter) 2013-08-20 14:52 |
It would be fairly easy to add to a kernel (e.g., the Linux kernel) the ability to configure which bytes/characters are allowed (or not) at the beginning, end, and middle of a filename, and whether or not UTF-8-ness of filename is enforced. I suggest only doing this check at creation time. Then the kernel provides a mechanism to enforce policy, while the actual *policy* is under control of the administrator. I looked over Linus' (old) comments, and he clearly didn't like the idea that ALL systems would forbid non-UTF-8 files. But if this were run-time-configurable, it might be just fine. |
(0001745) oiaohm (reporter) 2013-08-21 12:23 |
random832 "oiaohm, the thing is - if url encoding is forced at the open/readdir/etc level, then the files don't _really_, from the point of view of any program written to the existing API, contain control characters, they contain percent sign and hex. And when your gui programs try to apply url-encoding, they'll end up presenting the filenames with %25xx because they don't know the kernel has already applied url-encoding." In fact url encoding you do know. file:// is at the start. Yes if you do have "file://something" [^] as a real file-name to get some graphical programs to used file-name as argument it you have to encode it this is the only issue graphical programs currently have with strange. Arguments passed to programs are nul terminated. There is no reason why open/readdir/... etc could not have url encode as flag. The follow freedesktop guide lines open/readdir accept nul terminated string. So freedesktop requirement are meet since it is nul terminated or url encoded. Note or. Something can accept both if it does accept both it ideally is meant to have a flag to inform what it is accepting. Remember freedesktop guide lines place two requirements. nul terminated strings. These have zero problems with the strange chars. URL encoded for cases where you want to use like \n to make a list of files. So following freedesktop guide lines all filenames on the shell should be URL encoded since you are using \n space and other things as separators. |
(0001746) oiaohm (reporter) 2013-08-21 12:25 |
dwheeler "It would be fairly easy to add to a kernel (e.g., the Linux kernel) the ability to configure which bytes/characters are allowed (or not) at the beginning, end, and middle of a filename, and whether or not UTF-8-ness of filename is enforced. I suggest only doing this check at creation time. Then the kernel provides a mechanism to enforce policy, while the actual *policy* is under control of the administrator." Dwheeler no read the http://yarchive.net/comp/linux/utf8.html [^] again. Linus Torvalds "The kernel is _agnostic_ in what it does. As it should be. It doesn't really care AT ALL what you feed it, as long as it is a byte-stream." This is the key line you are missing dwheeler. To fix this problem you are not allowed to use kernel space if you want to have it work on Linux Kernel end of story. Linux kernel the base to the dominate posix operating systems out there. Next problem dwheeler you wanting to perform filtering. There is a reason why Linux kernel keeps it simple. The Linux kernel avoids transforming as much as possible. So a file system that supports utf-8 like Linux core file systems ext file systems xfs btrfs and others only the bare min of checks is performed. Performing the checks you are talking about is increasing code size in kernel space increasing location for bugs while string handling. dwheeler think Dinal of Service Attack. Administrator typos and forbids some char that is required. Is it possible by forbidding chars you can write to file system to leave a system you cannot log into. The answer is yes. The userspace locale setting can forbid creating file-names containing particular chars already on Linux. Yes anyone running particular backup programs from cron find out this because backup will not create particular files because the default is not utf-8 but is ascii yep 0-127. This forbid does not stop the files with invalid chars being displayed. dwheeler so you are proposing double filtering so double set of bug locations. Is it the kernel or the user space locale refusing to create a file? is what you ideal will cause. This is another reason why you filtering idea is so dead. Administrator can edit the locale tables as well to forbid chars without touching the kernel today. Yes run-time-configurable file name char limiting exists today on Linux. In the userspace. This is what you have to all get kernel space is off limits to solve this problem. |
(0001747) oiaohm (reporter) 2013-08-21 12:49 |
shware_systems "Not really, IMO... From the perspective of the standard and majority of implementations that object is no longer "read only media", it is "non-compliant hardware", as in "coaster" or "paperweight"." Sorry no. Do you know what write blockers are. For every type of storage media out there exists a matching write blocker. http://www.forensicswiki.org/wiki/Write_Blockers [^] Every storage item can be in read only mode. Some have a switch todo it some have a block of hardware you attach. Sorry that is the reality. shware_systems some times you should only raise things if you know them. "2) The point raised about hardware failures and single bit errors corrupting file names stored in DIR sectors on a disk points out to me that most implementations rely on the ECC code used by the hardware at the sector level to indicate a portion of a directory may have been corrupted." At what point do harddrives(same for spinning and ssd) stop being able to ECC code correct. When they run out of spare sectors for smart to replace with. Yes even the newest hard drives can end up in failed state spitting up bit errors. Hard drives stop correcting once out of replacement sectors. At this point the corrupted sector might be flagged as such but will be passed to the OS in its damaged state. There is a requirement to handle what ever when the crap hits the fan. Hardware failure or Been attacked. Write blockers are used when maintaining evidence or performing data recovery. Why performing data recovery to make sure you don't destroy more data. So how are you going to fell harddrive is starting to pack it in and that very important business file you could not copy off because the OS would not allow you over some stupid I don't allow X chars then the drive dies completely. "1) Is the standard explicit enough that portable applications can expect the file system, and file names, to adhere to the limitations of the"C" locale, or are file systems supposed to be compatible with the global implementation-defined "" locale?" Linus Torvalds "The kernel is _agnostic_ in what it does. As it should be. It doesn't really care AT ALL what you feed it, as long as it is a byte-stream." This answer this. Native File systems in a system like Linux are no locale just utf-8 bytestream filenames. Linux was not the first todo this this is common behaviour for most of the Unix OS's in existance. Applications are expected to cope with this fact. Posix is failing to meet what is required. Default BSD and Solaris file systems are also no locale. Dominate kernels of the posix world are no locale file systems. Agnostic is the Unix BSD and Linux worlds of file systems. There are debates from BSD that also state the same thing as Linus. Agnostic is defined by the major parties who will implement the posix spec. Windows has very poor posix implementation so really does not need to be shown special treatment. |
(0001748) steffen (reporter) 2013-08-21 13:12 |
I've just read Don Cragun's note `884' as of 2011-07-06 in [1]: The current plan is to add a set of byte values (based on single-byte characters in the C Locale) that will not be allowed in newly created filenames using 0000251 as the bug to make the changes. If consensus is reached on a resolution for bug 251, the plan is to reject and close bugs 243, 244, and 245. These three bugs will remain open until bug 251 is resolved [1] <http://austingroupbugs.net/view.php?id=245> [^] and just want to point out that imho this is a weird, non backward-compatible plan, just as useful as trying to add an URL encoding layer or any other plan that imposes loss of backward-compatibility and adds restrictions which require multiple parties to agree in the steps taken. POSIX standardized tools in common use have already taken the necessary steps to overcome this problem on a per-utility base via options like -print0, -z (BSD sort(1)), and many shells (mksh, ksh, bash) have adopted the -d option, as in this example from the mksh(1) manual: find . -type f -print0 | while IFS= read -d '' […] This is common practice in all over the Unix community for many years. (I must however admit that i also have written programs and tools that can be fooled by using control characters.) Thank you. |
(0001749) oiaohm (reporter) 2013-08-21 14:02 |
steffen locale blocking remains outside kernel space. Also locale blocking can be override if required. Adding URL encoding would be about alignment between shell and graphical. Yes graphical following freedesktop with nul and url. And posix way was the broken seperator. Like find being able to URL encode for -exec functions would be useful for the bug cases that can effect graphical. file://something [^] issues where the graphical can get confused if that is a file or a url pointing to a file. Yes different bug. Graphical has a different issue completely. Of course solve-able on both ends. One graphical application adds a flag to know that arguement is not URL encoded two shell adds means to send url encoded simply. The graphical and shell interlink is not great. steffen its not just backwards compatibility its better compatibility going forwards as well that has to be considered. Yes URL encoding is more solution that helps the compatibility between the shell and graphical. At least there are a existing population of applications that can take advantage of URL encode that are not existing security risks. URL encoding does not require kernel agreement. Any plan wanting kernel agreement is fairly much up the creek. Basically I am willing to tolerate some backwards comparability loss if the result going forwards does not increase risk of data loss and makes graphical and shell more friendly with each other. Shell and Graphical are in conflict over how to address the issue. The graphical solution of URL encode and Nul terminated only at least work without altering kernel space or restricting chars that can be in files(this is backwards compatibility access to data). Applications can be replaced data not always. steffen breaking shell scripting to fix it I would not put off the cards. |
(0001779) oiaohm (reporter) 2013-08-30 04:44 |
Something I failed to mention. Please look closer what is happening inside Linux. More and more init system solutions not using shell. Graphical using like execv to call other applications if they have to. So avoiding shell interpreter. Plans in the Linux kernel to kick text console processing to userspace to the point that its option to even be on the system. This is the problem. This is talking about fixing a issue that could perfectly fix itself because shell is given up on completely. So adding a restriction to the standard to fix something that ceases to exist. |
(0002480) safinaskar (reporter) 2014-12-05 14:45 |
At least, please add -print0 option to find and similar options. This is my proposal: http://austingroupbugs.net/view.php?id=903 [^] |
(0002730) mirabilos (reporter) 2015-06-22 16:48 |
Changing something fundamental like this is bad. I see three causes of trouble here: ① data interchange with systems that do not implement this yet (think mounting foreign-OS filesystems, network, but even just tarballs) – this is problematic, but didn’t prompt me to comment yet… ② malware hiding in files with “forbidden” names, which other readers pointed out, is also a strong issue, one which I found while re-reading this report… ③ but the addition of new interfaces to access such files will be the cause of big and utter pain; another LWN reader commented on this, and this prompted me to re-read this and actually comment… before I was just thinking “oh no, but, meh, we can just ignore this”. Honestly, I fully agree with the need to escape some kinds of filenames. Widely deployed, and easily later-added-to-existing-systems, measures like “print -0” and “read -d ''” exist. Actually, make that… ④ a further concern is that code doing 'foo `ls`' (instead of “xargs -0 foo --”) becoming “correct”, which will lead to more people writing code like that, which will lead to more bad (lacking enough proper escaping to be generally useful) code… I mean, ok, you can say “just implement POSIX”, but there’s enough existing systems around that that is… a rather unkind attitude to have, so I urge you to not do that. Escape mechanisms for bad Unicode sequences in full Unicode APIs exist. I know of two: ⒈ MirBSD’s maps invalid Unicode into the range EF80‥EFFF when converting from 8-bit to Unicode (we use 16-bit Unicode, but this also works with 32-bit Unicode), and the other way back. On-disc “valid Unicode EF80” will be treated as invalid and converted to three codepoints (EFEE EFEB EF80), and be reconstructed from it – it’s in the PUA, so applications shouldn’t have been using it. The range used is actually coordinated with and allocated by the CSUR (ConScript Unicode Registry, the largest PUA users around). MirBSD uses this because there is only one locale which uses only one charset/encoding, which is UTF-8 (well CESU-8, but it’s the same for 16-bit Unicode systems) but 8-bit transparent, so we need to be able to round-trip arbitrary octets through wchar_t. ⒉ Python 3’s maps them as high surrogates which are (in the UCS-2/UTF-16 case) not preceded by low surrogates (and in the UCS-4 case, surrogates are invalid anyway). Py3k uses this because their string data type is actually Unicode (of an OS-dependent width) and used by all APIs, but required to be able to pass things like OS filenames (which may be encoded in something not UTF-8) through to e.g. the open function. Dealing with NUL is also perfectly documented, and all those \n cases are just people using bad shell scripting and the standard not providing officialness stamps to the existing, widely deployed (over one *billion* Android devices come with mksh as the system – and only – shell installed; my contact at Google said in 2013? 2014? that they’re working on the second billion), perfectly usable, easily retrofitted, measures to fix them (see the other three issues about read -d, find -print0, xargs -0). On that note, please do *not* standardise on GNU sort (-z for NUL-terminated), but on BSD’s: mirbsd$ printf 'foo\0bar\0' | sort -R '' | hd 00000000 62 61 72 00 66 6F 6F 00 - |bar.foo.| mirbsd$ printf 'foo/bar/' | sort -R / | hd 00000000 62 61 72 2F 66 6F 6F 2F - |bar/foo/| … is much more flexible than GNU’s… -z, --zero-terminated line delimiter is NUL, not newline … for the same reason the shell read built-in command has -d for the delimiter (first octet of $OPTARG is used) instead of -0. |
(0006146) dwheeler (reporter) 2023-02-10 18:14 |
I was fully aware when I proposed this that this was a big change, and might not be accepted. Still, there's been a problem that POSIX *allows* control characters in filenames (particularly newline), yet hasn't provided standard mechanisms to deal with them, so it's often impractical to process filenames securely (a sad state of affairs). The latest efforts to allow \0 (NUL) terminated filename lists is a big step forward. It's now at least possible to portably process filename lists, and using a mechanism that is already widely implemented. mirabilos : > Honestly, I fully agree with the need to escape some kinds of filenames. Widely deployed, and easily later-added-to-existing-systems, measures like “print -0” and “read -d ''” exist. Actually, make that ... > On that note, please do *not* standardise on GNU sort (-z for NUL-terminated), but on BSD’s: > mirbsd$ printf 'foo\0bar\0' | sort -R '' | hd 00000000 62 61 72 00 66 6F 6F 00 - |bar.foo.| > ... > is much more flexible than GNU’s… > -z, --zero-terminated > line delimiter is NUL, not newline > > for the same reason the shell read built-in command has -d for the delimiter (first octet of $OPTARG is used) instead of -0. Sorting on filenames is useful, especially when you're trying to create reproducible builds. That said, don't use "-R" in the standard for sort to specify a record delimiter, as GNU sort already uses -R for "random-sort". I think it should be possible to find another option letter or two that is either already used for its purpose *or* doesn't conflict with an existing implementation. I would prefer that sort support both a an option meaning "use NUL terminator" option as well as a more general option of "use this delimiter (empty means NUL)". The resulting scripts are a little easier to read when there's a simple option, especially when the script is already quoted. In addition, different implementations have seen value in each, so it makes sense to support them both. I believe the sort specification could add "-z" for saying "use NUL terminators for input/output record separators"; the "-z" option is used in GNU sort, MacOS sort, and probably others. I think you could use "-D" as the option for an arbitrary delimiter, similar to read -d DELIMITER where an empty delimiter means NUL. As with other cases, in at LEAST the case where NUL is used, it SHOULD be an error if there are 1+ bytes at the end that are not terminated by the terminator, as this suggests partial data. |
(0006154) mirabilos (reporter) 2023-02-18 20:19 |
@ (0006146) dwheeler OK, I agree, adding 'z' and 'D:' both seems sensible. In BSD we can just alias D: to R: and z to D with nullstring optarg. I look at moving the focus from forbidding things with embedded control bytes to providing ways to deal with them that are compatiblish enough with existing tools to deal with them very favourably. Thanks! |
(0006155) mirabilos (reporter) 2023-02-18 20:21 |
@ (0006146) dwheeler OK, I agree, adding 'z' and 'D:' both seems sensible. In BSD we can just alias D: to R: and z to D with nullstring optarg. I look at moving the focus from forbidding things with embedded control bytes to providing ways to deal with them that are compatiblish enough with existing tools to deal with them very favourably. Thanks! |
(0006481) kre (reporter) 2023-09-20 16:17 |
I know I have ignored this issue, and much seems to have been decided before I was in any way involved, but IMO there's really no option available that is consistent with the rest of the system but to completely reject this bug. Filenames need to be considered as uninterpreted byte streams, just terminated with a \0 byte as that's how they're passed through the kernel interfaces. Even the use of '/' to separate directory components isn't essential (though the leading '/' is, to switch between relative and full path names). To be compatible with the various locales that POSIX allows, filenames simply cannot be interpreted as characters, as there's no guarantee anywhere, that a process attempting to access a file will be in the same locale as the one which created it. The accessing process can read the directory tree, and built the byte sequence needed to access the path name desired, even if that process has no idea how to represent any of that as characters. If we can accept that filenames cannot (in general) be interpreted as characters, then we cannot really attempt to name a bunch of characters, or even just one of them, and disallow it. That's absurd. Further, if the character newline is forbidden, what does that mean in a locale where 2 byte encoding of characters is in use, and one of the bytes in some of the characters happens to have the binary value 10? None of those characters is a newline, but when interpreted byte by byte it looks as if there are newlines in the path. It is fine to have what we do now, and demand that implementations support a fixed set of encoded characters as filename paths, so that portable applications can be sure of working anywhere, but that's as far as it is reasonable to take things. Had, 50 years ago, the original designers decided to adopt a more limited filename representation, or had unix systems followed plan9 and fully adopted UTF-8 as the one and only blessed encoding scheme, things might be different. But neither of those happened. We have been living with the effects of those decisions now for decades - and (despite the occasional badly written piece of code, or more frequently, script, showing up) we have been coping. Any implementation which feels like it isn't, is already allowed to restrict the filenames it supports. Most have not, for which there are plenty of reasons. Attempting to force those implementations to change, with the treat of making them non-conformant if they don't, is simply a very very poor choice to make - and will simply lead to far fewer posix conformant implementations, and as soom as that happens, we'll be heading back to the bad old days, when every system differed from the others, in an attempt to lock in clients, and be (in their view) better than the others. That is a much worse outcome than needing to deal with a few broken (mostly) scripts from time to time. Particularly as that problem will remain, as the non-conforming systems will start out being non-conforming by ignoring the restriction intended to help with that issue. Simply reject this. |
(0006484) eblake (manager) 2023-09-21 19:02 edited on: 2023-09-21 19:04 |
Responding to Note: 0006481 > If we can accept that filenames cannot (in general) be interpreted as characters, then we cannot really attempt to name a bunch of characters, or even just one of them, and disallow it. That's absurd. Further, if the character newline is forbidden, what does that mean in a locale where 2 byte encoding of characters is in use, and one of the bytes in some of the characters happens to have the binary value 10? None of those characters is a newline, but when interpreted byte by byte it looks as if there are newlines in the path. POSIX XBD 6.1 (in issue 8 draft 3 at page 119; in issue 7-tc3 2017 edition at page 127, but not in the 2008 edition) already says that <slash> and <newline> MUST be single-byte characters with invariant values across all possible viable locales in the implementation. You are right that some portable characters (such as 'A' or '\') have an encoding which also occurs as a tail byte in certain multibyte sequences of some encodings (Big5, anyone?). But when we added these limitations on <slash> and <newline> in XBD 6.1 in one of the TCs to issue 7, no one could identify any common locale where a multibyte sequence could contain one of those bytes as an embedded byte of any larger multi-byte sequence. So your worries about mis-identifying a byte value of 10 as a newline in a different locale is impossible within POSIX. That said, maybe the wording in 6.1 could still be tweaked to be more explicit that the four special invariant characters (and maybe more: the recent discussions on 0001649 about IFS could add <space> and <tab> to the list of characters that would be useful as invariants) also must not appear in the tail of any multibyte character. |
(0006486) kre (reporter) 2023-09-21 20:05 |
Re Note: 0006484 It might very well be true that there are no current known encodings used by various character sets that can contain an embedded newline (and perhaps space and tab) - but POSIX have no control over such things, and just as with the printf(3) %b nonsense and how that seemed to affect printf(1), other bodies which produce the relevant standards don't necessarily concern themselves with how POSIX will be affected by their decisions, and that is particularly likely with national standards bodies perhaps defining a new encoding for their local (perhaps unique) character set. I much prefer not to simply trust that no such encoding will ever appear, as unlike the %b nonsense, such an addition would actually affect things. It is OK to require that / and \n have invariant encodings (though I am not sure how one enforces that, particularly in an environment where users are permitted to define their own locales, and not simply use an implementation provided one) but that of itself doesn't mean that / and \n cannot occur as the part of some other encoding, just that all locales must encode the / character as the value 47 (say) and \n as 10. Just saying "outside the standard, non-conformant" doesn't help when we're talking about the filesystem, that has to work (on any real implementation) for everyone. Further, the only point of this change is to promise applications that they can go ahead and simply ignore the possibility that a file name might contain an embedded \n - which the proposed resolution as I understand it does not even guarantee, only newly created file names would be affected, not existing ones - and in such an environment, one can easily disable posix conformance for a while, create files with \n embedded, and then go back to being posix conforming again for the applications to read the directories now which contain existing files with embedded \n That is, the solution being proposed here benefits no-one. Further it isn't needed, as if an implementation wants to implement that "no \n in filenames" restriction, it already can, and remain conformant. The only point of this "bug" is to delude script writers (mostly) into believing that they can ignore the possibility that a \n might appear in a filename, and hence allow them to make more broken scripts. That hardly seems like any kind of win to me. |
(0006549) geoffclare (manager) 2023-10-19 16:32 |
> That is, the solution being proposed here benefits no-one. Further it isn't > needed, as if an implementation wants to implement that "no \n in filenames" > restriction, it already can, and remain conformant. The only point of this > "bug" is to delude script writers (mostly) into believing that they can ignore > the possibility that a \n might appear in a filename, and hence allow them to > make more broken scripts. This seems to be a reaction to earlier proposals to mandate the EILSEQ errors. The current plan for Issue 8 is only to encourage implementations to give those errors. As you point out, implementations can already do it if they choose to, so encouraging it is not much of a change. And I think this will have the opposite effect to the one you predict; mentioning the issue in the standard will increase awareness of it, and thus may help some script writers who were previously unaware of the issue to write better scripts. |
(0006551) kre (reporter) 2023-10-20 20:23 |
Re Note: 0006549 Sorry, I must have missed something somewhere, the only proposed text I can see for this bug is the "Desired Action" and while it seems clear from the notes that have followed, that's not going to be adopted as stated. The most recent hint as to what will happen I can find here is from Note: 0001748 (and yes, I know, more than 10 years ago) which quotes a note by Don Cragun from another bug (Note: 0000884) from 2 years previous to that, which said (apparently, I am copying from Note: 0001748): The current plan is to add a set of byte values (based on single-byte characters in the C Locale) that will not be allowed in newly created filenames using 0000251 as the bug to make the changes. 251 is this one... The meeting minutes just say that you're continually awaiting Don's proposed wording, which I assumed was to implement that plan. Perhaps if someone were to post some new proposed wording. sometime this decade, we might be able to make more relevant comments. But if the actual intent (unstated anywhere I can find) is, as you suggest, to simply add something encouraging implementations to reject some particular byte values (or ranges) without actually requiring anything, which I would suspect might be a fruitless endeavour, and I really doubt enough script writers read the POSIX standard for it to make any measurable difference at all to that audience, and given that it has taken 13 years now, and there's not even any proposed wording to do even that much, I'd still suggest simply rejecting this bug, or perhaps retargetting it to Issue 9. I also have no idea what the effect of accepting this bug, with a change that really changes nothing, has on the other bugs which are waiting to see if this one is accepted or not ... Note: 0000884 (as quoted in Note: 0001748) apparently planned on simply closing those other bugs if this one is accepted, but is that still true, if this is accepted, but in such a watered down state, that nothing is really changed? If not, then those other 3 (or 4 or whatever it was) bugs are going to need attention again - and while I know you're already applying for extensions to the deadline for this project, if you're going to go there, you might want to be asking for an extension until about 2030. |
(0006552) geoffclare (manager) 2023-10-23 09:25 |
> Sorry, I must have missed something somewhere I guess you haven't been reading the meeting minutes. Minutes from 16th Oct said: We started discussion on this item. Minutes from 19th Oct said: We continued discussion on this item. > I also have no idea what the effect of accepting this bug [...] has on the > other bugs which are waiting to see if this one is accepted or not ... > Note: 0000884 (as quoted in Note: 0001748) apparently planned on simply > closing those other bugs if this one is accepted, but is that still true If you go and look at those bugs, you'll see the status of 243 is Applied and the other two were closed as duplicates of 243. |
(0006553) steffen (reporter) 2023-10-24 00:30 |
After having read posix.rhansen.org. I am happy that this bug only boils down to newline aka U+000A LF aka \n. I wonder what is with Unicode "control" characters. Like directions markers left-to-right and vice versa, Unicode paragraph separator etc. For example: 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;; 202A;LEFT-TO-RIGHT EMBEDDING;Cf;0;LRE;;;;;N;;;;; 202B;RIGHT-TO-LEFT EMBEDDING;Cf;0;RLE;;;;;N;;;;; 202C;POP DIRECTIONAL FORMATTING;Cf;0;PDF;;;;;N;;;;; 202D;LEFT-TO-RIGHT OVERRIDE;Cf;0;LRO;;;;;N;;;;; 202E;RIGHT-TO-LEFT OVERRIDE;Cf;0;RLO;;;;;N;;;;; I note that here printf 'echo abc\u200f\u0646\u0631old 1 2\u200edef' > .T prints echo abcنرold 1 2def (actually firefox browser gets the arabic "right" however, i presume If i touch such a file and GNU ls(1) it (truncated): # ll 'echo abcنرold 1 2def' (again firefox does "arabic right", whereas my st->tmux->bash stack does not) # LC_ALL=C ll 'echo abc'$'\342\200\217\331\206\330\261''old 1 2'$'\342\200\216''def |
(0006554) steffen (reporter) 2023-10-24 00:39 |
Forgot 200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;; 200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;; "Cf" Unicode (at least). For ls(1)? |
(0006561) Don Cragun (manager) 2023-10-30 17:24 edited on: 2023-10-31 16:50 |
Proposed changes (All page and line numbers refer to Issue 8 Draft 3) to be applied after the changes for 0000253 have been applied: Change P266 L23296 (XSH bind rationale) from: None.to: Implementions are encouraged to have bind() report an [EILSEQ] error if the last component of the address to be bound to an AF_UNIX family socket contains any bytes that have the encoded value of a <newline> character. Move the [EEXIST] error on P1027, L35244-35245 (XSH fopen errors) before the [EILSEQ] error on P1027, L35241-35243 to put the errors in alphabetical order. Add a new paragraph after P977, L33357 (XSH fopen rationale): Implementations are encouraged to have fopen() and freopen() report an [EILSEQ] error if mode begins with 'w' or 'a', the file did not previously exist, and the last component of pathname contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1330, L44792 (XSH link rationale): Implementations are encouraged to have link() and linkat() report an [EILSEQ] error if the file named by path2 did not previously exist, and the last component of that pathname contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1407, L47277 (XSH mkdir rationale): Implementations are encouraged to have mkdir() and mkdirat() report an [EILSEQ] error if the last component of path contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1411, L47406 (XSH mkdtemp rationale): Implementations are encouraged to have mkdtemp(), mkostemp() and mkstemp() report an [EILSEQ] error if the last component of the pathname in template contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1414, L47532 (XSH mkfifo rationale): Implementations are encouraged to have mkfifo() and mkfifoat() report an [EILSEQ] error if the last component of path contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1419, L47699 (XSH mknod rationale): Implementations are encouraged to have mknod() and mknodat() report an [EILSEQ] error if the last component of path contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1514, L50793 (XSH open rationale): Implementations are encouraged to have open() and openat() report an [EILSEQ] error if oflag contains O_CREAT, the file did not previously exist, and the last component of path contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P1891, L62567 (XSH rename rationale): Implementations are encouraged to have rename() and renameat() report an [EILSEQ] error if the file named by new does not already exist and the last component of that pathname contains any bytes that have the encoded value of a <newline> character. Add a new paragraph after P2183, L71316 (XSH symlink rationale): Implementations are encouraged to have symlink() and symlinkat() report an [EILSEQ] error if the last component of path2 contains any bytes that have the encoded value of a <newline> character. After P2454 L79567 XCU section 1.4 (Utility Description Defaults: CONSEQUENCES OF ERRORS), add to the first bullet item: <small>Note: If the requested action is to write one or more pathnames in a format that has <newline> as a terminator or separator, and a pathname to be written contains any bytes that have the encoded value of a <newline> character, this should be treated as an action that cannot be performed. A future version of this standard may require that utilities treat this as an error.</small> Start of editor's notes for changes below to XCU: Replace each occurrence of the string "PARAGRAPH DELIM" with the paragraph: If this utility is directed to display a pathname that contains any bytes that have the encoded value of a <newline> character when <newline> is a terminator or separator in the output format being used, implementations are encouraged to treat this as an error. A future version of this standard may require implementations to treat this as an error. Replace each occurrence of the string "PARAGRAPH DIRENT" with the paragraph: If this utility is directed to create a new directory entry that contains any bytes that have the encoded value of a <newline> character, implementations are encouraged to treat this as an error. A future version of this standard may require implementations to treat this as an error. End of editor's notes. Change P2562, L83722 section admin FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Change P2573, L84133 section ar FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Add after P2624, Lx86260 section awk FUTURE DIRECTIONS: PARAGRAPH DIRENT Change P2629, L86431 section basename FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Add after P2664, L87886 section c17 FUTURE DIRECTIONS: PARAGRAPH DELIM PARAGRAPH DIRENT Change P2678, L88346 section cd FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2701, L89252 section cksum FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Add after P2705, L89390 section cmp FUTURE DIRECTIONS: None.to: PARAGRAPH DELIM Change P2715, L89771 section command FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Add after P2721, L89999 section compress FUTURE DIRECTIONS: PARAGRAPH DIRENT Change P2730, L90318 section cp FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P2737, L90620 section csplit FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P2742, L90803 section ctags FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Change P2750, L91075 section cxref FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Add after P2767, L91587 section dd FUTURE DIRECTIONS: PARAGRAPH DIRENT Change P2771, L91742 section delta FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Change P2775, L91898 section df FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2784, L92243 section diff FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2787, L92359 section dirname FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2791, L92498 section du FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2796, L92681 section ed OPERANDS from: file If the file argument is given, ed shall simulate an e command on the file named by the pathname, file, before accepting commands from the standard input.to: file If the file argument is given, ed shall perform the effect of an e command on the pathname file before accepting commands from the standard input, except that file can contain a <newline>, even though this is not possible for the argument to the e command. Change P2811, L93294 section ed FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P2888, L96380 section ex FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P2916, L97389 section file FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2926, L97857 section find FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2935, L98138 section fuser FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2947, L98555 section get FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Change P2970, L99423 section grep FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2974, L99526 section hash FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2977, L99637 section head FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P2994, L100247 section ipcs FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3025, L101427 section link FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3029, L101596 section ln FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3040, L102002 section localedef FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Add a new paragraph after P3053, L102472 section ls OPTIONS -C description: <small>Note: Since the output from this option may use separator characters that include characters that might appear in filenames (in addition to the problems related to <newline>s in filenames), -C should not be used when filenames might be extracted from the output by a script.</small> Change P3055, L102530-102531 section ls OPTIONS -q description from: Force each instance of non-printable filename characters and <tab> characters to be written as the <question-mark> ('?') character.to: Force each instance of non-printable filename characters (including <newline>, <tab>, and other control characters) to be written as the <question-mark> ('?') character. Add after P3062, L102840 section ls FUTURE DIRECTIONS: PARAGRAPH DELIM Change P3072, L103313 section m4 FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3101, L104386 section mailx FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Add after P3133, L105820 section make FUTURE DIRECTIONS: PARAGRAPH DIRENT Change P3140, L106025 section man FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3146, L106228 section mkdir FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3148, L106320 section mkfifo FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3169, L107117 section msgfmt FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3174, L107320 section mv FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3192, L107932 section nm FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3216, L108834 section patch FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3259, L110622 section pax FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM PARAGRAPH DIRENT Add a new paragraph after P3266, L110876, section pr APPLICATION USAGE: If a file operand contains <newline>, <form-feed>, or <vertical-tab> characters, or is overly long, and the pr utility is instructed to include the name of that file in the header, pagination may not be handled correctly. Applications can guard against this by using the -h option (for example, passing a sanitized, truncated form of the pathname with -h). Change P3280, L111438 section prs FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3289, L111832 section pwd FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3299, L112192 section realpath FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3308, L112527 section rm FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3310, L112621 section rmdel FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3315, L112792 section sact FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3350, L114226 section sh FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3361, L114651 section sort FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3365, L114788 section split FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3396, L115906 section tee FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3421, L116838 section touch FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3441, L117545 section type FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3461, L118235 section unget FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3466, L118382 section uniq FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3471, L118593 section uucp FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3475, L118703 section uudecode FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3490, L119265 section val FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3544, L121335 section vi FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT Change P3553, L121654 section wc FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Change P3556, L121752 section what FUTURE DIRECTIONS from: None.to: PARAGRAPH DELIM Add after P3575, L122506 section xgettext FUTURE DIRECTIONS: PARAGRAPH DIRENT Change P3594, L123258 section yacc FUTURE DIRECTIONS from: None.to: PARAGRAPH DIRENT |
Issue History | |||
Date Modified | Username | Field | Change |
2010-05-03 18:49 | dwheeler | New Issue | |
2010-05-03 18:49 | dwheeler | Status | New => Under Review |
2010-05-03 18:49 | dwheeler | Assigned To | => ajosey |
2010-05-03 18:49 | dwheeler | Name | => David A. Wheeler |
2010-05-03 18:49 | dwheeler | Section | => XBD 3.170 Filename |
2010-05-03 18:49 | dwheeler | Page Number | => 60 |
2010-05-03 18:49 | dwheeler | Line Number | => 1781 |
2010-05-03 19:37 | dwheeler | Note Added: 0000412 | |
2011-03-10 16:04 | msbrown | Note Added: 0000689 | |
2011-04-11 20:28 | eblake | Interp Status | => --- |
2011-04-11 20:28 | eblake | Note Added: 0000739 | |
2011-04-11 20:28 | eblake | Summary | Forbid bytes 1 through 31 (inclusive) in filenames => Forbid newline, or even bytes 1 through 31 (inclusive), in filenames |
2011-04-11 21:17 | dwheeler | Note Added: 0000740 | |
2011-07-07 15:31 | user27 | Note Added: 0000887 | |
2011-12-17 21:43 | dwheeler | Issue Monitored: dwheeler | |
2012-02-22 06:48 | oiaohm | Note Added: 0001140 | |
2012-02-22 19:25 | dwheeler | Note Added: 0001141 | |
2012-02-22 19:38 | wlerch | Note Added: 0001142 | |
2012-02-23 00:55 | oiaohm | Note Added: 0001143 | |
2012-02-25 02:25 | oiaohm | Note Added: 0001147 | |
2012-02-25 02:40 | eblake | Relationship added | related to 0000545 |
2012-03-29 23:10 | Don Cragun | Relationship replaced | has duplicate 0000545 |
2012-08-03 19:35 | eblake | Relationship added | related to 0000291 |
2012-08-03 19:36 | eblake | Relationship added | related to 0000293 |
2012-08-03 19:36 | eblake | Relationship added | related to 0000573 |
2012-12-23 05:12 | user229 | Note Added: 0001437 | |
2013-01-07 23:29 | oiaohm | Note Added: 0001438 | |
2013-01-07 23:47 | oiaohm | Note Added: 0001439 | |
2013-08-14 02:43 | jrincayc | Note Added: 0001711 | |
2013-08-14 03:23 | dwheeler | Note Added: 0001712 | |
2013-08-14 05:41 | dalias | Note Added: 0001713 | |
2013-08-14 07:34 | oiaohm | Note Added: 0001714 | |
2013-08-15 02:33 | jrincayc | Note Added: 0001715 | |
2013-08-15 02:48 | dalias | Note Added: 0001716 | |
2013-08-15 03:31 | jrincayc | Note Added: 0001717 | |
2013-08-16 02:43 | jrincayc | Note Added: 0001721 | |
2013-08-16 05:03 | oiaohm | Note Added: 0001722 | |
2013-08-16 10:46 | a brouwer | Note Added: 0001723 | |
2013-08-17 15:03 | dwheeler | Note Added: 0001724 | |
2013-08-17 16:13 | dalias | Note Added: 0001725 | |
2013-08-18 02:01 | dwheeler | Note Added: 0001726 | |
2013-08-18 05:22 | oiaohm | Note Added: 0001727 | |
2013-08-18 06:10 | oiaohm | Note Added: 0001728 | |
2013-08-18 13:12 | jrincayc | Note Added: 0001730 | |
2013-08-18 19:09 | dwheeler | Note Added: 0001731 | |
2013-08-19 09:25 | geoffclare | Note Added: 0001732 | |
2013-08-19 10:48 | a brouwer | Note Added: 0001733 | |
2013-08-19 12:34 | jrincayc | Note Added: 0001734 | |
2013-08-19 14:55 | oiaohm | Note Added: 0001735 | |
2013-08-19 22:40 | oiaohm | Note Added: 0001736 | |
2013-08-19 22:57 | user229 | Note Added: 0001737 | |
2013-08-19 23:21 | dalias | Note Added: 0001738 | |
2013-08-19 23:48 | shware_systems | Note Added: 0001739 | |
2013-08-20 02:40 | user229 | Note Added: 0001740 | |
2013-08-20 03:23 | dalias | Note Added: 0001741 | |
2013-08-20 06:38 | shware_systems | Note Added: 0001742 | |
2013-08-20 07:02 | dalias | Note Added: 0001743 | |
2013-08-20 14:52 | dwheeler | Note Added: 0001744 | |
2013-08-21 12:23 | oiaohm | Note Added: 0001745 | |
2013-08-21 12:25 | oiaohm | Note Added: 0001746 | |
2013-08-21 12:49 | oiaohm | Note Added: 0001747 | |
2013-08-21 13:12 | steffen | Note Added: 0001748 | |
2013-08-21 14:02 | oiaohm | Note Added: 0001749 | |
2013-08-30 04:44 | oiaohm | Note Added: 0001779 | |
2014-12-05 14:45 | safinaskar | Note Added: 0002480 | |
2015-06-22 16:48 | mirabilos | Note Added: 0002730 | |
2016-03-27 10:43 | dannyniu | Issue Monitored: dannyniu | |
2023-02-10 18:14 | dwheeler | Note Added: 0006146 | |
2023-02-10 18:15 | dwheeler | Note Added: 0006147 | |
2023-02-10 18:16 | dwheeler | Note Added: 0006148 | |
2023-02-10 21:16 | dwheeler | Note Deleted: 0006147 | |
2023-02-10 21:16 | dwheeler | Note Deleted: 0006148 | |
2023-02-18 20:19 | mirabilos | Note Added: 0006154 | |
2023-02-18 20:21 | mirabilos | Note Added: 0006155 | |
2023-08-22 06:28 | Don Cragun | Relationship added | related to 0000243 |
2023-08-22 06:29 | Don Cragun | Relationship added | related to 0000244 |
2023-08-22 06:30 | Don Cragun | Relationship added | related to 0000245 |
2023-09-20 16:17 | kre | Note Added: 0006481 | |
2023-09-21 19:02 | eblake | Note Added: 0006484 | |
2023-09-21 19:04 | eblake | Note Edited: 0006484 | |
2023-09-21 20:05 | kre | Note Added: 0006486 | |
2023-09-22 06:25 | safinaskar | Issue Monitored: safinaskar | |
2023-09-22 06:25 | safinaskar | Issue End Monitor: safinaskar | |
2023-10-19 16:32 | geoffclare | Note Added: 0006549 | |
2023-10-20 20:23 | kre | Note Added: 0006551 | |
2023-10-23 09:25 | geoffclare | Note Added: 0006552 | |
2023-10-24 00:30 | steffen | Note Added: 0006553 | |
2023-10-24 00:39 | steffen | Note Added: 0006554 | |
2023-10-30 15:26 | eblake | Relationship added | related to 0000248 |
2023-10-30 17:24 | Don Cragun | Note Added: 0006561 | |
2023-10-30 17:27 | Don Cragun | Note Edited: 0006561 | |
2023-10-30 17:35 | Don Cragun | Final Accepted Text | => See Note: 0006561. |
2023-10-30 17:35 | Don Cragun | Status | Under Review => Resolved |
2023-10-30 17:35 | Don Cragun | Resolution | Open => Accepted As Marked |
2023-10-30 17:39 | Don Cragun | Tag Attached: issue8 | |
2023-10-31 16:50 | Don Cragun | Note Edited: 0006561 | |
2023-11-02 15:16 | eblake | Relationship added | related to 0001786 |
2023-11-21 10:28 | geoffclare | Status | Resolved => Applied |
2023-11-21 10:28 | geoffclare | Tag Attached: applied_after_i8d3 | |
2024-06-11 08:53 | agadmin | Status | Applied => Closed |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |