|Anonymous | Login||2020-04-10 12:20 UTC|
|Main | My View | View Issues | Change Log | Docs|
|Viewing Issue Simple Details|
|ID||Category||Severity||Type||Date Submitted||Last Update|
|0000684||[1003.1(2013)/Issue7+TC1] System Interfaces||Comment||Error||2013-04-27 05:14||2020-03-24 15:48|
|Priority||normal||Resolution||Accepted As Marked|
|Final Accepted Text||See Note: 0001561|
|Summary||0000684: posix_fallocate() should be allowed to return ENOTSUP|
Not every filesystem can (or can easily) support posix_fallocate().
Copy-on-write filesystems in particular (e.g., ZFS, BTRFS, BSD4.4's LFS), need to reserve much more than just the number of blocks needed to hold the requested file length: they must be able to write all the blocks in the path from the data block(s) to the filesystem/volume root block(s). It can be difficult to predict the block sizes that will be needed (since in ZFS block sizes are variable, with files' block sizes growing as the file grows up and past to the maximum block size), or even the number of blocks that will be required (since that depends on the number of indirect block layers needed). In BSD4.4 LFS-type filesystems the blocks that might need writing are as follows: the data (leaf) blocks, indirect blocks, the inode, the data block of the inode file where the file's inode number falls, the indirect blocks of the inode file, the inode file inode, and the root. ZFS is LFS-like, but with more layers (datasets within datasets within volumes) and variable block sizes. It's those extra layers that make it difficult to predict how much space to reserve for the requested allocation: the block sizes and indirect block depth of layers above the given file can change drastically prior to the application writing to the given (offset, length). Moreover, the given file itself may have grown much larger by the time the application writes to the given (offset, length). This means that either the filesystem must constantly adjust reservations or must reserve blocks for the absolute worst-case needs at transaction-commit time, and this can be much, much more than the space the application requested.
Applications need to know whether a filesystem supports posix_fallocate(). Two methods are possible: an appropriate failure error code (ENOTSUP), and/or a pathconf()/fpathconf().
Note that AIX's posix_fallocate() manpage indicates failure with ENOTSUP for signalling lack of support.
Note that Solaris and OpenSolaris derivatives (Illumos, SmartOS, ...) fail with EINVAL instead of ENOTSUP, though Illumos and derivatives likely will soon return ENOTSUP instead of EINVAL.
Note that other systems (e.g., FreeBSD) that implement ZFS may also currently be failing with EINVAL.
Add ENOTSUP as a permitted error for posix_fallocate().
Alternatively or additionally, add a pathconf() for detecting posix_fallocate() support on a per-file basis.
Preferably both. Features like posix_fallocate() should generally be optional, and applications need a way to detect whether the feature is implemented.
A NOTES section could (should?) also be added noting that some implementations fail with EINVAL even though offset >= 0 and length > 0, and that this indicates that posix_fallocate() is not supported for the given file descriptor.
Note that there exist applications that either treat all posix_fallocate() as fatal (e.g., SQLite3 220.127.116.11), or have code to fake it up if either posix_fallocate() does not exist (SQLite3 18.104.22.168) or fails (glibc; see comments in SQLite3 22.214.171.124) source code.
Note too that SQLite3 apparently will stop using posix_fallocate() (see http://www.sqlite.org/src/info/1bbb4be1a2 [^] ).
One option might be for filesystems to treat posix_fallocate() as advisory rather than mandatory, given that this is often how applications handle it (either ignoring its failures or calling a stub that fakes up fallocation on failure, like glibc apparently does).
|Also, applications that use posix_fallocate() still have to check for and handle (and recover from) write()/pwrite() failures. Perhaps posix_fallocate() should be considered purely advisory, and should not fail when not implemented.|
If a robust application needs to know that writes will not fail due to insufficient space, it would be harmful for posix_fallocate to report false success. I do not think the standard should permit such behavior, but that's not to say that implementations can't offer it as an option. The simplest solution would be for implementations to simply document that they become non-conforming (and possibly dangerous) if you use filesystems that cannot or will not honor posix_fallocate. Ideally, such filesystems would offer an option to reserve storage (possibly by reserving an upper bound much larger than they'll actually need) so that they can still be used, albeit suboptimally, in deployments with strong robustness requirements.
The closest analogy I can come up with is memory overcommit for malloc and writable mappings. Clearly it's non-conforming for a correct application to crash when physical memory can't be instantiated for an allocation for which the implementation already reported success, but implementations do it anyway. To be conforming, and to meet the needs of robust applications, they need to provide a way to turn off overcommit, and need to document the dangerous behavior that can occur when it's enabled. Falsely reporting success for posix_fallocate is basically just overcommit of filesystem storage space, so my view is that it should be treated the same way.
I agree that whether posix_fallocate() lies should be a filesystem option, and in fact I'd already proposed as much on the illumos-zfs list.
HOWEVER, a) the application still has to be able to recover from write(2) failures (e.g., EIO) after a posix_fallocate(), b) it is already the case that posix_fallocate() often lies (see glibc). Add to this the idea of a filesystem option to allow posix_fallocate() to lie. This all means that applications that really want to count on posix_fallocate()... can't.
The inescapable conclusion is that today a portable application can only count on posix_fallocate() being advisory. It'd be nice if apps that want posix_fallocate() could tell if it will lie (so they can warn or fail). We can't fix it by adding "and we really mean it". But we can fix it by adding a pathconf() to indicate whether posix_fallocate() is advisory or mandatory for the given file.
Therefore I think we need not only to allow ENOTSUP and add a NOTES section, but we must also add a pathconf().
|The "Desired Action" should be updated (I don't think I can edit it) to say that we need both, to list ENOTSUP as a permitted failure case, and to specify a pathconf for determining whether the filesystem supports posix_fallocate() for a given file (and, if it does, then posix_fallocate() may not lie).|
edited on: 2013-10-03 15:32
The standard does not speak to this issue, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.
If a system conforms to the Advisory Information Option it must also provide a filesystem that will support that option (see XSH 2.1.5). However, such systems may also support other, non-conforming filesystems, and the standard currently requires an EINVAL error for these. ENOTSUP is preferred so that this error can be distinguished from the other EINVAL cases.
Notes to the Editor (not part of this interpretation):
On page 1425 line 47084 delete (from the EINVAL error):
, or the underlying file system does not support this operation
Add on page 1425 after line 47088 (to the "shall fail" errors):
[ENOTSUP] The underlying file system does not support this operation.
Add to APPLICATION USAGE (page 1426, after the sentence on 47096):
Not all filesystems are capable of supporting posix_fallocate(), in which case the function will return ENOTSUP. However, if a system supports the option, there must be at least one filesystem that is capable of supporting this function and is available for use in conforming environments.
|Interpretation Proposed 6 Sep 2013|
edited on: 2013-10-02 21:43
I was looking over the pending Interps and this caught my eye. I don't agree with it and I feel if it is applied it will have to be rolled back for Issue 8, but that's doable as part of ER #687.
Re: #1561 Suggested changes, for clarity, if this interpretation stands:
First off, it's missing the closing " at end of text.
"However, if a system supports the option, there must be at least one filesystem that is capable of supporting this function."
"However, if a system supports the option, there also must be at least one filesystem that is capable of supporting this function provided with the implementation, though at application execution time an instance of this filesystem may not be accessible."
At Line 47084:
Remove ", or the underlying file system does not support this operation" from EINVAL description
Change "underlying file system" to
"file was opened for writing but has been marked read only since and so".
Note: That last shouldn't happen, but chmod( ) appears to allow it for any writable file system, according to its App Usage section. The 'this will not happen...' part is not normative so can't be relied on.
This report seems to me a file system quality-of-implementation issue more than a standard defect, though. Adding ENOTSUP here opens a can of worms for other write interfaces, and not just those that permit arbitrary seeks past end-of-file before attempting writes, which is why I feel the proposed text is a stopgap measure. Using it to indicate just that the posix_fadvise( ) optimizations aren't supported is plausible, as a warning rather than outright error, but the space should still be allocated on the media if other errors aren't detected. I think the Interpretation should be:
The standard is clear enough and right; implementations exhibiting other behaviors are non-conforming and shall not claim to support the Advisory Information Option.
The posix_fallocate( ) routine can be implemented using, except one, only the functionality of interfaces from <stdio.h> that all writable file systems are required to support, with the caveat that a thread private FILE record instance allocated as part of the thread state is needed to avoid conflicts with per-process open FILE * limits. This is necessary because there is no 'int flush(int fildes)' interface to complement 'int fflush(FILE *stream)' used as part of the Line 47064, '...ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media' specification. The exception is also required to be supported as part of the Base conformance requirements.
Such a routine is usable to provide the required behavior whether or not a file system supports optimizing the allocation request in accordance with any posix_fadvise( ) settings that may have been applied. Conforming implementations therefore have no need to return EINVAL due to a file system not supporting the interface; only implementations providing non-conforming extensions require this.
Notes to the Editor (not part of this interpretation):
At Line 47084:
Remove ", or the underlying file system does not support this operation" from EINVAL description.
Change "underlying file system" to
"file was opened for writing but has been marked read only since and so".
After Line 47096, in Application Usage, add:
It is up to the application to write lock the segment being allocated using the same handle with fcntl( ) if it is unknown whether a handle open in another process might invalidate the effect of this interface before the application has a chance to write final data to this allocated segment and close that handle. This behavior is not normatively part of the interface so as to not slow down applications where it doesn't apply.
Replace in Future Directions, Line 47100, "None." with:
"It is recognized that some implementations currently permit use of this interface with some file systems in a manner that does not guarantee the behavior specified here and by other required interfaces because those file system implementations are unreliable when media usage is near its intrinsic or allocated capacity. A future version of the standard may require such file systems to be accessed as if it were read only media by conforming implementations to avoid data loss due to this unreliability."
In See Also, Line 47102, insert:
", fcntl( )" after "creat( )"
In other words, "yes we really mean it" isn't optional or needs to be added.
As to details, though I'm not providing the full code, it only needs from <stdio.h>:
int fseek(FILE *stream, long offset, int whence);
size_t fwrite(const void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);
int fclose(FILE *stream);
int fflush(FILE *stream);
and as CX extensions:
FILE *fdopen(int fildes, const char *mode);
int fseeko(FILE *stream, off_t offset, int whence);
From <unistd.h> it would use:
int ftruncate(int fildes, off_t length);
for a nominally complete basic implementation. All of these are part of the Base requirements.
Taking into account posix_fadvise( ) is implementation and file system dependent still, but defaulting to a basic mode when a file system doesn't support the optimizations is supposed to be transparent to all applications, that I can see (Line 47070-1). An internal fdopen( ) that uses a FILE reference as parameter rather than allocating one from the pool would be needed, but the mechanics of filling in the FILE record would be the same; it just wouldn't return allocation errors. If there was a 'int flush(int fildes)' interface xxx(int fildes) versions of these interfaces could be used directly without fdopen/fclose. An internal fdtruncate( ) that operates the same as ftruncate( ) but is passed a FILE * parameter might be useful also, but isn't really needed - ftruncate( ) can be attempted before the fdopen( ) to extend the file if necessary.
The cases specified in the description are for when the file systems are caching data prior to a flush. After a flush or file close if there is a physical space problem because of metadata reallocation that's an issue for that file system and the hosting environment so is beyond the scope of the standard. What overcommits or undercommits are required is not the standards concern. Such activities are expected to be transparent to an application once the interface reports success, even if optimal performance suffers accommodating this. A range lock might be needed to prevent another thread or process from shrinking the file too much, but that applies with any file system.
As I see it, when such problems exist the implementation shouldn't allow any writable mounts of that file system because it doesn't support the expectations of the basic c99 FILE * based stream operations, much less all the POSIX fildes operations where writes are concerned. A copy-on-write file system should return ENOSPC or EIO if it can't fully flush the file extension to disk as the result of one of the internal write( ) commits of the fflush( ) interface. Even for a file system that does sparse allocations when large blocks of zeros are detected the routine can succeed by writing a data value or pattern that disables this behavior.
So, IMO those implementations that lie about success are simply non-conforming for Issue 7. Prediction of space requirements isn't necessary; the ability to rollback a flush attempt is, even for a single fputc( ) that appends to a file. They may return from fflush( ) what c99 expects as errors, but not the CX errors POSIX defines and requires to enable rollback detection and act accordingly, so it seems.
This affects ER #687 as well. Read only media wouldn't support this, regardless of filesystem, and this should be discernable via f/pathconf( ) separate from the strictly advisory interfaces, as proposed there, before an open( ) of files on that media are attempted. Those file systems that do have issues would be candidates for mounting in a read only mode instead of read write rather than return ENOTSUP, and as proposed here I think the language now should reflect that requirement instead as a future direction. If it does the interface also becomes a candidate for moving to the Base in Issue 8. Then the pathconf( ) could use 'media read only mounted, file is read only, supports basic write modes[ADV], supports fadvise( ) optimizations' as its return values.
|Interpretation Proposed 21 Feb 2014|
|Interpretation Approved: 25 March 2014|
|2013-04-27 05:14||nico||New Issue|
|2013-04-27 05:14||nico||Name||=> Nicolas Williams|
|2013-04-27 05:14||nico||Organization||=> Cryptonector LLC|
|2013-04-27 05:14||nico||Section||=> posix_fallocate|
|2013-04-27 05:14||nico||Page Number||=> 1|
|2013-04-27 05:14||nico||Line Number||=> 1|
|2013-04-27 05:40||nico||Note Added: 0001554|
|2013-04-27 05:40||nico||Issue Monitored: nico|
|2013-04-27 16:56||nico||Note Added: 0001555|
|2013-04-27 21:36||dalias||Note Added: 0001556|
|2013-04-29 18:15||nico||Note Added: 0001557|
|2013-04-30 00:13||nico||Note Added: 0001558|
|2013-05-02 16:18||nick||Page Number||1 => 1425|
|2013-05-02 16:18||nick||Line Number||1 => 47078|
|2013-05-02 16:18||nick||Interp Status||=> ---|
|2013-05-02 16:18||nick||Note Added: 0001561|
|2013-05-02 16:19||nick||Interp Status||--- => Pending|
|2013-05-02 16:19||nick||Final Accepted Text||=> See Note: 0001561|
|2013-05-02 16:19||nick||Status||New => Interpretation Required|
|2013-05-02 16:19||nick||Resolution||Open => Accepted As Marked|
|2013-05-02 16:20||nick||Note Edited: 0001561|
|2013-05-02 16:21||nick||Note Edited: 0001561|
|2013-05-02 16:25||nick||Tag Attached: tc2-2008|
|2013-05-02 17:10||nick||Issue cloned||0000687|
|2013-05-02 17:10||nick||Relationship added||related to 0000687|
|2013-09-06 04:56||ajosey||Interp Status||Pending => Proposed|
|2013-09-06 04:56||ajosey||Note Added: 0001813|
|2013-09-27 19:06||shware_systems||Note Added: 0001857|
|2013-10-02 21:43||shware_systems||Note Edited: 0001857|
|2013-10-03 15:26||geoffclare||Tag Detached: tc2-2008|
|2013-10-03 15:26||geoffclare||Tag Attached: issue8|
|2013-10-03 15:31||geoffclare||Note Edited: 0001561|
|2013-10-03 15:32||geoffclare||Note Edited: 0001561|
|2013-10-03 15:32||geoffclare||Interp Status||Proposed => Pending|
|2014-02-21 15:41||ajosey||Interp Status||Pending => Proposed|
|2014-02-21 15:41||ajosey||Note Added: 0002161|
|2014-03-25 13:42||ajosey||Interp Status||Proposed => Approved|
|2014-03-25 13:42||ajosey||Note Added: 0002202|
|2020-03-24 15:48||geoffclare||Status||Interpretation Required => Applied|
|Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group|