0001635: iconv: please be more explicit in input-not-convertible case

ID	Project	Category	View Status	Date Submitted	Last Update

0001635	1003.1(2016/18)/Issue7+TC2	Base Definitions and Headers	public	2023-02-21 00:14	2024-06-13 16:37

Reporter	steffen	Assigned To
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	New	Resolution	Open

Name	steffen
Organization
User Reference
Section	iconv
Page Number	1123
Line Number	38014
Interp Status
Final Accepted Text


Summary	0001635: iconv: please be more explicit in input-not-convertible case
Description	issue 1007 resolves this to If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the output codeset: If either the //IGNORE or the //NON_IDENTICAL_DISCARD indicator suffix was specified when the conversion descriptor cd was opened, the character shall be discarded but shall still be counted in the return value of the iconv() call. If the //TRANSLIT indicator suffix was specified when the conversion descriptor cd was opened, an implementation-defined transliteration shall be performed, if possible, to convert the character into one or more characters of the output codeset that best resemble the input character. The character shall be counted as one character in the return value of the iconv() call, regardless of the number of output characters. If no indicator suffix was specified when the conversion descriptor cd was opened, or the //TRANSLIT indicator suffix was specified but no transliteration of the character is possible, iconv() shall perform an implementation-defined conversion on the character and it shall be counted in the return value of the iconv() call. However, as Martin Sebor stated in the issue description, The specification for the iconv() function assumes that every input sequence that is valid in the source codeset is convertible to some sequence in the destination codeset. In particular, the specification doesn't allow the function to fail when a valid sequence in the source codeset cannot be represented in the destination codeset. As an example where this assumption doesn't hold, consider a conversion from UTF-8 to ISO-8859 where a large number of source characters don't have equivalents in the destination codeset. A survey of a subset of existing implementations shows that they fail with EILSEQ in such cases, despite the specification defining the error condition as "Input conversion stopped due to an input byte that does not belong to the input codeset." And this is true, GNU C library and GNU libiconv seem to fail output conversion immediately with the same EILSEQ error that denotes invalid input data. (A much more drastic error, .. is it!?!)
Desired Action	Please be more explicit and denote that implementations exist which behave like GNU C-lib iconv / libiconv. That is to say that "implementation defined conversion" may mean no conversion at all, but an immediate stop. It would be tremendous if the standard could define hands that programmers can react upon, because, due to restriction of the iconv interface, it is impossible to decide what the error was. A programmer does know nothing of input nor output character set, how many bytes may make up a character, how many were consumed / produced, whether conversion replacements where stored, or not. (In practice all others known to me do place some character and continue.) This refers to GNU library bug report https://sourceware.org/bugzilla/show_bug.cgi?id=29913 where the honourable author of GNU iconv, and YES!, the GNU approach has lots of merits!, but it should be possible to differentiate in between the errors, Better even would be an explicit //CONVERR-STOP-WITH-ENODATA modifier. refers to gnulib source files where the same approach is implemented portably, it seems, and the cost is tremendous, because of all the shortcomings of the iconv interface! Like approaching cautiously byte-by-byte until a conversion succeeds! for (insize = 1; inptr + insize <= inptr_end; insize++) { res = iconv (cd, (ICONV_CONST char *) &inptr, &insize, &outptr, &outsize); if (!(res == (size_t)(-1) && errno == EINVAL)) break; / iconv can eat up a shift sequence but give EINVAL while attempting to convert the first character. E.g. libiconv does this. */ if (inptr > inptr_before) { res = 0; break; } } This is ridiculous!
Tags	No tags attached.

steffen 2023-02-21 18:20 reporter bugnote:0006164	P.S.: to eloberate some more, any implementation which wants to conform to the upcoming POSIX standard needs to pass the configured state (//IGNORE, //TRANSLIT ..) along its internal code path, so that this addition would be an "easy" one. Adding this very helpful addition to make the behaviour of the most widely used (and standards-incompatible) implementation explicitly addressable allows future programmers to know what they are doing. (By other means than performing compile-time tests and fixate behaviour according to the compile-time-tested environment.) I truly believe in the honourable Bruno Haible's approach, if it is addressable like that. The only other implementation that allows programmers to realize their own output conversion failure stuff is that of AIX (looking at gnulib), because that uses the NUL byte as an implementation-defined conversion, but that, of course, will fail for data which contains embedded NULs (which for one is otherwise graspable by iconv(3), and second will fail for all multi-byte as in 16-bit or 32-bit encodings per se). In general real life situation it very bad, just look at the efforts of character set conversion in the widely used and widely available libarchive. This addition could make things a bit better.

bhaible 2024-06-11 23:42 reporter bugnote:0006812	Regarding the case "when a valid sequence in the source codeset cannot be represented in the destination codeset." Here's how the various implementations behave (in case "when a valid sequence in the source codeset cannot be represented in the destination codeset"): * GNU libc and GNU libiconv and win-iconv (https://github.com/win-iconv/win-iconv): - They fail the conversion with EILSEQ, when the to_codeset did not have a //TRANSLIT or //IGNORE suffix. - If the to_codeset had a //IGNORE suffix, the character is discarded, i.e. produces 0 bytes in the output. - If the to_codeset had a //TRANSLIT suffix, then a transliteration is attempted. It may do substitutions such as ½ → 1/2 or å → aa. Transliterations between scripts (e.g. from cyrillic to latin script) are generally not done. * musl libc uses produces a '' character in the output. FreeBSD, NetBSD produce a '?' character in the output. * Solaris attempts a transliteration if enabled, otherwise it produces a '?' character in the output. * IRIX produces a NUL character in the output. * macOS 14 iconv always does transliteration, - regardless whether a //TRANSLIT suffix was present in to_codeset or not, - regardless whether a //IGNORE suffix was present in to_codeset or not, - regardless whether iconvctl ICONV_SET_TRANSLITERATE was done on the conversion descriptor, - regardless whether iconvctl ICONV_SET_DISCARD_ILSEQ was done on the conversion descriptor, - regardless whether iconvctl ICONV_SET_ILSEQ_INVALID was done on the conversion descriptor. The transliteration result depends on the input character. In some cases, the result is merely a '?' character. And the return value (count of "non-identical conversions") is always 0.

bhaible 2024-06-11 23:47 reporter bugnote:0006813	The documentation for GNU libc had not been up-to-date for many years, but has been extended in 2023: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=470b97d59797b040e28103a0ba0f616d95f0ed93 https://man.archlinux.org/man/iconv.3

bhaible 2024-06-12 00:16 reporter bugnote:0006814	Let me explain * why in GNU libc, GNU libiconv, and win-iconv, when //TRANSLIT or //IGNORE is not specified, the iconv() conversion stops when it encounters an unconvertible character, * why this would also be a useful behaviour for the other iconv() implementations. The output of iconv() conversion is generally not piped to /dev/null, but instead becomes visible to a user in one form or the other. Therefore, applications that make use of iconv must not use low-quality conversions, only high-quality conversions. For example, the string "Frédéric", when mapped to ASCII, could become "Frederic" (humanly acceptable, high quality) or "Frdric" or "Fr?d?ric" (not acceptable, because too low quality). An application that wants to use iconv() for character set conversion therefore needs an easy way to be alerted by the iconv() implementation that some "non-identical conversion" has been performed, so that it can test whether that non-identical conversion is of high or of low quality. With an implementation that, like GNU libc or GNU libiconv, reports the situation by stopping the conversion loop and returning in inbytesleft a pointer to the unconvertible character, the application can call iconv() on the entire input at once, and profit from the efficient conversion loop inside the iconv() engine. With an implementation that, like POSIX currently specifies it, always returns the conversion result for the entire input, the application does get an indication that some non-identical conversion was performed (through the return value of iconv()), but - it does not get an indication where the unconvertible character is, - it cannot back up to that point, because in the presence of stateful encodings you cannot just "jump back" to an arbitrary input pointer. The only way the application has is to feed the input bytes to iconv() slowly (the first byte, then the first 2 bytes, then the first 3 bytes, and so on), until iconv() converts a character. This is the only* way to do a conversion and check its quality as it goes. Of course, as Steffen Nurpmeso noted, this is inefficient. Implementations like the one in FreeBSD are optimized for converting as much as possible in one swoop, and invoking iconv() as many times as there are bytes in the input does not make use of this optimized implementation. The net result is that only for conversions to UTF-8 or UCS, the application may make use of the optimized implementation, and for the other ones (conversions to any encoding that is not full Unicode) the optimizations in the implementation are not used. Conclusion: For all* iconv implementation, it would be useful to stop when a "non-identical conversion" is about to happen, when neither //TRANSLIT nor //IGNORE has been specified.

bhaible 2024-06-12 00:24 reporter bugnote:0006815	The reporter of this defect is right regarding defect 1007: The intent of defect 1007 was to allow applications to make a more efficient use of the iconv() engine, while distinguishing high-quality conversions from low-quality conversions. But the resolution of defect 1007, to include the facilities that are present in Solaris iconv(), does not resolve the problem: It does not allow iconv() to fail upon non-identical conversions, and thus it does not allow the application to make use of an optimized conversion loop, while distinguishing high-quality conversions from low-quality conversions.

eblake 2024-06-13 16:37 manager bugnote:0006817	Based on the discussions of the 2024-06-13 call, the Austin Group understands the desire to have a means for an iconv() implementation that stops early when a transliteration is not possible, despite recognizing valid characters in the input. Would it work to utilize a different errno in this sequence, perhaps ENOTSUP or EPROTO, to make it easier for applications to distinguish between a stop because of unrecognized input (EILSEQ) vs unrepresentable output (the new errno)? 0001635:0006812 mentioned ICONV_SET_* flags on MacOS; that appears to be used with a non-standard interface iconvctl(), and while it may be possible to standardize that interface and a new ICONV_SET_* flag as a means for for opting into the new errno value behavior, it seems like a much bigger request at this time Regarding the different defaults, we could add //NOTRANSLIT and make it unspecified whether //TRANSLIT or //NOTRANSLIT is the default.

Date Modified	Username	Field	Change
2023-02-21 00:14	steffen	New Issue
2023-02-21 00:14	steffen	Name	=> steffen
2023-02-21 00:14	steffen	Section	=> iconv
2023-02-21 00:14	steffen	Page Number	=> 1123
2023-02-21 00:14	steffen	Line Number	=> 38014
2023-02-21 18:20	steffen	Note Added: 0006164
2023-03-06 16:35	nick	Relationship added	related to 0001007
2024-06-11 23:42	bhaible	Note Added: 0006812
2024-06-11 23:47	bhaible	Note Added: 0006813
2024-06-12 00:16	bhaible	Note Added: 0006814
2024-06-12 00:24	bhaible	Note Added: 0006815
2024-06-13 16:37	eblake	Note Added: 0006817

View Issue Details

Relationships

Activities

Issue History