0001008: 1. clarify iconv(3) reset usage; 2. truly support Unicode character input

ID	Project	Category	View Status	Date Submitted	Last Update

0001008	1003.1(2013)/Issue7+TC1	System Interfaces	public	2015-11-16 22:24	2024-06-11 09:02

Reporter	steffen	Assigned To	ajosey
Priority	normal	Severity	Objection	Type	Clarification Requested
Status	Closed	Resolution	Accepted As Marked

Name	steffen
Organization
User Reference
Section	Vol.2, System Interfaces, iconv
Page Number	1109
Line Number	37302 ff.
Interp Status	---
Final Accepted Text	0001008:0003326


Summary	0001008: 1. clarify iconv(3) reset usage; 2. truly support Unicode character input
Description	For the (iconv_t, NULL, NULL, &OBUF, &OBUF_LEN) usage case, POSIX says When iconv( ) is called in this way,.. [it] shall place, into the output buffer, the byte sequence to change the output buffer to its initial shift state. POSIX states at a different place (Vol. 1, Base Definitions, 6.4.1 State-Dependent Character Encodings, 2.; p. 133, l. 3830 ff.) A utility that divides, truncates, or extracts substrings from statefully encoded data shall produce output that contains locking shifts at the beginning or end of the resulting data, if appropriate, to retain correct state information. Effectively a string must be "atomic" regarding its locking state, otherwise it could not be used by itself. Therefore a reset sequence has to be placed if "normal US-ASCII" text is about to follow (e.g., after a RFC 2047 encoded word in e-mail header that uses stateful encoding). I wonder wether placing this reset sequence shouldn't be a mandatory task before iconv_close(), since only like that $ cat file1 file2 > file3 would work according to above wording if file1 has been created via strings that have been converted via iconv() - POSIX doesn't say that a newline character causes locking shift state reset. Making the (iconv_t, NULL, NULL, &OBUF, &OBUF_LEN) case mandatory before iconv_close() would enable character set conversion to reliably detect and compose ISO 10646 / Unicode constructs like decomposed character sequences and even graphem cluster boundaries. These "techniques" are basic concepts of Unicode and their understanding may be mandatory in order to be able to perform a correct input charset to output charset conversion. On the mailing list examples have been given, one replication of which can be found in [1]. [1] http://austingroupbugs.net/view.php?id=249#c2923 It has to be noted that today "hacks" exist to overcome the fact that the envisaged new requirement, e.g., many iconv implementations ship with a special "UTF-8-MAc" character set that i think does nothing but support decomposed characters. ..Ok it seems Apple has chosen not to honour the Unicode standard completely but to not decompose some character ranges due to some internal compatibility problems [2]. Beside that it is decomposed Unicode. [2] https://developer.apple.com/library/mac/qa/qa1173/_index.html
Desired Action	1. It should be clarified wether it is necessary to explicitly place a reset sequence after input processing is complete, before iconv_close(). Since iconv() doesn't know that the end of the input is reached, it could otherwise not ensure that the resulting data is valid according to the POSIX specification. 2. If the above is true and clarification will be applied in the envisaged way, POSIX should enhance the iconv description so that it not only talks about state-dependent encodings but also considers Unicode / ISO 10646 text processing requirements, since output character set character composition may be possible only after applying Unicode composition and graphem cluster boundary detection to input data, which may require to hold back data output unless text processing detects a true boundary that can be emitted to the output character set.
Tags	tc3-2008

steffen 2015-11-17 16:59 reporter bugnote:0002966	1. Excuse me please, this should have been comitted to Issue7+TC1; i searched before i have started editing, and obviously i forgot to switch back the form. I don't know how this could be changed except by opening another issue. 2. I also think, apart from the above, that 37302 For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by 37303 a call for which inbuf is a null pointer, or for which inbuf points to a null pointer. When iconv( ) is 37304 called in this way, and if outbuf is not a null pointer or a pointer to a null pointer, and outbytesleft 37305 points to a positive value, iconv( ) shall place, into the output buffer, the byte sequence to change 37306 the output buffer to its initial shift state. should be changed to For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by a call for which inbuf is a null pointer, or for which inbuf points to a null pointer. When iconv( ) is called in this way, and if outbuf is not a null pointer or a pointer to a null pointer, and outbytesleft points to a positive value, iconv( ) shall place, into the output buffer, the byte sequence to change the output buffer to its initial shift state, if the former state of the conversion descriptor cd mandates so.

geoffclare 2016-08-04 16:35 manager bugnote:0003326 Last edited: 2016-08-11 15:22	Add to application usage as a new paragraph after P1110, L37364: It is the responsibility of the application to ensure that, if the output codeset has a locking-shift encoding, the output buffer is returned to its initial shift state when conversion is completed. This can be accomplished by calling iconv() with inbuf as a null pointer, or with inbuf pointing to a null pointer, before calling iconv_close(). Since the standard does not provide a way to query whether a codeset has a locking-shift encoding, it is recommended that applications always call iconv() in this way before calling iconv_close().

geoffclare 2016-08-11 15:23 manager bugnote:0003335	0001008:0003326 was edited during the 2016-08-11 teleconference to add the last sentence.

Date Modified	Username	Field	Change
2015-11-16 22:24	steffen	New Issue
2015-11-16 22:24	steffen	Status	New => Under Review
2015-11-16 22:24	steffen	Assigned To	=> ajosey
2015-11-16 22:24	steffen	Name	=> steffen
2015-11-16 22:24	steffen	Section	=> Vol.2, System Interfaces, iconv
2015-11-16 22:24	steffen	Page Number	=> 1109
2015-11-16 22:24	steffen	Line Number	=> 37302 ff.
2015-11-17 16:59	steffen	Note Added: 0002966
2015-11-17 17:03	geoffclare	Project	1003.1(2008)/Issue 7 => 1003.1(2013)/Issue7+TC1
2016-08-04 16:35	geoffclare	Note Added: 0003326
2016-08-04 16:35	geoffclare	Interp Status	=> ---
2016-08-04 16:35	geoffclare	Final Accepted Text	=> 0001008:0003326
2016-08-04 16:35	geoffclare	Status	Under Review => Resolved
2016-08-04 16:35	geoffclare	Resolution	Open => Accepted As Marked
2016-08-04 16:36	geoffclare	Tag Attached: tc3-2008
2016-08-11 15:22	geoffclare	Note Edited: 0003326
2016-08-11 15:23	geoffclare	Note Added: 0003335
2019-10-21 13:42	geoffclare	Status	Resolved => Applied
2024-06-11 09:02	agadmin	Status	Applied => Closed

View Issue Details

Activities

Issue History