Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000601 [1003.1(2008)/Issue 7] System Interfaces Comment Clarification Requested 2012-08-15 15:57 2019-06-10 08:55
Reporter nick View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Nick Stoughton
Organization USENIX
User Reference nms-mbsnrtowcs-001
Section mbsnrtowcs
Page Number 1277
Line Number 41975
Interp Status Approved
Final Accepted Text Note: 0001568
Summary 0000601: mbsnrtowcs clarification
Description In austin-group-l:archive/latest/17532 Matthew Dempsky posed the following question:

On Ubuntu 10.04, the code below prints "0 2".  This is the behavior
that I think logically makes sense (and that I was intending to
implement for OpenBSD).

However, my reading of mbsnrtowcs() description in Issue 7 is that the
correct output (assuming "en_US.UTF-8" is a valid UTF-8 based locale)
should be "0 0".

Issue 7 says:

"""
If dst is not a null pointer, the pointer object pointed to by src
shall be assigned either a null pointer (if conversion stopped due to
reaching a terminating null character) or the address just past the
last character converted (if any).
"""

However, in my test program, mbs+2 is in the *middle* of a
[multi-byte] character, not "just past" a [multi-byte] character.
Ubuntu 10.04's behavior would be consistent if the description was
"just past the last input byte consumed".

Am I misunderstanding something?  Or is there a bug in either Ubuntu
10.04's implementation or the POSIX wording?


#include <wchar.h>
#include <locale.h>
#include <string.h>
#include <stdio.h>

wchar_t wcs[100];
char mbs[100];

int main()
{
        setlocale(LC_CTYPE, "en_US.UTF-8");
        memcpy(mbs, "\xe7\x95\x8c", 4);
        const char *s = mbs;
        printf("%u ", (unsigned)mbsnrtowcs(wcs, &s, 2, 100, NULL));
        printf("%u\n", (unsigned)(s - mbs));
}


Further discussion noted that 'C99 does in fact state that
mbstate_t's conversion state includes tracking "the position within a
multibyte character", so multibyte character string inputs do not
necessarily need to be processed exclusively at multibyte character
boundaries. E.g., it's okay to call mbrtowc() to process one byte at
a time of a multibyte string.'

But more importantly, do any implementations of mbsnrtowcs() print "0
0"? Glibc, FreeBSD, and OS X all print "0 2". If no implementation
actually prints "0 0", then I think it makes sense to revise the
wording for mbsnrtowcs() to "just past the last byte processed"
instead of "just past the last multibyte character converted".

---
Given that a number of implementations do not follow the apparent requirements of the standard to process the src string character by character rather than byte by byte, I believe a formal interpretation is required.
Desired Action EITHER: Add explicit permission for mbsnrtowcs to advance the src pointer to the middle of multibyte character when the nmc value does not include all the bytes of the final character. (this matches existing behavior on at least Glibc, FreeBSD, and OS X)

OR: Make a partial conversion (i.e. where *src[nmc] is in the middle of a multi-byte character) into an illegal sequence (as if the src string was actually truncated at this point).

OR: further explain why the current wording is correct in the light of several implementations that implement something else.

This is change 1 above, and the preferred approach:

At page 1277 line 41975-41979 change:

If dst is not a null pointer, the pointer object pointed to by src shall be assigned either a null
pointer (if conversion stopped due to reaching a terminating null character) or the address just
past the last character converted (if any). If conversion stopped due to reaching a terminating
null character, and if dst is not a null pointer, the resulting state 
described shall be the initial
conversion state.


to


If dst is not a null pointer, the pointer object pointed to by src shall be assigned either a null
pointer (if conversion stopped due to reaching a terminating null character) or the address just
past the last byte converted (if any). If conversion stopped due to reaching a terminating
null character or nmc bytes into src, and if dst is not a null pointer, the resulting state describe
d shall be the initial conversion state.
Tags c99, tc2-2008
Attached Files

- Relationships
parent of 0000616Appliedajosey mbsnrtowcs clarification 

-  Notes
(0001327)
mdempsky (reporter)
2012-08-15 17:01

"If conversion stopped due to reaching a terminating null character or nmc bytes into src, and if dst is not a null pointer, the resulting state described shall be the initial conversion state."

Hm, I don't agree with this change. I think the conversion state should be left as is if the conversion stops after nmc bytes.
(0001381)
geoffclare (manager)
2012-09-26 15:38
edited on: 2013-05-09 15:19

Initial Interpretation response (superseded by Note: 0001568)
-------------------------------
The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
None.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------

At page 1277 line 41977 change:

    past the last character converted (if any)

to:

    past the last byte processed (if any)

At page 1277 line 41986 change:

    ... limited to at most nmc bytes (the size of the input buffer).

to (all within the CX shading):

    ... limited to at most nmc bytes (the size of the input buffer).
    If the input buffer ends with an incomplete character, it is
    unspecified whether conversion stops at the end of the previous
    character (if any), or at the end of the input buffer. In the
    latter case, a subsequent call to mbsnrtowcs() with an input
    buffer that starts with the remainder of the incomplete character
    shall correctly complete the conversion of that character.
    
At line 1278 line 42008 change FUTURE DIRECTIONS from:

    None.
    
to:

    A future version may require that when the input buffer ends with
    an incomplete character, conversion stops at the end of the input buffer.

(0001383)
mdempsky (reporter)
2012-09-26 16:50

Minor nit: Is saying "past the last byte converted" the same as "past the last byte processed"? I ask because other wording in the standard (e.g., mbrtowc()'s (size_t)-2 return value description) uses "processed" to refer to input bytes that have been consumed for conversion but don't yield a complete character, but "converted" only seems to be used to refer to complete characters.
(0001384)
geoffclare (manager)
2012-09-27 07:27

I agree that it should say "processed", and I believe this will easily achieve consensus, so I have updated Note: 0001381 accordingly. I have also updated the desired action in 0000616 to match.
(0001526)
ajosey (manager)
2013-03-29 08:06

Interpretation Proposed 29 Mar 2013
(0001552)
geoffclare (manager)
2013-04-26 14:03

Antoine Leca pointed out in comp.std.c that the page 1277 line 41977
change from "past the last character converted (if any)" to
"past the last byte processed (if any)" applies to mbsrtowcs() as
well as mbsnrtowcs(), and therefore is potentially introducing a
conflict with the C Standard. Perhaps we should leave that line
alone, and deal with this in the later CX shaded text so that
it only applies to mbsnrtowcs().

Note that if we do this, we should also change 0000616 to match.
(0001553)
mdempsky (reporter)
2013-04-26 18:53

I think adding CX shading is fine to be careful, but my interpretation is that saying "past the last byte processed" is the same as "past the last character converted" for mbsrtowcs().

The conversions are done by mbrtowc(), and my reading of mbrtowc() is that if it fails with EILSEQ, then the bytes have only been inspected, not processed.
(0001568)
geoffclare (manager)
2013-05-03 09:42
edited on: 2013-05-09 15:17

Interpretation response
------------------------
The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
None.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------

At page 1277 line 41986 change:

    except that the conversion of characters pointed to by src is
    limited to at most nmc bytes (the size of the input buffer).

to (all within the CX shading):

    except that the conversion of characters indirectly pointed to by
    src is limited to at most nmc bytes (the size of the input buffer),
    and under conditions where mbsrtowcs() would assign the address
    just past the last character converted (if any) to the pointer
    object pointed to by src, mbsnrtowcs() shall instead assign the
    address just past the last byte processed (if any) to that pointer
    object. If the input buffer ends with an incomplete character, it
    is unspecified whether conversion stops at the end of the previous
    character (if any), or at the end of the input buffer. In the
    latter case, a subsequent call to mbsnrtowcs() with an input
    buffer that starts with the remainder of the incomplete character
    shall correctly complete the conversion of that character.
    
At line 1278 line 42008 change FUTURE DIRECTIONS from:

    None.
    
to:

    A future version may require that when the input buffer ends with
    an incomplete character, conversion stops at the end of the input buffer.

(0001703)
shware_systems (reporter)
2013-08-06 22:05

Just a note, for the example cited: If the function is behaving as if calls to mbrtowc() are being done, the return values should be similar... In this case it should have returned (size_t) -2 to indicate incomplete character, not 2 to indicate how many partially processed chars were done, or the value reserved for such indication. A successive call with the same parameters would return 1 to indicate three characters, i.e. nmc from first call + num<=nmc in second, contributed to the one wide character which was successfully converted and stored and src would be set to null to indicate a terminating '\0' was found before nmc bytes exhausted in the second call.

(size_t)-2 is not listed as one of the possible return values, however, which is more the bug in this case for POSIX as it's a lack of consistency. mbrtowc() needs this value because it isn't allowed to modify src to indicate '/0' found, it uses the 0 return value for this. However, glibc, et al, should be returning 0 as the substitute for (size_t)-2, even though it isn't documented for this purpose explicitly, on the first call to indicate that no complete wide character has been processed, but a successive call still might produce a wide char, since (size_t)-1 not warranted. Returning > 0 indicates, falsely in this case, that at least one wide char was successfully stored, as that is how mbrtowc() interprets it. Returning >0 from these functions is supposed to mean 1 or more wchars was stored and that IS explicit.

The src pointer should still be pointing to the first byte after the first call and the fact that ps has been modified would be used to advance the pointer to try for a successful conversion on that subsequent call. This is consistent with the wording that
"Conversion shall stop early in either of the following cases:
• A sequence of bytes is encountered that does not form a valid character."

A partial character is not fully valid yet, so the conversion should leave src at the start of the sequence, with ps set to continue the conversion, not at the end of the buffer pointing at a '/0'. This leaves src so it can be immediately copied to the head of that same buffer and the remaining chars making up that wide char can be appended from elsewhere to continue the conversion. src gets reset to the buffer head position value before attempting completing that conversion and doing any others. Also, a restart as the 'r' in mbrtowc() stands for implies start pos stays the same. Otherwise it would be 'c' for continuable, most likely. This would force the extra step of adding the bytes processed returned to the buffer head pointer to get back to the start of the fragment. Both mbsnr and mbsr should return 0 on such a fragment.

So, both are nominally buggy, but POSIX less so than the others, as consistent behavior can be inferred by cross referencing to mbrtowc() in this way. I think this would be enough to make a conformance distinction too, as is, though the language should be cleaned up to specify that is the function of 0 as return value and how this relates to ending position of src, rather than having to infer it.

mbrtowc() could be considered buggy in that a successful conversion should always return from 1 to MB_CUR_MAX, not 1 to n, after all bytes processed so that a wchar could be stored, whether in a single call or multiple calls were required, to show how much of src was used so it can be advanced properly to the next char beginning a possible wchar by the caller. Otherwise the application has to do the same num of chars partially processed managed by ps already when (size_t)-2 is returned, since src is not advanced by the interface, which to me doesn't make sense as a requirement. It complicates the implementation of these two functions, at least, as shown above and the use by applications directly.

Also, the return value (size_t)-2 should have text explicitly indicating that it is the value to be returned when a null '/0' byte is encountered and more chars are needed to finish the current pending wchar. This condition is an interruption of the conversion process, whether n bytes reached or not, so neither 0 nor (size_t)-1 particularly appropriate. 0 is appropriate when '/0' is the first byte examined, after a possible shift code to return ps to init state. Otherwise should just mean 'more data required' as for the tail fragments of mbs(n)r. (size_t)-1 should only be returned when the encoding specification has assigned a particular byte sequence as invalid explicitly.
(0001704)
dalias (reporter)
2013-08-07 00:37

The correct return value for mbrtowc when the mbstate_t object has recorded the position in a partially decoded character (*) and the next byte to be processed cannot continue or complete that character is (size_t)-1. This is true whether the next byte to be processed is '\0' or something else.

If the next byte to be processed is '\0', returning (size_t)-2 is non-conforming because ISO C forbids the null byte from appearing as part of any multibyte character except itself. A return value of (size_t)-2 would indicate that the null byte is contributing to a complete character but the character has not yet been completed; this is forbidden.

The poster of note 1703 should be aware that none of this is relevant to this issue report at hand. Moreover, this tracker is for the work of the Austin Group in clarifying issues that arise attempting to interpret the standard and maintaining/developing the standard, not for use as a personal blog. I do not speak for the group, but I believe I'm not being unreasonable to ask you to stop posting long off-topic ramblings here, especially when the information contained in your posts is often of questionable accuracy and posted as fact rather than as asking for an interpretation. I believe it is distracting from the standards process myself and others are trying to participate in, and confusing to others who are trying to follow the Austin Group's work.

(*) Note that position within a multibyte character is different from a shift state. See C99 7.24.6 Extended multibyte/wide character conversion utilities, paragraph 4.
(0001705)
shware_systems (reporter)
2013-08-07 03:32

Excuse Me? This was left as "No conformance distinction can be made" and I'm trying to show that such distinctions can be made so implementations are more reliably portable. I fail to see how that's not on-topic. That the logic chains to reconcile discrepancies may get long-winded is regrettable, but I'm trying to be proactive, not "blogging". What I am saying in this case is that the '\0' is indicating 'end-of-buffer', not participating in the decoding process.

This allows the mbsrtowcs() function to stop at the last fragment, as it's supposed to, not report every string with a trailing fragment as (size_t)-1 EILSEQ. Does this mean that other interfaces that currently require only -1 be returned be examined as being under-specified. Probably. I'm not proposing normative wording though because the issue has been closed, and giving possible reasons the group as a whole may want to reopen it. If you're not one of them, fine, but don't get all holier-than-thou either. You've shown some of your own misconceptions already.
(0001706)
dalias (reporter)
2013-08-07 04:24

OK, I understand better what you're trying to say now, but it's wrong and contrary to the normative requirements of ISO C. If it were correct, it would make it impossible for ISO C's multibyte facilities to support UTF-8, because they could not detect encoding errors or distinguish the strings "abc\xc2" and "abc"; both would convert to the same wide character sequence with no indication that the first contains an encoding error. I already cited the C requirements the return value of mbrtowc so I don't think I need to cite them again.

Your claim of relevance is also incorrect. The issue at hand is the mbsnrtowcs function, not mbsrtowcs. Per ISO C, mbsrtowcs is required to return (size_t)-1 if the string ends with an incomplete character, which is an encoding error. The mbsrtowcs function is for processing complete strings; it cannot process fragments.

POSIX added the mbsnrtowcs function for the purpose of processing fragments. The interpretation response that "No conformance distinction can be made" does not indicate inaction. It means that the existing text in the standard was insufficient to resolve the issue, and that textual changes (which follow) are being proposed to resolve the issue. As noted in the proposed "future directions", a future version of the standard may require processing of a final partial character, but the text fixing the ambiguity allows either historical behavior so as not to make incompatible changes in a technical corrigendum.

So for now, conforming applications which use mbsnrtowcs to process fragments must keep any unprocessed input bytes (of which there may be zero, if the fragment does not end with a partial character or if the implementation takes the option to process these bytes and store the partial character in the mbstate_t object) and prepend them to the next fragment when calling mbsnrtowcs. This has no negative impact on application portability, and in fact applications were already doing this anyway, since (as I understand it) most implementations do not process a final partial character. The possible change mentioned in "future directions" has nothing to do with the ability to write portable programs; all it would achieve is allowing slightly simpler and more efficient program logic (especially for programs which may be processing text from read-only buffers).
(0001707)
mdempsky (reporter)
2013-08-07 05:23

"This has no negative impact on application portability, and in fact applications were already doing this anyway, since (as I understand it) most implementations do not process a final partial character."

My recollection is the exact opposite actually: that all existing mbsnrtowcs() implementations *do* process the final partial character, which was contrary to the current standard's wording, hence this defect report.
(0001708)
shware_systems (reporter)
2013-08-08 08:35
edited on: 2013-08-08 08:52

Apologies, there was something I overlooked in the wording on mbrtowc( ), so I
withdraw objection about language of (size_t)-2 return. It can happen properly
in a robust implementation as stands. The standard does say:
If s is not a null pointer, the mbrtowc( ) function shall inspect at most n bytes beginning at the
byte pointed to by s to determine the number of bytes needed to complete the next character...

This 'at most' infers it can also do its own strlen( ) of s internally and use
the lesser of this len or n as the effective count to try and process. This
wouldn't stop a check for '/0' if after this effective count, less than n, was
processed and ps in initial state so 0 can be returned also. When count = n and
ps is holding a shift back to initial state, i.e. the '/0' is at position n+1,
I think the intent is (size_t)-2 should still be returned. I believe this
describes how (size_t)-1 can be avoided on ending fragments given the current
language. When '/0' is the initial char pointed at by s, and ps is neither in
initial state or a pending return to initial state, then would be appropriate
to return (size_t)-1.

So it can be done, I think, but the language doesn't specify this strlen( )
should be done also. That it can be done does give more point to the intent is
that mbsn( ) and mbsnr( ) should be stopping at the beginning of a trailing
fragment, though, as this is what best matches the semantics of mbrtowc( ), and
for mbsnr, an initial fragment when n<MB_CUR_MAX should be returning 0.

It also allows, as an extension, and possibly a normative option, CX or XSI, a
singleton '/0' to be examined at the start of a passed in buffer if ps isn't in
initial state and a code set is specified as having a nul as a possible
intermediate or trailing character. For something like this the description
could be reworded in the vendor's documentation of extensions to require
returning with (size_t)-2 after the one '/0' byte, rather than (size_t)-1, to
prevent buffer overruns from corrupted data. It wouldn't prevent GIGO, but it
would give the application a chance to use a different routine to validate the
'/0' was legitimate as a non-buffer delimiter. An extension might define an
ismbslastnul(mbstate_t ps) interface to facilitate checking for this and keep
mbstate_t opaque. This would be, I think, consistent with how it's supposed to
check for nul anyways to complete a pending 'return to initial state' shift
sequence and return 0 as 'end of mbs' as a special case.

I think some additional language would be needed to support it as a normative
option, though, and this extra interface makes it unsuitable for TC scope.
Unless is___( ) type interfaces are permitted. I forget offhand.

(0001709)
dalias (reporter)
2013-08-08 15:38

Note 1708 is presumes something false. The null byte can never be a possible intermediate or trailing character. This is specified in ISO C, to which POSIX defers. See C99 5.2.1.2 Multibyte characters:

"A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character."

Note, however, that such a byte can occur after bytes that represent a change in shift state. This would not be an encoding error, but would discard the shift state and result in the null wide character.

One might ask the question "Why must UTF-8 lead bytes be partial characters and not shift states?" In fact, as far as I could tell, it would be possible to make an implementation in which there is a locale that behaves similar to a UTF-8 based locale, but for which lead bytes are treated as shift states rather than partial characters. There would be observable differences in the behavior of mbrtowc. The important thing to realize is that this would not be UTF-8. While such an implementation would be implementing _a_ multibyte encoding, the encoding it would implement does not conform to the requirements of ISO-10646 or Unicode regarding the interpretation of UTF-8, and thus it would not be correct to call it UTF-8. In particular, "abc\xc3" and "abc" would both convert to the string L"abc".
(0001821)
ajosey (manager)
2013-09-06 08:31

Interpretation Proposed 6 Sep 2013
(0001883)
ajosey (manager)
2013-10-14 13:03

Interpretation approved 14 October 2013
(0001902)
shware_systems (reporter)
2013-10-14 14:59

Per the mailing list, I still object; it's creating a new function, not clarifying the issue... The implementations are non-conforming, that's all!

- Issue History
Date Modified Username Field Change
2012-08-15 15:57 nick New Issue
2012-08-15 15:57 nick Status New => Under Review
2012-08-15 15:57 nick Assigned To => ajosey
2012-08-15 15:57 nick Name => Nick Stoughton
2012-08-15 15:57 nick Organization => USENIX
2012-08-15 15:57 nick User Reference => nms-mbsnrtowcs-001
2012-08-15 15:57 nick Section => mbsnrtowcs
2012-08-15 15:57 nick Page Number => 1277
2012-08-15 15:57 nick Line Number => 41975
2012-08-15 15:57 nick Interp Status => ---
2012-08-15 17:01 mdempsky Note Added: 0001327
2012-09-26 15:10 nick Desired Action Updated
2012-09-26 15:38 geoffclare Note Added: 0001381
2012-09-26 15:39 geoffclare Interp Status --- => Pending
2012-09-26 15:39 geoffclare Final Accepted Text => Note: 0001381
2012-09-26 15:39 geoffclare Status Under Review => Interpretation Required
2012-09-26 15:39 geoffclare Resolution Open => Accepted As Marked
2012-09-26 15:39 geoffclare Tag Attached: tc2-2008
2012-09-26 15:41 geoffclare Note Edited: 0001381
2012-09-26 15:47 nick Issue cloned 0000616
2012-09-26 15:47 nick Relationship added parent of 0000616
2012-09-26 16:50 mdempsky Note Added: 0001383
2012-09-27 07:25 geoffclare Note Edited: 0001381
2012-09-27 07:27 geoffclare Note Added: 0001384
2013-03-29 08:06 ajosey Interp Status Pending => Proposed
2013-03-29 08:06 ajosey Note Added: 0001526
2013-04-26 14:03 geoffclare Note Added: 0001552
2013-04-26 18:53 mdempsky Note Added: 0001553
2013-05-02 15:30 nick Tag Attached: c99
2013-05-03 09:42 geoffclare Note Added: 0001568
2013-05-09 15:17 geoffclare Note Edited: 0001568
2013-05-09 15:19 geoffclare Note Edited: 0001381
2013-05-09 15:19 geoffclare Interp Status Proposed => Pending
2013-05-09 15:19 geoffclare Final Accepted Text Note: 0001381 => Note: 0001568
2013-08-06 22:05 shware_systems Note Added: 0001703
2013-08-07 00:37 dalias Note Added: 0001704
2013-08-07 03:32 shware_systems Note Added: 0001705
2013-08-07 04:24 dalias Note Added: 0001706
2013-08-07 05:23 mdempsky Note Added: 0001707
2013-08-08 08:35 shware_systems Note Added: 0001708
2013-08-08 08:52 shware_systems Note Edited: 0001708
2013-08-08 15:38 dalias Note Added: 0001709
2013-09-06 08:31 ajosey Interp Status Pending => Proposed
2013-09-06 08:31 ajosey Note Added: 0001821
2013-10-14 13:03 ajosey Interp Status Proposed => Approved
2013-10-14 13:03 ajosey Note Added: 0001883
2013-10-14 14:59 shware_systems Note Added: 0001902
2019-06-10 08:55 agadmin Status Interpretation Required => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker