[PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
Stephan Bergmann via libstdc++
libstdc++@gcc.gnu.org
Mon May 8 07:52:00 GMT 2017
On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
> As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
> hinted at by http://wg21.link/lwg1200) there's a problem with
> char_traits<char16_t>::eof() because it returns int_type(-1) which is
> the same value as u'\uFFFF', a valid UTF-16 code point.
>
> i.e. because all values of int_type are also valid values of char_type
> we cannot meet the requirement that:
>
> "The member eof() shall return an implementation-defined constant
> that cannot appear as a valid UTF-16 code unit."
>
> I've reported this as a defect, suggesting that the wording above
> needs to change.
>
> One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
> always returns the same value, whether it succeeds or not. On success
> it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
> which is the same value. I think that can be solved with the attached
> change, which preserves the invariant in [char.traits.require] that
> eof() returns:
>
> "a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
> all values c."
>
> This can be true if we ensure that to_int_type never returns the eof()
> value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
> doing something like this.
>
> It means that when writing u'\uFFFF' to a streambuf we write that
> character successfully, but return u'\uFFFD' instead; and when reading
> u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
> asymmetrical, as we can write that character but not read it back. It
> might be better to refuse to write u'\uFFFF' and write it as the
> replacement character instead, but I think I prefer to write the right
> character when possible. It also doesn't require any extra changes.
>
> All tests pass with this, does anybody see any problems with this
> approach?
Sounds scary to me. As an application programmer, I'd expect to be able
to use chart16_t based streams to read and write arbitrary sequences of
Unicode code points (encoded as sequences of UTF-16 code units). (Think
of an application temporarily storing internal strings to a disk file.)
Also, I'd be surprised to find this asymmetric behavior only for U+FFFF
and not for other noncharacters, and only for char16_t and not for char32_t.
To me, the definition of char16_t's int_type and eof() sounds like a bug
that needs fixing, not working around?
More information about the Libstdc++
mailing list