This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()


On 08/05/17 09:52 +0200, Stephan Bergmann via libstdc++ wrote:
On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits<char16_t>::eof() because it returns int_type(-1) which is
the same value as u'\uFFFF', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\uFFFF' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\uFFFF' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?

Sounds scary to me. As an application programmer, I'd expect to be able to use chart16_t based streams to read and write arbitrary sequences of Unicode code points (encoded as sequences of UTF-16 code units). (Think of an application temporarily storing internal strings to a disk file.)

Also, I'd be surprised to find this asymmetric behavior only for U+FFFF and not for other noncharacters, and only for char16_t and not for char32_t.

To me, the definition of char16_t's int_type and eof() sounds like a bug that needs fixing, not working around?

Fixing that would require changing the standard and breaking the ABI
of all existing implementations. I've opened a defect report against
that standard, but a change that requires an ABI break isn't likely to
be popular.

Changing the semantics of to_int_type for U+FFFF is far less likely to
affect any ABIs (it's a constexpr function so it's possible somebody
is using the value of to_int_type(char_type(-1)) as a template
argument, but it seems unlikely. It's a much smaller change, "allowed"
by http://www.unicode.org/faq/private_use.html#nonchar10 and it only
affects a noncharacter that is not intended for interchange anyway.

I'm not claiming it's ideal, but it fixes a bug today.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]