This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 08/05/17 09:52 +0200, Stephan Bergmann via libstdc++ wrote:
On 05/05/2017 07:05 PM, Jonathan Wakely wrote:As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda hinted at by http://wg21.link/lwg1200) there's a problem with char_traits<char16_t>::eof() because it returns int_type(-1) which is the same value as u'\uFFFF', a valid UTF-16 code point. i.e. because all values of int_type are also valid values of char_type we cannot meet the requirement that: "The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit." I've reported this as a defect, suggesting that the wording above needs to change. One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF') always returns the same value, whether it succeeds or not. On success it returns to_int_type(u'\uFFFF') and on failure it returns eof(), which is the same value. I think that can be solved with the attached change, which preserves the invariant in [char.traits.require] that eof() returns: "a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for all values c." This can be true if we ensure that to_int_type never returns the eof() value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests doing something like this. It means that when writing u'\uFFFF' to a streambuf we write that character successfully, but return u'\uFFFD' instead; and when reading u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is asymmetrical, as we can write that character but not read it back. It might be better to refuse to write u'\uFFFF' and write it as the replacement character instead, but I think I prefer to write the right character when possible. It also doesn't require any extra changes. All tests pass with this, does anybody see any problems with this approach?Sounds scary to me. As an application programmer, I'd expect to be able to use chart16_t based streams to read and write arbitrary sequences of Unicode code points (encoded as sequences of UTF-16 code units). (Think of an application temporarily storing internal strings to a disk file.)Also, I'd be surprised to find this asymmetric behavior only for U+FFFF and not for other noncharacters, and only for char16_t and not for char32_t.To me, the definition of char16_t's int_type and eof() sounds like a bug that needs fixing, not working around?
Fixing that would require changing the standard and breaking the ABI of all existing implementations. I've opened a defect report against that standard, but a change that requires an ABI break isn't likely to be popular. Changing the semantics of to_int_type for U+FFFF is far less likely to affect any ABIs (it's a constexpr function so it's possible somebody is using the value of to_int_type(char_type(-1)) as a template argument, but it seems unlikely. It's a much smaller change, "allowed" by http://www.unicode.org/faq/private_use.html#nonchar10 and it only affects a noncharacter that is not intended for interchange anyway. I'm not claiming it's ideal, but it fixes a bug today.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |