This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 27.8.1.4/19: "reconstruct the original contents of the file"?


> A block compression scheme would have to encode the actual amount of data in each block to even a valid unshift sequence mid-file. Once it has that, though, codecvt::length will never overrun the because it's not even looking at the next block.

correction: … to even HAVE a valid unshift sequence…

> But wait — won't a UTF-8 codecvt greedily consume the next byte if the top couple bits are set, resulting in much incompatibility? It depends how one approaches UTF-8. Any multibyte sequence could be said to consist of a shift sequence followed by a one-byte unshift sequence. A codecvt following this concept would be able to consume one external byte at a time, would return encoding() == -1, and would suffer this problem. The more intuitive way, though, is to set encoding() == 0 and be stateless.

Looking at config/locale/generic/codecvt_members.cc, actually GCC does synthesize stateless translation from standard C. Although the mbstate_t variable is properly maintained, encoding() returns 0. I wonder if returning -1 would actually be more conservative, in case the underlying OS strips BOM?

Following up on N-to-M by looking through defect reports:

#76 (which oddly didn't make it into C++03) mentions N-to-M being incompatible with basic_filebuf, but the resolution doesn't forbid it.

#393 as of 2008-07 (or Kona 2007?): addresses N-to-M directly, and guarantees that such an encoding may keep the internal pointer stationary, which is an important requirement. Mentions that related commentary in #76 is in error.

#382 as of 2009-07: "codecvt is meant to be a 1-to-N to N-to-1 conversion. It does not work well for N-to-M conversions. wbuffer_convert now exists, and handles N-to-M cases. Also, there is a new specialization of codecvt that permits UTF-16 <-> UTF-8 conversions." This is odd because UTF-16 <-> UTF-8 *is* N-to-M, but isn't something a codecvt should do as both are multibyte encodings, not useful arrays. So that is a job for wbuffer_convert, but does not address the DR. (The DR is, in any case, somewhat in error.)

So, the N-to-M situation has gotten marginally better in C++0x, although they remain unpopular.

---

There is a real problem, of course, and the solution is to replace mbstate_t with a pimpl idiom, such as through an abstract base class. There's no reason for fpos_t to contain a fundamentally opaque implementation-defined type, and being opaque and implementation-defined, no reason the Standard can't better specify it. (ABI issues notwithstanding. And the same symbol is used by C in the global namespace, but really, how much confusion could that cause?) Unfortunately, for now, std::mbstate_t is defined as ::mbstate_t, and it's not reasonable to hijack the C ABI for a C++ extension.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]