This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 27.8.1.4/19: "reconstruct the original contents of the file"?


On Sep 26, 2010, at 1:20 AM, Jerry Schwarz wrote:

> On Sep 25, 2010, at 2:13 PM, David Krauss wrote:
> 
>> OK, that's good news. I just can't be too sure :v) . In practice, encoding transitions are never solved by the converter itself. Instead you have some sequence like XML's "<![CDATA[...]]>" which you can lock onto and synchronize with. That's what is important to support, not the most general case.
> 
> Ok.  The block encodings I mentioned (encryption, compression, ...)  were speculative and you're saying that nobody is doing these operations using codecvt's. I can believe that. They really are the wrong way to do that kind of operation, but the standard didn't offer any alternative (and I think the new standard doesn't either.)

I'm pretty new to codecvts, so I don't know what people have done. Looking online, I don't find much, but that's not exactly conclusive. (We should get input from the author of the Boost Locale project… including references to other prior work.)

You have to jump through a few hoops and assume mbstate_t is big enough for a pointer, but they should be able to do anything. I'm just talking about what people certainly want to do with UTF-8 and XML. I think the new standard didn't update iostreams customization because there is little interest because few do it because it is little understood and poorly supported. That could perhaps be changed.

>> For the general case, however, assuming the file was written using C++ (or reasonably at all), a preceding encryption encoding will terminate the block when imbue() is called when writing the file, so assuming the file is at the same position when imbue() is called when reading, skipping the expected unshift sequence will reach the correct position just after the block. As I mentioned, this will fail (perhaps quietly) if the encodings are inherently incompatible but then the user deserves it. I'm writing up such a testcase now.
> 
> I may be missing something, but I don't see how you can ask a codecvt facet to skip the unshift sequence.  Are you expecting that to be handled by the new codecvt?  Is that what you mean by "compatible encoding"?  


The old codecvt translates some signal sequence which is recognized by the program. The program knows that an unshift may follow that signal. Positioning the stream at the next character will at least consume at least the unshift sequence.

In terms of the codecvt, the operation is

    post_unshift_offset = length( state, current_external_ptr, external_buffer_end, 0 );

in other words, the codecvt should return a pointer to the first byte after the unshift if you ask it to consume nothing. The encodings are incompatible if the first byte after the unshift is mistaken for part of another shift/unshift, and consumed. (Discussion below on what constitutes a shift in a multibyte or block encoding.)

Of course, an unshift sequence is anything written by codecvt::unshift, so it may itself bear the signal sequence. In that case (which is the common one, as encodings with pure shift-unshift codes are rare), it is sufficient simply to write the unshift sequence, and read the file as normal.

A block compression scheme would have to encode the actual amount of data in each block to even a valid unshift sequence mid-file. Once it has that, though, codecvt::length will never overrun the because it's not even looking at the next block.

In terms of streambuf::imbue implementation, all I plan to do is add sgetc() as the first operation. This will advance the position to the next input character, but not consume it, such that it is re-translated by the new codecvt.

---

But wait — won't a UTF-8 codecvt greedily consume the next byte if the top couple bits are set, resulting in much incompatibility? It depends how one approaches UTF-8. Any multibyte sequence could be said to consist of a shift sequence followed by a one-byte unshift sequence. A codecvt following this concept would be able to consume one external byte at a time, would return encoding() == -1, and would suffer this problem. The more intuitive way, though, is to set encoding() == 0 and be stateless.

Although more intuitive, this approach is less common because C99 mbsrtowcs doesn't work that way. C and POSIX provide a user-friendly service with the specific purpose of transcoding multibyte encodings in arbitrarily divided byte substrings. We're different — a codecvt doesn't need to be user-friendly, and it's free not to consume a few bytes, returning with from_next != from_end, to_next != to_end. Moreover, codecvt may be used for constant-width translation, in which case such behavior is expected.

The specification of do_length seems to discriminate against encoding() == 0 vs -1, but it's ambiguous which is favored:

(quote) Returns: (from_next-from) where from_next is the largest value in the range [from,from_end] such that the sequence of values in the range [from,from_next) represents max or fewer valid complete characters of type internT. (endquote)

The key word here is "complete." A stateless codecvt must set from_next to a point at the initial state. Then the range [from, from_next) is a representation of valid, complete characters. A stateful version greedily finds the maximal subsequence, so adding one more char results in exactly max+1 characters. As such, incomplete characters may be included in the tally, but there are max or fewer complete ones in the range, and accounting for the state change. I think the former interpretation is more reasonable.

Extending this to block encoding, we must re-introduce state, but that doesn't change much. Being stateful doesn't require being like mbsrtowcs; it can still be finicky. Attempting to codecvt::in less than the entire block will no-op. But that's OK, since codecvt::max_length will return the block size. The state can simply be the position of the desired character, and sub-block I/O may involve decoding the block into a temporary buffer, querying/modifying that, and reencoding (for output).

For that matter, a UTF decoder which strips byte-order markers would require encoding() = -1 despite being stateless. Mixing supposed statefulness with finicky consumption is key to making this work.

---

Here is a demo (unfinished testcase) implementing a block encoded segment in the middle of a file. This requires the big patch I posted, but only the imbue/__check_facet change is relevant.

Attachment: 1.cc
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]