This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

27.8.1.4/19: "reconstruct the original contents of the file"?

From: David Krauss <potswa at gmail dot com>
To: libstdc++ at gcc dot gnu dot org
Cc: libstdc++ at gcc dot gnu dot org, potswa at mac dot com
Date: Fri, 24 Sep 2010 20:24:51 -0700 (PDT)
Subject: 27.8.1.4/19: "reconstruct the original contents of the file"?
Complaints-to: groups-abuse@google.com
Injection-info: x42g2000yqx.googlegroups.com; posting-host=98.222.52.210; posting-account=5-u9dgoAAADmVJJ9NWsk_J89jZx8cTr-
Newsgroups: comp.std.c++

(cross-post: please reply to libstdc++ mailing list due to latency
issue, although comp.std.c++ is the preferable venue.)

I'm currently studying and editing the GNU basic_filebuf
implementation, and I'm confused by the requirements of
basic_filebuf::imbue (27.8.1.4/17-19), particularly 19 and 17.

There are two very different ways to interpret the Standard on imbue.
One is that it enables support for files containing multiple
encodings. This seems more reasonable. Another is that it implements
conversion of a file from one uniform encoding to another. In other
words, a call to imbue will read the whole file into memory, convert
it, and write it back out. This is less reasonable but seems to fit
the text better.

Paragraph 19 says:
(quote) Note: [imbue] may require reconversion of previously converted
characters. This in turn may require the implementation to be able to
reconstruct the original contents of the file. (endquote)

If some input has been fetched and converted but not yet extracted,
I'll just discard the incorrectly converted part and redo. So long as
the original input is kept, this is foolproof and bulletproof. What
would necessitate "reconstruction"? Does simply having a backup
constitute reconstruction, for these purposes? Given that a codecvt
may implement compression, so an underflow operation may overflow the
get area, keeping a backup more or less needs to be implemented
anyway.

On the other hand, the paragraph could be read as meaning that all
characters subsequently extracted are interpreted according to the new
encoding, and furthermore the whole file is in the new encoding.
"Reconstruct" and "reconversion" are apt terms for rewriting the
entire file.

Paragraph 17 says:
(quote) Precondition: If the file is not positioned at its beginning
and the encoding of the current locale as determined by
a_codecvt.encoding() is state-dependent (22.2.1.5.2) then that facet
is the same as the corresponding facet of loc. (endquote)

How is it difficult to change encodings midstream? Why would it make a
difference whether the preceding encoding is stateful? The user must
ensure that the first input to the *succeeding* encoding begins in its
initial state, but for the preceding one, I just finalize its output
and query the file position, whether it's stateful or not. There are a
couple minor issues, but they aren't specific to stateful encodings.

1. The point of the encoding change needs to be definite. This is
obviously the user's responsibility, even if it's not an easy one. For
a stateful encoding, you have the possibility of a one-to-many
correlation between raw and stream characters. However, it's
reasonable to ask the user to imbue only after a termination (e.g.
unshift) sequence has been read or written. As for output, imbue can
simply write the unshift sequence and validity is guaranteed.

2. The pathological case of a file consisting of shift and unshift
sequences. Relating to #1, returning the preceding input to the
initial state is insufficient to guarantee that the next input is in
the next encoding. However, this isn't specific to stateful encodings.
Consider an encoding where every byte is encoded as-is, except a NUL
character encodes zero bytes. No state involved, yet a large file can
encode nothing. The solution is simply to position the file before the
next character of input. If the first character of the next encoding
is accepted and skipped over by the current encoding, then tough
cookies! But that has nothing to do with statefulness, and there's no
reason to forbid it.

On the other hand, being in the middle of the file would seriously
complicate the process of converting from a stateful encoding. The
state is truly part of the file position, so it may be hard to
identify the right position in the new file. For example, consider an
encoding which compresses runs of zeroes. Translating a position
within such a run from old to new would require comparing fpos
objects. Although fpos is required to be EqualityComparable, that
doesn't guarantee that state participates in the equivalence relation.
In the GNU implementation, at least, it doesn't.

---

So, paragraphs 17 and 19 seem to be meaningless and pointless in the
context of the most reasonable functionality for imbue. They are quite
meaningful if you decide that imbue might mean something completely
different. What's up?

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]