This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Re: The mysterious _M_output_unshift
- From: Nathan Myers <ncm at cantrip dot org>
- To: libstdc++ at gcc dot gnu dot org
- Date: Mon, 21 Apr 2003 10:43:41 -0700
- Subject: Re: The mysterious _M_output_unshift
- References: <3EA3BB16.5000502@unitus.it>
On Mon, Apr 21, 2003 at 11:34:14AM +0200, Paolo Carlini wrote:
> I'm currently trying to understand in better detail basic_filebuf
> (in particular seekoff) in order to simplify _M_*_cur_move, the
> last remaining CPU hog in the implementation of sputc(), sbumpc(),
> etc.
>
> As part of this I would really appreciate some help about the
> rationale behind basic_filebuf::_M_output_unshift, currently
> completely empty, and also about this code in basic_filebuf::close:
>
> #if 0
> // XXX not done
> if (_M_last_overflowed)
> {
> _M_output_unshift();
> _M_really_overflow(__eof);
> }
> #endif
Do you know about shifted encodings? An example in common use is
JIS, a Japanese encoding. They define a "shift" marker that changes
the meaning of the characters after it, until an "unshift" sequence.
For example, the base state might be ASCII, and then when the shift
character comes along, characters after it are two bytes long and
encode Kanji characters. When a two-byte unshift is seen, characters
after it are ASCII again. (This does not necessarily describe JIS,
but it's something similar.) A subtle complication is that there
might be any number of shift/unshift sequences with no actual
characters in between, which is why max_len() isn't meaningful for
such encodings.
The C locale requires that all strings and sequences begin and end
in an unshifted state. At the end of a sequence of shifted characters,
we have to write out the unshift sequence before closing, or seeking
somewhere else.
Generally mbstate_t is supposed to be able to absorb enough information
from the initial bytes of a partial character and, when the rest of the
character comes along, produce the full character. This is made more
complicated by shift states, because the mbstate_t has to record the
shift state too.
The standard doesn't require that we handle any particular shifted
encodings, so unless there are some users who need it (and can send
patches) we can leave most of the code to handle it as stubs, which
seems to be what you found.
Most importantly, Unicode (UTF-8), EUC, and Shift-JIS are not shifted
encodings. (Shift-JIS uses a prefix on each character, instead, not
unlike UTF-8.)
Nathan Myers
ncm at cantrip dot org