This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The mysterious _M_output_unshift


On Mon, Apr 21, 2003 at 11:34:14AM +0200, Paolo Carlini wrote:
> I'm currently trying to understand in better detail basic_filebuf
> (in particular seekoff) in order to simplify _M_*_cur_move, the
> last remaining CPU hog in the implementation of sputc(), sbumpc(),
> etc.
> 
> As part of this I would really appreciate some help about the
> rationale behind basic_filebuf::_M_output_unshift, currently
> completely empty, and also about this code in basic_filebuf::close:
> 
> #if 0
>      // XXX not done
>      if (_M_last_overflowed)
>        {
>          _M_output_unshift();
>          _M_really_overflow(__eof);
>        }
> #endif

Do you know about shifted encodings?  An example in common use is 
JIS, a Japanese encoding.  They define a "shift" marker that changes 
the meaning of the characters after it, until an "unshift" sequence.  
For example, the base state might be ASCII, and then when the shift 
character comes along, characters after it are two bytes long and 
encode Kanji characters.  When a two-byte unshift is seen, characters 
after it are ASCII again.  (This does not necessarily describe JIS, 
but it's something similar.)  A subtle complication is that there 
might be any number of shift/unshift sequences with no actual 
characters in between, which is why max_len() isn't meaningful for 
such encodings.

The C locale requires that all strings and sequences begin and end 
in an unshifted state.  At the end of a sequence of shifted characters, 
we have to write out the unshift sequence before closing, or seeking
somewhere else.

Generally mbstate_t is supposed to be able to absorb enough information 
from the initial bytes of a partial character and, when the rest of the 
character comes along, produce the full character.  This is made more 
complicated by shift states, because the mbstate_t has to record the
shift state too.

The standard doesn't require that we handle any particular shifted
encodings, so unless there are some users who need it (and can send
patches) we can leave most of the code to handle it as stubs, which
seems to be what you found.

Most importantly, Unicode (UTF-8), EUC, and Shift-JIS are not shifted
encodings.  (Shift-JIS uses a prefix on each character, instead, not
unlike UTF-8.)

Nathan Myers
ncm at cantrip dot org


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]