This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Preliminary fix for codecvt_members_unicode_wchar_t

From: Benjamin Kosnik <bkoz at redhat dot com>
To: Paolo Carlini <pcarlini at unitus dot it>
Cc: libstdc++ at gcc dot gnu dot org
Date: Mon, 25 Mar 2002 14:42:19 -0800 (PST)
Subject: Re: [PATCH] Preliminary fix for codecvt_members_unicode_wchar_t


> However, there is something which keeps puzzling me. Why in 
> config/locale/ieee_1003.1-2001/codecvt_specializations.h there are these 
> lines:
> 
>     explicit __enc_traits(const locale& __loc)
>     : _M_in_desc(0), _M_out_desc(0), _M_ext_bom(0), _M_int_bom(0)
>     {
>       // __intc_end = whatever we are using internally, which is
>       // UCS4 (linux, solaris)
>       // UCS2 == UNICODE  (microsoft, java, aix, whatever...)
>       // XXX Currently don't know how to get this data from target system...
>       strcpy(_M_int_enc, "UCS4");
> 
> as if only "UCS4" were supported?!?

These tests only work on linux at the moment, so I'm fudging the details. 
(Although any os that can build libiconv and use it should be able to do 
this stuff too. Time to smooth out these details has thus far eluded me.)

The deal is that sizeof(wchar_t) == 2 on microsoft OS's, and the default 
encoding for wide characters is not the same as linux, or solaris, etc. 
On linux, sizeof(wchar_t) == 4, and the default encoding is WCHAR_T. 

At the moment, there is no way to cleanly get the native encoding from an 
OS. There's CODESET, from langinfo.h, but there is no consistent locale 
name that indicates wide capability. (For narrow, you could pick the 
standard "C" locale, and query CODESET.). 

To be quite honest, I'm not even quite sure if this ctor should exist. 
There is no way to set the internal encoding reliably. If you are 
interested in this, you might try deleting this ctor and seeing what 
breaks, if anything. 

>From Uli:
The internal encoding is *not* UCS4 on linux.  UCS4 includes a certain byte
order (big endian) which would mean a lot of additional work if it
would be the encoding.  WCHAR_T is UCS4 but with the native byte
order.


So, in summary, it looks like this is the deal, even if this directly 
contradicts my earlier email.

1) UCS4, UCS2 need a byte-order marker (bom) to indicate endianness. 
if there is no bom, then encodings assume native byte order. This varies 
per machine, as has been found out with the x86/powerpc divergence.

2) UCS4-BE, UCS2-BE should not need a bom to indicate endianness, as it 
is explicitly specified. 

I hope this helps explain the situation. If I'm wrong, please let me know
and I'll try to confuse the situation some more. I realize this sounds
really complicated at the moment. Writing docs that explain this is on my
TODO list for May. 

-benjamin

Follow-Ups:
- Re: [PATCH] Preliminary fix for codecvt_members_unicode_wchar_t
  - From: Paolo Carlini

References:
- Re: [PATCH] Preliminary fix for codecvt_members_unicode_wchar_t
  - From: Paolo Carlini

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]