This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

why using wchar_t?


I am not sure if this is a bug or if this is the way it's supposed to be
but GCC does behave in a way that I for one did not expect it to behavee
with regard to the handling of wchar_t data type.

On my machine (an IBM PC running Linux) whcar_t is defined to be a 4
byte size (32 bits) data type so it can theoretically hold any UCS-4
character including the full set of all Unicode characters in the range
0..0x10ffff. However, when defining a wide character string holding
non-ASCII characters, this is not taken advantage of. For example a
declaration as the one below in a source file:

        const wchar_t str[] = L"Lørdag";

The problematic character here is the letter LATIN SMALL LETTER O WITH
STROKE encoded as U+00f8 in unicode is stored int he string using TWO
wchar_t characters containing the UTF-8 translation of the character,
i.e. the string is equivalent to the following declaratation:

const wchar_t str[] = { L'L', wchar_t(0xc3), wchar_t(0xb8), L'r',
        L'd', L'a', L'g', L'\0' };

So here we use 8 32-bit entries storing 8 numbers all fitting in a byte.
On the good side, convertion to UTF-8 is immediate:

p = buf;
wp = wbuf;
while ( ( *p++ = (unsigned char)*wp++) != 0);

will successfully convert a wchar_t buffer wbuf into a char buffer buf.

However, it seems kinda pointless to have wchar_t when you never use it
as a wchar_t.

Of course, this is probably meant to be locale dependent and for narrow
characters being locale aware is very important, but I thought one of
the points of using wide characters was that it converted to unicode or
UCS-4 or something and from that point on was locale independent and
only used locale when converting back to narrow characters etc.

So, question is: is this the way it's supposed to be or is something set
up very wrong somewhere in my settings. I am very doubtful that reading
from a source code file should be locale dependent, after all if some
chinese write some source code with wide character text strings
containing chinese characters and I then download the file to me and
compile it I would expect chinese characters to appear on my output
provided my output program (xterm or similar program) was capable of
displaying them.

In other words: the source code file MUST be independent of locale and
must use it's own locale (the C locale probably) while reading the
source files and so I would expect it to translate the UTF-8 codes in
the original source file to some sensible unicode (*) character data in
the wchar_t string.

(*) Since ISO-10646 and unicode are synchronized and neither of them
will go beyond the ranges 0..0xd7ff and 0xe000..0x10ffff it makes sense
that wchar_t should be defined as one of the following two possible
definitions:

if (sizeof(wchar_t) == 4) then a wchar_t is a data object capable of
holding one character in the above mentioned set.

if (sizeof(wchar_t) == 2) (my cygwin version of gcc has this) then a
wchar_t is a data object capable of holding either one character in the
range 0..0xd7ff, the range 0xe000..0xfffd, or two of these characters
can hold a character with code point >= 0x10000 using UTF-16 decoding.

You could arguee that also if sizeof(wchar_t) == 4 you should be able to
handle UTF-16 encoding but personally I think it is cleaner to say that
such codes should be removed while decoding UTF-16 into a wchar_t
buffer.

In general, handling of wchar_t should be locale INDEPENDENT and only
when converting to multibyte character sequences which aren't unicode
related (such that varies with locale) will locale enter the picture, in
particular, conversion from wchar_t to UTF8 for example should be locale
independent and if locale has specified UTF-8 as its encoding.

So, you could call the behavior of the current gcc (version 3.2 of gcc
compiler) and glibc (version 2.2 of glibc) a bug, but I want to hear a
second opinion before I claim it as such.

Further I am unsure if you define this a bug in the compiler, a bug in
glibc or a bug in my editor (emacs 21.2.1) or a fault by the programmer,
the compiler itself should when using the C locale detect that the input
source file contains characters not acceptable by the compiler such as
if it doesn't recognize UTF-8 encoding but UTF-8 is in the input stream.
If the compiler is SUPPOSED to understand UTF-8 it should recognize the
input data as UTF-8 and convert them to wchar_t when storing them in the
string literal.

Another possible culprit could be glibc which doesn't recognize the
UTF-8 sequence and convert the characters to wchar_t, however, this is
unlikely since the compiler probably doesn't expect wchar_t on input,
even when reading data for a wchar_t literal.

However, making unicode the base character set for source code files (so
that the compiler read wchar_t characters from the source file) would by
some be considered a good thing, allowing people to define variables
having names containing characters in non-ascii character set.

A third possible culprit here could be the editor which stores non-ascii
characters in a UTF-8 encoding but doesn't properly mark the file as
UTF-8 by for exammple insert a BOM at the beginning of the file.

A fourth culprit could of course be the programmer who "should have
known" that making up string literals with non-ascii characters is
non-standard and should rather be done using some other syntax such as
\x, \X, \u or \U or some such such that the string L"Lørdag" is replaced
by L"L\x00f8rdag" or some such.

Above all, I would like to hear from developers of GCC what their view
on this is.

Alf
I am not sure if this is a bug or if this is the way it's supposed to be
but GCC does behave in a way that I for one did not expect it to behavee
with regard to the handling of wchar_t data type.

On my machine (an IBM PC running Linux) whcar_t is defined to be a 4
byte size (32 bits) data type so it can theoretically hold any UCS-4
character including the full set of all Unicode characters in the range
0..0x10ffff. However, when defining a wide character string holding
non-ASCII characters, this is not taken advantage of. For example a
declaration as the one below in a source file:

        const wchar_t str[] = L"Lørdag";

The problematic character here is the letter LATIN SMALL LETTER O WITH
STROKE encoded as U+00f8 in unicode is stored int he string using TWO
wchar_t characters containing the UTF-8 translation of the character,
i.e. the string is equivalent to the following declaratation:

const wchar_t str[] = { L'L', wchar_t(0xc3), wchar_t(0xb8), L'r',
        L'd', L'a', L'g', L'\0' };

So here we use 8 32-bit entries storing 8 numbers all fitting in a byte.
On the good side, convertion to UTF-8 is immediate:

p = buf;
wp = wbuf;
while ( ( *p++ = (unsigned char)*wp++) != 0);

will successfully convert a wchar_t buffer wbuf into a char buffer buf.

However, it seems kinda pointless to have wchar_t when you never use it
as a wchar_t.

Of course, this is probably meant to be locale dependent and for narrow
characters being locale aware is very important, but I thought one of
the points of using wide characters was that it converted to unicode or
UCS-4 or something and from that point on was locale independent and
only used locale when converting back to narrow characters etc.

So, question is: is this the way it's supposed to be or is something set
up very wrong somewhere in my settings. I am very doubtful that reading
from a source code file should be locale dependent, after all if some
chinese write some source code with wide character text strings
containing chinese characters and I then download the file to me and
compile it I would expect chinese characters to appear on my output
provided my output program (xterm or similar program) was capable of
displaying them.

In other words: the source code file MUST be independent of locale and
must use it's own locale (the C locale probably) while reading the
source files and so I would expect it to translate the UTF-8 codes in
the original source file to some sensible unicode (*) character data in
the wchar_t string.

(*) Since ISO-10646 and unicode are synchronized and neither of them
will go beyond the ranges 0..0xd7ff and 0xe000..0x10ffff it makes sense
that wchar_t should be defined as one of the following two possible
definitions:

if (sizeof(wchar_t) == 4) then a wchar_t is a data object capable of
holding one character in the above mentioned set.

if (sizeof(wchar_t) == 2) (my cygwin version of gcc has this) then a
wchar_t is a data object capable of holding either one character in the
range 0..0xd7ff, the range 0xe000..0xfffd, or two of these characters
can hold a character with code point >= 0x10000 using UTF-16 decoding.

You could arguee that also if sizeof(wchar_t) == 4 you should be able to
handle UTF-16 encoding but personally I think it is cleaner to say that
such codes should be removed while decoding UTF-16 into a wchar_t
buffer.

In general, handling of wchar_t should be locale INDEPENDENT and only
when converting to multibyte character sequences which aren't unicode
related (such that varies with locale) will locale enter the picture, in
particular, conversion from wchar_t to UTF8 for example should be locale
independent and if locale has specified UTF-8 as its encoding.

So, you could call the behavior of the current gcc (version 3.2 of gcc
compiler) and glibc (version 2.2 of glibc) a bug, but I want to hear a
second opinion before I claim it as such.

Further I am unsure if you define this a bug in the compiler, a bug in
glibc or a bug in my editor (emacs 21.2.1) or a fault by the programmer,
the compiler itself should when using the C locale detect that the input
source file contains characters not acceptable by the compiler such as
if it doesn't recognize UTF-8 encoding but UTF-8 is in the input stream.
If the compiler is SUPPOSED to understand UTF-8 it should recognize the
input data as UTF-8 and convert them to wchar_t when storing them in the
string literal.

Another possible culprit could be glibc which doesn't recognize the
UTF-8 sequence and convert the characters to wchar_t, however, this is
unlikely since the compiler probably doesn't expect wchar_t on input,
even when reading data for a wchar_t literal.

However, making unicode the base character set for source code files (so
that the compiler read wchar_t characters from the source file) would by
some be considered a good thing, allowing people to define variables
having names containing characters in non-ascii character set.

A third possible culprit here could be the editor which stores non-ascii
characters in a UTF-8 encoding but doesn't properly mark the file as
UTF-8 by for exammple insert a BOM at the beginning of the file.

A fourth culprit could of course be the programmer who "should have
known" that making up string literals with non-ascii characters is
non-standard and should rather be done using some other syntax such as
\x, \X, \u or \U or some such such that the string L"Lørdag" is replaced
by L"L\x00f8rdag" or some such.

Above all, I would like to hear from developers of GCC what their view
on this is.

Alf





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]