[Bug c/15500] New: gcc ignores locale when converting wide string literals to wchar_t strings
Markus dot Kuhn at cl dot cam dot ac dot uk
gcc-bugzilla@gcc.gnu.org
Tue May 18 11:19:00 GMT 2004
The gcc-3.4.0 manual says in
http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Environment-Variables.html#Environment%20Variables
under the second(!) entry for "LANG":
If LANG is not defined, or if it has some other value [than C-JIS,
C-SJIS or C-EUCJP], then the compiler will use mblen and mbtowc as
defined by the default locale to recognize and translate multibyte
characters.
When I try to compile a *.c source file that is encoded in UTF-8 and that
contains the wide string literal L"Schöne GrüÃe", then gcc interprets that
literal as if the locale were using ISO 8859-1, even though I have called gcc
with LANG=en_GB.UTF-8 and LC_ALL=en_GB.UTF-8. The resulting wchar_t * string in
the object file consists of the byte sequence
000005e0 53 00 00 00 63 00 00 00 68 00 00 00 c3 00 00 00 S...c...h.......
000005f0 b6 00 00 00 6e 00 00 00 65 00 00 00 20 00 00 00 ....n...e... ...
00000600 47 00 00 00 72 00 00 00 c3 00 00 00 bc 00 00 00 G...r...........
00000610 c3 00 00 00 9f 00 00 00 65 00 00 00 00 00 00 00 ........e.......
that is TWO 32-bit wchar_t words have been assigned to each of the non-ASCII
UTF-8 sequences, instead of one, as one would expect if the compiler had
honoured the locale.
(Tested on SuSE Linux 9.1 (gcc-3.3.3) for ix86, but judging from the manual,
this has not been fixed in 3.4.0.)
It appears as if gcc ignores the locale when converting L"..." literals into
wchar_t * strings.
The special handling of C-JIS, C-SJIS and C-EUCJP looks like an obsolete
anachronism. Shouldn't this ad-hoc conversion hack for merely three Japanese
encodings better be replaced simply by a proper LC_CTYPE dependent call to
mbtowc(), such that wide string literals finall work for all character sets
supported by the C library? GCC should not contain any special code for just
three Japanese encodings, if all this can be handled properly via libc and its
multi-byte encoding locale support.
--
Summary: gcc ignores locale when converting wide string literals
to wchar_t strings
Product: gcc
Version: 3.4.0
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: Markus dot Kuhn at cl dot cam dot ac dot uk
CC: gcc-bugs at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15500
More information about the Gcc-bugs
mailing list