[Bug c/15500] New: gcc ignores locale when converting wide string literals to wchar_t strings

Markus dot Kuhn at cl dot cam dot ac dot uk gcc-bugzilla@gcc.gnu.org
Tue May 18 11:19:00 GMT 2004


The gcc-3.4.0 manual says in

http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Environment-Variables.html#Environment%20Variables

under the second(!) entry for "LANG":

    If LANG is not defined, or if it has some other value [than C-JIS,
    C-SJIS or C-EUCJP], then the compiler will use mblen and mbtowc as
    defined by the default locale to recognize and translate multibyte
    characters. 

When I try to compile a *.c source file that is encoded in UTF-8 and that
contains the wide string literal L"Schöne Grüße", then gcc interprets that
literal as if the locale were using ISO 8859-1, even though I have called gcc
with LANG=en_GB.UTF-8 and LC_ALL=en_GB.UTF-8. The resulting wchar_t * string in
the object file consists of the byte sequence

000005e0  53 00 00 00 63 00 00 00  68 00 00 00 c3 00 00 00  S...c...h.......
000005f0  b6 00 00 00 6e 00 00 00  65 00 00 00 20 00 00 00  ....n...e... ...
00000600  47 00 00 00 72 00 00 00  c3 00 00 00 bc 00 00 00  G...r...........
00000610  c3 00 00 00 9f 00 00 00  65 00 00 00 00 00 00 00  ........e.......

that is TWO 32-bit wchar_t words have been assigned to each of the non-ASCII
UTF-8 sequences, instead of one, as one would expect if the compiler had
honoured the locale.

(Tested on SuSE Linux 9.1 (gcc-3.3.3) for ix86, but judging from the manual,
this has not been fixed in 3.4.0.)

It appears as if gcc ignores the locale when converting L"..." literals into
wchar_t * strings.

The special handling of C-JIS, C-SJIS and C-EUCJP looks like an obsolete
anachronism. Shouldn't this ad-hoc conversion hack for merely three Japanese
encodings better be replaced simply by a proper LC_CTYPE dependent call to
mbtowc(), such that wide string literals finall work for all character sets
supported by the C library? GCC should not contain any special code for just
three Japanese encodings, if all this can be handled properly via libc and its
multi-byte encoding locale support.

-- 
           Summary: gcc ignores locale when converting wide string literals
                    to wchar_t strings
           Product: gcc
           Version: 3.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: Markus dot Kuhn at cl dot cam dot ac dot uk
                CC: gcc-bugs at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15500



More information about the Gcc-bugs mailing list