This is GCC Bugzilla
This is GCC Bugzilla Version 2.20+
View Bug Activity | Format For Printing | Clone This Bug
The gcc-3.4.0 manual says in http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Environment-Variables.html#Environment%20Variables under the second(!) entry for "LANG": If LANG is not defined, or if it has some other value [than C-JIS, C-SJIS or C-EUCJP], then the compiler will use mblen and mbtowc as defined by the default locale to recognize and translate multibyte characters. When I try to compile a *.c source file that is encoded in UTF-8 and that contains the wide string literal L"Schöne Grüße", then gcc interprets that literal as if the locale were using ISO 8859-1, even though I have called gcc with LANG=en_GB.UTF-8 and LC_ALL=en_GB.UTF-8. The resulting wchar_t * string in the object file consists of the byte sequence 000005e0 53 00 00 00 63 00 00 00 68 00 00 00 c3 00 00 00 S...c...h....... 000005f0 b6 00 00 00 6e 00 00 00 65 00 00 00 20 00 00 00 ....n...e... ... 00000600 47 00 00 00 72 00 00 00 c3 00 00 00 bc 00 00 00 G...r........... 00000610 c3 00 00 00 9f 00 00 00 65 00 00 00 00 00 00 00 ........e....... that is TWO 32-bit wchar_t words have been assigned to each of the non-ASCII UTF-8 sequences, instead of one, as one would expect if the compiler had honoured the locale. (Tested on SuSE Linux 9.1 (gcc-3.3.3) for ix86, but judging from the manual, this has not been fixed in 3.4.0.) It appears as if gcc ignores the locale when converting L"..." literals into wchar_t * strings. The special handling of C-JIS, C-SJIS and C-EUCJP looks like an obsolete anachronism. Shouldn't this ad-hoc conversion hack for merely three Japanese encodings better be replaced simply by a proper LC_CTYPE dependent call to mbtowc(), such that wide string literals finall work for all character sets supported by the C library? GCC should not contain any special code for just three Japanese encodings, if all this can be handled properly via libc and its multi-byte encoding locale support.
Read <http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Preprocessor-Options.html> and -fwide-exec- charset. Though for this is fixed for 3.4.0 by a different means.
I said fixed.
Created an attachment (id=6319) [edit] Trival source code example to reproduce problem The attached UTF-8 source code, if compiled and executed under LC_ALL=en_GB.UTF-8, produces garbled text as output. The resulting output as a hex dump is 00000000 53 63 68 c3 83 c2 b6 6e 65 20 47 72 c3 83 c2 bc Sch....ne Gr.... 00000010 c3 83 c2 9f 65 0a ....e. and looks like the UTF-8 string "Schöne Grüße" has gone erroneously through an ISO 8859-1 to UTF-8 conversion step somewhere along the line (or more likely a ISO 8859-1 -> UTF-32 -> UTF-8 conversion chain, where it should have been a UTF-8 -> UTF-32 -> UTF-8 conversion chain instead). I tested this under both SuSE Linux 8.2 (gcc-3.3.1, i586-suse-linux) and SuSE Linux 9.1 (gcc 3.3.3, x86_64-suse-linux)
Subject: Re: gcc ignores locale when converting wide string literals to wchar_t strings On Mon, 17 May 2004, pinskia at gcc dot gnu dot org wrote: > Read <http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Preprocessor-Options.html> and -fwide-exec- > charset. Though for this is fixed for 3.4.0 by a different means. The compiler is fixed, but the documentation pointed out in this bug report is out-of-date and misleading (probably relating to the old --enable-c-mbchar). It needs to be changed to point to the relevant documentation you refer to before this bug report can be considered properly fixed.
Reopening for documentation problems
Confirmed.
Testing a documentation patch.
Fixed on trunk.
Subject: Bug 15500 Author: tromey Date: Fri Apr 18 17:53:34 2008 New Revision: 134441 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=134441 Log: PR libcpp/15500: * doc/cpp.texi (Implementation-defined behavior): Mention -finput-charset. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/cpp.texi
On further reflection I think this is very minor and I'm unlikely to back-port the fix. So, I am closing this.