Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 15500
Product:  
Component:  
Status: RESOLVED
Resolution: FIXED
Assigned To: Tom Tromey <tromey@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
utest.c Trival source code example to reproduce problem text/plain; charset=UTF-8 2004-05-17 19:31 197 bytes Edit
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 15500 depends on: Show dependency tree
Show dependency graph
Bug 15500 blocks:

Additional Comments:






View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2008-04-18 17:45 Opened: 2004-05-17 19:22
The gcc-3.4.0 manual says in

http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Environment-Variables.html#Environment%20Variables

under the second(!) entry for "LANG":

    If LANG is not defined, or if it has some other value [than C-JIS,
    C-SJIS or C-EUCJP], then the compiler will use mblen and mbtowc as
    defined by the default locale to recognize and translate multibyte
    characters. 

When I try to compile a *.c source file that is encoded in UTF-8 and that
contains the wide string literal L"Schöne Grüße", then gcc interprets that
literal as if the locale were using ISO 8859-1, even though I have called gcc
with LANG=en_GB.UTF-8 and LC_ALL=en_GB.UTF-8. The resulting wchar_t * string in
the object file consists of the byte sequence

000005e0  53 00 00 00 63 00 00 00  68 00 00 00 c3 00 00 00  S...c...h.......
000005f0  b6 00 00 00 6e 00 00 00  65 00 00 00 20 00 00 00  ....n...e... ...
00000600  47 00 00 00 72 00 00 00  c3 00 00 00 bc 00 00 00  G...r...........
00000610  c3 00 00 00 9f 00 00 00  65 00 00 00 00 00 00 00  ........e.......

that is TWO 32-bit wchar_t words have been assigned to each of the non-ASCII
UTF-8 sequences, instead of one, as one would expect if the compiler had
honoured the locale.

(Tested on SuSE Linux 9.1 (gcc-3.3.3) for ix86, but judging from the manual,
this has not been fixed in 3.4.0.)

It appears as if gcc ignores the locale when converting L"..." literals into
wchar_t * strings.

The special handling of C-JIS, C-SJIS and C-EUCJP looks like an obsolete
anachronism. Shouldn't this ad-hoc conversion hack for merely three Japanese
encodings better be replaced simply by a proper LC_CTYPE dependent call to
mbtowc(), such that wide string literals finall work for all character sets
supported by the C library? GCC should not contain any special code for just
three Japanese encodings, if all this can be handled properly via libc and its
multi-byte encoding locale support.

------- Comment #1 From Andrew Pinski 2004-05-17 19:29 -------
Read <http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Preprocessor-Options.html>
and -fwide-exec-
charset.  Though for this is fixed for 3.4.0 by a different means.

------- Comment #2 From Andrew Pinski 2004-05-17 19:29 -------
I said fixed.

------- Comment #3 From Markus Kuhn 2004-05-17 19:31 -------
Created an attachment (id=6319) [edit]
Trival source code example to reproduce problem

The attached UTF-8 source code, if compiled and executed under
LC_ALL=en_GB.UTF-8, produces garbled text as output. The resulting output as a
hex dump is

00000000  53 63 68 c3 83 c2 b6 6e  65 20 47 72 c3 83 c2 bc  Sch....ne Gr....
00000010  c3 83 c2 9f 65 0a				    ....e.

and looks like the UTF-8 string "Schöne Grüße" has gone erroneously through an
ISO 8859-1 to UTF-8 conversion step somewhere along the line (or more likely a
ISO 8859-1 -> UTF-32 -> UTF-8 conversion chain, where it should have been a
UTF-8 -> UTF-32 -> UTF-8 conversion chain instead).

I tested this under both SuSE Linux 8.2 (gcc-3.3.1, i586-suse-linux) and SuSE
Linux 9.1 (gcc 3.3.3, x86_64-suse-linux)

------- Comment #4 From Joseph S. Myers 2004-05-17 19:33 -------
Subject: Re:  gcc ignores locale when converting wide
 string literals to wchar_t strings

On Mon, 17 May 2004, pinskia at gcc dot gnu dot org wrote:

> Read <http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Preprocessor-Options.html> and -fwide-exec-
> charset.  Though for this is fixed for 3.4.0 by a different means.

The compiler is fixed, but the documentation pointed out in this bug
report is out-of-date and misleading (probably relating to the old
--enable-c-mbchar).  It needs to be changed to point to the relevant
documentation you refer to before this bug report can be considered
properly fixed.


------- Comment #5 From Andrew Pinski 2004-05-17 19:36 -------
Reopening for documentation problems

------- Comment #6 From Andrew Pinski 2004-05-17 19:36 -------
Confirmed.

------- Comment #7 From Tom Tromey 2008-04-18 17:45 -------
Testing a documentation patch.

------- Comment #8 From Tom Tromey 2008-04-18 17:54 -------
Fixed on trunk.

------- Comment #9 From Tom Tromey 2008-04-18 17:54 -------
Subject: Bug 15500

Author: tromey
Date: Fri Apr 18 17:53:34 2008
New Revision: 134441

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=134441
Log:
        PR libcpp/15500:
        * doc/cpp.texi (Implementation-defined behavior): Mention
        -finput-charset.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/cpp.texi

------- Comment #10 From Tom Tromey 2008-05-09 23:05 -------
On further reflection I think this is very minor and I'm unlikely
to back-port the fix.  So, I am closing this.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug