This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug other/61896] Wrong documentation for -finput-charset

From: "tom at honermann dot net" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 25 May 2016 14:03:51 +0000
Subject: [Bug other/61896] Wrong documentation for -finput-charset
Auto-submitted: auto-generated
References: <bug-61896-4 at http dot gcc dot gnu dot org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61896

Tom Honermann <tom at honermann dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tom at honermann dot net

--- Comment #1 from Tom Honermann <tom at honermann dot net> ---
Created attachment 38565
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38565&action=edit
Source file with ill-formed UTF-8 code unit sequences

Gcc's incorrect documentation regarding the default input character set
continues to be a source of confusion.  See the discussion on the C++
std-proposals list at the following link (search for 'locale').

https://groups.google.com/a/isocpp.org/forum/#!searchin/std-proposals/Draft$20proposal$20of$20file$20string/std-proposals/tKioR8OUiAw/85NCUmojBwAJ

The current gcc 6.1.0 documentation for -finput-charset can be found here:

https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/Preprocessor-Options.html#Preprocessor-Options

The relevant text is:

  -finput-charset=charset
    Set the input character set, used for translation from the character set of
    the input file to the source character set used by GCC. If the locale does
    not specify, or GCC cannot get this information from the locale, the
    default is UTF-8. This can be overridden by either the locale or this
    command-line option. Currently the command-line option takes precedence if
    there's a conflict. charset can be any encoding supported by the system's
    iconv library routine. 

The patch proposed in attachment 33179 in comment 0 is an improvement in that
it removes the incorrect references to use of the current locale in determining
the input character set.  However, the proposed documentation is still
incorrect, or at least imprecise, with regard to use of UTF-8 as the default
input character set since gcc does not reject (or even emit a warning for)
ill-formed UTF-8 text.

An example follows.  The attached test code (attached to prevent mutation of
the contents) contains ill-formed UTF-8 code unit sequences.  Compilation with
gcc 6.1.0 (on a Linux system) succeeds despite the ill-formed input.

# To demonstrate that the text is ill-formed:
$ iconv -f utf-8 -t utf-8 t.cpp
#include <cstdio>
int main()
{
    printf("narrow string: (well-formed UTF-8)\n");
    for (unsigned char c : "Â") { // 0xC2 0xA3
        printf("  0x%X\n", (unsigned int)c);
    }
    printf("narrow string: (ill-formed UTF-8)\n");
    for (unsigned char c : "iconv: illegal input sequence at position 261

$ g++ --version
g++ (GCC) 6.1.0
...

$ g++ -Wall -Wextra -pedantic t.cpp -o t; echo $?
0

$ ./t
narrow string: (well-formed UTF-8)
  0xC2
  0xA3
  0x0
narrow string: (ill-formed UTF-8)
  0xA3
  0x0
narrow string (hex escape):
  0xA3
  0x0
UTF-8 string: (well-formed UTF-8)
  0xC2
  0xA3
  0x0
UTF-8 string: (ill-formed UTF-8)
  0xA3
  0x0
UTF-8 string (hex escape):
  0xA3
  0x0

As shown above, ill-formed code unit sequences are passed through without being
transcoded to the execution character set (I would expect an error or
translation to a replacement character for the ill-formed sequences).

Note that validation is performed if a non-utf-8 execution character set is
specified.

$ g++ -Wall -Wextra -pedantic -fexec-charset=iso8859-1 t.cpp -o t
t.cpp: In function âint main()â:
t.cpp:9:28: error: converting to execution character set: Invalid or incomplete
multibyte or wide character
     for (unsigned char c : "ï") { // 0xA3
                            ^~~

I propose the documentation be updated to reflect this behavior:

  -finput-charset=charset
    Set the input character set, used for translation from the character set of
    the input file to the source character set used by GCC.  The default input
    character set is UTF-8.  charset can be any encoding supported by the
    system's iconv library routine.  If the input character set matches the
    execution character set, then ill-formed code unit sequences are passed
    through without validation or translation.  Otherwise, ill-formed code unit
    sequences will result in an error during transcoding to the execution
    character set.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]