This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug other/61896] Wrong documentation for -finput-charset
- From: "tom at honermann dot net" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 25 May 2016 14:03:51 +0000
- Subject: [Bug other/61896] Wrong documentation for -finput-charset
- Auto-submitted: auto-generated
- References: <bug-61896-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61896
Tom Honermann <tom at honermann dot net> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |tom at honermann dot net
--- Comment #1 from Tom Honermann <tom at honermann dot net> ---
Created attachment 38565
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38565&action=edit
Source file with ill-formed UTF-8 code unit sequences
Gcc's incorrect documentation regarding the default input character set
continues to be a source of confusion. See the discussion on the C++
std-proposals list at the following link (search for 'locale').
https://groups.google.com/a/isocpp.org/forum/#!searchin/std-proposals/Draft$20proposal$20of$20file$20string/std-proposals/tKioR8OUiAw/85NCUmojBwAJ
The current gcc 6.1.0 documentation for -finput-charset can be found here:
https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/Preprocessor-Options.html#Preprocessor-Options
The relevant text is:
-finput-charset=charset
Set the input character set, used for translation from the character set of
the input file to the source character set used by GCC. If the locale does
not specify, or GCC cannot get this information from the locale, the
default is UTF-8. This can be overridden by either the locale or this
command-line option. Currently the command-line option takes precedence if
there's a conflict. charset can be any encoding supported by the system's
iconv library routine.
The patch proposed in attachment 33179 in comment 0 is an improvement in that
it removes the incorrect references to use of the current locale in determining
the input character set. However, the proposed documentation is still
incorrect, or at least imprecise, with regard to use of UTF-8 as the default
input character set since gcc does not reject (or even emit a warning for)
ill-formed UTF-8 text.
An example follows. The attached test code (attached to prevent mutation of
the contents) contains ill-formed UTF-8 code unit sequences. Compilation with
gcc 6.1.0 (on a Linux system) succeeds despite the ill-formed input.
# To demonstrate that the text is ill-formed:
$ iconv -f utf-8 -t utf-8 t.cpp
#include <cstdio>
int main()
{
printf("narrow string: (well-formed UTF-8)\n");
for (unsigned char c : "Â") { // 0xC2 0xA3
printf(" 0x%X\n", (unsigned int)c);
}
printf("narrow string: (ill-formed UTF-8)\n");
for (unsigned char c : "iconv: illegal input sequence at position 261
$ g++ --version
g++ (GCC) 6.1.0
...
$ g++ -Wall -Wextra -pedantic t.cpp -o t; echo $?
0
$ ./t
narrow string: (well-formed UTF-8)
0xC2
0xA3
0x0
narrow string: (ill-formed UTF-8)
0xA3
0x0
narrow string (hex escape):
0xA3
0x0
UTF-8 string: (well-formed UTF-8)
0xC2
0xA3
0x0
UTF-8 string: (ill-formed UTF-8)
0xA3
0x0
UTF-8 string (hex escape):
0xA3
0x0
As shown above, ill-formed code unit sequences are passed through without being
transcoded to the execution character set (I would expect an error or
translation to a replacement character for the ill-formed sequences).
Note that validation is performed if a non-utf-8 execution character set is
specified.
$ g++ -Wall -Wextra -pedantic -fexec-charset=iso8859-1 t.cpp -o t
t.cpp: In function âint main()â:
t.cpp:9:28: error: converting to execution character set: Invalid or incomplete
multibyte or wide character
for (unsigned char c : "ï") { // 0xA3
^~~
I propose the documentation be updated to reflect this behavior:
-finput-charset=charset
Set the input character set, used for translation from the character set of
the input file to the source character set used by GCC. The default input
character set is UTF-8. charset can be any encoding supported by the
system's iconv library routine. If the input character set matches the
execution character set, then ill-formed code unit sequences are passed
through without validation or translation. Otherwise, ill-formed code unit
sequences will result in an error during transcoding to the execution
character set.