Compiling files not encoded with system settings
Nicolas De Rico
nicolas.derico@sand.com
Wed May 24 20:17:00 GMT 2006
Hello,
A few weeks ago, I posted on the general GCC mailing-list an issue that
arises when compiling with GCC on Linux files created on Windows and
saved in "unicode".
The problem seems to be that when CPP reads a source file, it uses the
parameter passed with the -finput-charset option or LC_CTYPE, as
encoding, which is fine. But then, it uses the same encoding to read
system header files which fails if these are not encoded in the same way
as passed with -finput-charset.
To make a reproducible test, I created a simple hello world program that
includes stdio.h. The file hi-utf16.c, created with Notepad and saved in
"unicode", contains a BOM which is, in essence, a small header at the
beginning of the file that indicates the encoding.
nicolas:~> gcc -finput-charset=UTF-16 hi-utf16.c
hi-utf16.c:1:19:failure to convert UTF-16 to UTF-8
It appears that CPP is telling libiconv to convert the source file from
UTF-16 to UTF-8, which works, but as soon as it hits the include file,
it fails. Of course, stdio.h is stored in UTF-8 on the system so trying
to convert it from UTF-16 will fail right away.
It would be nice if every file used the same unicode encoding, but
that's not always possible, especially when source control is involved.
This issue touches interoperability between Windows and UNIX and also
"legacy" (ie. pre-UTF-8) source files in general. My suggestion is to
have CPP open a file, read the first up to 4 bytes to figure out if
there is a BOM. If so, calculate the encoding and pass it libiconv. I
believe that's what vim does, btw. In short, I suggest that the
encoding be detected in the following order:
1-BOM
2-finput-charset option
3-LC_CTYPE
I would appreciate some feedback on the subject, including how to proceed.
Thank you in advance,
Nicolas
More information about the Gcc-bugs
mailing list