This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Patch: libcpp -vs- UTF-8 BOM
- From: Paolo Bonzini <bonzini at gnu dot org>
- To: tromey at redhat dot com
- Cc: Gcc Patch List <gcc-patches at gcc dot gnu dot org>
- Date: Thu, 17 Apr 2008 17:46:00 +0200
- Subject: Re: Patch: libcpp -vs- UTF-8 BOM
- References: <m3bq48sg57.fsf@fleche.redhat.com>
This implementation is a little funny in that it explicitly looks for
the UTF-8 BOM after decoding. This should be harmless, though. Note
that we can't rely on iconv() to handle the BOM (as we do with UTF-16)
-- the glibc iconv() does not appear to handle a UTF-8 BOM, and
furthermore we often bypass iconv in this case.
Does not seem too awful at all after thinking about it. The comment
could instead remind the reader that SOURCE_CHARSET *is* "UTF-8" if the
host charset is ASCII, and that's why we look at the BOM as UTF-8. In
other words, we don't look for a UTF-8 BOM because PR33415 is about
UTF-8 BOMs, but because it's the only sequence we can find after iconv.
Something like this (I don't like my own text much actually...):
/* Ignore the BOM if we see one. If iconv has not stripped it (as
of glibc 2.7, for example, iconv does not ignore a UTF-8 BOM)
it has been converted to SOURCE_CHARSET (i.e. UTF-8), and that's
what we test for. We would also find the BOM if we are in the
'convert_no_conversion' case. */
libcpp/ChangeLog:
Missing entry for charset.c.
Paolo