This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Patch: libcpp -vs- UTF-8 BOM



This implementation is a little funny in that it explicitly looks for
the UTF-8 BOM after decoding.  This should be harmless, though.  Note
that we can't rely on iconv() to handle the BOM (as we do with UTF-16)
-- the glibc iconv() does not appear to handle a UTF-8 BOM, and
furthermore we often bypass iconv in this case.

Does not seem too awful at all after thinking about it. The comment could instead remind the reader that SOURCE_CHARSET *is* "UTF-8" if the host charset is ASCII, and that's why we look at the BOM as UTF-8. In other words, we don't look for a UTF-8 BOM because PR33415 is about UTF-8 BOMs, but because it's the only sequence we can find after iconv.


Something like this (I don't like my own text much actually...):

  /* Ignore the BOM if we see one.  If iconv has not stripped it (as
     of glibc 2.7, for example, iconv does not ignore a UTF-8 BOM)
     it has been converted to SOURCE_CHARSET (i.e. UTF-8), and that's
     what we test for.  We would also find the BOM if we are in the
     'convert_no_conversion' case.  */

libcpp/ChangeLog:

Missing entry for charset.c.


Paolo


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]