This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Patch: libcpp -vs- UTF-8 BOM

From: Paolo Bonzini <bonzini at gnu dot org>
To: tromey at redhat dot com
Cc: Gcc Patch List <gcc-patches at gcc dot gnu dot org>
Date: Thu, 17 Apr 2008 17:46:00 +0200
Subject: Re: Patch: libcpp -vs- UTF-8 BOM
References: <m3bq48sg57.fsf@fleche.redhat.com>

This implementation is a little funny in that it explicitly looks for
the UTF-8 BOM after decoding.  This should be harmless, though.  Note
that we can't rely on iconv() to handle the BOM (as we do with UTF-16)
-- the glibc iconv() does not appear to handle a UTF-8 BOM, and
furthermore we often bypass iconv in this case.

Does not seem too awful at all after thinking about it. The comment could instead remind the reader that SOURCE_CHARSET *is* "UTF-8" if the host charset is ASCII, and that's why we look at the BOM as UTF-8. In other words, we don't look for a UTF-8 BOM because PR33415 is about UTF-8 BOMs, but because it's the only sequence we can find after iconv.

Something like this (I don't like my own text much actually...):

  /* Ignore the BOM if we see one.  If iconv has not stripped it (as
     of glibc 2.7, for example, iconv does not ignore a UTF-8 BOM)
     it has been converted to SOURCE_CHARSET (i.e. UTF-8), and that's
     what we test for.  We would also find the BOM if we are in the
     'convert_no_conversion' case.  */

libcpp/ChangeLog:

Missing entry for charset.c.

Paolo

Follow-Ups:
- Re: Patch: libcpp -vs- UTF-8 BOM
  - From: Tom Tromey

References:
- Patch: libcpp -vs- UTF-8 BOM
  - From: Tom Tromey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]