This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: UTF-8 quotation marks in diagnostics
- From: Joseph Myers <joseph at codesourcery dot com>
- To: "D. Hugh Redelmeier" <hugh at mimosa dot com>
- Cc: <gcc at gcc dot gnu dot org>
- Date: Wed, 21 Oct 2015 23:25:52 +0000
- Subject: Re: UTF-8 quotation marks in diagnostics
- Authentication-results: sourceware.org; auth=none
- References: <alpine dot LRH dot 2 dot 02 dot 1510211705080 dot 3681 at redclaw dot mimosa dot com>
On Wed, 21 Oct 2015, D. Hugh Redelmeier wrote:
> The LC_CTYPE environment variable specifies character
> classification. GCC uses it to determine the character
> boundaries in a string; this is needed for some multibyte
> encodings that contain quote and escape characters that are
> otherwise interpreted as a string end or escape.
That's inaccurate. The default source encoding is always UTF-8. See the
comment in libcpp/charset.c.
/* We disable this because the default codeset is 7-bit ASCII on
most platforms, and this causes conversion failures on every
file in GCC that happens to have one of the upper 128 characters
in it -- most likely, as part of the name of a contributor.
We should definitely recognize in-band markers of file encoding,
like:
- the appropriate Unicode byte-order mark (FE FF) to recognize
UTF16 and UCS4 (in both big-endian and little-endian flavors)
and UTF8
- a "#i", "#d", "/ *", "//", " #p" or "#p" (for #pragma) to
distinguish ASCII and EBCDIC.
- now we can parse something like "#pragma GCC encoding <xyz>
on the first line, or even Emacs/VIM's mode line tags (there's
a problem here in that VIM uses the last line, and Emacs has
its more elaborate "local variables" convention).
- investigate whether Java has another common convention, which
would be friendly to support.
(Zack Weinberg and Paolo Bonzini, May 20th 2004) */
I haven't checked whether the documentation (and the matching
documentation for -finput-charset) was once accurate in this regard (i.e.
if the documentation in question dates from a time when LC_CTYPE did
determine the source character set).
--
Joseph S. Myers
joseph@codesourcery.com