This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
UTF-8 quotation marks in diagnostics
- From: "D. Hugh Redelmeier" <hugh at mimosa dot com>
- To: gcc at gcc dot gnu dot org
- Date: Wed, 21 Oct 2015 17:23:34 -0400 (EDT)
- Subject: UTF-8 quotation marks in diagnostics
- Authentication-results: sourceware.org; auth=none
- Reply-to: "D. Hugh Redelmeier" <hugh at mimosa dot com>
Several of us don't want UTF-8 quotation marks in diagnostics in our
environment (Jove subshells). We'd like a way to turn them off. We don't
think that they are a bad idea but they are bad in our environment.
<https://gcc.gnu.org/gcc-4.0/changes.html>
English-language diagnostic messages will now use Unicode
quotation marks in UTF-8 locales. (Non-English messages
already used the quotes appropriate for the language in
previous releases.) If your terminal does not support UTF-8
but you are using a UTF-8 locale (such locales are the default
on many GNU/Linux systems) then you should set LC_CTYPE=C in
the environment to disable that locale. Programs that parse
diagnostics and expect plain ASCII English-language messages
should set LC_ALL=C. See Markus Kuhn's explanation of Unicode
quotation marks for more information.
This suggests that LC_CTYPE=C would do what we want: go back to ` and
' instead of 342\200\230 and \342\200\231.
I find that a little confusing and scary. I would expect that setting
LC_CTYPE=C would have the affect of changing the lexing done by the C
compiler. For one thing, valid characters in strings would be
different. This we don't want.
gcc(1) says:
The LC_CTYPE environment variable specifies character
classification. GCC uses it to determine the character
boundaries in a string; this is needed for some multibyte
encodings that contain quote and escape characters that are
otherwise interpreted as a string end or escape.
The LC_MESSAGES environment variable specifies the language to
use in diagnostic messages.
An experiment on my Fedora 20 system shows:
- LANG=en_CA.UTF-8 [correct]
- LC_CTYPE isn't set by default
- setting LC_CTYPE to C gets rid of the UTF-8 quotes in GCC diagnostics.
That's surprising because the manpage doesn't say that it affects diagnostics.
- setting LC_MESSAGES to C DOES NOT get rid of the UTF-8 quotes in GCC diagnostics
That's surprising because the manpage does say that it affects diagnostics.
I hope that it only affect compile-time diagnostics.
That sure sounds like we should NOT set LC_CTYPE=C because of bad
side-effects: it changes how the program is lexed. And the
documentation gives no basis for thinking that it would suppress those
UTF-8 quotes in messages (even though testing shows that this works).
That sure sounds like we should set LC_MESSAGES=C, but that doesn't work.
In our environment, our tool doesn't know that gcc is being invoked.
So the solution needs to be targetted. That's why a solution like
GCC_COLOURS would be good. In fact, it could probably be hacked into GCC_COLOURS.
Man pages in section 1 that explicitly reference LC_CTYPE:
enca
enconv
find
gcc
gnroff
grep
jove
koi8rxterm
less
locale
localedef
nroff
perl5004delta
perl5160delta
perl58delta
perlfunc
perllocale
perltoc
pico
pilot
sh
systemd
time
tree
uxterm
xterm
So I feel uncomfortable setting it.
Man pages in section 1 that explicitly reference LC_MESSAGES:
apropos
aspell
awk
bash
enca
enconv
find
gawk
gcc
grep
hunspell
install-tl
locale
localectl
localedef
lynx
man
nmcli
perllocale
perltoc
sh
systemd
systemd-firstboot
time
whatis
xdg-desktop-icon
xdg-desktop-menu
So setting this would hardly be safer.