This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

UTF-8 quotation marks in diagnostics


Several of us don't want UTF-8 quotation marks in diagnostics in our 
environment (Jove subshells).  We'd like a way to turn them off.  We don't 
think that they are a bad idea but they are bad in our environment.

<https://gcc.gnu.org/gcc-4.0/changes.html>

	English-language diagnostic messages will now use Unicode
	quotation marks in UTF-8 locales. (Non-English messages
	already used the quotes appropriate for the language in
	previous releases.) If your terminal does not support UTF-8
	but you are using a UTF-8 locale (such locales are the default
	on many GNU/Linux systems) then you should set LC_CTYPE=C in
	the environment to disable that locale. Programs that parse
	diagnostics and expect plain ASCII English-language messages
	should set LC_ALL=C. See Markus Kuhn's explanation of Unicode
	quotation marks for more information.

This suggests that LC_CTYPE=C would do what we want: go back to ` and
' instead of 342\200\230 and \342\200\231.

I find that a little confusing and scary.  I would expect that setting
LC_CTYPE=C would have the affect of changing the lexing done by the C
compiler.  For one thing, valid characters in strings would be
different.  This we don't want.

gcc(1) says:

	The LC_CTYPE environment variable specifies character
	classification.  GCC uses it to determine the character
	boundaries in a string; this is needed for some multibyte
	encodings that contain quote and escape characters that are
	otherwise interpreted as a string end or escape.

	The LC_MESSAGES environment variable specifies the language to
	use in diagnostic messages.


An experiment on my Fedora 20 system shows:

- LANG=en_CA.UTF-8 [correct]

- LC_CTYPE isn't set by default

- setting LC_CTYPE to C gets rid of the UTF-8 quotes in GCC diagnostics.
  That's surprising because the manpage doesn't say that it affects diagnostics.

- setting LC_MESSAGES to C DOES NOT get rid of the UTF-8 quotes in GCC diagnostics
  That's surprising because the manpage does say that it affects diagnostics.
  I hope that it only affect compile-time diagnostics.

That sure sounds like we should NOT set LC_CTYPE=C because of bad
side-effects: it changes how the program is lexed.  And the
documentation gives no basis for thinking that it would suppress those
UTF-8 quotes in messages (even though testing shows that this works).

That sure sounds like we should set LC_MESSAGES=C, but that doesn't work.

In our environment, our tool doesn't know that gcc is being invoked.
So the solution needs to be targetted.  That's why a solution like
GCC_COLOURS would be good.  In fact, it could probably be hacked into GCC_COLOURS.

Man pages in section 1 that explicitly reference LC_CTYPE:
	enca
	enconv
	find
	gcc
	gnroff
	grep
	jove
	koi8rxterm
	less
	locale
	localedef
	nroff
	perl5004delta
	perl5160delta
	perl58delta
	perlfunc
	perllocale
	perltoc
	pico
	pilot
	sh
	systemd
	time
	tree
	uxterm
	xterm
So I feel uncomfortable setting it.

Man pages in section 1 that explicitly reference LC_MESSAGES:
	apropos
	aspell
	awk
	bash
	enca
	enconv
	find
	gawk
	gcc
	grep
	hunspell
	install-tl
	locale
	localectl
	localedef
	lynx
	man
	nmcli
	perllocale
	perltoc
	sh
	systemd
	systemd-firstboot
	time
	whatis
	xdg-desktop-icon
	xdg-desktop-menu
So setting this would hardly be safer.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]