102099 – Diagnostics do not consider the user's locale when printing source lines

Bug 102099 - Diagnostics do not consider the user's locale when printing source lines

Summary: Diagnostics do not consider the user's locale when printing source lines

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	c++ (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	diagnostic

Depends on:
Blocks:

Reported:	2021-08-27 15:13 UTC by Lewis Hyatt
Modified:	2021-08-30 07:39 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Tested patch (9.88 KB, patch) 2021-08-27 15:13 UTC, Lewis Hyatt	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lewis Hyatt 2021-08-27 15:13:12 UTC

If the user has a non-UTF-8 locale configured, they will currently still receive UTF-8 output from GCC's stderr under some conditions:

-If a filename for which diagnostics are issued contains extended characters
-If a source line for which diagnostics are issued contains extended characters.

When a source line contains identifier names with extended characters, the C/C++ front ends take care to convert them to the user's locale, by always passing them to identifier_to_locale() before output. However, this only affects the diagnostics messages generated by the front end, and does not affect the source line itself which is printed separately.

Example:

$ cat á.cpp
int á = 0;
int á = 1;

#in UTF-8 locale, looks fine
$ gcc -c á.cpp
á.cpp:2:5: error: redefinition of ‘int á’
2 | int á = 1;
| ^
á.cpp:1:5: note: ‘int á’ previously defined here
1 | int á = 0;
| ^

#in C locale, only partially converted to UCNs
$ LC_ALL=C gcc -c á.cpp
á.cpp:2:5: error: redefinition of 'int \U000000e1'
2 | int á = 1;
| ^
á.cpp:1:5: note: 'int \U000000e1' previously defined here
1 | int á = 0;
| ^

The attached patch arranges for the output to be rather:

#corrected by this patch
$ LC_ALL=C gcc -c á.cpp
\U000000e1.cpp:2:5: error: redefinition of 'int \U000000e1'
2 | int \U000000e1 = 1;
| ^
\U000000e1.cpp:1:5: note: 'int \U000000e1' previously defined here
1 | int \U000000e1 = 0;
| ^

In the above example I showed C locale, where the extended characters need to be escaped, but the patch would also handle e.g. latin1 locale, where it would output as expected, using iconv to convert to the output charset.

The patch is pretty complete, and bootstraps all languages with no regression. However there are a couple potential issues with it that may need to be discussed before it's ready to be used, so I have held off submitting to gcc-patches for now. The two main points of concern are:

1. Diagnostics recently acquired a lot of infrastructure to know the correct display width of extended characters, so that things like carets and label lines show up at the correct place. This infrastructure however is not currently able to handle locale dependence of the display width. Changing that is rather complicated, because determining that the display width of "á" is actually 10 columns instead of 1 (in case of UCN escape), requires attempting to convert the character to the user's locale (perhaps with iconv), and determining if it can be displayed or requires an escape. So the process of determining the display width becomes an expensive operation that should be optimized and performed once for the line, not something that can be computed on the fly as is done now. This breaks the assumptions in the design of the current approach and so would require it to be redone. That is certainly doable but it seems unfortunate to make that process much more complicated, for what's not probably a commonly needed use case. I suspect, that in many cases, users with a C locale configured actually still see UTF-8 output fine in their terminal anyway... The output with UCN escapes already looks bad, so perhaps having misaligned labels and carets is not a big deal and it's fine as it is.

2. The testsuite always runs with LC_ALL=C currently. Therefore, after the change in this patch, a test is no longer able to test for UTF-8 output in diagnostics, it will be UCN escaped instead. There is one such test currently (gcc.dg/diagnostic-input-charset-1.c). It doesn't seem suitable to change that test to look for UCN escapes, because the purpose of that test is to confirm that correct UTF-8 is generated when an input file is in another charset. So I instead added a new option -fdiagnostics-format=force-utf8. This is the same as -fdiagnostics-format=text except it disables the conversion to the user's locale and restores the previous behavior. That seemed more simple than adding ability to change the locale in the testsuite, plus I thought users may want this option for themselves for some reason, if say they do not have access to a UTF-8 locale somehow but their terminal still displays it fine. So that much seems fine, however there is a wrinkle here that I am not sure how to fix. The user probably expects that this new option will cause all diagnostics output to be UTF-8 regardless of locale. But some of the output is not generated by the diagnostics infrastructure at all. For example, localized strings are converted by libintl and always come out in the current locale. I am not sure how hard/easy it is to change this. But as an example, suppose you configure a latin1 German locale and compile the above test case:

á.cpp:2:5: Fehler: Redefinition von »int á«
2 | int á = 1;
| ^
á.cpp:1:5: Anmerkung: »int á« wurde bereits hier definiert
1 | int á = 0;
| ^

In the above output, the quote characters "»«" are generated by the internationalization library and are already converted to latin1 when diagnostics infrastructure sees them. So currently, with the new -fdiagnostics-format=force-utf8 option, the output is not what's expected... the á is encoded in UTF-8, but the quote stays latin-1. I am not sure what's the best way to address this. One option would be to make this new option undocumented, and reserve it just for use by the testsuite, for which this would not be an issue. Otherwise need to find a way to either disable the conversion to the locale from the translation library, or translate it back when this option is in effect.

Comment 1 Lewis Hyatt 2021-08-27 15:13:49 UTC

Created attachment 51365 [details]
Tested patch