Bug 91843 - pretty printer mangles extended characters
Summary: pretty printer mangles extended characters
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-20 15:42 UTC by Lewis Hyatt
Modified: 2019-12-11 14:57 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Patch to implement the 3rd option (922 bytes, patch)
2019-09-20 19:17 UTC, Lewis Hyatt
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Lewis Hyatt 2019-09-20 15:42:25 UTC
There seem to be some issues with identifiers containing extended characters going through pretty-printer.c. For example, this test:

=====
int int_π = 3;
int π() {
    return itn_π;
}
=====

(note: if testing older trunk that doesn't have extended identifier support, the same behavior is seen using UCNs).

In either C or C++ mode, the output is wrong:

t8.cpp: In function ‘int\xcf\x80()’:
t8.cpp:3:12: error: ‘itn_π’ was not declared in this scope; did you mean ‘int\xcf\x80’?
    3 |     return itn_π;
      |            ^~~~~
      |            int_π

(in C it's the same except for minor changes in the text.)

So extended characters are printed 5 times, and 2 of them get mangled with hex escape codes and 3 come out OK. Of the 3 that work, 2 from diagnostic-show-locus.c are just output directly from the source, and the other one (the error: 'itn_π') is printed using %qD, which ends up in pp_c_identifier, which ends up calling pp_identifier in pretty-print.h, which calls pp_string, which does not do any hex escaping.

For the two wrong ones, the code path is different for C and C++, but they end up in pretty-printer.c being processed as a %qs directive either way.

(BTW, incidental to this bug report, the "did you mean" part is missing a call to identifier_to_locale(). But that isn't the reason it gets misprinted.)

It seems there are lots of code paths that may end up printing an identifier via %qs, so I tried to look at this common element, and the situation seems straightforward enough, but I don't understand why it is the way it is, so I'm not sure what's the correct fix. The source of the issue is that the %qs seems to apply quoting in two unrelated senses... It surrounds the string with quotation marks, and it also prints the string with pp_quoted_string() instead of pp_string(). The pp_quoted_string() then applies the hex escape to all non-printable bytes it comes across. Seems there would be a few options:

-Change %qs to only surround with quotes, not also do hex escapes. This is a simple one-line fix but I am not sure why it does this hex escaping or if it's still necessary for other use cases.

-Maybe there is some alternative to %qs that is already there that's supposed to be used for this, and needs to be added in various places? This test case reveals two of them, but there must be others among 2000 different uses of %qs, so not sure about this...

-Change pp_quoted_string() to check if the bytes it would escape form a valid UTF-8 sequence, in which case, don't escape them. That also seems relatively simple and would handle all uses cases of %qs wherever they may be.


I am happy to work on it but not sure how to decide on the best path. Thanks!

-Lewis
Comment 1 Lewis Hyatt 2019-09-20 19:17:29 UTC
Created attachment 46905 [details]
Patch to implement the 3rd option

The last option from my previous message seems, at least, unlikely to break anything. Attached patch resolves this issue.

It also fixes a small unrelated bug that prevents the first byte prior to any hex-escaped byte from being output, which was revealed by new self-test cases that I have added.
Comment 2 Lewis Hyatt 2019-10-28 22:47:48 UTC
Patch was sent for review:

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg00766.html

Thanks!
Comment 3 Lewis Hyatt 2019-12-11 14:57:56 UTC
Fixed for GCC 10.
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=279226