Patch to support extended characters in C/C++ identifiers

Wed Sep 11 14:32:00 GMT 2019

On Tue, Sep 10, 2019 at 7:47 PM Joseph Myers <joseph@codesourcery.com> wrote:
>
> Thanks, I think this is OK with a few updates to the documentation.

Thanks for looking through this, I'm glad it will be acceptable. I
will make the documentation adjustments as you suggest.

Speaking of documentation, one other thing occurred to me. When I made
these changes, I tried to make them as minimally disruptive as
possible, so they are the smallest changes I could find to the current
overall architecture to make this work. As a result there are some
things that may be a little surprising. For instance, you can take a
UTF-8 encoded file and insert a backslash line continuation in the
middle of a multibyte sequence, and gcc will happily paste it back
together and then interpret the resulting UTF-8. I think it's
technically OK standardwise since the conversion from extended
characters to the source character set is implementation-defined, but
it's hardly a straightforward definition. It is sort of consistent
with the treatment of undefined behavior with UCN escapes though,
which gcc already permits to be pasted together over a line
continuation. Anyway, should this behavior be documented as well? I
doubt anyone would be happy with a full-blown solution that involves
doing the UTF-8 conversion at initial parse time, given how much of
the libcpp code is devoted to optimizing the performance of scanning
input files, so this is probably the way it's going to end up working
I presume.

> I should also note that a few of the tests added by the test are testing
> things that are properties of the implementation that might arguably be
> bugs, rather than standard features, and so perhaps should at least have
> comments added saying they are testing those implementation properties.
>
> gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c, testing invalid UTF-8, is relying
> on GCC, in its default -finput-charset=utf-8 mode, not actually checking
> that the input is valid UTF-8.  It's clear that avoiding such a check
> makes sense in strings and comments, both as a matter of efficiency and
> because it's likely to do the right thing for a lot of user programs that
> use non-UTF-8 character sets in those places and just need the bytes in
> the strings to be passed through to the compiler output (rather than
> requiring users to specify -finput-charset and -fexec-charset for those
> programs).  Outside those contexts it's less obvious what's the best way
> to behave (this sort of test, where the stray non-UTF-8 bytes are in text
> that disappears as a result of macro expansion, is certainly a corner
> case).
>

My main reason for including this test was to demonstrate that
existing behavior is unchanged by the patch. If you think it makes
more sense, I could omit the test altogether, otherwise I will add a
comment like you suggested.

> gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and
> gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in
> C++, where strictly the results they expect show that GCC does not conform
> to the C++ standard requirement to convert all extended characters to UCNs
> (because C++ does not have the special C rule making it
> implementation-defined whether the \ of a UCN in a string literal is
> doubled when stringizing).

Thanks, I didn't mean to ignore this point when you made it on the PR
comments, I just wasn't sure what was the best way to handle it. Do
you find it preferable to just add a comment, or should I rather
change the test to look for the standard-confirming output, and make
it an XFAIL?

Finally, one general question, when I submit these last changes, is it
better to send them as a new patch relative to what I already sent, or
is it better to send the whole thing updated from scratch? Thanks
again.

-Lewis

-Lewis