As discussed in the sub-thread starting at <https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585710.html> "Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026]", -Wbidi-chars should not emit warnings when the problematic characters are written as UCNs rather than verbatim. For example, the line > aText = u"\u202D" + aText; found in the LibreOffice source code should not cause a warning (which couldn't even be silenced with a local `#pragma GCC diagnostic ignored "-Wbidi-chars"` due to bug 53431).
Mine then. Regression because it rejects previously accepted code.
Either we drop the UCN support altogether, or make -Wbidi-chars a 2 level warning, -Wbidi-chars mapping to -Wbidi-chars=1 which doesn't warn about UCNs and -Wbidi-chars=2 that does. UCNs indeed don't have the problem that a user in an editor sees something different than what it actually is (unless some editor interprets UCNs and shows them as unicode chars), but one reason to warn about UCNs was to make sure that even what the program prints doesn't suffer from such problems. Of course, if something like libreoffice (I bet) carefully ensures it is paired, but constructs it from smaller separate literals, then it is fine.
Or e.g. for identifiers with UCNs in them, if they aren't paired, then as/ld/readelf/demangler at runtime etc. can show weird things.
(In reply to Jakub Jelinek from comment #2) > Either we drop the UCN support altogether, or make -Wbidi-chars a 2 level > warning, -Wbidi-chars mapping to -Wbidi-chars=1 which doesn't warn about > UCNs and > -Wbidi-chars=2 that does. Exactly. Except I'm not sure how well that will play with the rest of the -Wbidi-chars= suboptions. :/ Like, -Wbidi-chars=any -Wbidi-chars=2 should probably warn about any bidi chars, including UCNs, but the latter option might cause it to be just -Wbidi-chars=unpaired, but warn about UCNs.
So maybe add -Wbidi-chars-ucn, which is off by default.
Or support -Wbidi-chars=unpaired,ucn or -Wbidi-chars=any,ucn ?
(In reply to Jakub Jelinek from comment #2) > Of course, if something like libreoffice (I bet) carefully ensures it is > paired, but constructs it from smaller separate literals, then it is fine. (Or doesn't even need to ensure that e.g. a LRO is paired with a PDF, as my understanding of <https://www.unicode.org/reports/tr9/tr9-44.html> "Unicode Bidirectional Algorithm" is that such an LRO doesn't require a matching PDF, in which case its effect extends to the end of the paragraph, which appears to be what the example LibreOffice code in comment 0 makes use of.)
The trunk branch has been updated by Marek Polacek <mpolacek@gcc.gnu.org>: https://gcc.gnu.org/g:ae36f839632ddb67a53c26e9c7e73b0f56c4c11b commit r12-6850-gae36f839632ddb67a53c26e9c7e73b0f56c4c11b Author: Marek Polacek <polacek@redhat.com> Date: Wed Jan 19 19:05:22 2022 -0500 preprocessor: -Wbidi-chars and UCNs [PR104030] Stephan Bergmann reported that our -Wbidi-chars breaks the build of LibreOffice because we warn about UCNs even when their usage is correct: LibreOffice constructs strings piecewise, as in: aText = u"\u202D" + aText; and warning about that is overzealous. Since no editor (AFAIK) interprets UCNs to show them as Unicode characters, there's less risk in misinterpreting them, and so perhaps we shouldn't warn about them by default. However, identifiers containing UCNs or programs generating other programs could still cause confusion, so I'm keeping the UCN checking. To turn it on, you just need to use -Wbidi-chars=unpaired,ucn or -Wbidi-chars=any,ucn. The implementation is done by using the new EnumSet feature. PR preprocessor/104030 gcc/c-family/ChangeLog: * c.opt (Wbidi-chars): Mark as EnumSet. Also accept =ucn. gcc/ChangeLog: * doc/invoke.texi: Update documentation for -Wbidi-chars. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): Add bidirectional_ucn. Set values explicitly. * internal.h (cpp_reader): Adjust warn_bidi_p. * lex.cc (maybe_warn_bidi_on_close): Don't warn about UCNs unless UCN checking is on. (maybe_warn_bidi_on_char): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidi-chars-10.c: Turn on UCN checking. * c-c++-common/Wbidi-chars-11.c: Likewise. * c-c++-common/Wbidi-chars-14.c: Likewise. * c-c++-common/Wbidi-chars-16.c: Likewise. * c-c++-common/Wbidi-chars-17.c: Likewise. * c-c++-common/Wbidi-chars-4.c: Likewise. * c-c++-common/Wbidi-chars-5.c: Likewise. * c-c++-common/Wbidi-chars-6.c: Likewise. * c-c++-common/Wbidi-chars-7.c: Likewise. * c-c++-common/Wbidi-chars-8.c: Likewise. * c-c++-common/Wbidi-chars-9.c: Likewise. * c-c++-common/Wbidi-chars-ranges.c: Likewise. * c-c++-common/Wbidi-chars-18.c: New test. * c-c++-common/Wbidi-chars-19.c: New test. * c-c++-common/Wbidi-chars-20.c: New test. * c-c++-common/Wbidi-chars-21.c: New test. * c-c++-common/Wbidi-chars-22.c: New test. * c-c++-common/Wbidi-chars-23.c: New test.
Fixed.