Bug 104030 - [12 Regression] -Wbidi-chars should not warn about UCNs
Summary: [12 Regression] -Wbidi-chars should not warn about UCNs
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 12.0
: P1 normal
Target Milestone: 12.0
Assignee: Marek Polacek
URL:
Keywords: diagnostic
Depends on:
Blocks:
 
Reported: 2022-01-14 14:51 UTC by Stephan Bergmann
Modified: 2022-01-24 22:50 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-01-14 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stephan Bergmann 2022-01-14 14:51:47 UTC
As discussed in the sub-thread starting at <https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585710.html> "Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026]", -Wbidi-chars should not emit warnings when the problematic characters are written as UCNs rather than verbatim.  For example, the line

>         aText = u"\u202D" + aText;

found in the LibreOffice source code should not cause a warning (which couldn't even be silenced with a local `#pragma GCC diagnostic ignored "-Wbidi-chars"` due to bug 53431).
Comment 1 Marek Polacek 2022-01-14 14:55:26 UTC
Mine then.  Regression because it rejects previously accepted code.
Comment 2 Jakub Jelinek 2022-01-14 14:59:02 UTC
Either we drop the UCN support altogether, or make -Wbidi-chars a 2 level warning, -Wbidi-chars mapping to -Wbidi-chars=1 which doesn't warn about UCNs and
-Wbidi-chars=2 that does.
UCNs indeed don't have the problem that a user in an editor sees something different than what it actually is (unless some editor interprets UCNs and shows them as unicode chars), but one reason to warn about UCNs was to make sure that even what the program prints doesn't suffer from such problems.  Of course, if something like libreoffice (I bet) carefully ensures it is paired, but constructs it from smaller separate literals, then it is fine.
Comment 3 Jakub Jelinek 2022-01-14 15:00:30 UTC
Or e.g. for identifiers with UCNs in them, if they aren't paired, then as/ld/readelf/demangler at runtime etc. can show weird things.
Comment 4 Marek Polacek 2022-01-14 15:05:53 UTC
(In reply to Jakub Jelinek from comment #2)
> Either we drop the UCN support altogether, or make -Wbidi-chars a 2 level
> warning, -Wbidi-chars mapping to -Wbidi-chars=1 which doesn't warn about
> UCNs and
> -Wbidi-chars=2 that does.

Exactly.  Except I'm not sure how well that will play with the rest of the -Wbidi-chars= suboptions.  :/

Like,

-Wbidi-chars=any -Wbidi-chars=2

should probably warn about any bidi chars, including UCNs, but the latter option might cause it to be just -Wbidi-chars=unpaired, but warn about UCNs.
Comment 5 Marek Polacek 2022-01-14 15:07:21 UTC
So maybe add -Wbidi-chars-ucn, which is off by default.
Comment 6 Jakub Jelinek 2022-01-14 15:30:15 UTC
Or support -Wbidi-chars=unpaired,ucn or -Wbidi-chars=any,ucn ?
Comment 7 Stephan Bergmann 2022-01-14 15:41:32 UTC
(In reply to Jakub Jelinek from comment #2)
> Of course, if something like libreoffice (I bet) carefully ensures it is
> paired, but constructs it from smaller separate literals, then it is fine.

(Or doesn't even need to ensure that e.g. a LRO is paired with a PDF, as my understanding of <https://www.unicode.org/reports/tr9/tr9-44.html> "Unicode Bidirectional Algorithm" is that such an LRO doesn't require a matching PDF, in which case its effect extends to the end of the paragraph, which appears to be what the example LibreOffice code in comment 0 makes use of.)
Comment 8 GCC Commits 2022-01-24 22:49:28 UTC
The trunk branch has been updated by Marek Polacek <mpolacek@gcc.gnu.org>:

https://gcc.gnu.org/g:ae36f839632ddb67a53c26e9c7e73b0f56c4c11b

commit r12-6850-gae36f839632ddb67a53c26e9c7e73b0f56c4c11b
Author: Marek Polacek <polacek@redhat.com>
Date:   Wed Jan 19 19:05:22 2022 -0500

    preprocessor: -Wbidi-chars and UCNs [PR104030]
    
    Stephan Bergmann reported that our -Wbidi-chars breaks the build
    of LibreOffice because we warn about UCNs even when their usage
    is correct: LibreOffice constructs strings piecewise, as in:
    
      aText = u"\u202D" + aText;
    
    and warning about that is overzealous.  Since no editor (AFAIK)
    interprets UCNs to show them as Unicode characters, there's less
    risk in misinterpreting them, and so perhaps we shouldn't warn
    about them by default.  However, identifiers containing UCNs or
    programs generating other programs could still cause confusion,
    so I'm keeping the UCN checking.  To turn it on, you just need
    to use -Wbidi-chars=unpaired,ucn or -Wbidi-chars=any,ucn.
    
    The implementation is done by using the new EnumSet feature.
    
            PR preprocessor/104030
    
    gcc/c-family/ChangeLog:
    
            * c.opt (Wbidi-chars): Mark as EnumSet.  Also accept =ucn.
    
    gcc/ChangeLog:
    
            * doc/invoke.texi: Update documentation for -Wbidi-chars.
    
    libcpp/ChangeLog:
    
            * include/cpplib.h (enum cpp_bidirectional_level): Add
            bidirectional_ucn.  Set values explicitly.
            * internal.h (cpp_reader): Adjust warn_bidi_p.
            * lex.cc (maybe_warn_bidi_on_close): Don't warn about UCNs
            unless UCN checking is on.
            (maybe_warn_bidi_on_char): Likewise.
    
    gcc/testsuite/ChangeLog:
    
            * c-c++-common/Wbidi-chars-10.c: Turn on UCN checking.
            * c-c++-common/Wbidi-chars-11.c: Likewise.
            * c-c++-common/Wbidi-chars-14.c: Likewise.
            * c-c++-common/Wbidi-chars-16.c: Likewise.
            * c-c++-common/Wbidi-chars-17.c: Likewise.
            * c-c++-common/Wbidi-chars-4.c: Likewise.
            * c-c++-common/Wbidi-chars-5.c: Likewise.
            * c-c++-common/Wbidi-chars-6.c: Likewise.
            * c-c++-common/Wbidi-chars-7.c: Likewise.
            * c-c++-common/Wbidi-chars-8.c: Likewise.
            * c-c++-common/Wbidi-chars-9.c: Likewise.
            * c-c++-common/Wbidi-chars-ranges.c: Likewise.
            * c-c++-common/Wbidi-chars-18.c: New test.
            * c-c++-common/Wbidi-chars-19.c: New test.
            * c-c++-common/Wbidi-chars-20.c: New test.
            * c-c++-common/Wbidi-chars-21.c: New test.
            * c-c++-common/Wbidi-chars-22.c: New test.
            * c-c++-common/Wbidi-chars-23.c: New test.
Comment 9 Marek Polacek 2022-01-24 22:50:35 UTC
Fixed.