Bug 103026 - Implement warning for Unicode bidi override characters [CVE-2021-42574]
Summary: Implement warning for Unicode bidi override characters [CVE-2021-42574]
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Marek Polacek
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: diagnostic, patch
Depends on:
Blocks: new-warning, new_warning
  Show dependency treegraph
 
Reported: 2021-11-01 15:03 UTC by Marek Polacek
Modified: 2021-11-18 14:35 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-11-01 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marek Polacek 2021-11-01 15:03:15 UTC
An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers.

We ought to have a warning in the preprocessor that warns about the potentially misleading Unicode bidirectional characters.

More info:
https://nvd.nist.gov/vuln/detail/CVE-2021-42574
https://trojansource.codes/
Comment 1 Marek Polacek 2021-11-01 15:03:37 UTC
I have a patch.
Comment 2 Marek Polacek 2021-11-01 16:38:27 UTC
Patch posted:
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583031.html
Comment 3 Jakub Jelinek 2021-11-01 17:50:08 UTC
This affects also other language FEs (Fortran, Go, D, ...).
E.g. for Go it has been mentioned already a few years ago:
https://github.com/golang/go/issues/20209
Some testcases from the reporters:
https://github.com/nickboucher/trojan-source
Comment 4 GCC Commits 2021-11-17 03:01:38 UTC
The trunk branch has been updated by Marek Polacek <mpolacek@gcc.gnu.org>:

https://gcc.gnu.org/g:51c500269bf53749b107807d84271385fad35628

commit r12-5331-g51c500269bf53749b107807d84271385fad35628
Author: Marek Polacek <polacek@redhat.com>
Date:   Wed Oct 6 14:33:59 2021 -0400

    libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026]
    
    From a link below:
    "An issue was discovered in the Bidirectional Algorithm in the Unicode
    Specification through 14.0. It permits the visual reordering of
    characters via control sequences, which can be used to craft source code
    that renders different logic than the logical ordering of tokens
    ingested by compilers and interpreters. Adversaries can leverage this to
    encode source code for compilers accepting Unicode such that targeted
    vulnerabilities are introduced invisibly to human reviewers."
    
    More info:
    https://nvd.nist.gov/vuln/detail/CVE-2021-42574
    https://trojansource.codes/
    
    This is not a compiler bug.  However, to mitigate the problem, this patch
    implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
    misleading Unicode bidirectional control characters the preprocessor may
    encounter.
    
    The default is =unpaired, which warns about improperly terminated
    bidirectional control characters; e.g. a LRE without its corresponding PDF.
    The level =any warns about any use of bidirectional control characters.
    
    This patch handles both UCNs and UTF-8 characters.  UCNs designating
    bidi characters in identifiers are accepted since r204886.  Then r217144
    enabled -fextended-identifiers by default.  Extended characters in C/C++
    identifiers have been accepted since r275979.  However, this patch still
    warns about mixing UTF-8 and UCN bidi characters; there seems to be no
    good reason to allow mixing them.
    
    We warn in different contexts: comments (both C and C++-style), string
    literals, character constants, and identifiers.  Expectedly, UCNs are ignored
    in comments and raw string literals.  The bidirectional control characters
    can nest so this patch handles that as well.
    
    I have not included nor tested this at all with Fortran (which also has
    string literals and line comments).
    
    Dave M. posted patches improving diagnostic involving Unicode characters.
    This patch does not make use of this new infrastructure yet.
    
            PR preprocessor/103026
    
    gcc/c-family/ChangeLog:
    
            * c.opt (Wbidi-chars, Wbidi-chars=): New option.
    
    gcc/ChangeLog:
    
            * doc/invoke.texi: Document -Wbidi-chars.
    
    libcpp/ChangeLog:
    
            * include/cpplib.h (enum cpp_bidirectional_level): New.
            (struct cpp_options): Add cpp_warn_bidirectional.
            (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL.
            * internal.h (struct cpp_reader): Add warn_bidi_p member
            function.
            * init.c (cpp_create_reader): Set cpp_warn_bidirectional.
            * lex.c (bidi): New namespace.
            (get_bidi_utf8): New function.
            (get_bidi_ucn): Likewise.
            (maybe_warn_bidi_on_close): Likewise.
            (maybe_warn_bidi_on_char): Likewise.
            (_cpp_skip_block_comment): Implement warning about bidirectional
            control characters.
            (skip_line_comment): Likewise.
            (forms_identifier_p): Likewise.
            (lex_identifier): Likewise.
            (lex_string): Likewise.
            (lex_raw_string): Likewise.
    
    gcc/testsuite/ChangeLog:
    
            * c-c++-common/Wbidi-chars-1.c: New test.
            * c-c++-common/Wbidi-chars-2.c: New test.
            * c-c++-common/Wbidi-chars-3.c: New test.
            * c-c++-common/Wbidi-chars-4.c: New test.
            * c-c++-common/Wbidi-chars-5.c: New test.
            * c-c++-common/Wbidi-chars-6.c: New test.
            * c-c++-common/Wbidi-chars-7.c: New test.
            * c-c++-common/Wbidi-chars-8.c: New test.
            * c-c++-common/Wbidi-chars-9.c: New test.
            * c-c++-common/Wbidi-chars-10.c: New test.
            * c-c++-common/Wbidi-chars-11.c: New test.
            * c-c++-common/Wbidi-chars-12.c: New test.
            * c-c++-common/Wbidi-chars-13.c: New test.
            * c-c++-common/Wbidi-chars-14.c: New test.
            * c-c++-common/Wbidi-chars-15.c: New test.
            * c-c++-common/Wbidi-chars-16.c: New test.
            * c-c++-common/Wbidi-chars-17.c: New test.
Comment 5 Marek Polacek 2021-11-17 03:05:45 UTC
Added to GCC 12.
Comment 6 GCC Commits 2021-11-17 22:33:53 UTC
The master branch has been updated by David Malcolm <dmalcolm@gcc.gnu.org>:

https://gcc.gnu.org/g:1a7f2c0774129750fdf73e9f1b78f0ce983c9ab3

commit r12-5355-g1a7f2c0774129750fdf73e9f1b78f0ce983c9ab3
Author: David Malcolm <dmalcolm@redhat.com>
Date:   Tue Nov 2 09:54:32 2021 -0400

    libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026]
    
    This flags rich_locations associated with -Wbidi-chars= so that
    non-ASCII bytes will be escaped when printing the source lines
    (using the diagnostics support I added in
    r12-4825-gbd5e882cf6e0def3dd1bc106075d59a303fe0d1e).
    
    In particular, this ensures that the printed source lines will
    be pure ASCII, and thus the visual ordering of the characters
    will be the same as the logical ordering.
    
    Before:
    
      Wbidi-chars-1.c: In function âmainâ:
      Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          6 |     /*â® } â¦if (isAdmin)⩠⦠begin admins only */
            |                                           ^
      Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          9 |     /* end admins only â® { â¦*/
            |                            ^
    
      Wbidi-chars-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
          6 | int LRE_âª_PDF_\u202c;
            |               ^
      Wbidi-chars-11.c:8:19: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
          8 | int LRE_\u202a_PDF_â¬_;
            |                   ^
      Wbidi-chars-11.c:10:28: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
         10 | const char *s1 = "LRE_âª_PDF_\u202c";
            |                            ^
      Wbidi-chars-11.c:12:33: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
         12 | const char *s2 = "LRE_\u202a_PDF_â¬";
            |                                 ^
    
    After:
    
      Wbidi-chars-1.c: In function âmainâ:
      Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
            |                                                                           ^
      Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          9 |     /* end admins only <U+202E> { <U+2066>*/
            |                                            ^
    
      Wbidi-chars-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
          6 | int LRE_<U+202A>_PDF_\u202c;
            |                       ^
      Wbidi-chars-11.c:8:19: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
          8 | int LRE_\u202a_PDF_<U+202C>_;
            |                   ^
      Wbidi-chars-11.c:10:28: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
         10 | const char *s1 = "LRE_<U+202A>_PDF_\u202c";
            |                                    ^
      Wbidi-chars-11.c:12:33: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=]
         12 | const char *s2 = "LRE_\u202a_PDF_<U+202C>";
            |                                 ^
    
    libcpp/ChangeLog:
            PR preprocessor/103026
            * lex.c (maybe_warn_bidi_on_close): Use a rich_location
            and call set_escape_on_output (true) on it.
            (maybe_warn_bidi_on_char): Likewise.
    
    Signed-off-by: David Malcolm <dmalcolm@redhat.com>
Comment 7 GCC Commits 2021-11-17 22:35:10 UTC
The master branch has been updated by David Malcolm <dmalcolm@gcc.gnu.org>:

https://gcc.gnu.org/g:bef32d4a28595e933f24fef378cf052a30b674a7

commit r12-5356-gbef32d4a28595e933f24fef378cf052a30b674a7
Author: David Malcolm <dmalcolm@redhat.com>
Date:   Tue Nov 2 15:45:22 2021 -0400

    libcpp: capture and underline ranges in -Wbidi-chars= [PR103026]
    
    This patch converts the bidi::vec to use a struct so that we can
    capture location_t values for the bidirectional control characters.
    
    Before:
    
      Wbidi-chars-1.c: In function âmainâ:
      Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
            |                                                                           ^
      Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=]
          9 |     /* end admins only <U+202E> { <U+2066>*/
            |                                            ^
    
    After:
    
      Wbidi-chars-1.c: In function âmainâ:
      Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
          6 |     /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */
            |       ~~~~~~~~                                ~~~~~~~~                    ^
            |       |                                       |                           |
            |       |                                       |                           end of bidirectional context
            |       U+202E (RIGHT-TO-LEFT OVERRIDE)         U+2066 (LEFT-TO-RIGHT ISOLATE)
      Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=]
          9 |     /* end admins only <U+202E> { <U+2066>*/
            |                        ~~~~~~~~   ~~~~~~~~ ^
            |                        |          |        |
            |                        |          |        end of bidirectional context
            |                        |          U+2066 (LEFT-TO-RIGHT ISOLATE)
            |                        U+202E (RIGHT-TO-LEFT OVERRIDE)
    
    Signed-off-by: David Malcolm <dmalcolm@redhat.com>
    
    gcc/testsuite/ChangeLog:
            PR preprocessor/103026
            * c-c++-common/Wbidi-chars-ranges.c: New test.
    
    libcpp/ChangeLog:
            PR preprocessor/103026
            * lex.c (struct bidi::context): New.
            (bidi::vec): Convert to a vec of context rather than unsigned
            char.
            (bidi::ctx_at): Rename to...
            (bidi::pop_kind_at): ...this and reimplement for above change.
            (bidi::current_ctx): Update for change to vec.
            (bidi::current_ctx_ucn_p): Likewise.
            (bidi::current_ctx_loc): New.
            (bidi::on_char): Update for usage of context struct.  Add "loc"
            param and pass it when pushing contexts.
            (get_location_for_byte_range_in_cur_line): New.
            (get_bidi_utf8): Rename to...
            (get_bidi_utf8_1): ...this, reintroducing...
            (get_bidi_utf8): ...as a wrapper, setting *OUT when the result is
            not NONE.
            (get_bidi_ucn): Rename to...
            (get_bidi_ucn_1): ...this, reintroducing...
            (get_bidi_ucn): ...as a wrapper, setting *OUT when the result is
            not NONE.
            (class unpaired_bidi_rich_location): New.
            (maybe_warn_bidi_on_close): Use unpaired_bidi_rich_location when
            reporting on unpaired bidi chars.  Split into singular vs plural
            spellings.
            (maybe_warn_bidi_on_char): Pass in a location_t rather than a
            const uchar * and use it when emitting warnings, and when calling
            bidi::on_char.
            (_cpp_skip_block_comment): Capture location when kind is not NONE
            and pass it to maybe_warn_bidi_on_char.
            (skip_line_comment): Likewise.
            (forms_identifier_p): Likewise.
            (lex_raw_string): Likewise.
            (lex_string): Likewise.
    
    Signed-off-by: David Malcolm <dmalcolm@redhat.com>
Comment 8 David Malcolm 2021-11-18 14:35:04 UTC
For refererence, here's a demo showing the colorized output on Compiler Explorer:
  https://godbolt.org/z/1sohErzWz