[PATCH v2] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026]
David Malcolm
dmalcolm@redhat.com
Tue Nov 16 23:00:58 GMT 2021
> On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote:
> > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
> > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine,
> > > > but changing the name is a trivial operation.
> > >
> > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no
> > > changes.
> >
> > Thanks for implementing this.
> >
> > >
> > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > >
> > > -- >8 --
> > > From a link below:
> > > "An issue was discovered in the Bidirectional Algorithm in the Unicode
> > > Specification through 14.0. It permits the visual reordering of
> > > characters via control sequences, which can be used to craft source code
> > > that renders different logic than the logical ordering of tokens
> > > ingested by compilers and interpreters. Adversaries can leverage this to
> > > encode source code for compilers accepting Unicode such that targeted
> > > vulnerabilities are introduced invisibly to human reviewers."
> > >
> > > More info:
> > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> > > https://trojansource.codes/
> > >
> > > This is not a compiler bug. However, to mitigate the problem, this patch
> > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
> > > misleading Unicode bidirectional characters the preprocessor may encounter.
[...snip...]
> >
> > Terminology nit:
> > The patch is referring to "bidirectional characters", but I think the
> > term "bidirectional control characters" would be better.
>
> Adjusted.
Thanks.
I wonder if the warning should be -Wbidi-control-chars, but I don't
care enough to insist on it being changed.
>
> > For example, a passage of text containing both numbers and characters
> > in a right-to-left script could be considered "bidirectional", since
> > the numbers are written from left-to-right.
> >
> > Specifically, the patch looks for these specific characters:
> > * U+202A LEFT-TO-RIGHT EMBEDDING
> > * U+202B RIGHT-TO-LEFT EMBEDDING
> > * U+202C POP DIRECTIONAL FORMATTING
> > * U+202D LEFT-TO-RIGHT OVERRIDE
> > * U+202E RIGHT-TO-LEFT OVERRIDE
> > * U+2066 LEFT-TO-RIGHT ISOLATE
> > * U+2067 RIGHT-TO-LEFT ISOLATE
> > * U+2068 FIRST STRONG ISOLATE
> > * U+2069 POP DIRECTIONAL ISOLATE
> >
> > However, the following characters could also be considered as
> > "bidirectional control characters":
> > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E)
> > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F)
> > but aren't checked for in the patch. Should they be? I can imagine
> > ways in which they could be abused, so I think so.
>
> I'd only intended to check the bidi chars described in the original
> trojan source pdf, but I added checking for U+200E/U+200F too, since
> it was easy enough. AFAIK they aren't popped by a PDF/PDI like the
> rest, so don't need to go on the vec, and so we only warn with =any.
> Tests: Wbidi-chars-16.c + Wbidi-chars-17.c
Thanks. I took a look through the revised patch and I think you
updated things correctly.
[...snip...]
> > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > new file mode 100644
> > > index 00000000000..9fd4bc535ca
> > > --- /dev/null
> > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > @@ -0,0 +1,166 @@
> > > +/* PR preprocessor/103026 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */
> > > +/* Test all bidi chars in various contexts (identifiers, comments,
> > > + string literals, character constants), both UCN and UTF-8. The bidi
> > > + chars here are properly terminated, except for the character constants. */
> > > +
> > > +/* a b c LRE 1 2 3 PDF x y z */
> > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */
> > > +/* a b c RLE 1 2 3 PDF x y z */
> > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */
> > > +/* a b c LRO 1 2 3 PDF x y z */
> > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */
> > > +/* a b c RLO 1 2 3 PDF x y z */
> > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */
> > > +/* a b c LRI 1 2 3 PDI x y z */
> > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */
> > > +/* a b c RLI 1 2 3 PDI x y */
> > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */
> > > +/* a b c FSI 1 2 3 PDI x y z */
> > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */
> >
> > AIUI the Unicode bidirectionality algorithm works at the line level,
> > and so each line in a block comment should be checked individually for
> > unclossed bidi control chars, rather than a block comment as a whole.
> > Hence I think the test case needs to have block comment test coverage
> > for:
> > - single line blocks
> > - first line of a multiline block comment
> > - middle line of a multiline block comment
> > - final line of a multiline block comment
> > but I think the patch as it stands is only checking for the first of
> > these four cases.
>
> The patch handles all of them, because of:
> 1534 if (warn_bidi_p)
> 1535 maybe_warn_bidi_on_close (pfile, cur);
> in _cpp_skip_block_comment, but I was lacking some more testing, so I've
> added some testing, and included a new test: Wbidi-chars-15.c.
All of the cases in Wbidi-chars-15.c only test for unparired chars in a
middle line of a multiline block comment; I don't think the patch has
any explicit coverage for unpaired control chars happening in the first
line and last lines of *multiline* block comments. So it would be good
if Wbidi-chars-15.c could gain some coverage for that (don't have to
handle all the different chars).
[...snip...]
> > > diff --git a/libcpp/lex.c b/libcpp/lex.c
> > > index fa2253d41c3..3fb518e202b 100644
> > > --- a/libcpp/lex.c
> > > +++ b/libcpp/lex.c
> > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment)
> > > }
> > > }
> > >
> > > +namespace bidi {
> > > + enum class kind {
> > > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI
> > > + };
> > > +
> > > + /* All the UTF-8 encodings of bidi characters start with E2. */
> > > + constexpr uchar utf8_start = 0xe2;
> >
> > Is there a difference between "constexpr" vs "const" here? (sorry for
> > my ignorance)
>
> I just wanted to make sure that utf8_start will be usable in contexts where
> an integral constant expression is required. 'const' objects need not be
> initialized with compile-time constants, but 'constexpr' objects do.
Thanks.
> > > +
> > > + /* A vector holding currently open bidi contexts. We use a char for
> > > + each context, its LSB is 1 if it represents a PDF context, 0 if it
> > > + represents a PDI context. The next bit is 1 if this context was open
> > > + by a bidi character written as a UCN, and 0 when it was UTF-8. */
> > > + semi_embedded_vec <unsigned char, 16> vec;
> > > +
> > > + /* Close the whole comment/identifier/string literal/character constant
> > > + context. */
> > > + void on_close ()
> > > + {
> > > + vec.truncate (0);
> > > + }
> > > +
> > > + /* Pop the last element in the vector. */
> > > + void pop ()
> > > + {
> > > + unsigned int len = vec.count ();
> > > + gcc_checking_assert (len > 0);
> > > + vec.truncate (len - 1);
> > > + }
> > > +
> > > + /* Return the context of the Ith element. */
> > > + kind ctx_at (unsigned int i)
> > > + {
> > > + return (vec[i] & 1) ? kind::PDF : kind::PDI;
> > > + }
> > > +
> > > + /* Return which context is currently opened. */
> > > + kind current_ctx ()
> > > + {
> > > + unsigned int len = vec.count ();
> > > + if (len == 0)
> > > + return kind::NONE;
> > > + return ctx_at (len - 1);
> > > + }
> > > +
> > > + /* Return true if the current context comes from a UCN origin, that is,
> > > + the bidi char which started this bidi context was written as a UCN. */
> > > + bool current_ctx_ucn_p ()
> > > + {
> > > + unsigned int len = vec.count ();
> > > + gcc_checking_assert (len > 0);
> > > + return (vec[len - 1] >> 1) & 1;
> > > + }
> > > +
> > > + /* We've read a bidi char, update the current vector as necessary. */
> > > + void on_char (kind k, bool ucn_p)
> > > + {
> > > + switch (k)
> > > + {
> > > + case kind::LRE:
> > > + case kind::RLE:
> > > + case kind::LRO:
> > > + case kind::RLO:
> > > + vec.push (ucn_p ? 3u : 1u);
> > > + break;
> > > + case kind::LRI:
> > > + case kind::RLI:
> > > + case kind::FSI:
> > > + vec.push (ucn_p ? 2u : 0u);
> > > + break;
> >
> > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in
> > the above have special meaning, but aren't clearly labelled as such.
> >
> > Please can you at least use some kind of constant/define to make clear
> > the meaning of the bits. Even clearer would be bitfields; is there a
> > performance reason for not using them? (though this code is only
> > called on bidi control chars, which presumably is a rare occurrence).
> > My patch here:
> > "[PATCH 2/2] Capture locations of bidi chars and underline ranges"
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html
> > did some refactoring of this patch, replacing hand-coded bit
> > manipulation with bitfields in a struct (as well as then using that as
> > a good place to stach location_t values, and then using these
> > locations).
> > Would it be helpful if I split that part of my patch out?
>
> I think they are just fine here, given they are used only in bidi:: and
> not outside of it. And I could just use a simple unsigned char in
> semi_embedded_vec instead of inventing a new struct.
>
> Your diagnostic patch changes it because you need to remember a location
> too, so we're changing it anyway, so I left it be so that you have fewer
> conflicts.
Fair enough; I'll post an updated version of my followup once yours
goes in.
> > [...snip...]
> >
> > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */
> > > +
> > > +static bidi::kind
> > > +get_bidi_utf8 (const unsigned char *const p)
> > > +{
> > > + gcc_checking_assert (p[0] == bidi::utf8_start);
> > > +
> > > + if (p[1] == 0x80)
> > > + switch (p[2])
> >
> > get_bidi_utf8 accesss up to 2 bytes beyond "p"...
> >
> > [...snip...]
> >
> > ...and is called in various places such as...
> >
> > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile)
> > >
> > > cur = buffer->cur;
> > > }
> > > + /* If this is a beginning of a UTF-8 encoding, it might be
> > > + a bidirectional character. */
> > > + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> > > + {
> > > + bidi::kind kind = get_bidi_utf8 (cur - 1);
> > > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false);
> > > + }
> >
> > Are we guaranteed to have a '\n' at the end of the buffer? (even for
> > the final line of the file) That would ensure that we don't read past
> > the end of the buffer.
>
> We've discussed this in our internal thread; I think I said that you
> will always get a '\n' so this is not going to read past the end of
> the buffer. To be sure...
>
> > Can we have testcases involving malformed UTF-8, in which, say:
> > - the final byte of the input file is 0xe2
> > - the final two bytes of the input file are 0xe2 0x80
> > for each of block comment, C++-style comment, string-literal,
> > identifier, etc?
> > (or is that overkill?)
>
> ...I'd crafted a malformed text file using hexedit but couldn't get
> it to crash. I'd rather not include it in the testsuite though.
Fair enough.
> > [...snip...]
> >
> > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
> > > {
> > > /* Slower version for identifiers containing UCNs
> > > or extended chars (including $). */
> > > - do {
> > > - while (ISIDNUM (*pfile->buffer->cur))
> > > - {
> > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > - pfile->buffer->cur++;
> > > - }
> > > - } while (forms_identifier_p (pfile, false, nst));
> > > + do
> > > + {
> > > + while (ISIDNUM (*pfile->buffer->cur))
> > > + {
> > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > + pfile->buffer->cur++;
> > > + }
> > > + }
> > > + while (forms_identifier_p (pfile, false, nst));
> >
> > Is the above purely a whitespace change?
>
> Yes.
If I'm reading things correctly, these lines in the existing code were
correctly indented, so is there a purpose to this change? If not,
please can you remove this change from the patch (to minimize the
change to the history).
[...snip...]
> +/* We're closing a bidi context, that is, we've encountered a newline,
> + are closing a C-style comment, or are at the end of a string literal,
> + character constant, or identifier. Warn if this context was not
> + properly terminated by a PDI or PDF. P points to the last character
> + in this context. */
> +
> +static void
> +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p)
> +{
> + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired
> + && bidi::vec.count () > 0)
> + {
> + const location_t loc
> + = linemap_position_for_column (pfile->line_table,
> + CPP_BUF_COLUMN (pfile->buffer, p));
> + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0,
> + "unpaired UTF-8 bidirectional character "
> + "detected");
> + }
Sorry, I missed this one in my initial review, should be "control
character" here.
[...snip...]
OK for trunk with the above nits fixed.
Thanks
Dave
More information about the Gcc-patches
mailing list