[PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027]

Martin Sebor msebor@gmail.com
Tue Nov 2 19:49:42 GMT 2021


On 11/1/21 3:14 PM, David Malcolm via Gcc-patches wrote:
> [Resending to get around mailing list size limit; see notes below]
> 
> This patch implements a new -Whomoglyph diagnostic, enabled by default.
> 
> Internally it implements the "skeleton" algorithm from:
>    http://www.unicode.org/reports/tr39/#Confusable_Detection
> so that every new identifier is mapped to a "skeleton", and if
> the skeleton is already in use by a different identifier, issue
> a -Whomoglyph diagnostic.
> It uses the data from:
>    https://www.unicode.org/Public/security/13.0.0/confusables.txt
> to determine which characters are confusable.
> 
> For example, given the example of CVE-2021-42694 at
> https://trojansource.codes/, with this patch we emit:
> 
> t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
>      7 | void say<U+041D>ello() {
>        | ^~~~
> t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
>      3 | void sayHello() {
>        | ^~~~
> 
> (the precise location of the token isn't quite right; the
> identifiers should be underlined, rather than the "void" tokens)
> 
> This takes advantage of:
>    "diagnostics: escape non-ASCII source bytes for certain diagnostics"
>      https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
> to escape non-ASCII characters when printing a source line for -Whomoglyph,
> so that we print "say<U+041D>ello" when quoting the source line, making it
> clearer that this is not "sayHello".
> 
> In order to implement "skeleton", I had to implement NFD support, so the
> patch also contains some UTF-32 support code.
> 
> Known issues:
> - I'm doing an extra hash_table lookup on every identifier lookup.
>    I haven't yet measured the impact on the speed of the compiler.
>    If this is an issue, is there a good place to stash an extra
>    pointer in every identifier?
> - doesn't yet bootstrap, as the confusables.txt data contains ASCII
>    to ASCII confusables, leading to warnings such as:
> ../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] [-Whomoglyph]
> ../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier ‘OPT_I’ (‘OPT_I’) here
>    Perhaps the option should have levels, where we don't complain about
>    pure ASCII confusables at the default level?
> - no docs yet
> - as noted above the location_t of the token isn't quite right
> - map_identifier_to_skeleton and map_skeleton_to_first_use aren't
>    yet integrated with the garbage collector
> - some other FIXMEs in the patch
> 
> [I had to trim the patch for space to get it to pass the size filter on the
> mailing list; I trimmed:
>    contrib/unicode/confusables.txt,
>    gcc/testsuite/selftests/NormalizationTest.txt
> which can be downloaded from the URLs in the ChangeLog, and:
>    gcc/confusables.inc
>    gcc/decomposition.inc
> which can be generated using the scripts in the patch ]
> 
> Thoughts?

None from me on the actual feature -- even after our discussion
this morning I remain comfortably ignorant of the problem :)
I just have a quick comment on the two new string classes:

...
> +
> +/* A class for manipulating UTF-32 strings.  */
> +
> +class utf32_string
> +{
...
> + private:
...
> +  cppchar_t *m_buf;
> +  size_t m_alloc_len;
> +  size_t m_len;
> +};
> +
> +/* A class for constructing UTF-8 encoded strings.
> +   These are not NUL-terminated.  */
> +
> +class utf8_string
> +{
...
> + private:
> +  uchar  *m_buf;
> +  size_t m_alloc_sz;
> +  size_t m_len;
> +};

There are container abstractions both in C++ and in GCC that
these classes look like they could be implemented in terms of:
I'm thinking of std::string, std::vector, vec, and auto_vec.
They have the additional advantage of being safely copyable
and assignable, and of course, of having already been tested.
I see that the classes in your patch provide additional
functionality that the abstractions don't.  I'd expect it
be doable on top of the abstractions and without
reimplementing all the basic buffer management.

Martin


More information about the Gcc-patches mailing list