[PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027]

Tue Nov 2 11:56:52 GMT 2021

On Mon, Nov 01, 2021 at 05:14:12PM -0400, David Malcolm via Gcc-patches wrote:
> Thoughts?

Resending my previously internally posted mail:

Thanks for working for this, but I'm afraid this is done at a wrong
location.

Consider attached testcases Whomoglyph1.C and Whomoglyph2.C.
On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear
confusion for the reader, something that isn't visible in any of emacs, vim,
joe editors or on the terminal, when f3 uses scope identifier, the casual
reader will expect that it uses N1::N2::scope, but there is no such
variable, only one N1::N2::ѕсоре that visually looks the same, but has
different UTF-8 chars in it.  So, name lookup will instead find N1::scope
and use that.
But Whomoglyph2.C will emit warnings that are IMHO not appropriate,
I believe there is no confusion at all there, e.g. for both C and C++,
the f5/f6 case, it doesn't really matter how each of the function names its
own parameter, one can never access another function's parameter.
Ditto for different namespace provided that both namespaces aren't searched
in the same name lookup, or similarly classes etc.
So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++
FE).
And, another important thing is that most users don't really use unicode in
identifiers, I bet over 99.9% of identifiers don't have any >= 0x80
characters in it and even when people do use them, confusable identifiers
during the same lookup are even far more unlikely.
So, I think we should optimize for the common case, ASCII only identifiers
and spend as little compile time as possible on this stuff.

Another thing we've discussed on IRC is that the Unicode confusables.txt has
some canonicalizations that I think most users will not appreciate, or at
least not appreciate if it is warned by default.  In particular, the
canonicalizations from ASCII to ASCII:
0 -> O
1 -> l
I -> l
m -> rn
I think most users use monospace fonts when working on C/C++ etc. source
code, anything else ruins indentation (ok, except for projects that indent
everything by tabs only).  In monospace fonts, I think m can't be confused
with rn.  And in most monospace fonts, 0 and O and 1/l/I are also fairly
well distinguishable.  So, I think we should warn about f00 vs. fOO or
home vs. horne or f1 vs. fl vs. fI only with -Whomoglyph=2, a non-default
extra level, rather than by default.
But, given the way confusables.txt is generated, while it is probably
easy to undo the m -> rn canonicalizations (assume everything that maps
to rn actually maps to m), for the 0 vs. O or 1 vs. l vs. I case I'm afraid
trying to differentiate to what it actually maps will be harder.
So, my proposal would be by default not to warn for the 0 vs. O, 1 vs. l vs. I
and m vs. rn differences if neither of the identifiers contains any UTF-8
chars and do warn otherwise.

For spending as short compile time as possible on this, I think libcpp
knows well if a UTF-8 or UCN character has been encountered in an identifier
(it uses the libcpp/charset.c entrypoints then), so perhaps we could mark
the tokens with a new flag that stands for "identifier contains at least one
UTF-8 (non-ASCII) or UCN character.  Or if we don't, perhaps just quickly
scanning identifier for characters with MSB set might be tollerable too.
So, that for how to identify identifiers for which we should compute the
canonical homoglyph form at all.
How to store it, again as we should IMHO optimize for the usual case where even
identifiers with UTF-8/UCN characters canonicalize to itself, I think best
representation for that would be not to waste whole pointer on each
IDENTIFIER_NODE for it, but instead reserve for it one bit on the
identifiers, "identifier has a canonical homoglyph version different from
itself", and that bit would mean we have some hash map etc. on the side
that maps it to the canonical identifier.
As for the actual uses of confusables.txt transformations, do you think
we need to work on UTF-32?  Can't it all be done on UTF-8 instead?  Let
whatever confusable.txt -> something.h generator we write prepare a decision
function that can use UTF-8 in both what to replace and certainly just a
UTF-8 string literal in what to replace it with.
Note, the above would mean we don't compute them for those 0 -> O, [1I] -> l
or m -> rn canonicalizations for the default -Whomoglyph mode.
Perhaps we can use some bit/flag on the C FE scopes or C++ namespaces
and on the identifiers we map to through the hash map.  On the scopes etc.
it would mean this scope has some identifiers in it that have the homoglyphs
alternatives and the homoglyph canonical forms have been already added to
wherever the name lookup can find them.  And on the canonical forms perhaps
stand for this canonical form has any O, l or rn sequences in it.
And then during actual name lookup, if current identifier doesn't the
has alt homoglyph canonicalization flag set and the scope doesn't have
the new bit set either, there would be nothing further to do, for
name-lookup where the identifier we search for has the new bit set
and the scope has it too we'd do two lookups, one normal and one for the
alt homoglyph canonical form, and if the scope doesn't have the new bit
set, we'd perform whatever is needed to figure that out.

	Jakub
-------------- next part --------------
// { dg-do compile }
// { dg-options "-Whomoglyph" }

namespace N1
{
  int scope; // ASCII
  int f1 () { return scope++; } // ASCII
  namespace N2
  {
    int ѕсоре; // Cyrillic
    int f2 () { return ѕсоре++; } // Cyrillic
    int f3 () { return scope++; } // ASCII	{ dg-warning "Whomoglyph" }
  }
}
-------------- next part --------------
// { dg-do compile }
// { dg-options "-Whomoglyph" }

namespace N1
{
  int scope; // ASCII
  int f1 () { return scope++; } // ASCII
}
namespace N2
{
  int ѕсоре; // Cyrillic	{ dg-bogus "Whomoglyph" }
  int f2 () { return ѕсоре++; } // Cyrillic
}
struct S1
{
  constexpr static int Hello = 1; // ASCII
  int f3 () { return Hello; } // ASCII
};
struct S2
{
  constexpr static int Ηello = 2; // Greek+ASCII	{ dg-bogus "Whomoglyph" }
  int f4 () { return Ηello; } // Greek+ASCII
};
int f5 (int s) // ASCII
{
  return s; // ASCII
}
int f6 (int ѕ) // Cyrillic	{ dg-bogus "Whomoglyph" }
{
  return ѕ; // Cyrillic
}