[PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027]

Jakub Jelinek jakub@redhat.com
Tue Nov 2 12:06:05 GMT 2021

On Tue, Nov 02, 2021 at 12:56:53PM +0100, Jakub Jelinek wrote:
> Consider attached testcases Whomoglyph1.C and Whomoglyph2.C.
> On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear
> confusion for the reader, something that isn't visible in any of emacs, vim,
> joe editors or on the terminal, when f3 uses scope identifier, the casual
> reader will expect that it uses N1::N2::scope, but there is no such
> variable, only one N1::N2::ѕсоре that visually looks the same, but has
> different UTF-8 chars in it.  So, name lookup will instead find N1::scope
> and use that.
> But Whomoglyph2.C will emit warnings that are IMHO not appropriate,
> I believe there is no confusion at all there, e.g. for both C and C++,
> the f5/f6 case, it doesn't really matter how each of the function names its
> own parameter, one can never access another function's parameter.
> Ditto for different namespace provided that both namespaces aren't searched
> in the same name lookup, or similarly classes etc.
> So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++
> FE).
> And, another important thing is that most users don't really use unicode in
> identifiers, I bet over 99.9% of identifiers don't have any >= 0x80
> characters in it and even when people do use them, confusable identifiers
> during the same lookup are even far more unlikely.
> So, I think we should optimize for the common case, ASCII only identifiers
> and spend as little compile time as possible on this stuff.

If we keep doing it in the stringpool, then e.g. one couldn't
#include <zlib.h>
in a program with Russian/Ukrainian/Serbian etc. identifiers where some parameter
or automatic variable etc. in some function in that file is called
с (Cyrillic letter es), etc. just because in zlib.h one of the arguments
in one of the function prototypes is called c (latin small letter c).
I'd be afraid most of the users that actually want to use UTF-8 or UCNs in
their identifiers would then just need to disable this warning...


More information about the Gcc-patches mailing list