Bug 103027 - Implement warning for homoglyphs in identifiers
Summary: Implement warning for homoglyphs in identifiers
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: preprocessor (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: diagnostic, patch
Depends on:
Blocks: new-warning, new_warning
  Show dependency treegraph
Reported: 2021-11-01 15:05 UTC by David Malcolm
Modified: 2022-04-13 13:18 UTC (History)
5 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2021-11-01 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description David Malcolm 2021-11-01 15:05:16 UTC
An issue was discovered in the character definitions of the Unicode Specification through 14.0. The specification allows an adversary to produce source code identifiers such as function names using homoglyphs that render visually identical to a target identifier. Adversaries can leverage this to inject code via adversarial identifier definitions in upstream software dependencies invoked deceptively in downstream software.

We ought to have a diagnostic the warns about such problematic identifiers.

More info:
Comment 1 David Malcolm 2021-11-01 15:17:01 UTC
I have a work-in-progress patch for this, though it has some issues that need discussion; I hope to post it soon.
Comment 2 David Malcolm 2021-11-01 21:15:01 UTC
Initial version of patch posted for discussion to:
Comment 3 David Malcolm 2021-11-02 14:12:46 UTC
For reference, here's a patch to clang-tidy for this (currently under review):
Comment 4 Reini Urban 2022-02-20 15:33:20 UTC
Just checking confusables.txt and ignoring the official TR39 Unicode security guidelines for identifiers won't get you very far. It's merely fighting a tiny symptom of a huge attack space.

I suggest to properly implement TR39, such as I did in libu8ident and proposed to the C++/C working groups. Latest here: https://github.com/rurban/libu8ident/blob/master/doc/P2528R1.md

confusables.txt itself is almost useless. I used it only to restrict some Greek letters not to be confused with its Latin counterparts. Checking mixed scripts is much more secure.

Note that the TR31 XID lists are also pretty insecure still, even if C23 will restrict the XID's to the official TR31 XID lists.
Comment 5 Eric Gallager 2022-04-13 13:18:24 UTC
Example bug that this warning flag could have found, if the string involved were a C string: https://twitter.com/nyt_first_said/status/1513148451210637313