An issue was discovered in the character definitions of the Unicode Specification through 14.0. The specification allows an adversary to produce source code identifiers such as function names using homoglyphs that render visually identical to a target identifier. Adversaries can leverage this to inject code via adversarial identifier definitions in upstream software dependencies invoked deceptively in downstream software.
We ought to have a diagnostic the warns about such problematic identifiers.
I have a work-in-progress patch for this, though it has some issues that need discussion; I hope to post it soon.
Initial version of patch posted for discussion to:
For reference, here's a patch to clang-tidy for this (currently under review):
Just checking confusables.txt and ignoring the official TR39 Unicode security guidelines for identifiers won't get you very far. It's merely fighting a tiny symptom of a huge attack space.
I suggest to properly implement TR39, such as I did in libu8ident and proposed to the C++/C working groups. Latest here: https://github.com/rurban/libu8ident/blob/master/doc/P2528R1.md
confusables.txt itself is almost useless. I used it only to restrict some Greek letters not to be confused with its Latin counterparts. Checking mixed scripts is much more secure.
Note that the TR31 XID lists are also pretty insecure still, even if C23 will restrict the XID's to the official TR31 XID lists.
Example bug that this warning flag could have found, if the string involved were a C string: https://twitter.com/nyt_first_said/status/1513148451210637313