Created attachment 36187 [details]
One line patch to add C99 UTF-8 support in identifiers to gcc
In response to FAQ
> What is the status of adding the UTF-8 support for identifier names in GCC?
and the request
> Support for actual UTF-8 in identifiers is still pending (please contribute!)
my observation is that UTF-8 in identifiers is easy to add to gcc by changing one line in the cpp preprocessor, provided a recent version of iconv is installed on the system. The patch is attached and has been tested for about 6 months. More information about this patch as well as unrelated information about getting cilkrts to work on ARM is available at
To check the installed version of iconv has C99 support type
$ iconv --list | grep "C99"
which means that iconv is recent enough.
Related to bug 41374.
Have you tried -fextended-identifiers ?
I cannot say anything about the correctness of the patch, but I would expect such a patch to contain many testcases (at least similar to those that test for UCNs see https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00337.html), patches need to be bootstrapped & regression tested and submitted to gcc-patches with a Changelog (https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps). Please CC Joseph Myers when you submit.
There is no "C99" character set in glibc libiconv (after all, it's not a
character set at all). Converting extended characters to UCNs like that
would in any case be correct for C++ (provided you also convert $ ` @ and
control characters other than those in the basic source character set) but
not for C - but for C++, it would be necessary to keep track of the
conversions to revert them in raw string literals. This requirement to
revert such conversions in raw string literals (in C++14, see 2.5
[lex.pptoken] paragraph 3: "Between the initial and final double quote
characters of the raw string, any transformations performed in phases 1
and 2 (trigraphs, universal-character-names, and line splicing) are
reverted; this reversion shall apply before any d-char, r-char, or
delimiting parenthesis is identified.") renders such an approach
non-viable (it would break things that currently work); the conversions to
UCNs have to take place within cpplib, not through an external iconv
Note that cpplib identifier spelling preservation is now implemented
<https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00548.html>, which adds
other ways in which it should be visible whether an identifier was
represented with UTF-8 or UCNs.
From the webpage (current as of Aug 17, 2015)
under *Details* it is described that the library provides support for the following encodings:
UCS-2, UCS-2BE, UCS-2LE
UCS-4, UCS-4BE, UCS-4LE
UTF-16, UTF-16BE, UTF-16LE
UTF-32, UTF-32BE, UTF-32LE
Therefore, I don't understand the statement that libiconv doesn't support C99 or that it isn't, somehow, a character set.
Please look at the Raspberry Pi forum post linked in the original report for more information about testing this patch. As the text describes there, the command line options
are both needed in order to compile a UTF-8 input file containing unicode identifiers. I have included a small test program as another attachment. Searching on UTF-8 Identifiers in GCC will turn up a number of people asking for this feature and additional example codes that use UTF-8 identifers. The document "Unicode for the PCC C99 Compiler" available at
also contains example UTF-8 C99 input files which can be used to test the compiler. The one-line patch submitted above has also been tested in the sense that the compiler still bootstraps and has no trouble compiling thousands of lines of standard ASCII C input.
The patch inserts "C99" in only one place as the uses of SOURCE_CHARSET are conflicted and changing them all to "C99" doesn't yield a working solution. In particular, the "C99" in _cpp_convert_input should not be considered the source character set appearing in the input files but rather an internal character set suitable for later parsing. As iconv is already a well debugged library, it would appear the risks of this patch are minor.
Note however, the following problem: "C99" is probably not the correct for EBCDIC hosts. In that case it might be possible to write UCNs using trigraphs of the form ??/uXXXX and ??/UXXXXXXXX, however, as the number of people wanting to compile C source files with identifiers encoded using UTF-EBCDIC is likely zero, the easiest solution going forward is to modify the patch so it only applies to non-EBCDIC hosts. As there are already #ifdef's in the code to check for this, this does not add any new complexity to the code base.
Created attachment 36196 [details]
Test program with UTF-8 identifiers...
Compile this test program using
-finput-charset=UTF-8 -fextended-identifiers \
-o circle circle.c
to check whether gcc can handle UTF-8 identifiers.
(In reply to Eric from comment #7)
> also contains example UTF-8 C99 input files which can be used to test the
> compiler. The one-line patch submitted above has also been tested in the
> sense that the compiler still bootstraps and has no trouble compiling
> thousands of lines of standard ASCII C input.
I think what Joseph is saying is that your approach may work for the small examples that you have tested, but it would break things that are working fine right now (in particular raw string literals). Many of those things are not tested by a gcc bootstrap (but some of them should be tested by the regression testsuite, did you run that? Point 4 here: https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps )
I hope Joseph can give you more details so you may try to implement this in the proper way.
The only reason why GCC does not have UTF-8 support in identifiers is that no one had time to implement it yet, so your help is appreciated.
(In reply to Eric from comment #7)
> command line options
> -finput-charset=UTF-8 -fextended-identifiers
> are both needed in order to compile a UTF-8 input file containing unicode
Note also that since GCC 5.1: The option -fextended-identifiers is now enabled by default for C++, and for C99 and later C versions (https://gcc.gnu.org/gcc-5/changes.html) and the default C version is C11, thus it is enabled by default.
Sorry, glibc iconv (not libiconv) doesn't handle "C99". So your patch
would not work on any GNU host in normal configurations of GCC (libiconv
is a completely separate package and is only likely to be used on non-GNU
hosts such as Windows, on GNU hosts iconv from glibc is normally used
although it's possible to use libiconv there).
You need to test cases such as that if a macro is defined twice, once with
a UCN in its expansion and once with the equivalent character written in
UTF-8, the difference in the expansion is diagnosed (whichever of all the
valid UCNs for that character is the one used). And that the original
spelling appears on the right hand side of a definition output with -dD.
And that if (in C but not, properly, C++) a string contains a backslash
followed by an extended character, this is properly diagnosed as an
invalid escape sequence rather than being treated as \\u<something> or
\\U<something>. See the tests in my spelling preservation patch
isn't necessarily an issue here because of the special C rules about
stringizing UCNs together with the C++ rule about converting to UCNs in
phase 1 - the effect is that for C it's always OK to stringize as the
extended character, though you can't stringize as a UCN if the extended
character was originally written, while for C++ you have to stringize as a
UCN.) And then you need tests of C++ programs with extended characters
inside raw strings (like c-c++-common/raw-string-*.c, but none of those
cover extended characters at present). And the patch needs to add all
these tests to the testsuite.
I'm glad to know people like Joseph are working on UTF-8 in gcc. Last year I spent a week adding UTF-8 input support to pcc. At that time Microsoft Studio and clang already supported UTF-8 input files and I expected that gcc would do so in the next release. As this didn't happen, a few months ago I looked and developed a one-line patch to add this support to gcc.
It appears the C preprocessor falls back to libiconv when it encounters a conversion not supported internally. From what I can tell this is enabled by default, though it is surely possible to disable it.
I'm aware that C strings are often used to store 8-bit data, for example, to display various graphics characters from legacy code pages. I will run the regression tests as soon as possible to see what, if anything, has broken by my one-line patch. UCN quoting of UTF-8 input should happen only if the -finput-charset=UTF-8 flag is set and this is worth checking.
(In reply to Eric from comment #12)
> I'm glad to know people like Joseph are working on UTF-8 in gcc.
I think at the moment, neither Joseph nor anyone else is planning to work on this. There doesn't seem to be sufficient demand for this feature so that companies fund it or volunteers step up to implement it (you are the first one to do an attempt that I am aware of).
> I spent a week adding UTF-8 input support to pcc. At that time Microsoft
> Studio and clang already supported UTF-8 input files and I expected that gcc
> would do so in the next release.
Unfortunately, GCC has very few developers compared to Microsoft or Clang. Many things in GCC will never get done if new people do not contribute to its development. This is why if you want to see this feature, you are the best and perhaps the only person to make it happen.
The problem is that this cannot be fixed by one-line patch, otherwise it would have been fixed a long time ago.
* GCC cannot rely on libiconv being always present. It has to work with glibc's iconv, which is what is used in GNU/Linux.
* Even if glibc's supported C99 conversion, this will break other things.
* You need to add tests explicitly for various things (see Joseph's comments). The tests will be added to the GCC testsuite to prove that your patch works as it should and to make sure future changes do not break the tests.
* At a minimum, look at all the gcc.dg/cpp/ucnid-*.c g++.dg/cpp/ucnid-*.c and see what happens if you replace the \uNNN with actual extended characters.
* Joseph thinks that the best approach is to do the conversion from UTF-8 to UCNs "manually" within cpplib, such that you can handle all the corner cases of C/C++ (quoted strings, \µ, macro names,...)
While there may not be current demand for gcc to accept UTF-8 identifiers, the fact that clang and Visual Studio support this C99 feature means source code using Greek and accented characters in variable names is likely to become more prevalent over time.
I have done a little testing to check by default whether string literals can contain arbitrary 8-bit data. This is used, for example, in legacy code which directly includes graphics characters from CP437. The original preprocessor specifies "UTF-8" as the default input character set and "UTF-8" as the internal character set. Then, if the internal and working character sets are identical no translation is done and arbitrary 8-bit data is passed through cleanly. A slight modification to my patch needs to be made to retain the same behavior. In particular, the patch now specifies both the internal and default input character sets to be "C99" so no translation is done by default. The improved patch also includes consideration of EBCDIC hosts.
As iconv was installed on every GNU/Linux system I've tried, I'm not sure what is wrong with using the C99 mode present in newer releases. This achieves exactly the suggested result of converting all UTF-8 input to UCNs in the preprocessor while directly allowing other potentially useful conversions. Perhaps the configure script should be modified to check for a compatibile version of iconv and if one is not found resort to a manual conversion.
Testing is still underway. After the standard regression tests are finished I will create new tests utf8id-.* which will be versions of the uncid-.* tests for native utf-8 files. I will also include a new test for arbitrary 8-bit string literals, to verify further compatibility.
Created attachment 36206 [details]
Improved UTF-8 identifier patch
Improved patch to support UTF-8 identifiers. This version by default does no translation unless -finput-charset=XXX is specified where XXX is something other than C99 and should not affect EBCDIC hosts.
With my second patch the command line must now include the options
-finput-charset=UTF-8 -fextended-identifiers -fexec-charset=UTF-8
or otherwise C99 will also be used for the default execution character set. A better approach to maintain nearly 8-bit clean string literals by default might result from leaving the default input and execution characters sets as UTF-8 and setting the internal character set to C99 only when -fextended-identifiers is selected. Sorry for too many comments. I'll post a new patch when everything is ready and has been tested.
On Tue, 18 Aug 2015, ejolson at unr dot edu wrote:
> As iconv was installed on every GNU/Linux system I've tried, I'm not sure what
> is wrong with using the C99 mode present in newer releases. This achieves
The iconv that is installed is glibc iconv. It has *nothing to do with*
libiconv, a completely independent package. iconv --version will report a
glibc version and iconv --list will produce a list not mentioning C99,
$ iconv --version
iconv (Ubuntu EGLIBC 2.19-0ubuntu6.6) 2.19
$ iconv --list
The following list contains all the coded character sets known. This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters. One coded character set can be
listed with several different names (aliases).
437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5,
CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278,
CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813,
CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863,
CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875,
CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918,
CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949,
CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079,
CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125,
CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143,
CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155,
CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166,
CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257,
CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517,
CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712,
CP16804, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500,
CSA_Z243.4-1985-1, CSA_Z243.4-1985-2, CSA_Z243.419851, CSA_Z243.419852,
CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA, CSEBCDICCAFR, CSEBCDICDKNO,
CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA, CSEBCDICESS, CSEBCDICFISE,
CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT, CSEBCDICUK, CSEBCDICUS,
CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8, CSIBM037, CSIBM038,
CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280, CSIBM281,
CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423, CSIBM424,
CSIBM500, CSIBM803, CSIBM851, CSIBM855, CSIBM856, CSIBM857, CSIBM860,
CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870,
CSIBM871, CSIBM880, CSIBM891, CSIBM901, CSIBM902, CSIBM903, CSIBM904,
CSIBM905, CSIBM918, CSIBM921, CSIBM922, CSIBM930, CSIBM932, CSIBM933,
CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1008, CSIBM1025, CSIBM1026,
CSIBM1097, CSIBM1112, CSIBM1122, CSIBM1123, CSIBM1124, CSIBM1129, CSIBM1130,
CSIBM1132, CSIBM1133, CSIBM1137, CSIBM1140, CSIBM1141, CSIBM1142, CSIBM1143,
CSIBM1144, CSIBM1145, CSIBM1146, CSIBM1147, CSIBM1148, CSIBM1149, CSIBM1153,
CSIBM1154, CSIBM1155, CSIBM1156, CSIBM1157, CSIBM1158, CSIBM1160, CSIBM1161,
CSIBM1163, CSIBM1164, CSIBM1166, CSIBM1167, CSIBM1364, CSIBM1371, CSIBM1388,
CSIBM1390, CSIBM1399, CSIBM4517, CSIBM4899, CSIBM4909, CSIBM4971, CSIBM5347,
CSIBM9030, CSIBM9066, CSIBM9448, CSIBM12712, CSIBM16804, CSIBM11621162,
CSISO4UNITEDKINGDOM, CSISO10SWEDISH, CSISO11SWEDISHFORNAMES,
CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH,
CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH,
CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC,
CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2,
CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN,
CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS,
CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2,
CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150,
CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH,
CSISO2022CN, CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033,
CSISO5427CYRILLIC, CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX,
CSISOLATIN1, CSISOLATIN2, CSISOLATIN3, CSISOLATIN4, CSISOLATIN5, CSISOLATIN6,
CSISOLATINARABIC, CSISOLATINCYRILLIC, CSISOLATINGREEK, CSISOLATINHEBREW,
CSKOI8R, CSKSC5636, CSMACINTOSH, CSNATSDANO, CSNATSSEFI, CSN_369103,
CSPC8CODEPAGE437, CSPC775BALTIC, CSPC850MULTILINGUAL, CSPC862LATINHEBREW,
CSPCP852, CSSHIFTJIS, CSUCS4, CSUNICODE, CSWINDOWS31J, CUBA, CWI-2, CWI,
CYRILLIC, DE, DEC-MCS, DEC, DECMCS, DIN_66003, DK, DS2089, DS_2089, E13B,
EBCDIC-AT-DE-A, EBCDIC-AT-DE, EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR,
EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH,
EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB,
EBCDIC-CP-GR, EBCDIC-CP-HE, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL,
EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US,
EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO,
EBCDIC-ES-A, EBCDIC-ES-S, EBCDIC-ES, EBCDIC-FI-SE-A, EBCDIC-FI-SE, EBCDIC-FR,
EBCDIC-GREEK, EBCDIC-INT, EBCDIC-INT1, EBCDIC-IS-FRISS, EBCDIC-IT,
EBCDIC-JP-E, EBCDIC-JP-KANA, EBCDIC-PT, EBCDIC-UK, EBCDIC-US, EBCDICATDE,
EBCDICATDEA, EBCDICCAFR, EBCDICDKNO, EBCDICDKNOA, EBCDICES, EBCDICESA,
EBCDICESS, EBCDICFISE, EBCDICFISEA, EBCDICFR, EBCDICISFRISS, EBCDICIT,
EBCDICPT, EBCDICUK, EBCDICUS, ECMA-114, ECMA-118, ECMA-128, ECMA-CYRILLIC,
ECMACYRILLIC, ELOT_928, ES, ES2, EUC-CN, EUC-JISX0213, EUC-JP-MS, EUC-JP,
EUC-KR, EUC-TW, EUCCN, EUCJP-MS, EUCJP-OPEN, EUCJP-WIN, EUCJP, EUCKR, EUCTW,
FI, FR, GB, GB2312, GB13000, GB18030, GBK, GB_1988-80, GB_198880,
GEORGIAN-ACADEMY, GEORGIAN-PS, GOST_19768-74, GOST_19768, GOST_1976874,
GREEK-CCITT, GREEK, GREEK7-OLD, GREEK7, GREEK7OLD, GREEK8, GREEKCCITT,
HEBREW, HP-GREEK8, HP-ROMAN8, HP-ROMAN9, HP-THAI8, HP-TURKISH8, HPGREEK8,
HPROMAN8, HPROMAN9, HPTHAI8, HPTURKISH8, HU, IBM-803, IBM-856, IBM-901,
IBM-902, IBM-921, IBM-922, IBM-930, IBM-932, IBM-933, IBM-935, IBM-937,
IBM-939, IBM-943, IBM-1008, IBM-1025, IBM-1046, IBM-1047, IBM-1097, IBM-1112,
IBM-1122, IBM-1123, IBM-1124, IBM-1129, IBM-1130, IBM-1132, IBM-1133,
IBM-1137, IBM-1140, IBM-1141, IBM-1142, IBM-1143, IBM-1144, IBM-1145,
IBM-1146, IBM-1147, IBM-1148, IBM-1149, IBM-1153, IBM-1154, IBM-1155,
IBM-1156, IBM-1157, IBM-1158, IBM-1160, IBM-1161, IBM-1162, IBM-1163,
IBM-1164, IBM-1166, IBM-1167, IBM-1364, IBM-1371, IBM-1388, IBM-1390,
IBM-1399, IBM-4517, IBM-4899, IBM-4909, IBM-4971, IBM-5347, IBM-9030,
IBM-9066, IBM-9448, IBM-12712, IBM-16804, IBM037, IBM038, IBM256, IBM273,
IBM274, IBM275, IBM277, IBM278, IBM280, IBM281, IBM284, IBM285, IBM290,
IBM297, IBM367, IBM420, IBM423, IBM424, IBM437, IBM500, IBM775, IBM803,
IBM813, IBM819, IBM848, IBM850, IBM851, IBM852, IBM855, IBM856, IBM857,
IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM866NAV, IBM868,
IBM869, IBM870, IBM871, IBM874, IBM875, IBM880, IBM891, IBM901, IBM902,
IBM903, IBM904, IBM905, IBM912, IBM915, IBM916, IBM918, IBM920, IBM921,
IBM922, IBM930, IBM932, IBM933, IBM935, IBM937, IBM939, IBM943, IBM1004,
IBM1008, IBM1025, IBM1026, IBM1046, IBM1047, IBM1089, IBM1097, IBM1112,
IBM1122, IBM1123, IBM1124, IBM1129, IBM1130, IBM1132, IBM1133, IBM1137,
IBM1140, IBM1141, IBM1142, IBM1143, IBM1144, IBM1145, IBM1146, IBM1147,
IBM1148, IBM1149, IBM1153, IBM1154, IBM1155, IBM1156, IBM1157, IBM1158,
IBM1160, IBM1161, IBM1162, IBM1163, IBM1164, IBM1166, IBM1167, IBM1364,
IBM1371, IBM1388, IBM1390, IBM1399, IBM4517, IBM4899, IBM4909, IBM4971,
IBM5347, IBM9030, IBM9066, IBM9448, IBM12712, IBM16804, IEC_P27-1, IEC_P271,
INIS-8, INIS-CYRILLIC, INIS, INIS8, INISCYRILLIC, ISIRI-3342, ISIRI3342,
ISO-2022-CN-EXT, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP,
ISO-2022-KR, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5,
ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-9E, ISO-8859-10,
ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-10646,
ISO-10646/UCS2, ISO-10646/UCS4, ISO-10646/UTF-8, ISO-10646/UTF8, ISO-CELTIC,
ISO-IR-4, ISO-IR-6, ISO-IR-8-1, ISO-IR-9-1, ISO-IR-10, ISO-IR-11, ISO-IR-14,
ISO-IR-15, ISO-IR-16, ISO-IR-17, ISO-IR-18, ISO-IR-19, ISO-IR-21, ISO-IR-25,
ISO-IR-27, ISO-IR-37, ISO-IR-49, ISO-IR-50, ISO-IR-51, ISO-IR-54, ISO-IR-55,
ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86,
ISO-IR-88, ISO-IR-89, ISO-IR-90, ISO-IR-92, ISO-IR-98, ISO-IR-99, ISO-IR-100,
ISO-IR-101, ISO-IR-103, ISO-IR-109, ISO-IR-110, ISO-IR-111, ISO-IR-121,
ISO-IR-122, ISO-IR-126, ISO-IR-127, ISO-IR-138, ISO-IR-139, ISO-IR-141,
ISO-IR-143, ISO-IR-144, ISO-IR-148, ISO-IR-150, ISO-IR-151, ISO-IR-153,
ISO-IR-155, ISO-IR-156, ISO-IR-157, ISO-IR-166, ISO-IR-179, ISO-IR-193,
ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-209, ISO-IR-226, ISO/TR_11548-1,
ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES,
ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-HU,
ISO646-IT, ISO646-JP-OCR-B, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2,
ISO646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU,
ISO2022CN, ISO2022CNEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937,
ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7,
ISO8859-8, ISO8859-9, ISO8859-9E, ISO8859-10, ISO8859-11, ISO8859-13,
ISO8859-14, ISO8859-15, ISO8859-16, ISO11548-1, ISO88591, ISO88592, ISO88593,
ISO88594, ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO88599E,
ISO885910, ISO885911, ISO885913, ISO885914, ISO885915, ISO885916,
ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-EXT, ISO_5427,
ISO_5427:1981, ISO_5427EXT, ISO_5428, ISO_5428:1980, ISO_6937-2,
ISO_6937-2:1983, ISO_6937, ISO_6937:1992, ISO_8859-1, ISO_8859-1:1987,
ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4,
ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987,
ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988,
ISO_8859-9, ISO_8859-9:1989, ISO_8859-9E, ISO_8859-10, ISO_8859-10:1992,
ISO_8859-14, ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16,
ISO_8859-16:2001, ISO_9036, ISO_10367-BOX, ISO_10367BOX, ISO_11548-1,
ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO,
JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R,
KOI8-RU, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6,
L7, L8, L10, LATIN-9, LATIN-GREEK-1, LATIN-GREEK, LATIN1, LATIN2, LATIN3,
LATIN4, LATIN5, LATIN6, LATIN7, LATIN8, LATIN9, LATIN10, LATINGREEK,
LATINGREEK1, MAC-CENTRALEUROPE, MAC-CYRILLIC, MAC-IS, MAC-SAMI, MAC-UK, MAC,
MACCYRILLIC, MACINTOSH, MACIS, MACUK, MACUKRAINIAN, MIK, MS-ANSI, MS-ARAB,
MS-CYRL, MS-EE, MS-GREEK, MS-HEBR, MS-MAC-CYRILLIC, MS-TURK, MS932, MS936,
MSCP949, MSCP1361, MSMACCYRILLIC, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO,
NATS-SEFI, NATSDANO, NATSSEFI, NC_NC0010, NC_NC00-10, NC_NC00-10:81,
NF_Z_62-010, NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010,
NF_Z_62010_1973, NO, NO2, NS_4551-1, NS_4551-2, NS_45511, NS_45512,
OS2LATIN1, OSF00010001, OSF00010002, OSF00010003, OSF00010004, OSF00010005,
OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020,
OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106,
OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8,
OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D,
OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10010004,
OSF10010006, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118,
OSF10020122, OSF10020129, OSF10020352, OSF10020354, OSF10020357, OSF10020359,
OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370,
OSF10020387, OSF10020388, OSF10020396, OSF10020402, OSF10020417, PT, PT2,
PT154, R8, R9, RK1048, ROMAN8, ROMAN9, RUSCII, SE, SE2, SEN_850200_B,
SEN_850200_C, SHIFT-JIS, SHIFT_JIS, SHIFT_JISX0213, SJIS-OPEN, SJIS-WIN,
SJIS, SS636127, STRK1048-2002, ST_SEV_358-88, T.61-8BIT, T.61, T.618BIT,
TCVN-5712, TCVN, TCVN5712-1, TCVN5712-1:1993, THAI8, TIS-620, TIS620-0,
TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, TURKISH8, UCS-2,
UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK,
UNICODE, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE,
UTF16LE, UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM,
WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251,
WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256,
WINDOWS-1257, WINDOWS-1258, WINSAMI2, WS2, YU
> exactly the suggested result of converting all UTF-8 input to UCNs in the
Which, as I have explained, is fundamentally incompatible with C++
requirements on raw string literals (namely that UTF-8 within such a
string appears as UTF-8 bytes in the resulting object, while a \u or \U
sequence appears as such in the resulting object). Proper C++ semantics
require a conversion that is aware of the lexical context and only
converts to UCNs in certain contexts.
The string literal R"(\u00C0)" must contain the six bytes \u00C0 plus the
trailing null byte. The string literal R"(À)" (given UTF-8 as the
multibyte encoding of the execution character set) must contain the two
bytes of that character's UTF-8 encoding plus the trailing null byte.
See how lex_raw_string deals with reverting trigraph and line splicing
transformations; conversions to UCNs would need similarly reverting if
done at all (it's probably better to do the conversions later, only in the
corner cases where it's actually visible whether such a conversion was
done, lexing as UTF-8 as far as possible).
Thanks Joseph for the clarification about the two different versions of iconv. I was admittedly confused about this until moments ago. Anyway, I just discovered that libiconv doesn't support conversions to and from the IBM1047 EBCDIC character set and this causes some of the regression tests to fail. Coupled with the fact that C99 isn't supported in the glibc version of iconv this creates a little problem with my patch.
You mention a bigger problem which I had not thought about: the C++ semantics of raw strings. Processing UCNs in C++ code apparently requires surprisingly deep syntactic analysis. Raw literals seem to appear in the gnu99 and gnu11 extensions to C as well.
Amusingly, if I understand the C++ specifications
trigraphs are supposed to be interpreted before any other processing takes place. However, the simple code
printf("%s or %s or %s\n",p1,p2,p3);
$ g++ -std=c++11 pp.c
ä or ??/u00E4 or \u00E4
which illustrates that g++ does not process trigraphs inside raw string literals. Admittedly I'm looking at the draft standard, but I don't think this is something which changed suddenly in the final draft. Clearly, my patch makes a further mess of raw string literals in gcc. My first reaction is that raw string literals were not well thought out, but I suppose bad standards are sometimes better than no standards. At anyrate, there appears no easy way of supporting both UTF-8 identifiers and raw literal strings.
My plan for now is to take a break and keep my UTF-8 identifier support as a one-line patch reliant on libiconv which breaks EBCDIC encodings and raw string literals.
On Tue, 18 Aug 2015, ejolson at unr dot edu wrote:
> which illustrates that g++ does not process trigraphs inside raw string
> literals. Admittedly I'm looking at the draft standard, but I don't think this
As stated in [lex.pptoken] in both C++11 and C++14: "Between the initial
and final double quote characters of the raw string, any transformations
performed in phases 1 and 2 (trigraphs, universal-character-names, and
line splicing) are reverted; this reversion shall apply before any d-char,
r-char, or delimiting parenthesis is identified.". Yes, the positioning
of this in the standard may be confusing....
That is, the effect is more or less as if trigraphs weren't processed
inside raw strings (but the implementation involves undoing trigraph
substitutions, as described in the standard).
I think the right way to implement UTF-8 in identifiers involves making
lex_identifier handle UTF-8 (when extended identifiers are enabled), and
making _cpp_lex_direct handle bytes with the high bit set as
potentially[*] starting identifiers (requiring the same handling of
normalization state as for the other cases of characters starting
identifiers, of course). If you do that, then raw strings and all the
corner cases of spelling preservation fall out naturally (though they
still need testcases added to the testsuite).
[*] I think the right rule for C is that UTF-8 for a character not allowed
in identifiers should produce a preprocessing token on its own rather than
an error for an invalid character in an identifier (and similarly, such a
character after the start of the identifier should terminate the
identifier and produce such a preprocessing token). Unless and until
someone implements the C++ phase 1 conversion to UCNs, it would seem
reasonable to follow this rule for C++ as well.
I've been looking at the code in lex_identifier as well as what goes on in forms_identifier_p and so forth. As some point each identifier needs to be stored in the symbol table using ht_lookup_with_hash. Proper functioning requires that UTF-8 and UCN representations of the same unicode characters are treated as the same symbol. Thus, there needs to be some point at which the identifiers are regularized to be either all UTF-8 or all UCN escaped ASCII. As gcc is working with UCNs right now, the obvious implementation allocates temporary memory to hold the UCN escaped ASCII version of an UTF-8 identifier and then frees it again after calling ht_lookup. Any comments would be appreciated.
_cpp_interpret_identifier converts UCNs to UTF-8 which is the canonical
internal form for identifiers - for UTF-8 in identifiers, you just need to
pass in straight through unmodified there. (cpplib takes care to store
the original spelling of the identifier as well for purposes for which
that matters, but that's simply a matter of lex_identifier calling
cpp_lookup on the original spelling as well as using
_cpp_interpret_identifier to get the canonical version.)
So you never need to convert UTF-8 to UCNs in order to handle UTF-8 in
identifiers (cpplib has logic to do so when needed for output, but you
don't need to add anything new in that regard). You do need to decode
UTF-8 into character values for the code that checks normalization, which
characters are allowed at the start of identifiers, etc., just as the
existing code decodes UCNs into such values. (But as I noted, a UCN not
allowed in identifiers is lexed as part of an identifier, which is then
considered invalid, whereas a UTF-8 character not allowed in identifiers
should be lexed as a separate pp-token. However, UTF-8 for a character
allowed in identifiers but not at the start of an identifier should, I
think, be lexed as an identifier character even at the start of an
identifier, and then give an error for an invalid identifier if it appears
at the start of an identifier. That's my reading of the syntax
productions in the C standard.)
You can ignore anything claiming to handle UTF-EBCDIC.
Has there been any progress on this since 2015? I'm maintaining a project that uses the International Phonetic Alphabet (IPA) internally. My life would be much easier if I could use identifiers like aʊ or dʒ. Both are valid C++ identifiers supported by Clang, Xcode and Visual Studio, but not supported by GCC.
My knowledge of compilers is very limited, so I'm afraid I can't be of practical help. But I'd like to point out that there is indeed demand for this feature -- see for example this StackOverflow question: <http://stackoverflow.com/questions/12692067/and-other-unicode-characters-in-identifiers-not-allowed-by-g#>
An important patch. Is there a similar patch for versions later than 5.2.0 of gcc? I'm looking for gcc-7.2.1-2 patch for unicode idenfifiers.
(In reply to email@example.com from comment #23)
> An important patch. Is there a similar patch for versions later than 5.2.0
> of gcc? I'm looking for gcc-7.2.1-2 patch for unicode idenfifiers.
The patch above is not recommended due to the problems mentioned above.
The recommended work-around is given here:
Guidelines for a proper implementation are given in comment #21.
Many thanks Manu. The to_UCN.sh script works well. The only trouble was that my include file also contain unusual characters with diacritic marks and the script changes these file names to UCN also. So compiler cant find them. I had to re-edit the .cpp file manually after conversion to UCN to change the include file names back. But in spite of that, it is useful and enables coding with much greater choice of words for identifiers. Much easier for me to read my code. Thanks again.