Character sets - The C Preprocessor

Source code character set processing in C and related languages is rather complicated. The C standard discusses two character sets, but there are really at least four.

The files input to CPP might be in any character set at all. CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.

At present, GNU CPP does not implement conversion from arbitrary file encodings to the source character set. Use of any encoding other than plain ASCII or UTF-8, except in comments, will cause errors. Use of encodings that are not strict supersets of ASCII, such as Shift JIS, may cause errors even if non-ASCII characters appear only in comments. We plan to fix this in the near future.

All preprocessing work (the subject of the rest of this manual) is carried out in the source character set. If you request textual output from the preprocessor with the -E option, it will be in UTF-8.

After preprocessing is complete, string and character constants are converted again, into the execution character set. This character set is under control of the user; the default is UTF-8, matching the source character set. Wide string and character constants have their own character set, which is not called out specifically in the standard. Again, it is under control of the user. The default is UTF-16 or UTF-32, whichever fits in the target's wchar_t type, in the target machine's byte order.¹ Octal and hexadecimal escape sequences do not undergo conversion; '\x12' has the value 0x12 regardless of the currently selected execution character set. All other escapes are replaced by the character in the source character set that they represent, then converted to the execution character set, just like unescaped characters.

GCC does not permit the use of characters outside the ASCII range, nor `\u' and `\U' escapes, in identifiers. We hope this will change eventually, but there are problems with the standard semantics of such “extended identifiers” which must be resolved through the ISO C and C++ committees first.

Footnotes

[1] UTF-16 does not meet the requirements of the C standard for a wide character set, but the choice of 16-bit wchar_t is enshrined in some system ABIs so we cannot fix this.

1.1 Character sets

Footnotes