This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: UCNs-in-IDs patch
On Thu, 17 Mar 2005, Per Bothner wrote:
> To go further: it may be acceptable that foo.E always
> be in UTF-8, even when that doesn't match the input Locale.
> Or always use ASCII with \U-escapes. Or better: if the
> current locale uses UTF-8. emit UTF-8; otherwise emit
> ASCII with \U-escapes. Code that reads pre-processed input
> should assume the input is UTF-8 which might contain \U-escapes,
> rather than the current locale.
I'd say that for C++ the preprocessed output should contain UCNs (because
of the C++ phase 1 mapping), for C as a quality-of-implementation issue -
and as a correctness issue insofar as we say that the preprocessed output
is the token sequence resulting from the preprocessing phases - it should
use original token spellings. I don't like the phase 1
implementation-defined mapping being any more complicated than it needs to
be (i.e., converting the input character set, specified in the documented
way, to Unicode using iconv). I think we can just barely justify how we
ignore whitespace between backslash and newline (unfortunately not
documented in the documentation of implementation-defined behavior) on the
grounds of user confusion, and ignoring byte sequences not in the input
character set within comments (with a warning, and only for
ASCII-compatible character sets) (which we don't yet do, but it might
allow us to start using the locale character set as the default input
character set) on similar grounds: but a UCN conversion not needed by the
standard (i.e. any other than the standard C++ one) doesn't seem justified
that way.
--
Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/
jsm@polyomino.org.uk (personal mail)
joseph@codesourcery.com (CodeSourcery mail)
jsm28@gcc.gnu.org (Bugzilla assignments and CCs)