This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: extended characters (was Re: The integrated preprocessor)
>>>>> Zack Weinberg <weinberg@cygnus.com> writes:
I've moved your response to my personal mail onto this thread; hope that's
OK. Please do read my earlier post in this thread, in case you haven't.
> On Mon, Aug 21, 2000 at 07:32:22PM -0700, Jason Merrill wrote:
>> It might be convenient for the
>> preprocessor to parse \u sequences and multibyte characters and feed
>> the code along to the compiler such that
>>
>> 1) identifiers are recoded to UTF-8
>> 2) string and character constants are recoded to the current multibyte
>> character set. In the case of the integrated preprocessor, wide
>> string and character constants could be passed along as wide
>> characters.
> How do we decide what the current multibyte character set is? Locale?
> Command line option? How do we know what the character set of the
> file is?
The source character set is determined by the locale. The execution
character set must be specified by the user, except in special cases
(i.e. TARGET_EBCDIC). That could be done with command-line option,
environment variable or pragma, or all of the above.
Hmm...I notice that the code currently applies MAP_CHARACTER to all
escapes; it shouldn't be mapping octal or hex escapes, they should pass
through.
>> Thoughts? It seems to me that since the goal with the integrated
>> preprocessor is to avoid doing any lexing in the compiler, this work
>> will need to move into the preprocessor.
> I do think we'll want to be doing this in the preprocessor eventually.
> I doubt it'll be me implementing it - I'm starting graduate school in
> just under three weeks and I have to move. cpplib will have to scan
> the multibyte characters anyway, so that it knows where the boundaries
> are; it might as well convert them to whatever form is suitable for
> the compiler. More generally, we want to move the ascii->binary
> conversion for numeric and character constants, and the escape
> sequence conversion for string constants, into cpplib, but again it
> probably won't be me implementing it.
> There's a major stumbling block in my way when it comes to making
> cpplib multibyte-aware at all. There's no way to go from a charset
> name to something you can hand to setlocale(), and therefore no way to
> control the charset being expected by mbrtowc().
True.
> Iconv lets you give charset names, but it wants to convert characters in
> bulk.
It can, but it can also convert single characters. Just tell it you only
have room for one output wide char.
> So how do I find the end of a string constant - or even a
> comment! - when the character set of the file is different from the
> default charset of the locale, and isn't a strict superset of ASCII?
> Short of starting by iconv-ing the entire file to UTF8, which does
> wonderful things for performance and memory consumption.
I would think you could play games with scanning normally until you hit
something not in the basic source character set. S-JIS may not be a strict
superset of ASCII, but multibyte character sequences do start with <ESC>.
Unless the source file is in EBCDIC and the compiler expects ASCII or
something bizarre like that.
> And I'm pretty confident we're going to want to let a file specify what
> character set it's in, using something like "#pragma GCC charset <name>"
> at the top of the file, or a MULE magic comment.
Yup.
> How does the Java front end do it?
I don't think it currently supports multibyte characters in source files,
but I could be wrong.
Jason