This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: extended characters (was Re: The integrated preprocessor)


>>>>> Zack Weinberg <weinberg@cygnus.com> writes:

I've moved your response to my personal mail onto this thread; hope that's
OK.  Please do read my earlier post in this thread, in case you haven't.

 > On Mon, Aug 21, 2000 at 07:32:22PM -0700, Jason Merrill wrote:
 >> It might be convenient for the
 >> preprocessor to parse \u sequences and multibyte characters and feed
 >> the code along to the compiler such that
 >> 
 >> 1) identifiers are recoded to UTF-8
 >> 2) string and character constants are recoded to the current multibyte
 >>    character set.  In the case of the integrated preprocessor, wide
 >>    string and character constants could be passed along as wide
 >>    characters.

 > How do we decide what the current multibyte character set is?  Locale?
 > Command line option?  How do we know what the character set of the
 > file is?

The source character set is determined by the locale.  The execution
character set must be specified by the user, except in special cases
(i.e. TARGET_EBCDIC).  That could be done with command-line option,
environment variable or pragma, or all of the above.

Hmm...I notice that the code currently applies MAP_CHARACTER to all
escapes; it shouldn't be mapping octal or hex escapes, they should pass
through.

 >> Thoughts?  It seems to me that since the goal with the integrated
 >> preprocessor is to avoid doing any lexing in the compiler, this work
 >> will need to move into the preprocessor.

 > I do think we'll want to be doing this in the preprocessor eventually.
 > I doubt it'll be me implementing it - I'm starting graduate school in
 > just under three weeks and I have to move.  cpplib will have to scan
 > the multibyte characters anyway, so that it knows where the boundaries
 > are; it might as well convert them to whatever form is suitable for
 > the compiler.  More generally, we want to move the ascii->binary
 > conversion for numeric and character constants, and the escape
 > sequence conversion for string constants, into cpplib, but again it
 > probably won't be me implementing it.

 > There's a major stumbling block in my way when it comes to making
 > cpplib multibyte-aware at all.  There's no way to go from a charset
 > name to something you can hand to setlocale(), and therefore no way to
 > control the charset being expected by mbrtowc().

True.

 > Iconv lets you give charset names, but it wants to convert characters in
 > bulk.

It can, but it can also convert single characters.  Just tell it you only
have room for one output wide char.

 > So how do I find the end of a string constant - or even a
 > comment! - when the character set of the file is different from the
 > default charset of the locale, and isn't a strict superset of ASCII?
 > Short of starting by iconv-ing the entire file to UTF8, which does
 > wonderful things for performance and memory consumption.

I would think you could play games with scanning normally until you hit
something not in the basic source character set.  S-JIS may not be a strict
superset of ASCII, but multibyte character sequences do start with <ESC>.
Unless the source file is in EBCDIC and the compiler expects ASCII or
something bizarre like that.

 > And I'm pretty confident we're going to want to let a file specify what
 > character set it's in, using something like "#pragma GCC charset <name>"
 > at the top of the file, or a MULE magic comment.

Yup.

 > How does the Java front end do it?

I don't think it currently supports multibyte characters in source files,
but I could be wrong.

Jason

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]