This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: cpplib: locale-sensitive lexing


Zack Weinberg wrote:-

> I would prefer an approach which made no use of Cx9 multibyte
> character functions.  Specifically, the concept of "locale" is far too
> slippery.  We run the risk of having code compile correctly on one
> system and not on another because the library implementers have
> decided that "ru_RU" means something different.

Hmm.  I don't see the problem here - doesn't only the host's locale
matter?  If not, would you elaborate?

I don't see how using iconv eliminates the potential for confusion,
either.  If there is a danger of confusion, I would imagine the user
could specify a more descriptive locale (like in glibc where you
append e.g. UTF-8 or EUC-JP).

> The Java front end appears to ignore locale and use iconv and
> character encoding names.  This seems much more appropriate to me.
> 
> I also think it's short-sighted to implement a solution which does not
> allow for the source character set to vary per input file, or for the
> execution character set to be different from the source character
> set(s).  We know we will need these things.  We don't have to
> implement them now, but let's not implement something that will have
> to be thrown away and redone later.

If there are input files with different encodings, then change the
locale before compiling them.  I'm not persuaded of the need for
multiple encodings within a translation unit.

How does my code restrict the execution character set to be the same
as the source character set?

> A related problem we need to consider is the width of non-ASCII
> characters.  For instance, most Chinese ideograms take up two columns
> each.  In order to get column positions right, we have to check what
> the width of each character is.  C99 has wcwidth(), but again I would
> prefer we avoid anything that depends on locale.

I've thought about this a bit.  Is there such a thing as a definite
width of non-ASCII characters?  The more I thought about it, the more
I thought that the concept of "column" becomes bogus.  There is no way
you can reasonably know the method that such characters are being
displayed to the user.  (Or is there?)  For example, possibly the
user's interface cannot display the non-ASCII characters, and displays
\u escapes instead.

What is important is the byte offset of the diagnostic in the current
line.  Is it not up to the user's editor or IDE or whatever to convert
that byte offset into a meaningful place onscreen?

Suppose we did use iconv.  How would you avoid overhead for the normal
case?

Neil.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]