This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Merge cpplib and front end hashtables, part 1


On Thu, May 17, 2001 at 06:51:37PM +0100, Neil Booth wrote:
> Zack Weinberg wrote:-
> 
> > I don't think it's feasible for us to use the C89 or even C99
> > multibyte primitives at all.  They are too underspecified to be of
> > use.  iconv has approximately the right interface, although it's
> > geared to bulk conversion not char-by-char processing and is therefore
> > too heavyweight ... but then, so are the <wchar.h> primitives.  Bleah.
> 
> I think we should go for translating the whole file in one go, at
> least initially.  cpplib can then cache the translated buffer, rather
> than the buffer itself - obviously a big win.

I'm not sure how much of a win it is, but it definitely fits the iconv
interface better, so it's a good idea just on those grounds.  (Files
that we scan repeatedly are likely to be system headers, and therefore
less likely to need conversion.)  It also avoids having to spread
charset-translation through the rest of cpplib.

We should retain a short-circuit where we don't bother running
anything through iconv, for files which are known to be in the
encoding used by the next phase already.  UTF8 probably works best
with the existing architecture.

> The only problems I see with this are users wanting strings and
> character constants to be converted back to the original charset.  But
> it's too good an optimisation to lose easily.

It's not hard to convert each string back from UTF8 to the desired
charset if necessary.  This is not supposed to lose information.  I
have seen people claiming that it does in some cases, but I think they
were talking about e.g. the most general form of ISO-2022-JP which no
one uses in real life.

Passing through Unicode also means that \u and \U escapes make sense
no matter what the user's preferred execution charset is.

> What is our execution charset going to be?  UTF8?  We could always
> translate back, depending on, say, a command-line flag or some extra
> flag indicated in the same place as the source charset.  I'm not sure
> whether this would lose information in some cases, though.

Wide string constants are tricky.  What you really want is for L"foo"
to contain the same bit pattern that mbstowcs("foo") would give you if
executed *on the target*.  Currently I think we punt.

My idea was that we always encoded identifiers in UTF8, and
string/char constants default to UTF8 for multibyte and UCS2/UCS4 for
wide chars, depending on sizeof(target wchar_t).  Later, we add a
command line switch that lets the user request strings etc. be
translated to a different execution character set.  I think there
should be just one execution character set per translation unit.

How's that sound?

Oh, and you should look at the Java front end, which does something
like this already, and we need to be compatible.

-- 
zw  I was saving quarters, but now I'm going home tomorrow, so I guess I have
    more money than I have.
    	-- Nathaniel Smith


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]