This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Merge cpplib and front end hashtables, part 1
On Thu, May 17, 2001 at 07:41:03AM +0100, Neil Booth wrote:
> Zack Weinberg wrote:-
>
> > I put a brain dump on charset handling into the cpplib projects web
> > page. It remains a pretty good statement of what I think our end goal
> > should be in terms of user-visible behavior. It'd be reasonable to do
> > a subset of this stuff to begin with, then get better as things go on.
>
> I don't understand how the user is going to communicate the encoding
> of a file to us. My understanding is charset encoding is on a
> per-file basis; i.e. there is only one encoding per file.
No, it is dependent on the current locale as set by setlocale. Ie, you could
do one setlocale, open up a file and read it via the mb functions, and then
close the file, do a different setlocale, open the exact same file, and get a
different set of multibytes.
> But since some header files are system header files, clearly the whole
> translation unit cannot be in a single charset.
Ummm, I know it is currently late at night for me, but for C89, IIRC, it was
the intention of the committee that the entire translation unit be in a single
charset and that the compiler does the equivalent of setlocale (LC_ALL, "").
Certainly the way I read the first stage of translation in C99's 5.1.1.2, the
compiler does logically translate everything into the source character set.
5.1.2.2 Translation phases
The precedence among the syntax rules of translation is specified by
the following phases [5]
1. Physical source file multibyte characters are mapped to the
source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are
replaced by corresponding single-character internal
representations.
> So we need a way to specify it on a per-file basis, presumably in the
> file itself. But how can we grok what's in the file if we don't know
> what charset it's written in? It seems like chicken-and-egg to me.
The characters needed for the C langauge must be present in any encoding, and I
believe they must have the exact same encoding (though I don't recall exactly
where in the standard this is set down, though it may be the section that
describes L"" strings). Thus for instance:
"X"[0] == L"X"[0]
--
Michael Meissner, Red Hat, Inc. (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work: meissner@redhat.com phone: +1 978-486-9304
Non-work: meissner@spectacle-pond.org fax: +1 978-692-4482