This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Merge cpplib and front end hashtables, part 1


On Thu, May 17, 2001 at 07:41:03AM +0100, Neil Booth wrote:
> Zack Weinberg wrote:-
> 
> > I put a brain dump on charset handling into the cpplib projects web
> > page.  It remains a pretty good statement of what I think our end goal
> > should be in terms of user-visible behavior.  It'd be reasonable to do
> > a subset of this stuff to begin with, then get better as things go on.
> 
> I don't understand how the user is going to communicate the encoding
> of a file to us.  My understanding is charset encoding is on a
> per-file basis; i.e. there is only one encoding per file.

No, it is dependent on the current locale as set by setlocale.  Ie, you could
do one setlocale, open up a file and read it via the mb functions, and then
close the file, do a different setlocale, open the exact same file, and get a
different set of multibytes.

> But since some header files are system header files, clearly the whole
> translation unit cannot be in a single charset.

Ummm, I know it is currently late at night for me, but for C89, IIRC, it was
the intention of the committee that the entire translation unit be in a single
charset and that the compiler does the equivalent of setlocale (LC_ALL, "").
Certainly the way I read the first stage of translation in C99's 5.1.1.2, the
compiler does logically translate everything into the source character set.

	5.1.2.2 Translation phases

	The precedence among the syntax rules of translation is specified by
	the following phases [5]

	    1.	Physical source file multibyte characters are mapped to the
		source character set (introducing new-line characters for
		end-of-line indicators) if necessary.  Trigraph sequences are
		replaced by corresponding single-character internal
		representations.

> So we need a way to specify it on a per-file basis, presumably in the
> file itself.  But how can we grok what's in the file if we don't know
> what charset it's written in?  It seems like chicken-and-egg to me.

The characters needed for the C langauge must be present in any encoding, and I
believe they must have the exact same encoding (though I don't recall exactly
where in the standard this is set down, though it may be the section that
describes L"" strings).  Thus for instance:

	"X"[0] == L"X"[0]

-- 
Michael Meissner, Red Hat, Inc.  (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work:	  meissner@redhat.com		phone: +1 978-486-9304
Non-work: meissner@spectacle-pond.org	fax:   +1 978-692-4482


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]