This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Merge cpplib and front end hashtables, part 1


On Thu, May 17, 2001 at 03:11:57AM -0400, Michael Meissner wrote:
> On Thu, May 17, 2001 at 07:41:03AM +0100, Neil Booth wrote:
> > 
> > I don't understand how the user is going to communicate the encoding
> > of a file to us.  My understanding is charset encoding is on a
> > per-file basis; i.e. there is only one encoding per file.
> 
> No, it is dependent on the current locale as set by setlocale.  Ie, you could
> do one setlocale, open up a file and read it via the mb functions, and then
> close the file, do a different setlocale, open the exact same file, and get a
> different set of multibytes.

That's the model assumed by the standard, but it is not adequate for
our purposes, on three counts:

- There is no well-defined map between character set (encodings) and
  setlocale strings.
- The meaning of a source file must not depend on random environment
  variables set by the user who runs the compiler.
- Different source files in different encodings *will* be presented in
  the same translation unit, and we must cope with that.

I don't think it's feasible for us to use the C89 or even C99
multibyte primitives at all.  They are too underspecified to be of
use.  iconv has approximately the right interface, although it's
geared to bulk conversion not char-by-char processing and is therefore
too heavyweight ... but then, so are the <wchar.h> primitives.  Bleah.

> > But since some header files are system header files, clearly the whole
> > translation unit cannot be in a single charset.
> 
> Ummm, I know it is currently late at night for me, but for C89, IIRC, it was
> the intention of the committee that the entire translation unit be in a single
> charset and that the compiler does the equivalent of setlocale (LC_ALL, "").
> Certainly the way I read the first stage of translation in C99's 5.1.1.2, the
> compiler does logically translate everything into the source character set.
> 
> 	5.1.2.2 Translation phases
> 
> 	The precedence among the syntax rules of translation is specified by
> 	the following phases [5]
> 
> 	    1.	Physical source file multibyte characters are mapped to the
> 		source character set (introducing new-line characters for
> 		end-of-line indicators) if necessary.  Trigraph sequences are
> 		replaced by corresponding single-character internal
> 		representations.

Depends how you read this... I see no requirement that all the
physical source files use the same encoding, and the "source character
set" can be just the logical union of all the character sets used in
the source files.

As a practical matter, we most certainly will be handed a primary
source file encoded (say) in KOI8-R, which includes a third-party
header encoded in EUC-JP, plus system headers which keep to ASCII.
Even if all the non-comment text is ASCII, we still have to deal with
multiple encodings in the same translation unit.

> > So we need a way to specify it on a per-file basis, presumably in the
> > file itself.  But how can we grok what's in the file if we don't know
> > what charset it's written in?  It seems like chicken-and-egg to me.
> 
> The characters needed for the C langauge must be present in any
> encoding, and I believe they must have the exact same encoding
> (though I don't recall exactly where in the standard this is set
> down, though it may be the section that describes L"" strings).
> Thus for instance:
> 
> 	"X"[0] == L"X"[0]

That would be a property of the execution character set, though,
wouldn't it?

-- 
zw                I'm on a spaceship full of college students.
                  	-- Martin "PCHammer" Rose


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]