This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: gcc2 at gnu dot org, egcs at cygnus dot com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Per Bothner <bothner at cygnus dot com>
Date: Mon, 21 Dec 1998 19:16:53 -0800

I too am rather leary of using #pragma locale or any other in-band
indicator or the character set.

Paul mentions the problem of converting a set of text files from one
encoding to another.  Perhaps someone in Western Europe wants
to examine a program with its documentation, but both were written
in China.  It makes sense to convert it to the local character
set first.  If the original program contains #pragma locale statements,
these have to be translated also, but expecting a chracter-set
translation tool to understand C syntax seems a bit much.

If you *don't* do the translation, all your other tools (emacs,
less, grep, etc) need to understand the #pragma locale statement,
which again seems reasonable.

Another problem is that switching character encoding
in-band may be difficult.  Many libraries do not support it.
The Java FileReader class requires you to specify the encoding
at *open* time.  Of course there are various work-around.
For example, you can try opening the file in UTF-8 mode,
and if you see a #pragma locale statement, re-open it in the
apprioriate mode.  Still this is not something applications
programmers shoudl have to deal with.

The only general solution I think is for the *file system*
and/or input library to do the translation.  Perferably
each file should specify its encoding out-of-bound,
just like MIME does.  As a back-up, the user should be
able tospecify a default encoding (based on their lcoale),
and perhaps over-ride it for individual files.

Still, while #pragra locale does have its problems, and
we must also support other ways for getting character
encoding information, it might still be a useful
*alternative* method for specifying the encoding.

One useful data point is that the XML specification provides
a command to specify the character encoding in use.
See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl
The XML spec also includes an appendix on auto-detection:
http://www.w3.org/TR/PR-xml-971208#sec-guessing

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Per Bothner
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Richard Stallman

References:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]