This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: thoughts on martin's proposed patch for GCC and UTF-8
- To: gcc2 at gnu dot org, egcs at cygnus dot com
- Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
- From: Per Bothner <bothner at cygnus dot com>
- Date: Mon, 21 Dec 1998 19:16:53 -0800
I too am rather leary of using #pragma locale or any other in-band
indicator or the character set.
Paul mentions the problem of converting a set of text files from one
encoding to another. Perhaps someone in Western Europe wants
to examine a program with its documentation, but both were written
in China. It makes sense to convert it to the local character
set first. If the original program contains #pragma locale statements,
these have to be translated also, but expecting a chracter-set
translation tool to understand C syntax seems a bit much.
If you *don't* do the translation, all your other tools (emacs,
less, grep, etc) need to understand the #pragma locale statement,
which again seems reasonable.
Another problem is that switching character encoding
in-band may be difficult. Many libraries do not support it.
The Java FileReader class requires you to specify the encoding
at *open* time. Of course there are various work-around.
For example, you can try opening the file in UTF-8 mode,
and if you see a #pragma locale statement, re-open it in the
apprioriate mode. Still this is not something applications
programmers shoudl have to deal with.
The only general solution I think is for the *file system*
and/or input library to do the translation. Perferably
each file should specify its encoding out-of-bound,
just like MIME does. As a back-up, the user should be
able tospecify a default encoding (based on their lcoale),
and perhaps over-ride it for individual files.
Still, while #pragra locale does have its problems, and
we must also support other ways for getting character
encoding information, it might still be a useful
*alternative* method for specifying the encoding.
One useful data point is that the XML specification provides
a command to specify the character encoding in use.
See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl
The XML spec also includes an appendix on auto-detection:
http://www.w3.org/TR/PR-xml-971208#sec-guessing
--Per Bothner
Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner