This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Windows Unicode and GCC


> I think that CPP should try to determine the encoding for each file
> and not use a single encoding for every file.  It should look for
> a unicode header when it opens a file (original c source or any
> include), and if it doesn't find one, use the default: -finput-charset,
> LC_CTYPE, UTF-8, until it's done processing that file.  Note that
> vim is reads files saved with unicode headers without problem.

This is a desired feature, but one that no one has ever had time
to implement.  If you implement it, I can critique it until it is
ready for inclusion.  [Editors that put a BOM on files in UTF-8
are in error, but it is a common error so it should be accepted
gracefully.  And, of course, it is supposed to be there on a UTF-16
or UTF-32 file.]

Note that GCC should not be limited to looking for the Unicode
"byte order mark".  It should recognize and handle all other
reasonable in-band annotations of the file encoding.  Examples are
Emacs' -*- marker in a comment on the first line and (rather more
complicated) "Local Variables:" marker near the end of the file;
other editors have similar, but of course incompatible, conventions
(I know Vim has one but I don't know what it looks like).  It would
also be good to take advantage of the fact that 95+% of C source
files start with "/*", "//", "#i", or "#d" to distinguish ASCII
from EBCDIC.  (This is in fact necessary in order to have any hope
of detecting and processing an editor's code page marker in an EBCDIC
source file.)

You should have read and fully understood the long comment near the top
of libcpp/charset.c, and the sections of the C standard that it refers
to, before you attempt to code this.

It may be necessary to import GNU iconv to the source tree in order to
gain reliable handling of non-Unicode encodings.  This should not be
hard but has to be run by the steering committee.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]