This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Universal Character Names, v2

From: Keld Jørn Simonsen <keld at dkuug dot dk>
To: "Joseph S. Myers" <jsm28 at cam dot ac dot uk>
Cc: Zack Weinberg <zack at codesourcery dot com>,Neil Booth <neil at daikokuya dot co dot uk>,"Martin v. L?wis" <martin at v dot loewis dot de>, gcc-patches at gcc dot gnu dot org,java at gcc dot gnu dot org
Date: Thu, 5 Dec 2002 01:07:50 +0100
Subject: Re: Universal Character Names, v2
References: <87u1hxbe0z.fsf@egil.codesourcery.com> <Pine.LNX.4.33.0212012351040.31807-100000@kern.srcf.societies.cam.ac.uk>

On Mon, Dec 02, 2002 at 12:32:19AM +0000, Joseph S. Myers wrote:
> On Sun, 1 Dec 2002, Zack Weinberg wrote:
> 
> > 3. ISO 10646 (Unicode) is updated more frequently than ISO 9899 (C)
> > and ISO 14882 (C++).  It is reasonable to expect that future revisions
> > of the latter two standards will augment the lists of acceptable code
> > points as further identifier characters are added to Unicode.  As a
> > convenience to our users, we should accept all the plausible
> > identifier characters in the latest revision of Unicode, not just
> > whatever revision was current the last time C or C++ was revised.
> 
> ISO 10646 is updated frequently.  Unicode is updated frequently.  They are
> not the same standard.  If a list of code points is taken from a third
> document, that should rather be ISO/IEC TR 10176, the document used in
> C99, which is also updated from time to time (the current draft for the
> 4th edition apparently being
> <http://std.dkuug.dk/JTC1/SC22/WG20/docs/n970-tr10176-2002.pdf>), and the
> main table should be used (in the spirit of the choices made by C99 and
> C++98), not the supplementary table requiring normalization.  This draft
> points out that there are cases of different (normalized) characters that
> look the same, e.g. LATIN CAPITAL LETTER A, GREEK CAPITAL LETTER ALPHA and
> CYRILLIC CAPITAL LETTER A.  Users may be confused by such cases, but no
> reasonable normalization can solve that problem.

You may actually also use the ISO/IEC TR 14652, which implements the
TR 10176 list of characters for identifiers in a POSIX-like locale
format. This format is actually the one that is used in glibc,
and the classes alpha and digit in the 14652 standard locale "i18n"
are a specification of the TR 10176 recommendations. So you could use
that locale for the compilation, and use iswalpha() and iswdigit()
for testing of valid characters in identifiers.

That code would be somewhat faster and smaller than the patch from
Martin on UCN.

Please note that the glibc "i18n" locale is different from the
standard TR 14652 locale in the alpha and digit classes.
This was done to conform to C99, and also some enhancements
in repertoire were done.

The latest version of TR 14652 is available at:
http://std.dkuug.dk/JTC1/SC22/WG20/docs/n972-14652ft.pdf

Best regards
keld

Follow-Ups:
- Re: Universal Character Names, v2
  - From: Martin v. Löwis

References:
- Re: Universal Character Names, v2
  - From: Zack Weinberg
- Re: Universal Character Names, v2
  - From: Joseph S. Myers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]