This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Query on UTF-32 encodings for letters


On Sat, 15 Jan 2005, Robert Dewar wrote:

> Well I really don't understand the implementation of iswalpha. For
> example, it yields false for "FEMININE ORDINAL INDICATOR" (16#AA#)
> even though the definition in the database is:

glibc's iswalpha works for me, provided the program has called setlocale 
before iswalpha and is running under a suitable locale whose definition 
copies the i18n file's LC_CTYPE data (e.g. en_GB.UTF-8, not C / POSIX).

> At first, it looked to me like it was just testing LETTER in the
> name of the symbol, but that is disproved by:

That is one thing gen-unicode-ctype.c looks at in addition to the 
character class.  To quote from CVS glibc, localedata/gen-unicode-ctype.c,

  return (unicode_attributes[ch].name != NULL
	  && ((unicode_attributes[ch].category[0] == 'L'
	       /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
		  <U0E2F>, <U0E46> should belong to is_punct.  */
	       && (ch != 0x0E2F) && (ch != 0x0E46))
	      /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
		 <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha.  */
	      || (ch == 0x0E31)
	      || (ch >= 0x0E34 && ch <= 0x0E3A)
	      || (ch >= 0x0E47 && ch <= 0x0E4E)
	      /* Avoid warning for <U0345>.  */
	      || (ch == 0x0345)
	      /* Avoid warnings for <U2160>..<U217F>.  */
	      || (unicode_attributes[ch].category[0] == 'N'
		  && unicode_attributes[ch].category[1] == 'l')
	      /* Avoid warnings for <U24B6>..<U24E9>.  */
	      || (unicode_attributes[ch].category[0] == 'S'
		  && unicode_attributes[ch].category[1] == 'o'
		  && strstr (unicode_attributes[ch].name, " LETTER ")
		     != NULL)
	      /* Consider all the non-ASCII digits as alphabetic.
		 ISO C 99 forbids us to have them in category "digit",
		 but we want iswalnum to return true on them.  */
	      || (unicode_attributes[ch].category[0] == 'N'
		  && unicode_attributes[ch].category[1] == 'd'
		  && !(ch >= 0x0030 && ch <= 0x0039))));

If what you require is a specific definition in terms of (maybe a specific 
version of) the Unicode Character database rather than something 
locale-dependent and so system-dependent, then indeed the system library 
may be unsuitable.

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    jsm@polyomino.org.uk (personal mail)
    joseph@codesourcery.com (CodeSourcery mail)
    jsm28@gcc.gnu.org (Bugzilla assignments and CCs)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]