java.lang.Character

Eric Blake ebb9@email.byu.edu
Wed Feb 20 23:35:00 GMT 2002


Per Bothner wrote:
> No, I will not approva that either.  The only solution I will accept
> is one that is no worse than the current solution.  That probably
> means using constant C-style tables.
> 
> I apologize, I should have caught this and spoken up earlier.  I got
> the impression we were talking about updating replacing the scripts to
> generate the tables.  Perhaps it would make sense to extend/modify
> the scripts so they generate C-style native arrays like we currently do?
> Could I ask you to take a look at that?

Yes, I will look at retooling Character.java and chartables.pl to use
the same algorithms as the Classpath counterparts, but using
java-chartables.h for storing the table as C arrays.  The Classpath
implementation DOES have the algorithmic advantages: it is fully
compliant with the 1.4 specs (gcj is currently broken in several
methods, such as isJavaLetter()), and its concept of a direct 2-level
fixed-block-size lookup is better than gcj's binary search (maximum 10
comparisons for Unicode 3.0) followed by a 1-level variable-block-size
lookup.  Besides, Classpath has Javadoc, which gcj lacks.

So, my end target is to have the end algorithms and behavior match,
although the database interface will differ between the two projects. 
Merging is definately a chore ;)

Actually, I do have one more question on efficiency:
In the Classpath version, the 2nd level lookup occupies up 5800+
characters, although there are about only 350 unique character
attributes (so 9 bits are required to index into a 3rd level).  As a
result, there are four 3rd level tables for the attributes that did not
fit in the 7 otherwise unused bits of the 2nd level lookup.

Is it better to have multiple 2nd level tables, at the expense of more
memory; or to use 3rd level tables, at the cost of another level of
array lookups?

For example, in Classpath, getType(char) is a 2-level lookup
  return 0x1F & data[(char) (blocks[ch >> shift] + ch)];

while toUpperCase(char) is 3-level, but uses the same first two tables
  return (char) (ch + upper[data[(char) (blocks[ch >> shift] + ch)] >>
7]);

In my brief glance at natCharacter.cc, it looks like both of these
methods are "2-level" lookups, first a binary search to find an index in
the correct first-level table, then a single lookup in the second
level.  But with 2 distinct tables for each attribute, gcj is using
quite a bit of memory to save the cost of one more array lookup.

> 
> > Where is java-chardata.h?

Ahh, it's named java-chartables.h; that's why I couldn't find it.

-- 
This signature intentionally left boring.

Eric Blake             ebb9@email.byu.edu
  BYU student, free software programmer



More information about the Java-patches mailing list