Universal Character Names, v2

Neil Booth neil@daikokuya.co.uk
Sat Nov 30 05:00:00 GMT 2002

Martin v. L?wis wrote:-

> > I suggest this should only be a warning (it could be -S with the
> > output used on a different assembler, or for some other purpose),
> > only be emitted once per translation unit, and be moved to c_lex().
> It was an explicit request to have that kind of determination, for
> Java compatibility.
> The Java compiler requires UCN support on all platforms (and has
> mangling to do so), but also requires C++ compatibility on systems
> where C++ supports UCNs.
> So if I assume that C++ can use UTF-8 everywhere, the Java compiler
> will break on systems where no suitable assembler is available.

I don't see how my request is affected by this.

> The code isn't actually duplicated. In one case, the results are
> written to a FILE*, in the other case, they are written to a char*
> buffer. How can I unify those two?

Write the buffer to the FILE *?  It may not be an improvement.

> > Can I suggest that, instead of doing this, you have a routine that
> > reads a UCS's digits (4 or 8) into a uchar[8] buffer, and that you
> > re-use maybe_read_ucs() on this buffer?  maybe_read_ucs() might
> > need a few small tweaks.  Again, this would avoid duplication.
> I can try, but I doubt it saves much duplication. Instead of

Why not

for (len = 0; len < 4 or 8; len++)
  c = get_effective_char (pfile);
  if (c == EOF || !ISXDIGIT (c))
    { BACKUP(); break;}
  buf[len] = c;
  // maybe_read_ucs handles diagnostics
  temp = buf;
  maybe_read_ucs (pfile, &temp, buf + len, &val);

where VAL contains what the routine expects.  You might be able to
find a better way by modifying maybe_read_ucs somehow; or by breaking
out most of it into a common subroutine that both use.

If we find \U in a file, we should assume it is a UCN.  There is little
use for \ as a separate token if followed by U.  If there is a syntax
error in a UCN with an invalid char, there is no obviously right thing to
do to recover; certainly I don't think backing up to the \U and making
two tokens out of it is a good idea.  It might not even be worth the
XDIGIT check above; let maybe_read_ucs handle it, and if there is an
error, don't add any UCS to the identifier and stop lexing the identifier.


More information about the Java mailing list