Universal Character Names, v2

Fri Nov 29 16:16:00 GMT 2002

Neil Booth <neil@daikokuya.co.uk> writes:

> > +#ifndef HAVE_AS_UTF8
> > +	  cpp_error (pfile, DL_ERROR, 
> > +		     "Non-ASCII identifiers not supported by your assembler");
> > +#endif
> > +	}
> 
> This doesn't belong here.  Someone doing preprocessing only would be
> not too happy at this message.
> 
> I suggest this should only be a warning (it could be -S with the
> output used on a different assembler, or for some other purpose),
> only be emitted once per translation unit, and be moved to c_lex().

It was an explicit request to have that kind of determination, for
Java compatibility.

The Java compiler requires UCN support on all platforms (and has
mangling to do so), but also requires C++ compatibility on systems
where C++ supports UCNs.

So if I assume that C++ can use UTF-8 everywhere, the Java compiler
will break on systems where no suitable assembler is available.

> This should be in a subroutine to avoid code duplication.  (I know this
> isn't true of this code in general, but we're not in the fast path
> when doing UCS's.  One day I hope to have solved the performance issue,
> and then there will only be a single copy of the lot).

The code isn't actually duplicated. In one case, the results are
written to a FILE*, in the other case, they are written to a char*
buffer. How can I unify those two?

> Can I suggest that, instead of doing this, you have a routine that
> reads a UCS's digits (4 or 8) into a uchar[8] buffer, and that you
> re-use maybe_read_ucs() on this buffer?  maybe_read_ucs() might
> need a few small tweaks.  Again, this would avoid duplication.

I can try, but I doubt it saves much duplication. Instead of

  for (; length; length--)
    {
      c = get_effective_char (pfile);
      if (ISXDIGIT (c))
	code = (code << 4) + hex_digit_value (c);
      else
	{
	  cpp_error (pfile, DL_ERROR,
		     "non-hex digit '%c' in universal-character-name", c);
	  /* We shouldn't skip in case there are multibyte chars.  */
	  break;
	}
    }

I would get

  for (; length; length--)
    {
      buf[8-length] = get_effective_char (pfile);
    }

It is questionable wheterh the error should be there at all: if you
find something like \u89xy, this might not be an error, instead, it
tokenizes as "\", followed by "u89xy".

So the error might need to go, and instead, we have to backup the
tokenization if one of the digits is not a hexdigit. This would change
the loop to

  for (; length; length--)
    {
      buf[8-length] = c = get_effective_char (pfile);
      if (!ISXDIGIT (c)){
        BACKUP();
        return;
      }
    }

or some such, since I cannot unify the backup code for both functions.
Then, the common function to compute the value would become

  for (; length; length--)
    {
     code = (code << 4) + hex_digit_value (buf[8-length]);
    }

Regards,
Martin