Unicode mangling (was Re: [PATCH] Java: New C++ ABI compatibility changes.)

Mon Jan 15 04:14:00 GMT 2001

It looks like you're still using the same scheme for mangling Unicode
strings.  I'd like to reexamine that, since we're stabilizing the ABI.

First, we need to remember that C99 and C++ (will) also need this
functionality.  The frontend work to support extended characters in
identifiers remains to be done, of course.

As far as I can tell, the current scheme affects individual identifiers.
If it contains extended characters, you prepend 'U' to the length and
replace each extended character with _NNNN (the 16-bit hex encoding of the
UCS2 value).  This currently has several flaws:

1) It doesn't allow for C-like symbols, which have no length specifier.
   This could be fixed by defining some encoding starting with, say, '_U'.
2) It doesn't accommodate 32-bit extended characters in C++/C99
   (\UNNNNNNNN).  This could be fixed by escaping them with, say, '_L'.
3) _NNNN is a valid component of an identifier, complicating the
   demangler intelligence.  This could be fixed by also escaping the '_'
   character in affected names.  Hmm...it looks like you intend to do
   so in unicode_mangling_length, but don't actually do so in
   append_unicode_mangled_name.  We could also just use '__'.

With these fixes, I think the current scheme is OK.  But for targets with
8-bit clean binutils, I think it makes a lot of sense to just use the UTF8
encoding in the symbol.

Thoughts?

Jason