This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
more compact name assembly/mangling for GCC and non-Ascii names

To: bothner at cygnus dot com
Subject: more compact name assembly/mangling for GCC and non-Ascii names
From: Paul Eggert <eggert at twinsun dot com>
Date: Sun, 10 Jan 1999 17:20:16 -0800 (PST)
CC: egcs at cygnus dot com, gcc2 at gnu dot org
Here's a proposal for a more space-efficient way to represent
non-Ascii names on GCC platforms where assemblers require names to use
the usual A-Za-z0-9_ alphabet.  As before, I'm assuming that name
``assembly'' (i.e. making names acceptable for the assembler) is done
after name mangling, but the same basic idea can be used even if the
two procedures are combined into one.

For Chinese, Japanese, and Korean characters in UTF-8, the proposed
representation uses only 3 bytes per character, the same as UTF-8.  In
contrast, my previous proposal used 9 bytes per character, and the
gxxint.texi proposal uses 5 bytes.

The price of this space efficiency is a bit more complexity.  However,
I don't think the following proposal is hard to implement.  (And if
complexity is the principal objection to this proposal, I'll volunteer
to write the code to prove my point. :-)


* Introduction

This proposal covers how names are ``assembled'', i.e. translated from
an arbitrary byte string (representing a possibly mangled source name)
into a name suitable for assembly language.

Here are some properties of this proposal, roughly in decreasing order
of importance:

 . Ordinary names (i.e. file-scope names allowed by the 1989 C Standard)
   are not affected by name assembly, for backward compatibility.
 . All possible byte strings can be assembled.
 . The assembly procedure is independent of locale.
 . Assembled names are valid identifiers according to the 1989 C standard.
 . Names that are affected by assembly do not clash with ordinary names,
   because they begin with a reserved sequence "_0".
 . Assembly and disassembly need not keep state;
   i.e. each input prefix is translated independently.
 . There is space in the encoding for future extensions; e.g. the letter 'Z'
   is currently not used as the first byte of a character representation.
 . The procedure is tailored for UTF-8: i.e. it assembles UTF-8 efficiently.
 . Lower-case Ascii letters, digits, and '_' all assemble to themselves.
 . Ascii characters, other single bytes, and the UTF-8 representations for
   all Latin-1 characters all assemble to 2 bytes maximum.
 . Unicode (16-bit) characters all assemble to 3 bytes maximum.
 . ISO 10646 (31-bit) characters all assemble to 7 bytes maximum.
 . Unlike UTF-8, the name-assembling encoding does not have
   self-identifying character boundaries; it's strictly left-to-right.


* Space efficiency

Assembled names aren't quite as short as their UTF-8 counterparts, but
they come close.  The expansion ratios from UTF-8 characters to their
representations in assembled names are as follows:

 byte counts	expansion
UTF-8	asm	ratio	UTF-8 character set
1	1  	1	Lower-case Ascii letters, digits, and '_'
1	2	2	other Ascii characters
2	2	1	00000080 - 000000FF Latin-1
2	3	1.5	00000100 - 000007FF Extended Latin, Cyrillic, ...
3	3	1	00000800 - 0000FFFF Chinese, Devanagari, Japanese, ...
4	6	1.5	00010000 - 001FFFFF not assigned yet
5	6	1.2	00200000 - 03FFFFFF "
6	7	1.167	04000000 - 7FFFFFFF "


* Name assembly procedure

Name assembly uses the 63 bytes corresponding to the Ascii characters
'0'-'9', 'a'-'z', '_', 'A'-'Z', in that order, as the digits in a
base-63 representation.

If a string of bytes is already a valid name for this platform, do not
modify it.  To assemble any other string of bytes into a name, output
"_0" and then repeat the following mutually exclusive steps until the
input byte string is exhausted.

1.  If the leading byte is any of the 37 Ascii characters 0-9, a-z, _,
then consume and output that byte.

2.  If the leading byte has one of the following values, consume it
and output the corresponding 2-byte string.

    name byte value	assembled output 2-byte string
		  0	"AK"
		127	"AL"
		254	"AM"
		255	"AN"

This does not cause any ambiguity with the output of (3), since the
output strings are not used by (3).

3.  If the value B of the leading byte is in the range 1 <= B <= 126
and (1) does not apply, consume the leading byte and output the
2-digit base-63 representation for (37*63 - 1 + B).  The leading digit
d of this representation will be in the range 37 <= d <= 38 (i.e. the
leading output byte b will be in the range 'A' <= b <= 'B').

4.  If the value B of the leading byte is in the range 128 <= B <= 253
but the leading bytes do not form a valid UTF-8 character, consume the
leading byte and output the 2-digit base-63 representation for
(39*63 - 128 + B).  The leading digit d of this representation
will be in the range 39 <= d <= 40 (i.e. the leading output byte b
will be in the range 'C' <= b <= 'D').

5.  If the leading bytes are the UTF-8 string for the ISO 10646
character C, where 160 <= C <= 285, then consume the bytes and output
the 2-digit base-63 representation for (41*63 - 160 + C).  The leading
digit d of this representation will be in the range 41 <= d <= 42
(i.e. the leading output byte b will be in the range 'E' <= b <= 'F').

6.  If the leading bytes are the UTF-8 string for the ISO 10646
character C, where 128 <= C < 160 || 286 <= C < 2**16, then consume
the bytes and output the 3-digit base-63 representation for
(43*(63**2) + C).  The leading digit of this representation will be in
the range 43 <= d <= 59 (i.e. the leading output byte b will be in the
range 'G' <= b <= 'W').

7.  If the leading bytes are the UTF-8 string for the ISO 10646
character C, where 2**16 <= C < 2**26, then consume the bytes and
output the 6-digit base-63 representation for (60*(63**5) + C).  The
leading digit b of this representation will be 60 (i.e. the leading
output byte b will be 'X').

8.  If the leading bytes are the UTF-8 string for the ISO 10646
character C, where 2**26 <= C < 2**31, then consume the bytes and
output the 7-digit base-63 representation for (61*(63**6) + C).  The
leading digit b of this representation will be 61 (i.e. the leading
output byte b will be 'Y').


* Name disassembly

To invert this process and disassemble a name into its original byte
string, consume the name's leading "_0" bytes and then repeat the
following mutually exclusive steps until the name's tail is exhausted,
at each step examining the leading byte b of the name's tail.  It is
an error if the name's tail starts with any other byte, or if the
name's tail does not start with an N-digit base-63 representation as
required by steps (3)-(8).

If the name does not start with "_0", leave it alone; it does not
require disassembly.

1.  If b is any of the 37 Ascii characters 0-9, a-z, _, then consume
and output b.

2.  If the initial two bytes are one of the assembled output byte
strings in table (2) above, consume the two bytes and output the
corresponding name byte.

3.  If 'A' <= b <= 'B' and (2) does not apply, then consume the
initial 2-digit base-63 representation for (37*63 - 1 + B) and output
the byte with value B; it is an error unless 1 <= B <= 126.

4.  If 'C' <= b <= 'D', then consume the initial 2-digit base-63
representation for (39*63 - 128 + B) and output the byte with value B.

5.  If 'E' <= b <= 'F', then consume the initial 2-digit base-63
representation for (41*63 - 160 + C) and output the UTF-8
representation for C.  It is an error unless 160 <= C <= 285.

6.  If 'G' <= b <= 'W', then consume the initial 3-digit base-63
representation for (43*(63**2) + C) and output the UTF-8
representation for C.  It is an error unless 286 <= C < 2**16.

7.  If b == 'X', then consume the initial 6-digit base-63
representation for (60*(63**5) + C) and then output the UTF-8
representation for C.  It is an error unless 2**16 <= C < 2**26.

8.  If b == 'Y', then consume the initial 7-digit base-63
representation for (61*(63**6) + C) and then output the UTF-8
representation for C.  It is an error unless 2**26 <= C < 2**31.


* Example:

Suppose the original name is in UTF-8, and contains the following
characters in order:

	UTF-8		Unicode		character name
	bytes (hex)	value
	70		0070		LATIN SMALL LETTER P
	C2 B5		00B5		MICRO SIGN
	51		0051		LATIN CAPITAL LETTER Q
	E1 81 A0	10A0		GEORGIAN CAPITAL LETTER AN
	5F		005F		UNDERSCORE
	72		0072		LATIN SMALL LETTER R

and suppose the platform does not allow this name.  Then it is encoded
as the name `_0pElBhH4z_r', as follows:

	_0	the escape prefix
	p	for `p'
	El	for MICRO SIGN: 41*63 - 160 + 0x00B5 == 2604 == 41 21 (base 63)
	Bh	for `Q'; (37*63 - 1 + 0x51) == 2411 == 38 17 (base 63)_
	H4z	for GEORGIAN CAPITAL LETTER AN; (43*(63**2) + 0x10A0)
			== 174923 == 44 04 35 (base 63)
	_	for `_'
	r	for `r'

In this case, name assembly expands 9 UTF-8 bytes to 10 assembled bytes,
which grow to 12 bytes once we count the "_0" prefix.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]