[PATCH] PR18785: Support non-native execution charsets

Wed Dec 22 17:02:00 GMT 2004

On Wed, 22 Dec 2004, Roger Sayle wrote:

> To support "cross-environments", I propose the following changes below
> to resolve PR18785 and generally improve handling of execution character
> sets.  The first is that the middle-end needs to keep track of "charset
> cross-compilation" where the execution charset is different from the
> internal source charset.  This is represented below by the new global
> variable "nonnative_charset_p".  When this flag is true, the contents
> of TREE_STRING etc... can't be assumed to be NUL-terminated C strings
> that can be passed to the current run-time, i.e. "strcmp".  Checking
> this flag, which should be rare in practice, allows the middle-end to
> avoid optimizations that transform or precompute libc builtins at
> comile-time.

String constants are always NUL-terminated, DR#278 / C99 TC2 have 
disallowed encodings in which NUL is the first byte of a multibyte 
character.  (But this is target NUL which might be wider than host NUL, on 
C4x; however we don't presently really allow for target strings wider than 
host strings in such optimizations at present, and Stage 3 is no time to 
start doing so.)  String functions such as strcmp operate on strings 
independent of locale, with characters treated as unsigned char, see 
DR#274 / C99 TC2.  (Whereas strcoll indeed can't be optimized even in the 
C locale, see DR#235.)

So in general the optimizations based on extracting strings are valid, and 
only a few need to take account of the execution character set.  (The 
standard doesn't seem to make any allowance of runtime variation of the 
values of the basic execution character set, although the extended 
character set may vary with the runtime locale, so I think isdigit can 
indeed be optimized at compile time based on -fexec-charset.)

I think a conservative patch should only disable ctype/wctype/*printf 
optimizations for non-native character sets (but not the format checking 
diagnostics though they do have the same issue, and the format strings can 
be mixtures of multibyte characters and sequences of bytes interpreted as 
individual bytes, in complicated ways, rather than simple multibyte 
strings).  The <string.h> optimizations are safe.

-- 
Joseph S. Myers               http://www.srcf.ucam.org/~jsm28/gcc/
    jsm@polyomino.org.uk (personal mail)
    joseph@codesourcery.com (CodeSourcery mail)
    jsm28@gcc.gnu.org (Bugzilla assignments and CCs)