[PATCH] PR18785: Support non-native execution charsets

Wed Dec 22 19:52:00 GMT 2004

Roger Sayle <roger@eyesopen.com> writes:

> The following patch should resolve PR middle-end/18785 which is marked
> as a 4.0 regression misoptimizing __builtin_isdigit when generating
> code for an EBCDIC target from an ASCII host.

You seem to be fundamentally confused by a few things.  I see that
Joseph has pointed some of the consequent problems out, but I don't
think he's doing a very good job of explaining what's conceptually
wrong, so let me have a go at it.

> As described at the top of libcpp/charset.c, GCC worries about three
> character sets: the "input" character set (used to encode the input
> source files), the "source" character set (used by the host
> operating system) and the "execution" character set (used on the target).

There are actually no fewer than five character sets to be concerned
about.

  * The input character set: the "physical source file multibyte
    characters" referred to in the description of translation phase 1.
    Theoretically this could be different for different source
    files included in the translation unit, but we don't currently
    implement that.  Nobody outside cpplib has to worry about it.  It
    can be any encoding whatsoever.

  * The source character set: the encoding used by internal processing
    in translation phases 1b-4 (1a is the conversion from input to
    source character set).  This has several major constraints on it:

      - It has to be a proper multibyte character set as C99 defines
        that term (5.2.1.2p1).  It may NOT have a state-dependent
        encoding.

      - It has to be isomorphic to ISO 10646 (Unicode) so that \u, \U
        escapes are meaningful.  (Because of this, the source
        character set cannot be a single-byte encoding.)

      - All characters within the basic source character set must have
        the same code points that they do in ...

   * The host character set: that is, the narrow execution character
     set of the host machine.  At present this is always either ASCII
     or EBCDIC, and we assume that whichever variant of EBCDIC is in
     use does not alter the code points corresponding to the basic
     source character set.

   * The narrow execution character set: the encoding used by narrow
     string literals and character constants on the target machine.
     This is C99's execution character set, and it's what
     -fexec-charset selects.

   * The wide execution character set: the encoding used by wide
     string literals and character constants on the target machine.
     C99 neglects to discuss this, but it is obviously necessary.
     -fwide-exec-charset controls this encoding.

It's important not to confuse the source character set with the host
character set, because they are only guaranteed to be the same for
code points corresponding to the basic source character set.  It is
therefore a bug to use a character constant or string literal in GCC's
source code, in a context where it will be compared to source text,
for any character not in the basic source character set.  Note that
for GCC's purposes, $ and @ count as basic source character set, even
though they don't in C99.

We don't do much (any?) optimization on wide string functions, so the
wide execution character set isn't a big concern.

The narrow execution character set is the one of primary concern right
now.  All the optimizations done on C library functions with known
semantics are required to be done as-if the function call was
evaluated at runtime, so its string/character arguments would
definitely be in the narrow execution character set.

The bug described by PR18785, as I understand it, is that some of
those optimizations assume the narrow execution character set is the
same as the *host* character set, e.g. by calling the analogous
function from the host C library.  This is always wrong.  Also, you're
correct to be looking for a simple fix for right now.  However, I
think we should be considering each buggy optimization in isolation,
because in some cases it may be very easy to fix it properly instead
of just disabling the optimization.  For example, isdigit(c) can be
optimized to ((c - '0') <= 9) if and only if we know the value of '0'
in the narrow execution character set.  Well, '0' is in the basic
source character set, so we can get the right value from cpplib:

unsigned int target_digit0 (void)
{
  cpp_token t0 = { 0, CPP_CHAR, 0, { 0 } };
  t0.val.str.text = "0";
  t0.val.str.len = 1;
  unsigned int pchars_seen;
  int unsignedp;

  cppchar_t result = cpp_interpret_charconst (parse_in, t0, 
                                              &chars_seen, &unsignedp);
  gcc_assert (chars_seen == 1);
  return result;
}

will return a value which can be used in place of TARGET_DIGIT0 in
fold_builtin_isdigit.  (The cpp_interpret_charconst interface isn't
designed for this use - I'd be happy to add something more
straightforward.)

Joseph is correct to point out that the narrow execution set will
always have the property that a byte with all bits zero terminates the
string.  Thus it is always safe to call host strlen() on a narrow
TREE_STRING.  

Since the conversion to the execution character set has already
happened, I'm not clear on why it wouldn't be safe to call host
strcmp(), too, since that function is defined to be 

int strcmp (char *a, char *b)
{
  unsigned char *u = a, *v = b;

  while (*u && *v && *u == *v)
    u++, v++;

  return (*u == *v) ? 0 : (*u < *v) ? -1 : 1;
}

which is completely charset-independent.

I'm not opposed to something like your nonnative_charset_p for the
genuinely hard cases, but I would strongly suggest you determine it in
a more robust manner, e.g. by inquiring of cpplib whether the
conversion from the source to execution character sets is a nop.  I
don't think there's an interface to that right now, but again one can
be easily added.

> Penultimately, because the execution character set is selectable at
> compile-time, the current system of target macros such as TARGET_CR,
> TARGET_LF, TARGET_BS etc... can't be compile-time constants and
> therefore can't be used as case labels of switches in the GCC source
> code.  Fortunately, now that we track nonnative_charset_p, we can
> do the right thing in the C pretty printer.

I would almost rather you always printed these characters with octal
escapes.

> Finally, the source code to tree-browser.c is the only source file
> in the gcc/gcc/ tree that refers to EBCDIC (now that i370 support
> has been dropped), and is currently using an obsolete form of the
> test for host charset (i.e. different from libiberty, libcpp and
> this patch).  This divergence is cleaned up below.

I think you should check this piece in by itself; it's obviously
correct and is independent of the rest of the patch.

zw