[PATCH] PR18785: Support non-native execution charsets

Roger Sayle roger@eyesopen.com
Wed Dec 22 15:53:00 GMT 2004


The following patch should resolve PR middle-end/18785 which is marked
as a 4.0 regression misoptimizing __builtin_isdigit when generating
code for an EBCDIC target from an ASCII host.

As described at the top of libcpp/charset.c, GCC worries about three
character sets: the "input" character set (used to encode the input
source files), the "source" character set (used by the host operating
system) and the "execution" character set (used on the target).  String
literals, character constants and libc calls in the GCC source code use
the "source" character set which is restricted to be either ASCII (UTF-8)
or EBCDIC, i.e. a single byte encoding.  The "execution" character set
(which for the C-family languages may be specified at compile-time using
the -fexec-charset= command line option) is used to represent the contents
of TREE_STRING tree nodes, and may potentially be a multibyte character
encoding.  The default/usual configuration is that source and execution
character sets are the same.

To support "cross-environments", I propose the following changes below
to resolve PR18785 and generally improve handling of execution character
sets.  The first is that the middle-end needs to keep track of "charset
cross-compilation" where the execution charset is different from the
internal source charset.  This is represented below by the new global
variable "nonnative_charset_p".  When this flag is true, the contents
of TREE_STRING etc... can't be assumed to be NUL-terminated C strings
that can be passed to the current run-time, i.e. "strcmp".  Checking
this flag, which should be rare in practice, allows the middle-end to
avoid optimizations that transform or precompute libc builtins at
comile-time.

In the case of isascii, for example, we can only optimize and/or
constant fold this call at compile-time, if we're targeting a native
character set and the native character set is ASCII/UTF-8.  In the
case of isdigit, we can transform this into (x - '0') <= 9 on both
ASCII/UTF-8 and EBCDIC hosts provided we're targeting the native
character set.

Currently, we (should) never attempt to interpret the "execution"
character set if it isn't the same as the source/host charset, i.e.
native.  In theory, we could use "iconv" or encode knowledge of
ASCII/UTF-8, but this probably isn't worth it nor suitable for stage3.

Penultimately, because the execution character set is selectable at
compile-time, the current system of target macros such as TARGET_CR,
TARGET_LF, TARGET_BS etc... can't be compile-time constants and
therefore can't be used as case labels of switches in the GCC source
code.  Fortunately, now that we track nonnative_charset_p, we can
do the right thing in the C pretty printer.

Finally, the source code to tree-browser.c is the only source file
in the gcc/gcc/ tree that refers to EBCDIC (now that i370 support
has been dropped), and is currently using an obsolete form of the
test for host charset (i.e. different from libiberty, libcpp and
this patch).  This divergence is cleaned up below.


The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with
a top-level "make -k check" with no new failures.  Wouldn't it be
wonderful if I had my own IBM mainframe?  Fortunately, all of the
changes below are conservative (suitable for stage3), so I wouldn't
expect problems if ever someone attempted to further revive EBCDIC
support.

Ok for mainline?



2004-12-22  Roger Sayle  <roger@eyesopen.com>

	PR middle-end/18785
	* toplev.c (nonnative_charset_p): New global variable.
	* flags.h (nonnative_charset_p): Prototype here.
	* c-opts.c (c_common_handle_option): Set nonnative_charset_p to
	true, if the "-fexec-charset=" command line option is given.
	* builtins.c (c_getstr): Return NULL if using a non-native charset.
	(fold_builtin_isascii): Do nothing if the host character set isn't
	ASCII, or we're using a non-native character set.
	(fold_builtin_toascii): Likewise.
	(fold_builtin_isdigit): Do nothing if using a non-native charset.
	Replace TARGET_DIGIT0 with the host's '0'.
	* c-pretty-print.c: Include "flags.h".
	(pp_c_char): If using a non-native character set, display characters
	as octal escape codes.  Replace uses of TARGET_CR, TARGET_LF et al.
	macros with the host's own escape codes when using a native charset.
	(pp_c_integer_constant): Fix an incorrect call to pp_c_char with the
	intended pp_character.
	* tree-browser.c: Remove obsolete #ifdef HOST_EBCDIC code.
	* Makefile.in (c-pretty-print.o): Update dependencies.


Index: toplev.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/toplev.c,v
retrieving revision 1.935
diff -c -3 -p -r1.935 toplev.c
*** toplev.c	18 Dec 2004 06:38:24 -0000	1.935
--- toplev.c	22 Dec 2004 05:01:05 -0000
*************** const char *flag_random_seed;
*** 241,246 ****
--- 241,254 ----
     user has specified a particular random seed.  */
  unsigned local_tick;

+ /* Nonzero if the target's execution character set may be different to
+    the source (host's) character set.  Note, this internal "source"
+    character set may be different again from the input character set
+    which is the encoding of the source file.  If this flag is true,
+    the contents of a STRING_CST node should be considered opaque.  */
+
+ bool nonnative_charset_p = false;
+
  /* -f flags.  */

  /* Nonzero means `char' should be signed.  */
Index: flags.h
===================================================================
RCS file: /cvs/gcc/gcc/gcc/flags.h,v
retrieving revision 1.150
diff -c -3 -p -r1.150 flags.h
*** flags.h	28 Sep 2004 20:34:17 -0000	1.150
--- flags.h	22 Dec 2004 05:01:05 -0000
*************** extern int in_system_header;
*** 127,132 ****
--- 127,140 ----
     pattern and alternative used.  */

  extern int flag_print_asm_name;
+
+ /* Nonzero if the target's execution character set may be different to
+    the source (host's) character set.  Note, this internal "source"
+    character set may be different again from the input character set
+    which is the encoding of the source file.  If this flag is true,
+    the contents of a STRING_CST node should be considered opaque.  */
+
+ extern bool nonnative_charset_p;

  /* Now the symbols that are set with `-f' switches.  */

Index: c-opts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/c-opts.c,v
retrieving revision 1.133
diff -c -3 -p -r1.133 c-opts.c
*** c-opts.c	30 Nov 2004 14:10:08 -0000	1.133
--- c-opts.c	22 Dec 2004 05:01:05 -0000
*************** c_common_handle_option (size_t scode, co
*** 746,751 ****
--- 746,754 ----

      case OPT_fexec_charset_:
        cpp_opts->narrow_charset = arg;
+       /* We could attempt to compare ARG to cpplib's SOURCE_CHARSET or
+ 	 _cpp_default_encoding, but assuming the worst is also safe.  */
+       nonnative_charset_p = true;
        break;

      case OPT_fwide_exec_charset_:
Index: builtins.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/builtins.c,v
retrieving revision 1.410
diff -c -3 -p -r1.410 builtins.c
*** builtins.c	14 Dec 2004 18:04:51 -0000	1.410
--- builtins.c	22 Dec 2004 05:01:07 -0000
*************** c_getstr (tree src)
*** 358,363 ****
--- 358,368 ----
  {
    tree offset_node;

+   /* Return a NULL pointer, if we're unsure of the encoding used in the
+      string constant.  */
+   if (nonnative_charset_p)
+     return 0;
+
    src = string_constant (src, &offset_node);
    if (src == 0)
      return 0;
*************** fold_builtin_copysign (tree arglist, tre
*** 7410,7415 ****
--- 7415,7424 ----
  static tree
  fold_builtin_isascii (tree arglist)
  {
+ #if HOST_CHARSET == HOST_CHARSET_ASCII
+   if (nonnative_charset_p)
+     return NULL_TREE;
+
    if (! validate_arglist (arglist, INTEGER_TYPE, VOID_TYPE))
      return 0;
    else
*************** fold_builtin_isascii (tree arglist)
*** 7428,7433 ****
--- 7437,7445 ----
        else
          return arg;
      }
+ #else
+   return NULL_TREE;
+ #endif
  }

  /* Fold a call to builtin toascii.  */
*************** fold_builtin_isascii (tree arglist)
*** 7435,7440 ****
--- 7447,7456 ----
  static tree
  fold_builtin_toascii (tree arglist)
  {
+ #if HOST_CHARSET == HOST_CHARSET_ASCII
+   if (nonnative_charset_p)
+     return NULL_TREE;
+
    if (! validate_arglist (arglist, INTEGER_TYPE, VOID_TYPE))
      return 0;
    else
*************** fold_builtin_toascii (tree arglist)
*** 7445,7450 ****
--- 7461,7469 ----
        return fold (build2 (BIT_AND_EXPR, integer_type_node, arg,
  			   build_int_cst (NULL_TREE, 0x7f)));
      }
+ #else
+   return NULL_TREE;
+ #endif
  }

  /* Fold a call to builtin isdigit.  */
*************** fold_builtin_toascii (tree arglist)
*** 7452,7458 ****
  static tree
  fold_builtin_isdigit (tree arglist)
  {
!   if (! validate_arglist (arglist, INTEGER_TYPE, VOID_TYPE))
      return 0;
    else
      {
--- 7471,7481 ----
  static tree
  fold_builtin_isdigit (tree arglist)
  {
!   /* If the target charset is different from the host's, do nothing!  */
!   if (nonnative_charset_p)
!     return NULL_TREE;
!
!   if (!validate_arglist (arglist, INTEGER_TYPE, VOID_TYPE))
      return 0;
    else
      {
*************** fold_builtin_isdigit (tree arglist)
*** 7461,7467 ****
        tree arg = TREE_VALUE (arglist);
        arg = fold_convert (unsigned_type_node, arg);
        arg = build2 (MINUS_EXPR, unsigned_type_node, arg,
! 		    build_int_cst (unsigned_type_node, TARGET_DIGIT0));
        arg = build2 (LE_EXPR, integer_type_node, arg,
  		    build_int_cst (unsigned_type_node, 9));
        arg = fold (arg);
--- 7484,7490 ----
        tree arg = TREE_VALUE (arglist);
        arg = fold_convert (unsigned_type_node, arg);
        arg = build2 (MINUS_EXPR, unsigned_type_node, arg,
! 		    build_int_cst (unsigned_type_node, '0'));
        arg = build2 (LE_EXPR, integer_type_node, arg,
  		    build_int_cst (unsigned_type_node, 9));
        arg = fold (arg);
Index: c-pretty-print.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/c-pretty-print.c,v
retrieving revision 1.57
diff -c -3 -p -r1.57 c-pretty-print.c
*** c-pretty-print.c	7 Sep 2004 10:18:59 -0000	1.57
--- c-pretty-print.c	22 Dec 2004 05:01:07 -0000
*************** Software Foundation, 59 Temple Place - S
*** 28,33 ****
--- 28,34 ----
  #include "c-tree.h"
  #include "tree-iterator.h"
  #include "diagnostic.h"
+ #include "flags.h"

  /* The pretty-printer code is primarily designed to closely follow
     (GNU) C and C++ grammars.  That is to be contrasted with spaghetti
*************** pp_c_function_definition (c_pretty_print
*** 717,743 ****
  static void
  pp_c_char (c_pretty_printer *pp, int c)
  {
    switch (c)
      {
!     case TARGET_NEWLINE:
        pp_string (pp, "\\n");
        break;
!     case TARGET_TAB:
        pp_string (pp, "\\t");
        break;
!     case TARGET_VT:
        pp_string (pp, "\\v");
        break;
!     case TARGET_BS:
        pp_string (pp, "\\b");
        break;
!     case TARGET_CR:
        pp_string (pp, "\\r");
        break;
!     case TARGET_FF:
        pp_string (pp, "\\f");
        break;
!     case TARGET_BELL:
        pp_string (pp, "\\a");
        break;
      case '\\':
--- 718,753 ----
  static void
  pp_c_char (c_pretty_printer *pp, int c)
  {
+   /* If we're targeting a non-native character set, write out each
+      character as an escape.  In theory, we could try to use iconv,
+      but that's a lot of effort for a pretty-printer corner case.  */
+   if (nonnative_charset_p)
+     {
+       pp_scalar (pp, "\\%03o", (unsigned) c);
+       return;
+     }
+
    switch (c)
      {
!     case '\n':
        pp_string (pp, "\\n");
        break;
!     case '\t':
        pp_string (pp, "\\t");
        break;
!     case '\v':
        pp_string (pp, "\\v");
        break;
!     case '\b':
        pp_string (pp, "\\b");
        break;
!     case '\r':
        pp_string (pp, "\\r");
        break;
!     case '\f':
        pp_string (pp, "\\f");
        break;
!     case '\a':
        pp_string (pp, "\\a");
        break;
      case '\\':
*************** pp_c_integer_constant (c_pretty_printer
*** 785,791 ****
      {
        if (tree_int_cst_sgn (i) < 0)
          {
!           pp_c_char (pp, '-');
            i = build_int_cst_wide (NULL_TREE,
  				  -TREE_INT_CST_LOW (i),
  				  ~TREE_INT_CST_HIGH (i)
--- 795,801 ----
      {
        if (tree_int_cst_sgn (i) < 0)
          {
!           pp_character (pp, '-');
            i = build_int_cst_wide (NULL_TREE,
  				  -TREE_INT_CST_LOW (i),
  				  ~TREE_INT_CST_HIGH (i)
Index: tree-browser.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-browser.c,v
retrieving revision 2.4
diff -c -3 -p -r2.4 tree-browser.c
*** tree-browser.c	9 Dec 2004 10:54:36 -0000	2.4
--- tree-browser.c	22 Dec 2004 05:01:07 -0000
*************** struct tb_command {
*** 53,63 ****
  };

  #define DEFTBCODE(code, str, help) { help, str, sizeof(str) - 1, code },
- #ifdef HOST_EBCDIC
- static struct tb_command tb_commands[] =
- #else
  static const struct tb_command tb_commands[] =
- #endif
  {
  #include "tree-browser.def"
  };
--- 53,59 ----
*************** struct tb_tree_code {
*** 77,87 ****
  };

  #define DEFTREECODE(SYM, STRING, TYPE, NARGS) { SYM, STRING, sizeof (STRING) - 1 },
- #ifdef HOST_EBCDIC
- static struct tb_tree_code tb_tree_codes[] =
- #else
  static const struct tb_tree_code tb_tree_codes[] =
- #endif
  {
  #include "tree.def"
  };
--- 73,79 ----
Index: Makefile.in
===================================================================
RCS file: /cvs/gcc/gcc/gcc/Makefile.in,v
retrieving revision 1.1436
diff -c -3 -p -r1.1436 Makefile.in
*** Makefile.in	20 Dec 2004 21:10:39 -0000	1.1436
--- Makefile.in	22 Dec 2004 05:01:08 -0000
*************** c-common.o : c-common.c $(CONFIG_H) $(SY
*** 1432,1440 ****
  	$(GGC_H) $(EXPR_H) $(TM_P_H) builtin-types.def builtin-attrs.def \
  	$(DIAGNOSTIC_H) gt-c-common.h langhooks.h varray.h $(RTL_H) \
  	$(TARGET_H) $(C_TREE_H) tree-iterator.h langhooks.h tree-mudflap.h
  c-pretty-print.o : c-pretty-print.c $(C_PRETTY_PRINT_H) \
! 	$(C_COMMON_H) $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) real.h \
! 	$(DIAGNOSTIC_H)

  c-opts.o : c-opts.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)		\
          $(TREE_H) $(C_PRAGMA_H) $(FLAGS_H) toplev.h langhooks.h		\
--- 1432,1441 ----
  	$(GGC_H) $(EXPR_H) $(TM_P_H) builtin-types.def builtin-attrs.def \
  	$(DIAGNOSTIC_H) gt-c-common.h langhooks.h varray.h $(RTL_H) \
  	$(TARGET_H) $(C_TREE_H) tree-iterator.h langhooks.h tree-mudflap.h
+
  c-pretty-print.o : c-pretty-print.c $(C_PRETTY_PRINT_H) \
! 	$(C_TREE_H) $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) real.h \
! 	$(DIAGNOSTIC_H) tree-iterator.h $(FLAGS_H)

  c-opts.o : c-opts.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)		\
          $(TREE_H) $(C_PRAGMA_H) $(FLAGS_H) toplev.h langhooks.h		\

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833



More information about the Gcc-patches mailing list