[PATCH] CSE MEMs loaded in narrower modes

Thu Mar 6 14:49:00 GMT 2003

The following patch to GCC allows us to optimize the following code:

int var;

int foo(int x)
{
  var = 0;
  return x + (short)var;
}

The addition is optimized away by Microsoft's Visual C/C++ compilers,
but is preserved by mainline CVS.  The issue is that GCC converts
the cast of "var" to a HImode load very early on, and CSE/GCSE are
unable to realize that corresponding memory is constant, due to the
discrepancy in MEM modes (and possibly offset calculations).

My first solution to this problem was an enhancement to cse.c that
made use of the fact that the mode of the MEM isn't used in its hash,
looked for an equivalent constant, then used simplify_subreg to extract
the appropriate constant.  This catches the above optimization but had
several major draw backs.  It only worked for constants, not pesudos,
didn't handle big-endian architectures, didn't help GCSE or the other
RTL optimizers and worst of all required additional CSE hash table
probes for all MEMs, searching for a rare optimization, which would
slow the compiler slightly (even only with -fexpensive-optimizations).

Then thinking outside the box, I realized the problem could also be
solved by loading "var" from memory in its original mode into a
register, and then extracting the low part of the register.  This
representation is more convenient for GCC's RTL optimizers, and the
conversion back to a HImode load would take place during combine.

Hence the small patch below to gen_lowpart in emit-rtl.c.

Of course, being suitably paranoid I asked Andreas Jaeger if
he'd be kind enough to run the SPEC2000 benchmarks for me, to check
that all the loads were caught by combine and that we didn't have
any catastrophic performance degradations.  May I say that I'm eternally
endebted to Andreas for his benchmarking efforts.  Its a pity none of
the other hardware vendors or linux distributions consider performance
important enough to offer similar services to the community :>

Size of binaries:
 Total: Base: 4092719 bytes
 Total: Peak: 4092351 bytes

                                     Estimated                     Estimated
                   Base      Base      Base      Peak      Peak      Peak
   Benchmarks    Ref Time  Run Time   Ratio    Ref Time  Run Time   Ratio
   ------------  --------  --------  --------  --------  --------  --------
   164.gzip          1400   290       484    *     1400   288       486    *
   175.vpr           1400   448       312    *     1400   448       313    *
   176.gcc           1100   297       370    *     1100   297       371    *
   181.mcf           1800   815       221    *     1800   815       221    *
   186.crafty        1000   174       575    *     1000   174       574    *
   197.parser        1800   533       337    *     1800   534       337    *
   252.eon                                   X                             X
   253.perlbmk       1800   339       530    *     1800   342       527    *
   254.gap           1100   273       403    *     1100   272       404    *
   255.vortex        1900   416       457    *     1900   415       458    *
   256.bzip2         1500   436       344    *     1500   435       345    *
   300.twolf         3000   874       343    *     3000   871       344    *
   Est. SPECint_base2000              385
   Est. SPECint2000                                                 385

So good news, as hoped combine narrows the loads as appropriate and we
don't lose anything in performance, and catch the optimization opportunity
when available.  Performance also remained the same on SPECfp2000 at -O2
and both SPECint and SPECfp at -O3.  The biggest surprise was the ~1%
improvement in compiler performance at -O2:

Compile times for benchmarks:
Total time for base compilation: 552 s
Total time for peak compilation: 543 s

GCC was configured as: configure --enable-threads=posix
--enable-languages="c,c++,f77"
GCC bootstrap times for 'make -j1 bootstrap && make install':
Base compiler: 2554 s
Peak compiler: 2530 s

The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all languages except Ada and treelang, and regression
tested with a top-level "make -k check" with no new failures.  Its also
been bootstrapped on i686-pc-cygwin.  The example at the top of this
e-mail is optimized with the patch, and not without.

Ok for mainline?

2003-03-06  Roger Sayle  <roger@eyesopen.com>

	* emit-rtl.c (gen_lowpart): When requesting the low-part of a
	MEM, try loading the MEM into a register and taking the low-part
	of that, to help CSE see the use of the MEM in its true mode.

Index: emit-rtl.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/emit-rtl.c,v
retrieving revision 1.312
diff -c -3 -p -r1.312 emit-rtl.c
*** emit-rtl.c	28 Feb 2003 07:06:33 -0000	1.312
--- emit-rtl.c	2 Mar 2003 18:24:13 -0000
*************** gen_lowpart (mode, x)
*** 1371,1376 ****
--- 1371,1382 ----
      {
        /* The only additional case we can do is MEM.  */
        int offset = 0;
+
+       /* The following exposes the use of "x" to CSE.  */
+       if (GET_MODE_SIZE (GET_MODE (x)) <= UNITS_PER_WORD
+ 	  && ! no_new_pseudos)
+ 	return gen_lowpart (mode, force_reg (GET_MODE (x), x));
+
        if (WORDS_BIG_ENDIAN)
  	offset = (MAX (GET_MODE_SIZE (GET_MODE (x)), UNITS_PER_WORD)
  		  - MAX (GET_MODE_SIZE (mode), UNITS_PER_WORD));

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833