[PATCH] Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7

Tue Aug 9 20:22:00 GMT 2011

This is an initial patch to work around the slow down of sphinx3 in power7 VSX
that first shows up in GCC 4.6 and is still present in the current GCC 4.7
trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031

The key part of the slowdown is in this inner loop in the
vector_gautbl_eval_logs3  function in sphinx3 vector.c:

{
  int32 i, r;
  float64 f;
  int32 end, veclen;
  float32 *m1, *m2, *v1, *v2;
  float64 dval1, dval2, diff1, diff2;

    /* ... */

    for (i = 0; i < veclen; i++) {
      diff1 = x[i] - m1[i];
      dval1 -= diff1 * diff1 * v1[i];
      diff2 = x[i] - m2[i];
      dval2 -= diff2 * diff2 * v2[i];
    }

    /* ... */

}

In particular, the compiler 4.6 and beyond vectorizes this inner loop.  Because
it doesn't know the alignment of the float pointers, it generates code to use
unaligned vector loads unconditionally, which on the powerpc, involves using a
load of an aligned pointer, and then doing a vperm instruction to permute the
bytes.  Since the code first does the calculation in 32-bit floating point and
then converts it to 64-bit floating point, the compiler does a vector convert
of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
and then the vector conversion.  Thus in the inner loop, there are:

    4 vector loads
    4 vector permutes to do the unalgined load
    8 vector permutes to get things in the right registers for conversion
    4 vector conversions

This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
disables the vector float/int conversions to double.  Overall this is a win:

GCC 4.6, 32-bit:
    12% improvement, 464.h264ref
     5% improvement, 450.soplex
     3% regression,  465.tonto
     2% improvement, 481.wrf
     9% improvement, 482.sphinx3

GCC 4.6, 64-bit:
     5% improvement, 456.hmmer
     6% improvement, 464.h264ref
    14% improvement, 482.sphinx3

GCC 4.7, 32-bit:
      2% improvement, 437.leslie3d
      9% improvement, 482.sphinx3

I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
desired.

While I don't think this is the only solution to 50031, it at least helps us.
It is encouraging that GCC 4.7 doesn't have the regression in tonto.

I have bootstraped and run make check on both 4.6 and 4.7 compilers with no
regressions.  Is it ok to install in the 4.7 tree?  At present, I have made the
default to generate the vectorized conversion, but it may make sense to flip
the default.  Is this patch ok to apply?  Given if affects 4.6, did you want to
see it in 4.6 as well?

[gcc]
2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR tree-optimization/50031
	* doc/invoke.texi (RS/6000 and PowerPC Options): Add
	-mnvsx-vector-32bit-to-64bit switch.

	* config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on
	-mvector-convert-32bit-to-64bit switch.
	(vec_unpacks_float_hi_v4s): Ditto.
	(vec_unpacks_float_lo_v4s): Ditto.
	(vec_unpacku_float_hi_v4s): Ditto.
	(vec_unpacku_float_lo_v4s): Ditto.

	* config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New
	switch to control whether the compiler does 32->64 bit conversions.

[gcc/testsuite]
2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR tree-optimization/50031
	* gcc.target/powerpc/vsx-vector-7.c: New test for
	-mvector-convert-32bit-to-64bit.
	* gcc.target/powerpc/vsx-vector-8.c: Ditto.

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899
-------------- next part --------------
Index: gcc/doc/invoke.texi
===================================================================

--- gcc/doc/invoke.texi	(revision 177467)
+++ gcc/doc/invoke.texi	(working copy)
@@ -813,7 +813,8 @@ See RS/6000 and PowerPC Options.
 -mrecip -mrecip=@var{opt} -mno-recip -mrecip-precision @gol
 -mno-recip-precision @gol
 -mveclibabi=@var{type} -mfriz -mno-friz @gol
--mpointers-to-nested-functions -mno-pointers-to-nested-functions}
+-mpointers-to-nested-functions -mno-pointers-to-nested-functions @gol
+-mvector-convert-32bit-to-64bit -mno-vector-convert-32bit-to-64bit}
 
 @emph{RX Options}
 @gccoptlist{-m64bit-doubles  -m32bit-doubles  -fpu  -nofpu@gol
@@ -16426,6 +16427,13 @@ static chain value to be loaded in regis
 not be able to call through pointers to nested functions or pointers
 to functions compiled in other languages that use the static chain if
 you use the @option{-mno-pointers-to-nested-functions}.
+
+@item -mvector-convert-32bit-to-64bit
+@itemx -mno-vector-convert-32bit-to-64bit
+@opindex mvector-convert-32bit-to-64bit
+Generate (do not generate) code to use VSX vector instructions when
+converting 32-bit types to 64-bit types.  The default is
+@option{-mvector-convert-32bit-to-64bit}.
 @end table
 
 @node RX Options
Index: gcc/config/rs6000/vector.md
===================================================================
--- gcc/config/rs6000/vector.md	(revision 177467)
+++ gcc/config/rs6000/vector.md	(working copy)
@@ -797,7 +797,8 @@ (define_expand "vec_pack_ufix_trunc_v2df
 (define_expand "vec_unpacks_hi_v4sf"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SF 1 "vfloat_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SFmode);
 
@@ -809,7 +810,8 @@ (define_expand "vec_unpacks_hi_v4sf"
 (define_expand "vec_unpacks_lo_v4sf"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SF 1 "vfloat_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SFmode);
 
@@ -821,7 +823,8 @@ (define_expand "vec_unpacks_lo_v4sf"
 (define_expand "vec_unpacks_float_hi_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -833,7 +836,8 @@ (define_expand "vec_unpacks_float_hi_v4s
 (define_expand "vec_unpacks_float_lo_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -845,7 +849,8 @@ (define_expand "vec_unpacks_float_lo_v4s
 (define_expand "vec_unpacku_float_hi_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -857,7 +862,8 @@ (define_expand "vec_unpacku_float_hi_v4s
 (define_expand "vec_unpacku_float_lo_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
Index: gcc/config/rs6000/rs6000.opt
===================================================================
--- gcc/config/rs6000/rs6000.opt	(revision 177467)
+++ gcc/config/rs6000/rs6000.opt	(working copy)
@@ -203,6 +203,10 @@ mvsx-scalar-memory
 Target Undocumented Report Var(TARGET_VSX_SCALAR_MEMORY)
 ; If -mvsx, use VSX scalar memory reference instructions for scalar double (off by default)
 
+mvector-convert-32bit-to-64bit
+Target Report Var(TARGET_VECTOR_CONVERT_32BIT_TO_64BIT) Init(1)
+If -mvsx, enable conversion of 32-bit types to 64-bit types with vector instructions
+
 mvsx-align-128
 Target Undocumented Report Var(TARGET_VSX_ALIGN_128)
 ; If -mvsx, set alignment to 128 bits instead of 32/64