[PATCH] Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7
Michael Meissner
meissner@linux.vnet.ibm.com
Thu Aug 11 07:11:00 GMT 2011
On Wed, Aug 10, 2011 at 10:08:54AM +0200, Richard Guenther wrote:
> Are the arrays all well-aligned in practice? Thus, would versioning the loop
> for all-good-alignment help?
I suspect yes on 64-bit, but no on 32-bit, due to malloc not returning 128-bit
aligned memory in 32-bit. It only returns memory that is aligned to double the
alignment of size_t. Long doubles in powerpc are 128 bits, as are the vector
types.
I did a test, eliminating the vec_realign stuff under switch control. This has
the effect of versioning the loop into a vector loop that is run when all are
aligned, and a scalar loop that is run when they aren't all aligned. I ran
spec 2006 in 32-bit, and I see the following differences (eliminating the ones
that are close enough).
Benchmark % of baseline
========= =============
400.perlbench 96.09%
429.mcf 104.50%
456.hmmer 95.85%
458.sjeng 104.23%
464.h264ref 112.18%
483.xalancbmk 102.35%
410.bwaves 107.02%
416.gamess 96.01%
433.milc 98.90%
434.zeusmp 94.92%
435.gromacs 105.55%
450.soplex 108.58%
453.povray 103.71%
454.calculix 97.54%
459.GemsFDTD 97.35%
465.tonto 97.79%
470.lbm 98.56%
481.wrf 87.11%
482.sphinx3 110.33%
I was hoping that doing the versioning for an aligned loop and unaligned loop
would eliminate the percentages under 100%.
Note, the powerpc VSX memory instructions for V4SF/V4SI types can run if the
pointer is not aligned to a 128-bit boundary, but there is a slowdown if they
get pointers that aren't aligned to a 64-bit boundary. I'm doing a run right
now, with movmisalign enabled for V4SF/V4SI, and I am seeing some regressions
in the run.
> If we have 4 permutes and then 8 further ones - can we combine for example
> an unaligned load permute and the following permute for the sf->df conversion?
I don't think so.
The unaligned stuff is to load up a 128-bit value in a register using a left
half and a right half, and a mask. The Altivec instruction set has an
instruction (lvsl) that computes the mask based on the address, and the loads
and stores ignore the bottom 4 bits.
The unaligned loop looks something like:
left = vector_load (addr & -16)
mask = lvsl (addr)
for (...) {
addr += 16;
right = vector_load (addr & -16)
value = permute (left, right, mask);
/* ... */
left = right;
}
The two permutes for the conversion, get the values in the correct place for
the conversion instruction, ie if you have a vector with the parts:
+====+====+====+====+
| A | B | C | D |
+====+====+====+====+
The first permute (xxmrghw) in the conversion would create a vector:
+====+====+====+====+
| A | A | B | B |
+====+====+====+====+
and the second (xxmrglw) would create:
+====+====+====+====+
| C | C | D | D |
+====+====+====+====+
Note, the values are doubled, because the instruction takes 2 registers as
input, and we just give the same register for both inputs.
The xvcvspdp instruction then takes a vector of the form (ignoring the 2nd and
4th fields):
+====+====+====+====+
| X | ?? | Y | ?? |
+====+====+====+====+
and converts it to double precision:
+=========+=========+
| X | Y |
+=========+=========+
> Does ppc have a VSX tuned cost-model and is it applied correctly in this case?
> Maybe we need more fine-grained costs?
The ppc has a cost model, but as I said in 50031, I think it needs to be
improved.
--
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com fax +1 (978) 399-6899
More information about the Gcc-patches
mailing list