[PATCH] Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7

Thu Aug 11 07:11:00 GMT 2011

On Wed, Aug 10, 2011 at 10:08:54AM +0200, Richard Guenther wrote:
> Are the arrays all well-aligned in practice?  Thus, would versioning the loop
> for all-good-alignment help?

I suspect yes on 64-bit, but no on 32-bit, due to malloc not returning 128-bit
aligned memory in 32-bit.  It only returns memory that is aligned to double the
alignment of size_t.  Long doubles in powerpc are 128 bits, as are the vector
types.

I did a test, eliminating the vec_realign stuff under switch control.  This has
the effect of versioning the loop into a vector loop that is run when all are
aligned, and a scalar loop that is run when they aren't all aligned.  I ran
spec 2006 in 32-bit, and I see the following differences (eliminating the ones
that are close enough).

Benchmark	   % of baseline
=========	   =============
400.perlbench        96.09%
429.mcf             104.50%
456.hmmer            95.85%
458.sjeng           104.23%
464.h264ref         112.18%
483.xalancbmk       102.35%
410.bwaves          107.02%
416.gamess           96.01%
433.milc             98.90%
434.zeusmp           94.92%
435.gromacs         105.55%
450.soplex          108.58%
453.povray          103.71%
454.calculix         97.54%
459.GemsFDTD         97.35%
465.tonto            97.79%
470.lbm              98.56%
481.wrf              87.11%
482.sphinx3         110.33%

I was hoping that doing the versioning for an aligned loop and unaligned loop
would eliminate the percentages under 100%.

Note, the powerpc VSX memory instructions for V4SF/V4SI types can run if the
pointer is not aligned to a 128-bit boundary, but there is a slowdown if they
get pointers that aren't aligned to a 64-bit boundary.  I'm doing a run right
now, with movmisalign enabled for V4SF/V4SI, and I am seeing some regressions
in the run.

> If we have 4 permutes and then 8 further ones - can we combine for example
> an unaligned load permute and the following permute for the sf->df conversion?

I don't think so.

The unaligned stuff is to load up a 128-bit value in a register using a left
half and a right half, and a mask.  The Altivec instruction set has an
instruction (lvsl) that computes the mask based on the address, and the loads
and stores ignore the bottom 4 bits.

The unaligned loop looks something like:

	left = vector_load (addr & -16)
	mask = lvsl (addr)
	for (...) {
	    addr += 16;
	    right = vector_load (addr & -16)
	    value = permute (left, right, mask);
	    	  /* ... */
	    left = right;
	}

The two permutes for the conversion, get the values in the correct place for
the conversion instruction, ie if you have a vector with the parts:

	+====+====+====+====+
	|  A |  B |  C |  D |
	+====+====+====+====+

The first permute (xxmrghw) in the conversion would create a vector:

	+====+====+====+====+
	|  A |  A |  B |  B |
	+====+====+====+====+

and the second (xxmrglw) would create:

	+====+====+====+====+
	|  C |  C |  D |  D |
	+====+====+====+====+

Note, the values are doubled, because the instruction takes 2 registers as
input, and we just give the same register for both inputs.

The xvcvspdp instruction then takes a vector of the form (ignoring the 2nd and
4th fields):

	+====+====+====+====+
	|  X | ?? |  Y | ?? |
	+====+====+====+====+

and converts it to double precision:

	+=========+=========+
	|       X |       Y |
	+=========+=========+

> Does ppc have a VSX tuned cost-model and is it applied correctly in this case?
> Maybe we need more fine-grained costs?

The ppc has a cost model, but as I said in 50031, I think it needs to be
improved.

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899