LRA vs reload on powerpc: 2 extra FAILs that are actually improvements?
Alan Modra
amodra@gmail.com
Sun Dec 1 05:55:00 GMT 2013
> On Sat, Nov 2, 2013 at 6:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > The failure of pr53199.c is because of different instruction selection
> > for bswap. Test case is reduced to just one function:
[snip]
> > Is this an improvement or a regression? If it's an improvement then
> > these two test cases should be adjusted :-)
As David said, going through memory is bad, we get a load-hit-store
flush. Definitely a regression on power7. Does anyone know why the
bswapdi2_64bit r,r alternative is disparaged? Seems like it has been
that way since the orginal mainline commit.
int main (void)
{
int i;
long ret = 0;
long tmp1, tmp2, tmp3;
for (i = 0; i < 1000000000; i++)
#if MEM == 1
/* From pr53199.c reg_reverse, -mlra -mcpu=power6 -mtune=power7. */
__asm__ __volatile__ ("\
addi %1,1,-16\n\
srdi %3,%0,32\n\
li %2,4\n\
stwbrx %0,0,%1\n\
stwbrx %3,%2,%1\n\
ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&r" (tmp2), "=&r" (tmp3));
#elif MEM == 2
/* From pr53199.c reg_reverse, -mlra -mcpu=power6. */
__asm__ __volatile__ ("\
addi %1,1,-16\n\
srdi %3,%0,32\n\
addi %2,%1,4\n\
stwbrx %0,0,%1\n\
stwbrx %3,0,%2\n\
ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&b" (tmp2), "=&r" (tmp3));
#elif MEM == 3
/* From pr53199.c reg_reverse, -mlra -mcpu=power7. */
__asm__ __volatile__ ("\
std %0,-16(1)\n\
addi %1,1,-16\n\
ldbrx %0,0,%1\n" : "+r" (ret), "=&b" (tmp1));
#else
__asm__ __volatile__ ("\
srdi %1,%0,32\n\
rlwinm %2,%0,8,0xffffffff\n\
rlwinm %3,%1,8,0xffffffff\n\
rlwimi %2,%0,24,0,7\n\
rlwimi %2,%0,24,16,23\n\
rlwimi %3,%1,24,0,7\n\
rlwimi %3,%1,24,16,23\n\
sldi %2,%2,32\n\
or %2,%2,%3\n\
mr %0,%2" : "+r" (ret), "=&r" (tmp1), "=&r" (tmp2), "=&r" (tmp3));
#endif
return ret;
}
/*
amodra@bns:~> gcc -O2 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m3.096s
user 0m3.089s
sys 0m0.001s
amodra@bns:~> time ./a.out
real 0m3.096s
user 0m3.094s
sys 0m0.002s
amodra@bns:~> gcc -O2 -DMEM=1 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m12.661s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> gcc -O2 -DMEM=2 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.004s
amodra@bns:~> gcc -O2 -DMEM=3 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m10.279s
user 0m10.276s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m10.279s
user 0m10.276s
sys 0m0.003s
I also looked at the register version and -DMEM=1 case with power7
simulators finding that the register version had a delay of 12 cycles
from completion of the first instruction to completion of the last.
The -DMEM=1 case had a corresponding delay of 49 cycles, which matches
the loop timing above quite well.
*/
--
Alan Modra
Australia Development Lab, IBM
More information about the Gcc-patches
mailing list