[PATCH, i386]: Fix PR target/34682, 70% slowdown with SSE enabled

Tue Jan 8 16:52:00 GMT 2008

On Jan 8, 2008 2:58 PM, Jan Hubicka <hubicka@ucw.cz> wrote:

> This is because we further split r-m-w instructions into RISC like
> sequence on many modern CPUs when integer reg happens to be available.
> This helps scheduling.  The point is that read-write-pairs are faster if
> executed through integer unit.  Turning your testcase into benchmark:
>
> double a[256];
>
> void
> main (void)
> {
>   int i;
>   int b;
>
>   for (b = 0; b < 10000000; b++)
>   for (i = 0; i < 256; i++)
>     a[i] = -a[i];
> }
>
> > On Jan 8, 2008 1:41 PM, Jan Hubicka <jh@suse.cz> wrote:
> >
> > But this doesn't work as expected, neither for -mfpmath=sse, neither
> > -mfpmath=387. I have tried 4.0, 4.1, 4.2 and 4.3 [patched / unpatched]
> > branches with following testcase:
> >
> > --cut here--
> > double a[256];
> >
> > void test (void)
> > {
> >         int i;
> >
> >         for (i = 0; i < 256; i++)
> >                 a[i] = -a[i];
> > }
> > --cut here--
> >
> > There were no r-m-w instructions, always fchs and xor, no matter if
> > data was float or double.

<snip>

> a.out is with GCC 3.3.5 that use xor, while b.out is mainline on Athlon XP.

Thanks for these timings. I certainly agree that these instructions
are faster, but there is still real danger of partial memory access
using these patterns, at least until PR 22332 is addressed. The
problem is, that when value lives in memory, RA can be tricked to
generate r-m-w insn if that integer r-m-w insn is followed by a FP
insn [so a int->mem->fp reload is needed], like in the snippet below:

--cut here--
  dtime();

  for (i = 1; i <= m; i++)
    {
      s = -s;
      sa = sa + s;
    }

  dtime();
--cut here--

.L4:
        xorb    $-128, -17(%ebp)
        addl    $1, %eax
        cmpl    $512000001, %eax
        addsd   -24(%ebp), %xmm0
        jne     .L4

Unfortunatelly, disparaging mem alternative was not enough to avoid
this situation, and all support for mem operands has to be removed
from the patterns. And IMO 530% slower code for quite common construct
certainly outweights (non-working ATM) 200% slower code. However, I do
propose that we reopen this issue once LR splitting is properly
implemented in gcc.

Thanks for comments,
Uros.