This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: SSE conversion optimization


> On Sat, Sep 08, 2007 at 06:56:35PM -0500, Jagasia, Harsha wrote:
> > Hi Honza, H.J,
> > >> Amdfam10 preffers doing packed conversions destinating SSE register
> > >rather than scalar.
> > >> This means basically following replacments:
> > >>
> > >> -      cvtsi2sd -> movd + cvtdq2pd
> > >> -      cvtsi2ss -> movd + cvtdq2ps
> > >
> > >Can you disable them for -mtune=generic if an extra pair of
> > >memory load/store is added?
> > 
> > Instead of disabling would it help to do the below?
> > 
> > Replace:
> > cvtsi2sd reg32, xmm
> > with:
> > mov reg32, mem32
> > cvtsi2sd mem32, xmm
> > 
> > This could work for cvtsi2ss and could also work for reg64.
> > 
> 
> That is one kind of extra pair for memory load/store I was referring
> to. It is bad for Core 2 Duo.

Hi,
I did the attached microbenchmark for the 3 alternatives on simple loop.
  scalar:
    cvtsi2sd reg32, xmm
  mem:
    mov reg32, mem32
    cvtsi2sd mem32, xmm
  pcaked:
    mov reg32, mem32
    movd mem32, xmm
    cvtpi2pd xmm, xmm
the results for core2 seems to be, quite surprisingly, in favour of
pcaked:

packed

real	0m1.961s
user	0m1.960s
sys	0m0.000s

real	0m1.962s
user	0m1.964s
sys	0m0.000s

real	0m1.963s
user	0m1.964s
sys	0m0.000s
scalar

real	0m2.152s
user	0m2.152s
sys	0m0.000s

real	0m2.153s
user	0m2.152s
sys	0m0.000s

real	0m2.153s
user	0m2.152s
sys	0m0.000s
mem

real	0m2.027s
user	0m2.020s
sys	0m0.008s

real	0m2.026s
user	0m2.024s
sys	0m0.000s

real	0m2.027s
user	0m2.028s
sys	0m0.000s

K8 is quite unsurprisingly in favour of scalar variant, with packed and
mem being roughly the same.

packed

real    0m3.485s
user    0m3.460s
sys     0m0.024s

real    0m3.506s
user    0m3.504s
sys     0m0.000s

real    0m3.467s
user    0m3.460s
sys     0m0.004s
scalar

real    0m3.284s
user    0m3.276s
sys     0m0.004s

real    0m3.288s
user    0m3.276s
sys     0m0.008s

real    0m3.290s
user    0m3.280s
sys     0m0.004s
mem

real    0m3.481s
user    0m3.432s
sys     0m0.000s

real    0m3.396s
user    0m3.396s
sys     0m0.000s

real    0m3.574s
user    0m3.540s
sys     0m0.000s

So the slowdown you saw might be caused by something else?
Honza
int a[100];
int b[100];
double c[100];

main()
{
  int i,j;
  for (i=0;i<10000000;i++)
  for (j=0;j<100;j++)
    c[j]=a[j]+b[j];
}

Attachment: micro-packed.s
Description: Text document

Attachment: micro-scalar.s
Description: Text document

Attachment: microb
Description: Text document

Attachment: micro-mem.s
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]