This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
> On Sat, Sep 08, 2007 at 06:56:35PM -0500, Jagasia, Harsha wrote: > > Hi Honza, H.J, > > >> Amdfam10 preffers doing packed conversions destinating SSE register > > >rather than scalar. > > >> This means basically following replacments: > > >> > > >> - cvtsi2sd -> movd + cvtdq2pd > > >> - cvtsi2ss -> movd + cvtdq2ps > > > > > >Can you disable them for -mtune=generic if an extra pair of > > >memory load/store is added? > > > > Instead of disabling would it help to do the below? > > > > Replace: > > cvtsi2sd reg32, xmm > > with: > > mov reg32, mem32 > > cvtsi2sd mem32, xmm > > > > This could work for cvtsi2ss and could also work for reg64. > > > > That is one kind of extra pair for memory load/store I was referring > to. It is bad for Core 2 Duo. Hi, I did the attached microbenchmark for the 3 alternatives on simple loop. scalar: cvtsi2sd reg32, xmm mem: mov reg32, mem32 cvtsi2sd mem32, xmm pcaked: mov reg32, mem32 movd mem32, xmm cvtpi2pd xmm, xmm the results for core2 seems to be, quite surprisingly, in favour of pcaked: packed real 0m1.961s user 0m1.960s sys 0m0.000s real 0m1.962s user 0m1.964s sys 0m0.000s real 0m1.963s user 0m1.964s sys 0m0.000s scalar real 0m2.152s user 0m2.152s sys 0m0.000s real 0m2.153s user 0m2.152s sys 0m0.000s real 0m2.153s user 0m2.152s sys 0m0.000s mem real 0m2.027s user 0m2.020s sys 0m0.008s real 0m2.026s user 0m2.024s sys 0m0.000s real 0m2.027s user 0m2.028s sys 0m0.000s K8 is quite unsurprisingly in favour of scalar variant, with packed and mem being roughly the same. packed real 0m3.485s user 0m3.460s sys 0m0.024s real 0m3.506s user 0m3.504s sys 0m0.000s real 0m3.467s user 0m3.460s sys 0m0.004s scalar real 0m3.284s user 0m3.276s sys 0m0.004s real 0m3.288s user 0m3.276s sys 0m0.008s real 0m3.290s user 0m3.280s sys 0m0.004s mem real 0m3.481s user 0m3.432s sys 0m0.000s real 0m3.396s user 0m3.396s sys 0m0.000s real 0m3.574s user 0m3.540s sys 0m0.000s So the slowdown you saw might be caused by something else? Honza
int a[100]; int b[100]; double c[100]; main() { int i,j; for (i=0;i<10000000;i++) for (j=0;j<100;j++) c[j]=a[j]+b[j]; }
Attachment:
micro-packed.s
Description: Text document
Attachment:
micro-scalar.s
Description: Text document
Attachment:
microb
Description: Text document
Attachment:
micro-mem.s
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |