Advice about using SIMD extensions

Sat Feb 26 02:44:00 GMT 2005

Hmmm, I doubt that.  It seems very important that your data be in
registers when you want to do arithmetic on it.

I can see that if your data was already in registers, maybe a
"randomized" instruction ordering would perform okay, but loading the
data properly is time consuming.  At least these are the things I've
observed.

  Brian

On Thu, 24 Feb 2005 11:43:23 -0500, Daniel Berlin <dberlin@dberlin.org> wrote:
> On Thu, 2005-02-24 at 13:48 +0100, Brian Budge wrote:
> > Daniel -
> >
> > Yeah, that's what I meant... but wouldn't optimal scheduling be nice ;)
> >
> > I've been noticing this on a pentium4 (which it seemed was also what
> > Richard was using).
> >
> > It seems like SSE would be a pretty widely used target, and that's why
> > I was surprised
> > to get slowdowns on even simple vector additions/multiplies/etc...
> > when mixed with other code.  If I ran very contrived examples, things
> > ran very fast, but as soon as I put my library into an application, I
> > noticed that things were slower, despite some things being calculated
> > 4 times as fast.
> >
> > It seems that you must use the intrinsics the same way that you'd
> > write the assembly in order to get decent results.
> 
> You shouldn't have to.
> The whole advantage of the intrinsics is that they are scheduled :).
> 
> Anyway, looking at the scheduler descriptions, i don't see the p4
> including any sort of vector scheduling.
> 
> The athlon description looks like it does.
> Try -mcpu=k8 and see if it is any better.
> 
> I should note that AFAIK, Intel's compiler doesn't actually do
> scheduling for the pentium4 anymore, because it wasn't worth it.  Maybe
> that doesn't apply to vector instructions (or maybe the person who told
> me this was wrong).
> 
>