Advice about using SIMD extensions

Richard Beare
Mon Feb 28 20:19:00 GMT 2005

Hi everyone,

I've done a few more experiments using various pieces of advice that 
have come back from the list. They shed some light on the problems I've 
been having.

Before I summarize the results I should mention that my initial 
motivation for this was to improve arithmetic operations on images. Any 
changes I propose would need to be compatible with our imaging libraries 
in which an image is simply a 1D array.

The errors pointed out to me were:

1)  Use -march=pentium4 instead of -mcpu=pentium4

2) use
typedef float myvec __attribute__ ((vector_size (16)));

instead of

typedef int myvec __attribute__ ((mode(V4SF)));

The former is not compatible with newer versions of gcc (>3.4?).

These changes certainly improved the performance of the test I posted to 
the list.

However when I went back to test my image arithmetic code with these 
changes I found no difference.

I then did some more tests which are summarized in the attached graph - 
These demonstrate, I think, that I was experiencing a cache problem with 
my image code. The images I was experimenting with were 1600x1300, so 
way to large to fit in cache.

I now need to do some thinking, and more advice would be appreciated. 
I'm going to experiment with oprofile to see what it tells me, but 
haven't done so yet.

I had always thought that accessing array elements in raster order 
should be cache neutral, but it doesn't seem to be the case. I'm not 
sure what governs the size of the data being loaded into the cache.

Can anything be done about it without changing underlying data 
structures in my code?

As an aside, can anyone recommend example macros for unrolling loops?

Thanks very much.

Brian Budge wrote:
> In the example above, it's not only register allocation, but also
> scheduling.  The data needs to be loaded from memory, and how that
> happens can affect performance quite a bit.
> And yeah, I can't understand how 8.1 could get decent performance
> without instruction scheduling... but maybe I'm stuck in my own little
> RISC processing world (the (toy) compilers I have written have been
> for SPARC and MIPS), and I just don't understand enough about how the
> pentium works.
>   Brian
> On Fri, 25 Feb 2005 14:24:27 -0500, Daniel Berlin <> wrote:
>>On Fri, 2005-02-25 at 12:18 +0100, Brian Budge wrote:
>>>Hmmm, I doubt that.  It seems very important that your data be in
>>>registers when you want to do arithmetic on it.
>>That's register allocation, not scheduling :)
>>>I can see that if your data was already in registers, maybe a
>>>"randomized" instruction ordering would perform okay, but loading the
>>>data properly is time consuming.  At least these are the things I've
>>stevenb was the source of this information for me, so maybe he can
>>confirm it (Steven, i mentioned to brian that icc 8.1 doesn't do
>>scheduling for the pentium4 anymore, and he doubts it :P)

Richard Beare, CSIRO Mathematical & Information Sciences
Locked Bag 17, North Ryde, NSW 1670, Australia
Phone: +61-2-93253221 (GMT+~10hrs)  Fax: +61-2-93253200
-------------- next part --------------
A non-text attachment was scrubbed...
Name: relative_speed.pdf
Type: application/pdf
Size: 2112 bytes
Desc: not available
URL: <>

More information about the Gcc-help mailing list