# using vector extension in gcc slows down my code

Brian Budge brian.budge@gmail.com
Thu Feb 11 18:45:00 GMT 2010

```Hi Zheng -

Your code can be sped up significantly even without SSE.  Most
important is removing the modulo operators, which (unless gcc's
optimizer is REALLY good), is definitely taking a lot of your time.
(Note that doing things with SSE or SIMD in general might require a
little bit of thinking in a different way)

You started with this:

for (v1 = 0; v1 < MATRIX_Y; v1++) {
for (v2 = 0; v2 < MATRIX_X; v2++) {
double v;

v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) % MATRIX_X];
v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) % MATRIX_X];
v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
MATRIX_X + v2) % 32)) & 1;
v *= 0.25;
v += in2[v1 * MATRIX_X + v2];
out[v1 * MATRIX_X + v2] = v;
out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
}
}

The key is to realize that the code is very simple other than at the
boundary conditions (all around the outside of the matrix).  This can
allow us to remove the modulos, and yields something like this:

inline void filterRow(double *oy, double *oy2, double *in2,
double *ym1, double *y, double *yp1, int *b) {
double y_0 = *y;
double *yLoopEnd = ym1 + M_MATRIX-1;

double l = *(y+MATRIX_X);
double r  *(y-1);
double d = *ym1++;
double u = *yp1++;

double v = l + r + d + u;
v *= 0.25 * *b++;
v += *in2++;
*oy++ = v;
*oy2++ = fabs(*y++ - v);

for(; ym1 != yLoopEnd; ++ym1) {
l = *(y-1);
r = *(y-1);
d = *ym1; //incremented in loop
u = *yp1++;

v = l + r + d + u;
v *= 0.25 * *b++;
v += *in2++;
*oy++ = v;
*oy2++ = fabs(*y++ - v);
}

l = *(y-1);
d = *ym1;
u = *yp1;

v = l + y_0 + d + u;
v *= 0.25 * *b;
v += *in2;
*oy = v;
*oy2 = fabs(*y - v);
}

double *y_0 = in;
double *ym1 = in + (MATRIX_Y-1) * MATRIX_X;
double *yp1 = in + MATRIX_X;

// note that I'm now assuming that "b" is bits precomputed and ready
// to use at each index... exercise left to reader
filterRow(out, out2, in2, ym1, in, yp1, b);

for(v1 = 0; v1 < MATRIX_Y-1; ++v1) {
out += MATRIX_X;
out2 += MATRIX_X;
in2 += MATRIX_X;
ym1 = in;
in = yp1;
yp1 += MATRIX_X;
b += MATRIX_X;
filterRow(out, out2, in2, ym1, in, yp1, b);
}

filterRow(out+MATRIX_X, out2+MATRIX_X, in2+MATRIX_X, in, yp1, yp1+MATRIX_X,
b+MATRIX_X);

This code (which is untested and hasn't even been compiled, so don't
expect it to work without some effort) should blow the socks off the
previous version.  Note with the new layout how easy it would be to
load two adjacent elements in the inner loop.  In fact, the compiler
may be able to vectorize this code without any effort from you.

This should get you started, but I don't want to go and give the whole
answer away.  Play with it for a while.  Look at _mm_loadu_pd for
unaligned loads.  If you wanted to be clever, you could unroll the
loop yourself and do half aligned loads and half unaligned.

Good luck,
Brian

On Wed, Feb 10, 2010 at 8:33 PM, Da Zheng <zhengda1936@gmail.com> wrote:
> On 10-2-11 上午12:13, Brian Budge wrote:
>> This makes a difference because the SSE unit can do two single loads,
>> an add, and a store, and it can be easily pipelined.  The ratio of
>> load/store to math is not ideal, but if you consider the amount of
>> still beneficial.  You're also using unaligned loads and stores, which
>> for some architectures is very bad, and is usually less good than
>> aligned loads and stores.  Moreover, in your case, it's not just the
> I wanted to use unaligned loads and stores, but I cannot find the corresponding
> built-in functions for them.
>> loads and stores, but all the integer math to calculate array indices,
> It is necessary to calculate array indices in my code because the first 4 loads
> in one iteration load data from the same array. In this case, I can reduce a lot
> of cache miss.
>> etc... as well as using unions, which doesn't allow the results to
>> remain in registers, which makes for a not-very-optimal result.  Note
> I tried not to use unions, but it seems the result is even a little worse. I
> don't know why.
>> that if you are running on 64-bit, you are likely using SSE in the
> I'm not sure of it. I think it's still 32-bit. How can I see it?
>> first version of your code, but its using the scalar path (only the
>> first entry of each register).
>>
>> The code is pretty confusing.  If I could understand what it's doing,
>> I'd write you a version using the intel SSE intrinsics (see
>> emmintrin.h and friends), that has a more appropriate data layout.
>> Note that I'm simply assuming that this is possible, but there may be
>> some valid reason why you cannot lay your data out in a SIMD-friendly
>> way.
> I rewrite it to simulate what I really want to do (see the code below) and hope
> it can help you understand the logic of the code. The first 4 loads are from the
> same array in order to save direct memory access (by doing so, it is more likely
> that the data needed is already in the cache).
>
> #define MATRIX_X 1000
> #define MATRIX_Y 1000
>        double *in, *in2, *out, *out2;
>        int *bits;
>        int v1, v2;
>        struct timeval start_time, end_time;
>        int startp1, startp2, startp3, startp4;
>
>        startp1 = 1;
>        startp2 = -1;
>        startp3 = 1;
>        startp4 = -1;
>        in = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
>        in2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
>        out = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
>        out2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
>        bits = malloc ((MATRIX_X * MATRIX_Y / 32 + 1) * sizeof (int));
>        for (v1 = 0; v1 < MATRIX_Y; v1++)
>        {
>                for (v2 = 0; v2 < MATRIX_X; v2++)
>                {
>                        double v;
>
>                        v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X +
> v2];
>                        v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X
> + v2];
>                        v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) %
> MATRIX_X];
>                        v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) %
> MATRIX_X];
>                        v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
> MATRIX_X + v2) % 32)) & 1;
>                        v *= 0.25;
>                        v += in2[v1 * MATRIX_X + v2];
>                        out[v1 * MATRIX_X + v2] = v;
>                        out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
>                }
>        }
> I'll really appreciate it if you could write a version using SSE and teach me
> how to have better data layout.
>
> Best regards,
> Zheng Da
>

```