using vector extension in gcc slows down my code
Da Zheng
zhengda1936@gmail.com
Thu Feb 11 08:54:00 GMT 2010
On 10-2-11 ä¸Šåˆ12:13, Brian Budge wrote:
> This makes a difference because the SSE unit can do two single loads,
> an add, and a store, and it can be easily pipelined. The ratio of
> load/store to math is not ideal, but if you consider the amount of
> work to do 2 doubles instead (4 loads, 2 adds, and 2 stores), it's
> still beneficial. You're also using unaligned loads and stores, which
> for some architectures is very bad, and is usually less good than
> aligned loads and stores. Moreover, in your case, it's not just the
I wanted to use unaligned loads and stores, but I cannot find the corresponding
built-in functions for them.
> loads and stores, but all the integer math to calculate array indices,
It is necessary to calculate array indices in my code because the first 4 loads
in one iteration load data from the same array. In this case, I can reduce a lot
of cache miss.
> etc... as well as using unions, which doesn't allow the results to
> remain in registers, which makes for a not-very-optimal result. Note
I tried not to use unions, but it seems the result is even a little worse. I
don't know why.
> that if you are running on 64-bit, you are likely using SSE in the
I'm not sure of it. I think it's still 32-bit. How can I see it?
> first version of your code, but its using the scalar path (only the
> first entry of each register).
>
> The code is pretty confusing. If I could understand what it's doing,
> I'd write you a version using the intel SSE intrinsics (see
> emmintrin.h and friends), that has a more appropriate data layout.
> Note that I'm simply assuming that this is possible, but there may be
> some valid reason why you cannot lay your data out in a SIMD-friendly
> way.
I rewrite it to simulate what I really want to do (see the code below) and hope
it can help you understand the logic of the code. The first 4 loads are from the
same array in order to save direct memory access (by doing so, it is more likely
that the data needed is already in the cache).
#define MATRIX_X 1000
#define MATRIX_Y 1000
double *in, *in2, *out, *out2;
int *bits;
int v1, v2;
struct timeval start_time, end_time;
int startp1, startp2, startp3, startp4;
startp1 = 1;
startp2 = -1;
startp3 = 1;
startp4 = -1;
in = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
in2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
out = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
out2 = malloc (MATRIX_X * MATRIX_Y * sizeof (double));
bits = malloc ((MATRIX_X * MATRIX_Y / 32 + 1) * sizeof (int));
for (v1 = 0; v1 < MATRIX_Y; v1++)
{
for (v2 = 0; v2 < MATRIX_X; v2++)
{
double v;
v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X +
v2];
v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X
+ v2];
v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) %
MATRIX_X];
v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) %
MATRIX_X];
v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
MATRIX_X + v2) % 32)) & 1;
v *= 0.25;
v += in2[v1 * MATRIX_X + v2];
out[v1 * MATRIX_X + v2] = v;
out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
}
}
I'll really appreciate it if you could write a version using SSE and teach me
how to have better data layout.
Best regards,
Zheng Da
More information about the Gcc-help
mailing list