using vector extension in gcc slows down my code
Da Zheng
zhengda1936@gmail.com
Fri Feb 12 10:08:00 GMT 2010
Hi,
On 10-2-12 ä¸Šåˆ2:31, Brian Budge wrote:
> Hi Zheng -
>
> Your code can be sped up significantly even without SSE. Most
> important is removing the modulo operators, which (unless gcc's
> optimizer is REALLY good), is definitely taking a lot of your time.
> (Note that doing things with SSE or SIMD in general might require a
> little bit of thinking in a different way)
>
> You started with this:
>
> for (v1 = 0; v1 < MATRIX_Y; v1++) {
> for (v2 = 0; v2 < MATRIX_X; v2++) {
> double v;
>
> v = in[(v1 + startp1 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
> v += in[(v1 + startp2 + MATRIX_Y) % MATRIX_Y * MATRIX_X + v2];
> v += in[v1 * MATRIX_X + (v2 + startp3 + MATRIX_X) % MATRIX_X];
> v += in[v1 * MATRIX_X + (v2 + startp4 + MATRIX_X) % MATRIX_X];
> v *= (bits[(v1 * MATRIX_X + v2) / 32] >> (31 - (v1 *
> MATRIX_X + v2) % 32)) & 1;
> v *= 0.25;
> v += in2[v1 * MATRIX_X + v2];
> out[v1 * MATRIX_X + v2] = v;
> out2[v1 * MATRIX_X + v2] = fabs(in[v1 * MATRIX_X + v2] - v);
> }
> }
>
> The key is to realize that the code is very simple other than at the
> boundary conditions (all around the outside of the matrix). This can
> allow us to remove the modulos, and yields something like this:
>
> inline void filterRow(double *oy, double *oy2, double *in2,
> double *ym1, double *y, double *yp1, int *b) {
> double y_0 = *y;
> double *yLoopEnd = ym1 + M_MATRIX-1;
>
> double l = *(y+MATRIX_X);
> double r *(y-1);
> double d = *ym1++;
> double u = *yp1++;
>
> double v = l + r + d + u;
> v *= 0.25 * *b++;
> v += *in2++;
> *oy++ = v;
> *oy2++ = fabs(*y++ - v);
>
> for(; ym1 != yLoopEnd; ++ym1) {
> l = *(y-1);
> r = *(y-1);
> d = *ym1; //incremented in loop
> u = *yp1++;
>
> v = l + r + d + u;
> v *= 0.25 * *b++;
> v += *in2++;
> *oy++ = v;
> *oy2++ = fabs(*y++ - v);
> }
>
> l = *(y-1);
> d = *ym1;
> u = *yp1;
>
> v = l + y_0 + d + u;
> v *= 0.25 * *b;
> v += *in2;
> *oy = v;
> *oy2 = fabs(*y - v);
> }
>
> double *y_0 = in;
> double *ym1 = in + (MATRIX_Y-1) * MATRIX_X;
> double *yp1 = in + MATRIX_X;
>
> // note that I'm now assuming that "b" is bits precomputed and ready
> // to use at each index... exercise left to reader
> filterRow(out, out2, in2, ym1, in, yp1, b);
>
> for(v1 = 0; v1 < MATRIX_Y-1; ++v1) {
> out += MATRIX_X;
> out2 += MATRIX_X;
> in2 += MATRIX_X;
> ym1 = in;
> in = yp1;
> yp1 += MATRIX_X;
> b += MATRIX_X;
> filterRow(out, out2, in2, ym1, in, yp1, b);
> }
>
> filterRow(out+MATRIX_X, out2+MATRIX_X, in2+MATRIX_X, in, yp1, yp1+MATRIX_X,
> b+MATRIX_X);
>
> This code (which is untested and hasn't even been compiled, so don't
> expect it to work without some effort) should blow the socks off the
> previous version. Note with the new layout how easy it would be to
> load two adjacent elements in the inner loop. In fact, the compiler
> may be able to vectorize this code without any effort from you.
>
> This should get you started, but I don't want to go and give the whole
> answer away. Play with it for a while. Look at _mm_loadu_pd for
> unaligned loads. If you wanted to be clever, you could unroll the
> loop yourself and do half aligned loads and half unaligned.
I actually eliminated some % operations in the inner loop by splitting it into
three ones (the code is supposed to be generated by a compiler, so I don't
expect it to be as efficient as hand-coded), quite similar as what you did in
filterRow(), but I didn't paste it because I thought it could make the logic of
my code more difficult to understand. It's kind of strange. I don't see much
speedup in Intel dual core, but there is big speedup in AMD chips. I didn't
expect AMD and Intel chips to have so much difference.
I finally understand why there is no built-in functions for aligned loads. The code
v2df *vp1 = (v2df *) ((double *) v25.valp);
v2df v = vp1[v2 / STRIDE_LEN];
has already done the work. I tried that at the beginning, but didn't align data
and got segmentation fault. I thought I had to use some built-in function to
load data. I use aligned loads now, and do see some speedup.
Best regards,
Zheng Da
More information about the Gcc-help
mailing list