[ Forwarded from http://bugs.debian.org/512050 ] brian m. carlson reports the following problem with gcc 4.3 and trunk: Attached is a C file that is compiled with -O3. mul and mul2 perform the same operation; mul uses a loop, and mul2 uses SSE intrinsics. mul2 results in three instructions, whereas mul results in many, many more. Obviously, since the two functions do the exact same thing, they should be optimized to be identical. Instead, mul is pessimized. Note that there are no alignment issues present since the arrays declared in main are 16-byte aligned (since they are allocated on the stack, which is 16-byte aligned on x86_64). And: I also just noted that gcc-4.1 and gcc-4.2 produce much less bad code: they each use 8 movss and 4 mulss. Nevertheless, they still do not convert the code into three SSE instructions.
Testcase: #include <stdio.h> #include <xmmintrin.h> #ifndef MUL #define MUL mul #endif void mul(float in1[4], float in2[4], float out[4]) { int i; for (i = 0; i < 4; i++) out[i] = in1[i] * in2[i]; } void mul2(float in1[4], float in2[4], float out[4]) { __m128 a, b, c; a = _mm_load_ps(in1); b = _mm_load_ps(in2); c = _mm_mul_ps(a, b); _mm_store_ps(out, c); } int main(void) { float inp1[] = { 1.2, 3.5, 1.7, 2.8 }; float inp2[] = { -0.7, 2.6, 3.3, -4.0 }; float outp[4]; MUL(inp1, inp2, outp); printf("%f %f %f %f\n", outp[0], outp[1], outp[2], outp[3]); return 0; }
Following case isn't vecterized with -O3 on x86_64 either, although arrays are aligned: #include <stdio.h> float __attribute__((aligned(16))) in1[] = { 1.2, 3.5, 1.7, 2.8 }; float __attribute__((aligned(16))) in2[] = { -0.7, 2.6, 3.3, -4.0 }; float __attribute__((aligned(16))) out[4]; void __attribute__((noinline)) mul() { int i; for (i = 0; i < 4; i++) out[i] = in1[i] * in2[i]; } int main(void) { mul(); printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]); return 0; }
>void mul(float in1[4], float in2[4], float out[4]) Those arrays are not known to be aligned .... The other one I have to look into.
(In reply to comment #2) That is because the early complete unrolling comes and unrolls the loop so the autovectorizer does not have a loop to work on anymore. If I increase it to be 16 instead of 4, the loop is vectorizer. So the original testcase is invalid as two things: aliasing and alignment. Aliasing because out could overlap with in1/in2, restrict fixes that. And then the alignment comes into play because there is no way to say the incoming arguments are 16 byte aligned.
t.c:11: note: cost model: Adding cost of checks for loop versioning to treat misalignment. t.c:11: note: cost model: Adding cost of checks for loop versioning aliasing.