Bug 38899 - pessimizes function without SSE intrinsics
Summary: pessimizes function without SSE intrinsics
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-17 19:39 UTC by Martin Michlmayr
Modified: 2009-01-21 03:04 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Michlmayr 2009-01-17 19:39:56 UTC
[ Forwarded from http://bugs.debian.org/512050 ]

brian m. carlson reports the following problem with gcc 4.3 and trunk:

Attached is a C file that is compiled with -O3.  mul and mul2 perform the
same operation; mul uses a loop, and mul2 uses SSE intrinsics.  mul2
results in three instructions, whereas mul results in many, many more.

Obviously, since the two functions do the exact same thing, they should
be optimized to be identical.  Instead, mul is pessimized.

Note that there are no alignment issues present since the arrays
declared in main are 16-byte aligned (since they are allocated on the
stack, which is 16-byte aligned on x86_64).

And:

I also just noted that gcc-4.1 and gcc-4.2 produce much less bad code:
they each use 8 movss and 4 mulss.  Nevertheless, they still do not
convert the code into three SSE instructions.
Comment 1 Martin Michlmayr 2009-01-17 19:40:26 UTC
Testcase:


#include <stdio.h>
#include <xmmintrin.h>

#ifndef MUL
#define MUL mul
#endif

void mul(float in1[4], float in2[4], float out[4])
{
        int i;
        for (i = 0; i < 4; i++)
                out[i] = in1[i] * in2[i];
}

void mul2(float in1[4], float in2[4], float out[4])
{
        __m128 a, b, c;
        a = _mm_load_ps(in1);
        b = _mm_load_ps(in2);
        c = _mm_mul_ps(a, b);
        _mm_store_ps(out, c);
}

int main(void)
{
        float inp1[] = {
                1.2, 3.5, 1.7, 2.8
        };
        float inp2[] = {
                -0.7, 2.6, 3.3, -4.0
        };
        float outp[4];
        MUL(inp1, inp2, outp);
        printf("%f %f %f %f\n", outp[0], outp[1], outp[2], outp[3]);
        return 0;
}
Comment 2 Joey Ye 2009-01-21 02:40:55 UTC
Following case isn't vecterized with -O3 on x86_64 either, although arrays are aligned:
#include <stdio.h>

float __attribute__((aligned(16))) in1[] = {
                1.2, 3.5, 1.7, 2.8
};
float __attribute__((aligned(16))) in2[] = {
                -0.7, 2.6, 3.3, -4.0
};
float __attribute__((aligned(16))) out[4]; 
void __attribute__((noinline)) mul()
{
        int i;
        for (i = 0; i < 4; i++)
                out[i] = in1[i] * in2[i];
}

int main(void)
{
        mul();
        printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
        return 0;
}
Comment 3 Andrew Pinski 2009-01-21 02:44:32 UTC
>void mul(float in1[4], float in2[4], float out[4])

Those arrays are not known to be aligned ....

The other one I have to look into.
Comment 4 Andrew Pinski 2009-01-21 03:00:25 UTC
(In reply to comment #2)
That is because the early complete unrolling comes and unrolls the loop so the autovectorizer does not have a loop to work on anymore.  If I increase it to be 16 instead of 4, the loop is vectorizer.

So the original testcase is invalid as two things: aliasing and alignment.   Aliasing because out could overlap with in1/in2, restrict fixes that.  And then the alignment comes into play because there is no way to say the incoming arguments are 16 byte aligned.
Comment 5 Andrew Pinski 2009-01-21 03:04:32 UTC
t.c:11: note: cost model: Adding cost of checks for loop versioning to treat misalignment.

t.c:11: note: cost model: Adding cost of checks for loop versioning aliasing.