Bug 54703 - _mm_sub_pd is incorrectly substituted with vandnps
Summary: _mm_sub_pd is incorrectly substituted with vandnps
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 major
Target Milestone: 4.7.3
Assignee: Jakub Jelinek
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2012-09-25 13:29 UTC by Matthias Kretz
Modified: 2013-04-12 16:29 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work: 4.7.3, 4.8.0
Known to fail: 4.6.2, 4.7.0, 4.7.1, 4.7.2
Last reconfirmed: 2012-09-25 00:00:00


Attachments
gcc48-pr54703.patch (803 bytes, patch)
2012-09-25 14:34 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz 2012-09-25 13:29:16 UTC
The following testcase:

#include <xmmintrin.h>
__attribute__((aligned(16))) static const unsigned long long mask[2] = { 0xffffffffff000000ull, 0xffffffffff000000ull };
inline __m128d foo(__m128d v1) {
    const __m128d h1 = _mm_and_pd(v1, _mm_load_pd(reinterpret_cast<const double *>(&mask)));
    const __m128d l1 = _mm_sub_pd(v1, h1);
    return _mm_mul_pd(h1, l1);
}
__m128d test() {
    __m128d a = _mm_set1_pd(2.);
    return foo(foo(a));
}

compiles to

        .cfi_startproc
        vmovaps _ZL4mask(%rip), %xmm0
        vandps  .LC0(%rip), %xmm0, %xmm2
        vandnps .LC0(%rip), %xmm0, %xmm1
        vmulpd  %xmm1, %xmm2, %xmm1
        vandps  %xmm0, %xmm1, %xmm0
        vsubpd  %xmm0, %xmm1, %xmm1
        vmulpd  %xmm1, %xmm0, %xmm0
        ret
        .cfi_endproc

The second foo call is correct: vandps and vsubpd are used. But the first call uses vandps and vandnps. This pattern would be correct for integers, but is obviously wrong for floating point numbers.
Comment 1 Matthias Kretz 2012-09-25 13:32:27 UTC
Um, sorry. Forgot to note the compiler switches:

gcc -O1 -march=bdver1

I can't reproduce the error with corei7-avx or any other non-AVX target.
Comment 2 Richard Biener 2012-09-25 13:47:48 UTC
Confirmed.
Comment 3 Jakub Jelinek 2012-09-25 13:48:10 UTC
Looking into it.
Comment 4 Jakub Jelinek 2012-09-25 14:34:32 UTC
Created attachment 28270 [details]
gcc48-pr54703.patch

Untested fix.
Comment 5 Andrew Pinski 2012-09-25 18:04:00 UTC
(In reply to comment #4)
> Created attachment 28270 [details]
> gcc48-pr54703.patch

Since I wrote the code which breaks this, this looks good.  The only question I have is why is AND allowed on non integer types anyways?
Comment 6 Jakub Jelinek 2012-09-25 21:56:53 UTC
Yeah, the generic code is not buggy, it seems sane to assume AND is integral/vector integer operation only, but I think i?86 isn't the only backend that uses AND on vector floating point modes extensively.
Comment 7 Matthias Kretz 2012-09-26 10:52:38 UTC
Thanks for the quick response! You guys are cool! :)

The pattern here is for calculation with extended precision:

xh = x & mask;
xl = x - xh;
yh = y & mask
yl = y - yh;
/* x * y + small = */ xh * yh + ((xh * yl + xl * yh) + (xl * yl + small));

Other than that, yes there's a lot of ANDPS is SSE code, just this pattern which gets miscompiled is not very common.
Comment 8 Jakub Jelinek 2012-09-27 10:48:14 UTC
Author: jakub
Date: Thu Sep 27 10:48:07 2012
New Revision: 191801

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191801
Log:
	PR target/54703
	* simplify-rtx.c (simplify_binary_operation_1): Perform
	(x - (x & y)) -> (x & ~y) optimization only for integral
	modes.

	* gcc.target/i386/pr54703.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr54703.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
    trunk/gcc/testsuite/ChangeLog
Comment 9 Jakub Jelinek 2012-09-27 10:53:47 UTC
Author: jakub
Date: Thu Sep 27 10:53:42 2012
New Revision: 191802

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191802
Log:
	PR target/54703
	* simplify-rtx.c (simplify_binary_operation_1): Perform
	(x - (x & y)) -> (x & ~y) optimization only for integral
	modes.

	* gcc.target/i386/pr54703.c: New test.

Added:
    branches/gcc-4_7-branch/gcc/testsuite/gcc.target/i386/pr54703.c
Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/simplify-rtx.c
    branches/gcc-4_7-branch/gcc/testsuite/ChangeLog
Comment 10 Jakub Jelinek 2012-09-27 10:54:36 UTC
Fixed for 4.7.3+ so far.
Comment 11 Jakub Jelinek 2013-04-12 16:29:14 UTC
The 4.6 branch has been closed, fixed in GCC 4.7.3.