Bug 93418 - [9 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav
Summary: [9 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P2 normal
Target Milestone: 9.3
Assignee: Jakub Jelinek
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2020-01-24 15:13 UTC by Devin Hussey
Modified: 2020-02-13 22:49 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*
Build: 2020-01-24 0:00
Known to work:
Known to fail:
Last reconfirmed: 2020-01-24 00:00:00


Attachments
gcc10-pr93418.patch (855 bytes, patch)
2020-01-27 17:36 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Devin Hussey 2020-01-24 15:13:16 UTC
Regression starting in GCC 9

Currently, GCC constant propagates the AVX2 _mm_sllv family with constant amounts to only shift by the first element instead of all elements individually.

#include <immintrin.h>
#include <stdio.h>

// force -O0
__attribute__((__optimize__("-O0")))
void unoptimized() {
    __m128i vals = _mm_set1_epi32(0xffffffff);
    __m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
    __m128i shifted = _mm_sllv_epi32(vals, shifts);
    printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3));
}

// force -O3
__attribute__((__optimize__("-O3")))
void optimized() {
    __m128i vals = _mm_set1_epi32(0xffffffff);
    __m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
    __m128i shifted = _mm_sllv_epi32(vals, shifts);
    printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3));
}

int main() {
    printf("Without optimizations (correct result):\t");
    unoptimized();
    printf("With optimizations (incorrect result):\t");
    optimized();
}


I would expect this code to emit the following:

Without optimizations (correct result):  ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):   ffff0000 80000000 00000000 fffffff8

Clang and GCC < 9 exhibit the first output, but 9.1 and later 

However, I get this output on GCC 9 and later:

Without optimizations (correct result):  ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):   ffff0000 ffff0000 ffff0000 ffff0000

Godbolt link: https://gcc.godbolt.org/z/oC3Psp
Comment 1 Andrew Pinski 2020-01-24 15:36:31 UTC
From .ccp1:


Visiting statement:
_17 = VIEW_CONVERT_EXPR<vector(4) int>(shifts_13);
which is likely CONSTANT
Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(shifts_13) to { 16, 31, -34, 3 }
Lattice value changed to CONSTANT { 16, 31, -34, 3 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again


Visiting statement:
_18 = VIEW_CONVERT_EXPR<vector(4) int>(vals_11);
which is likely CONSTANT
Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(vals_11) to { -1, -1, -1, -1 }
Lattice value changed to CONSTANT { -1, -1, -1, -1 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again


Visiting statement:
_19 = __builtin_ia32_psllv4si (_18, _17);
which is likely CONSTANT
Lattice value changed to CONSTANT { -65536, -65536, -65536, -65536 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again
Comment 2 Andrew Pinski 2020-01-24 15:43:49 UTC
The bug is most likely inside ix86_fold_builtin.
Comment 3 Devin Hussey 2020-01-24 20:59:27 UTC
I think I found the culprit commit.

Haven't set up a GCC build tree yet, though. 

https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b
Comment 4 Andrew Pinski 2020-01-24 21:48:42 UTC
(In reply to Devin Hussey from comment #3)
> I think I found the culprit commit.
> 
> Haven't set up a GCC build tree yet, though. 
> 
> https://github.com/gcc-mirror/gcc/commit/
> a51c4926712307787d133ba50af8c61393a9229b

g:6a03477e85e1b097ed6c0b86c76436de575aef04 for the new git.
Comment 5 Devin Hussey 2020-01-25 01:52:05 UTC
Finally got GCC to build after it was throwing a fit.

I can confirm that the regression is in that commit.

g:28a8a768ebef5e31f950013f1b48b14c008b4b3b works correctly, 
g:6a03477e85e1b097ed6c0b86c76436de575aef04 does not.
Comment 6 Jakub Jelinek 2020-01-27 11:06:05 UTC
I'll have a look.
Comment 7 Jakub Jelinek 2020-01-27 17:36:40 UTC
Created attachment 47717 [details]
gcc10-pr93418.patch

Untested fix.
Comment 8 Devin Hussey 2020-01-27 19:50:55 UTC
Seems to work.

~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c

~ $ ./a.exe
Without optimizations (correct result): ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):  ffff0000 80000000 00000000 fffffff8

~ $

And checking the assembly, the shifts are constant propagated.

The provided test file also passes.
Comment 9 GCC Commits 2020-01-28 07:47:48 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bff948aa337807260344c83ac9079d6386410094

commit r10-6273-gbff948aa337807260344c83ac9079d6386410094
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jan 28 08:46:23 2020 +0100

    i386: Fix ix86_fold_builtin shift folding [PR93418]
    
    The following testcase is miscompiled, because the variable shift left
    operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with
    VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when
    we call builder.new_unary_operation, builder.encoded_nelts () will be just 1
    and thus we encode the resulting vector as if all the elements were the
    same.
    For non-masked is_vshift, we could perhaps call builder.new_binary_operation
    (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked
    shifts, for non-is_vshift we could perhaps call it too but with args[2]
    instead of args[1], but there is no builder.new_ternary_operation.
    All this stuff is primarily for aarch64 anyway, on x86 we don't have any
    variable length vectors, and it is not a big deal to compute all elements
    and just let builder.finalize () find the most efficient VECTOR_CST
    representation of the vector.  So, instead of doing too much, this just
    keeps using new_unary_operation only if only one VECTOR_CST is involved
    (i.e. non-masked shift by constant) and for the rest just compute all elts.
    
    2020-01-28  Jakub Jelinek  <jakub@redhat.com>
    
    	PR target/93418
    	* config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not
    	-1 or is_vshift is true, use new_vector with number of elts npatterns
    	rather than new_unary_operation.
    
    	* gcc.target/i386/avx2-pr93418.c: New test.
Comment 10 Jakub Jelinek 2020-01-28 07:48:53 UTC
Fixed on the trunk so far.
Comment 11 GCC Commits 2020-02-13 22:32:04 UTC
The releases/gcc-9 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:764e831291a2e510978ca7be0bffb55589a5a0b6

commit r9-8214-g764e831291a2e510978ca7be0bffb55589a5a0b6
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jan 28 08:46:23 2020 +0100

    i386: Fix ix86_fold_builtin shift folding [PR93418]
    
    The following testcase is miscompiled, because the variable shift left
    operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with
    VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when
    we call builder.new_unary_operation, builder.encoded_nelts () will be just 1
    and thus we encode the resulting vector as if all the elements were the
    same.
    For non-masked is_vshift, we could perhaps call builder.new_binary_operation
    (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked
    shifts, for non-is_vshift we could perhaps call it too but with args[2]
    instead of args[1], but there is no builder.new_ternary_operation.
    All this stuff is primarily for aarch64 anyway, on x86 we don't have any
    variable length vectors, and it is not a big deal to compute all elements
    and just let builder.finalize () find the most efficient VECTOR_CST
    representation of the vector.  So, instead of doing too much, this just
    keeps using new_unary_operation only if only one VECTOR_CST is involved
    (i.e. non-masked shift by constant) and for the rest just compute all elts.
    
    2020-01-28  Jakub Jelinek  <jakub@redhat.com>
    
    	PR target/93418
    	* config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not
    	-1 or is_vshift is true, use new_vector with number of elts npatterns
    	rather than new_unary_operation.
    
    	* gcc.target/i386/avx2-pr93418.c: New test.
Comment 12 Jakub Jelinek 2020-02-13 22:49:44 UTC
Fixed for 9.3+ too.