93418 – [9 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

Bug 93418 - [9 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

Summary: [9 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.0

Importance:	P2 normal
Target Milestone:	9.3
Assignee:	Jakub Jelinek

URL:
Keywords:	wrong-code

Depends on:
Blocks:

Reported:	2020-01-24 15:13 UTC by Devin Hussey
Modified:	2020-02-13 22:49 UTC (History)
CC List:	1 user (show)

See Also:	85323
Host:
Target:	x86_64--
Build:	2020-01-24 0:00
Known to work:
Known to fail:
Last reconfirmed:	2020-01-24 00:00:00

Attachments
gcc10-pr93418.patch (855 bytes, patch) 2020-01-27 17:36 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Devin Hussey 2020-01-24 15:13:16 UTC

Regression starting in GCC 9

Currently, GCC constant propagates the AVX2 _mm_sllv family with constant amounts to only shift by the first element instead of all elements individually.

#include <immintrin.h>
#include <stdio.h>

// force -O0
__attribute__((__optimize__("-O0")))
void unoptimized() {
    __m128i vals = _mm_set1_epi32(0xffffffff);
    __m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
    __m128i shifted = _mm_sllv_epi32(vals, shifts);
    printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3));
}

// force -O3
__attribute__((__optimize__("-O3")))
void optimized() {
    __m128i vals = _mm_set1_epi32(0xffffffff);
    __m128i shifts = _mm_setr_epi32(16, 31, -34, 3);
    __m128i shifted = _mm_sllv_epi32(vals, shifts);
    printf("%08x %08x %08x %08x\n", _mm_extract_epi32(shifted, 0), _mm_extract_epi32(shifted, 1), _mm_extract_epi32(shifted, 2), _mm_extract_epi32(shifted, 3));
}

int main() {
    printf("Without optimizations (correct result):\t");
    unoptimized();
    printf("With optimizations (incorrect result):\t");
    optimized();
}


I would expect this code to emit the following:

Without optimizations (correct result):  ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):   ffff0000 80000000 00000000 fffffff8

Clang and GCC < 9 exhibit the first output, but 9.1 and later 

However, I get this output on GCC 9 and later:

Without optimizations (correct result):  ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):   ffff0000 ffff0000 ffff0000 ffff0000

Godbolt link: https://gcc.godbolt.org/z/oC3Psp

Comment 1 Andrew Pinski 2020-01-24 15:36:31 UTC

From .ccp1:


Visiting statement:
_17 = VIEW_CONVERT_EXPR<vector(4) int>(shifts_13);
which is likely CONSTANT
Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(shifts_13) to { 16, 31, -34, 3 }
Lattice value changed to CONSTANT { 16, 31, -34, 3 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again


Visiting statement:
_18 = VIEW_CONVERT_EXPR<vector(4) int>(vals_11);
which is likely CONSTANT
Match-and-simplified VIEW_CONVERT_EXPR<vector(4) int>(vals_11) to { -1, -1, -1, -1 }
Lattice value changed to CONSTANT { -1, -1, -1, -1 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again


Visiting statement:
_19 = __builtin_ia32_psllv4si (_18, _17);
which is likely CONSTANT
Lattice value changed to CONSTANT { -65536, -65536, -65536, -65536 }.  Adding SSA edges to worklist.
marking stmt to be not simulated again

Comment 2 Andrew Pinski 2020-01-24 15:43:49 UTC

The bug is most likely inside ix86_fold_builtin.

Comment 3 Devin Hussey 2020-01-24 20:59:27 UTC

I think I found the culprit commit.

Haven't set up a GCC build tree yet, though. 

https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b

Comment 4 Andrew Pinski 2020-01-24 21:48:42 UTC

(In reply to Devin Hussey from comment #3)
> I think I found the culprit commit.
> 
> Haven't set up a GCC build tree yet, though. 
> 
> https://github.com/gcc-mirror/gcc/commit/
> a51c4926712307787d133ba50af8c61393a9229b

g:6a03477e85e1b097ed6c0b86c76436de575aef04 for the new git.

Comment 5 Devin Hussey 2020-01-25 01:52:05 UTC

Finally got GCC to build after it was throwing a fit.

I can confirm that the regression is in that commit.

g:28a8a768ebef5e31f950013f1b48b14c008b4b3b works correctly, 
g:6a03477e85e1b097ed6c0b86c76436de575aef04 does not.

Comment 6 Jakub Jelinek 2020-01-27 11:06:05 UTC

I'll have a look.

Comment 7 Jakub Jelinek 2020-01-27 17:36:40 UTC

Created attachment 47717 [details]
gcc10-pr93418.patch

Untested fix.

Comment 8 Devin Hussey 2020-01-27 19:50:55 UTC

Seems to work.

~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c

~ $ ./a.exe
Without optimizations (correct result): ffff0000 80000000 00000000 fffffff8
With optimizations (incorrect result):  ffff0000 80000000 00000000 fffffff8

~ $

And checking the assembly, the shifts are constant propagated.

The provided test file also passes.

Comment 9 GCC Commits 2020-01-28 07:47:48 UTC

The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bff948aa337807260344c83ac9079d6386410094

commit r10-6273-gbff948aa337807260344c83ac9079d6386410094
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jan 28 08:46:23 2020 +0100

    i386: Fix ix86_fold_builtin shift folding [PR93418]
    
    The following testcase is miscompiled, because the variable shift left
    operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with
    VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when
    we call builder.new_unary_operation, builder.encoded_nelts () will be just 1
    and thus we encode the resulting vector as if all the elements were the
    same.
    For non-masked is_vshift, we could perhaps call builder.new_binary_operation
    (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked
    shifts, for non-is_vshift we could perhaps call it too but with args[2]
    instead of args[1], but there is no builder.new_ternary_operation.
    All this stuff is primarily for aarch64 anyway, on x86 we don't have any
    variable length vectors, and it is not a big deal to compute all elements
    and just let builder.finalize () find the most efficient VECTOR_CST
    representation of the vector.  So, instead of doing too much, this just
    keeps using new_unary_operation only if only one VECTOR_CST is involved
    (i.e. non-masked shift by constant) and for the rest just compute all elts.
    
    2020-01-28  Jakub Jelinek  <jakub@redhat.com>
    
    	PR target/93418
    	* config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not
    	-1 or is_vshift is true, use new_vector with number of elts npatterns
    	rather than new_unary_operation.
    
    	* gcc.target/i386/avx2-pr93418.c: New test.

Comment 10 Jakub Jelinek 2020-01-28 07:48:53 UTC

Fixed on the trunk so far.

Comment 11 GCC Commits 2020-02-13 22:32:04 UTC

The releases/gcc-9 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:764e831291a2e510978ca7be0bffb55589a5a0b6

commit r9-8214-g764e831291a2e510978ca7be0bffb55589a5a0b6
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jan 28 08:46:23 2020 +0100

    i386: Fix ix86_fold_builtin shift folding [PR93418]
    
    The following testcase is miscompiled, because the variable shift left
    operand, { -1, -1, -1, -1 } is represented as a VECTOR_CST with
    VECTOR_CST_NPATTERNS 1 and VECTOR_CST_NELTS_PER_PATTERN 1, so when
    we call builder.new_unary_operation, builder.encoded_nelts () will be just 1
    and thus we encode the resulting vector as if all the elements were the
    same.
    For non-masked is_vshift, we could perhaps call builder.new_binary_operation
    (TREE_TYPE (args[0]), args[0], args[1], false), but then there are masked
    shifts, for non-is_vshift we could perhaps call it too but with args[2]
    instead of args[1], but there is no builder.new_ternary_operation.
    All this stuff is primarily for aarch64 anyway, on x86 we don't have any
    variable length vectors, and it is not a big deal to compute all elements
    and just let builder.finalize () find the most efficient VECTOR_CST
    representation of the vector.  So, instead of doing too much, this just
    keeps using new_unary_operation only if only one VECTOR_CST is involved
    (i.e. non-masked shift by constant) and for the rest just compute all elts.
    
    2020-01-28  Jakub Jelinek  <jakub@redhat.com>
    
    	PR target/93418
    	* config/i386/i386.c (ix86_fold_builtin) <do_shift>: If mask is not
    	-1 or is_vshift is true, use new_vector with number of elts npatterns
    	rather than new_unary_operation.
    
    	* gcc.target/i386/avx2-pr93418.c: New test.

Comment 12 Jakub Jelinek 2020-02-13 22:49:44 UTC

Fixed for 9.3+ too.