Bug 91838 - [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift
Summary: [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimiz...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 9.2.0
: P2 normal
Target Milestone: 8.4
Assignee: Tamar Christina
URL:
Keywords: missed-optimization, wrong-code
Depends on:
Blocks:
 
Reported: 2019-09-20 12:42 UTC by Matthias Kretz (Vir)
Modified: 2024-03-26 15:05 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work: 10.0, 7.4.0
Known to fail: 8.3.0, 9.2.0
Last reconfirmed: 2020-01-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz (Vir) 2019-09-20 12:42:58 UTC
Test case:

using T = unsigned char; // or ushort, or uint
using V [[gnu::vector_size(8)]] = T;
V f(V x) { return x >> 8 * sizeof(T); }

GCC 10 compiles to either xor or shift (which should better be xor, as well)

GCC 9.2 compiles to:
  vmovq rax, xmm0
  mov ecx, 64
  shr rax, cl
  sal rax, (64 - 8*sizeof(T))
  vmovq xmm0, rax

The `shr rax, cl`, where cl == 64 is a nop, because shr (and shrx, which is used when BMI2 is enabled) mask the count with 0x3f. Consequently the last element of the input vector is unchanged in the output.

In any case, the use of shr/shrx with shifts > 64 (or 32 in case of the 32-bit variant) should not occur.
Comment 1 Matthias Kretz (Vir) 2019-09-20 12:46:11 UTC
https://godbolt.org/z/zxmCTz
Comment 2 Martin Liška 2020-01-23 13:27:30 UTC
So the issue started with r8-349-gcc5b8f3d568e95ce74e03d8d87ada71117a6c106 and disappeared in r10-2445-g5fbc8ab48a57a75e0ce064befc30dee3dc63327a.
Comment 3 Alexander Monakov 2020-01-23 14:30:42 UTC
For unsigned int this would be undefined behavior via attempt to shift by 32 (but we fail to emit the corresponding warning).

For narrower types (char, short) this seems well-defined.
Comment 4 Matthias Kretz (Vir) 2020-01-23 16:04:01 UTC
Good point. Since gnu::vector_size(N) types are defined by you, you might be able to say that for char and short this is also UB. After all the left operand isn't actually promoted to int. Consequently the the right operand is equal to the width of the "promoted" left operand in all cases.
Comment 5 Alexander Monakov 2020-01-23 17:33:42 UTC
Ah, indeed, it should be explicitly UB, and the documentation should mention that as well as that implicit integer promotion does not happen for vector shifts and other operations.
Comment 6 Matthias Kretz (Vir) 2020-01-23 19:30:18 UTC
FWIW, I'd prefer gnu::vector_size(N) to not introduce any additional UB over the scalar arithmetic types. I.e. behave like if promotion would happen, just with final assignment back to T (truncation). While I hate integer promotion, in this case I'd vote for consistency.
Comment 7 Tamar Christina 2020-01-24 09:25:25 UTC
(In reply to Martin Liška from comment #2)
> So the issue started with r8-349-gcc5b8f3d568e95ce74e03d8d87ada71117a6c106
> and disappeared in r10-2445-g5fbc8ab48a57a75e0ce064befc30dee3dc63327a.

Hi Martin, How do I build gcc around those commits? I have tried building x86_64 around 5b8f3d568e95ce74e03d8d87ada71117a6c106 and it keeps failing for the commits around there with a libgcc compile error.

In file included from /data/tamchr01/write-access/gcc-git/libgcc/unwind-dw2.c:403:0:
./md-unwind-support.h: In function 'x86_64_fallback_frame_state':
./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext'
       sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;

and disabling libgcc doesn't get me very far because of the dependencies of things on it.
Comment 8 Andrew Pinski 2020-01-24 21:45:13 UTC
(In reply to Tamar Christina from comment #7)
> In file included from
> /data/tamchr01/write-access/gcc-git/libgcc/unwind-dw2.c:403:0:
> ./md-unwind-support.h: In function 'x86_64_fallback_frame_state':
> ./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type
> 'struct ucontext'
>        sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
> 
> and disabling libgcc doesn't get me very far because of the dependencies of
> things on it.

You must be using a too new glibc, g:883312dc79806f513275b72502231c751c14ff72 is what fixed libgcc to support newer glibc.
Comment 9 Tamar Christina 2020-01-27 15:35:34 UTC
Thanks Andrew, managed to build it now.

Testing a patch.
Comment 10 GCC Commits 2020-01-31 14:42:21 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:e60b1e23626701939e8a2f0cf6fc1e48abdf867b

commit r10-6371-ge60b1e23626701939e8a2f0cf6fc1e48abdf867b
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri Jan 31 14:39:38 2020 +0000

    middle-end: Fix logical shift truncation (PR rtl-optimization/91838)
    
    This fixes a fall-out from a patch I had submitted two years ago which started
    allowing simplify-rtx to fold logical right shifts by offsets a followed by b
    into >> (a + b).
    
    However this can generate inefficient code when the resulting shift count ends
    up being the same as the size of the shift mode.  This will create some
    undefined behavior on most platforms.
    
    This patch changes to code to truncate to 0 if the shift amount goes out of
    range.  Before my older patch this used to happen in combine when it saw the
    two shifts.  However since we combine them here combine never gets a chance to
    truncate them.
    
    The issue mostly affects GCC 8 and 9 since on 10 the back-end knows how to deal
    with this shift constant but it's better to do the right thing in simplify-rtx.
    
    Note that this doesn't take care of the Arithmetic shift where you could replace
    the constant with MODE_BITS (mode) - 1, but that's not a regression so punting it.
    
    gcc/ChangeLog:
    
    	PR rtl-optimization/91838
    	* simplify-rtx.c (simplify_binary_operation_1): Update LSHIFTRT case
    	to truncate if allowed or reject combination.
    
    gcc/testsuite/ChangeLog:
    
    	PR rtl-optimization/91838
    	* g++.dg/pr91838.C: New test.
Comment 11 GCC Commits 2020-01-31 18:38:13 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:5910b14503dd82772dfeca5336a0176f9b1d260a

commit r10-6381-g5910b14503dd82772dfeca5336a0176f9b1d260a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Jan 31 19:35:11 2020 +0100

    testsuite: Fix up pr91838.C test [PR91838]
    
    The test FAILs on i686-linux with:
    FAIL: g++.dg/pr91838.C   (test for excess errors)
    Excess errors:
    /home/jakub/src/gcc/gcc/testsuite/g++.dg/pr91838.C:7:8: warning: MMX vector return without MMX enabled changes the ABI [-Wpsabi]
    /home/jakub/src/gcc/gcc/testsuite/g++.dg/pr91838.C:7:3: warning: MMX vector argument without MMX enabled changes the ABI [-Wpsabi]
    and on x86_64-linux with -m32 testing with failure to match the
    expected pattern in there (or both with e.g. -m32/-mno-mmx/-mno-sse testing).
    The test is also in a wrong directory, has non-standard specification that
    it requires c++11 or later.
    
    2020-01-31  Jakub Jelinek  <jakub@redhat.com>
    
    	PR rtl-optimization/91838
    	* g++.dg/pr91838.C: Moved to ...
    	* g++.dg/opt/pr91838.C: ... here.  Require c++11 target instead of
    	dg-skip-if for c++98.  Pass -Wno-psabi -w to avoid psabi style
    	warnings on vector arg passing or return.  Add -masm=att on i?86/x86_64.
    	Only check for pxor %xmm0, %xmm0 on lp64 i?86/x86_64.
Comment 12 GCC Commits 2020-02-11 10:49:23 UTC
The releases/gcc-8 branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:0e35a433f8ec02ac46eb5ceb4a9bc6a25e88b05c

commit r8-9975-g0e35a433f8ec02ac46eb5ceb4a9bc6a25e88b05c
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Tue Feb 11 10:47:19 2020 +0000

    middle-end: Fix logical shift truncation (PR rtl-optimization/91838) (gcc-8 backport)
    
    This fixes a fall-out from a patch I had submitted two years ago which started
    allowing simplify-rtx to fold logical right shifts by offsets a followed by b
    into >> (a + b).
    
    However this can generate inefficient code when the resulting shift count ends
    up being the same as the size of the shift mode.  This will create some
    undefined behavior on most platforms.
    
    This patch changes to code to truncate to 0 if the shift amount goes out of
    range.  Before my older patch this used to happen in combine when it saw the
    two shifts.  However since we combine them here combine never gets a chance to
    truncate them.
    
    The issue mostly affects GCC 8 and 9 since on 10 the back-end knows how to deal
    with this shift constant but it's better to do the right thing in simplify-rtx.
    
    Note that this doesn't take care of the Arithmetic shift where you could replace
    the constant with MODE_BITS (mode) - 1, but that's not a regression so punting it.
    
    gcc/ChangeLog:
    
    	Backport from mainline
    	2020-01-31  Tamar Christina  <tamar.christina@arm.com>
    
    	PR rtl-optimization/91838
    	* simplify-rtx.c (simplify_binary_operation_1): Update LSHIFTRT case
    	to truncate if allowed or reject combination.
    
    gcc/testsuite/ChangeLog:
    
    	Backport from mainline
    	2020-01-31  Tamar Christina  <tamar.christina@arm.com>
    		    Jakub Jelinek  <jakub@redhat.com>
    
    	PR rtl-optimization/91838
    	* g++.dg/opt/pr91838.C: New test.
Comment 13 GCC Commits 2020-02-11 10:52:00 UTC
The releases/gcc-9 branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:f6e9ae4da8f040ab2ef2eb37d0fb4da6f823bf81

commit r9-8210-gf6e9ae4da8f040ab2ef2eb37d0fb4da6f823bf81
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Tue Feb 11 10:50:12 2020 +0000

    middle-end: Fix logical shift truncation (PR rtl-optimization/91838) (gcc-9 backport)
    
    This fixes a fall-out from a patch I had submitted two years ago which started
    allowing simplify-rtx to fold logical right shifts by offsets a followed by b
    into >> (a + b).
    
    However this can generate inefficient code when the resulting shift count ends
    up being the same as the size of the shift mode.  This will create some
    undefined behavior on most platforms.
    
    This patch changes to code to truncate to 0 if the shift amount goes out of
    range.  Before my older patch this used to happen in combine when it saw the
    two shifts.  However since we combine them here combine never gets a chance to
    truncate them.
    
    The issue mostly affects GCC 8 and 9 since on 10 the back-end knows how to deal
    with this shift constant but it's better to do the right thing in simplify-rtx.
    
    Note that this doesn't take care of the Arithmetic shift where you could replace
    the constant with MODE_BITS (mode) - 1, but that's not a regression so punting it.
    
    gcc/ChangeLog:
    
    	Backport from mainline
    	2020-01-31  Tamar Christina  <tamar.christina@arm.com>
    
    	PR rtl-optimization/91838
    	* simplify-rtx.c (simplify_binary_operation_1): Update LSHIFTRT case
    	to truncate if allowed or reject combination.
    
    gcc/testsuite/ChangeLog:
    
    	Backport from mainline
    	2020-01-31  Tamar Christina  <tamar.christina@arm.com>
    		    Jakub Jelinek  <jakub@redhat.com>
    
    	PR rtl-optimization/91838
    	* g++.dg/opt/pr91838.C: New test.
Comment 14 Tamar Christina 2020-02-11 10:53:30 UTC
Fixed in master and backported to GCC 9 and GCC 8
Comment 15 Uroš Bizjak 2023-05-19 08:04:04 UTC
This test returns different results, depending on whether V8QImode shift
pattern is present in target *.md files. For x86, if the following expander is added:

+(define_expand "<insn>v8qi3"
+  [(set (match_operand:V8QI 0 "register_operand")
+       (any_shift:V8QI (match_operand:V8QI 1 "register_operand")
+                       (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_MMX_WITH_SSE"
+{
+  ix86_expand_vecop_qihi_partial (<CODE>, operands[0],
+                                 operands[1], operands[2]);
+  DONE;
+})

then the tree optimizers produce:

V f (V x)
{
  V _2;

  <bb 2> [local count: 1073741824]:
  _2 = x_1(D) >> 8;
  return _2;

}

otherwise, without the named expander:

V f (V x)
{
  <bb 2> [local count: 1073741824]:
  return { 0, 0, 0, 0, 0, 0, 0, 0 };

}
Comment 16 Andrew Pinski 2023-06-02 02:18:12 UTC
So the testcase g++.dg/opt/pr91838.C depends on the out come of the discussion at:
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620378.html

Which I think is saying this testcase is undefined really.
Comment 17 Richard Biener 2023-07-27 11:11:54 UTC
Interestingly even with -mno-sse we somehow have a shift for V2QImode.  When
a scalar shift by precision is in the IL (for example via veclower) then
it's CCPs bit value tracking that makes it zero, if that's disabled then
VRP will compute it zero.  But we don't have any pattern that would consistently do this.

I'm going to add one.
Comment 18 Uroš Bizjak 2023-07-27 11:38:09 UTC
(In reply to Richard Biener from comment #17)
> Interestingly even with -mno-sse we somehow have a shift for V2QImode.
This is implemented by a combination of shl rl,cl and shl rh,cl, so no XMM registers are needed.
Comment 19 GCC Commits 2023-07-27 13:56:13 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d1c072a1c3411a6fe29900750b38210af8451eeb

commit r14-2821-gd1c072a1c3411a6fe29900750b38210af8451eeb
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 27 13:08:32 2023 +0200

    tree-optimization/91838 - fix FAIL of g++.dg/opt/pr91838.C
    
    The following fixes the lack of simplification of a vector shift
    by an out-of-bounds shift value.  For scalars this is done both
    by CCP and VRP but vectors are not handled there.  This results
    in PR91838 differences in outcome dependent on whether a vector
    shift ISA is available and thus vector lowering does or does not
    expose scalar shifts here.
    
    The following adds a match.pd pattern to catch uniform out-of-bound
    shifts, simplifying them to zero when not sanitizing shift amounts.
    
            PR tree-optimization/91838
            * gimple-match-head.cc: Include attribs.h and asan.h.
            * generic-match-head.cc: Likewise.
            * match.pd (([rl]shift @0 out-of-bounds) -> zero): New pattern.
Comment 20 Richard Biener 2023-07-27 13:57:20 UTC
The testcase is again fixed in GCC 14.
Comment 21 GCC Commits 2024-03-26 14:58:52 UTC
The releases/gcc-13 branch has been updated by Andre Simoes Dias Vieira <avieira@gcc.gnu.org>:

https://gcc.gnu.org/g:c6e171ff827f8ff1bd160babac0dd24933972664

commit r13-8498-gc6e171ff827f8ff1bd160babac0dd24933972664
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 27 13:08:32 2023 +0200

    tree-optimization/91838 - fix FAIL of g++.dg/opt/pr91838.C
    
    The following fixes the lack of simplification of a vector shift
    by an out-of-bounds shift value.  For scalars this is done both
    by CCP and VRP but vectors are not handled there.  This results
    in PR91838 differences in outcome dependent on whether a vector
    shift ISA is available and thus vector lowering does or does not
    expose scalar shifts here.
    
    The following adds a match.pd pattern to catch uniform out-of-bound
    shifts, simplifying them to zero when not sanitizing shift amounts.
    
            PR tree-optimization/91838
            * gimple-match-head.cc: Include attribs.h and asan.h.
            * generic-match-head.cc: Likewise.
            * match.pd (([rl]shift @0 out-of-bounds) -> zero): New pattern.
    
    (cherry picked from commit d1c072a1c3411a6fe29900750b38210af8451eeb)
Comment 22 GCC Commits 2024-03-26 15:05:27 UTC
The releases/gcc-12 branch has been updated by Andre Simoes Dias Vieira <avieira@gcc.gnu.org>:

https://gcc.gnu.org/g:1ddd9f9e53bd649d3d236f7941106d8cc46e7358

commit r12-10293-g1ddd9f9e53bd649d3d236f7941106d8cc46e7358
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 27 13:08:32 2023 +0200

    tree-optimization/91838 - fix FAIL of g++.dg/opt/pr91838.C
    
    The following fixes the lack of simplification of a vector shift
    by an out-of-bounds shift value.  For scalars this is done both
    by CCP and VRP but vectors are not handled there.  This results
    in PR91838 differences in outcome dependent on whether a vector
    shift ISA is available and thus vector lowering does or does not
    expose scalar shifts here.
    
    The following adds a match.pd pattern to catch uniform out-of-bound
    shifts, simplifying them to zero when not sanitizing shift amounts.
    
            PR tree-optimization/91838
            * gimple-match-head.cc: Include attribs.h and asan.h.
            * generic-match-head.cc: Likewise.
            * match.pd (([rl]shift @0 out-of-bounds) -> zero): New pattern.
    
    (cherry picked from commit d1c072a1c3411a6fe29900750b38210af8451eeb)