Bug 102540 - [12/13 Regression] Dead Code Elimination Regression at -O3 since r12-476-gd846f225c25c5885
Summary: [12/13 Regression] Dead Code Elimination Regression at -O3 since r12-476-gd84...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P2 normal
Target Milestone: 12.3
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: VRP
  Show dependency treegraph
 
Reported: 2021-09-30 09:27 UTC by Theodoros Theodoridis
Modified: 2022-10-13 17:03 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-09-30 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Theodoros Theodoridis 2021-09-30 09:27:06 UTC
cat test.c
static long a;
static unsigned b;
void foo(void);
int main() {
    long c, e;
    c = b = a;
    e = c ? 2 / (c + 1) : 0;
    if (e && !b)
        foo();
    a = 0;
}

11.2.0 at -O3 can eliminate the call to foo but trunk at -O3 cannot:

gcc-11 -v
Target: x86_64-pc-linux-gnu
Configured with: ../configure --disable-multilib --disable-bootstrap --enable-languages=c,c++
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.2.0 (GCC)

gcc-11 test.c -S -O3
cat test.s
	.file	"test.c"
	.text
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	movq	$0, a(%rip)
	xorl	%eax, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.local	a
	.comm	a,8,8
	.ident	"GCC: (GNU) 11.2.0"
	.section	.note.GNU-stack,"",@progbits


gcc-trunk -v
Target: x86_64-pc-linux-gnu
Configured with: ../configure --disable-multilib --disable-bootstrap --enable-languages=c,c++ 
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.0.0 20210930 (experimental) (GCC)

gcc-trunk test.c -S -O3
cat test.s
	.file	"test.c"
	.text
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	movq	a(%rip), %rcx
	movq	%rcx, %rsi
	andl	$4294967295, %esi
	je	.L13
	movl	$2, %eax
	addq	$1, %rsi
	cqto
	idivq	%rsi
	testb	$1, %al
	je	.L13
	testl	%ecx, %ecx
	je	.L17
.L13:
	movq	$0, a(%rip)
	xorl	%eax, %eax
	ret
.L17:
	pushq	%rax
	.cfi_def_cfa_offset 16
	call	foo
	xorl	%edx, %edx
	xorl	%eax, %eax
	movq	%rdx, a(%rip)
	popq	%rcx
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.local	a
	.comm	a,8,8
	.ident	"GCC: (GNU) 12.0.0 20210930 (experimental)"
	.section	.note.GNU-stack,"",@progbits
Comment 1 Martin Liška 2021-09-30 10:05:27 UTC
Started with r12-476-gd846f225c25c5885.
Comment 2 Richard Biener 2021-09-30 12:50:00 UTC
FRE1 has the following difference, simplifying the (unsigned int) truncation.

   <bb 2> :
   a.0_1 = a;
   _2 = (unsigned int) a.0_1;
   b = _2;
-  c_10 = (long int) _2;
+  _6 = a.0_1 & 4294967295;
+  c_10 = _6;
   if (c_10 != 0)
     goto <bb 3>; [INV]
   else

where the EVRP which now uses ranger retains (diff from GCC 11 to trunk):

   <bb 2> :
   a.0_1 = a;
   _2 = (unsigned int) a.0_1;
   b = _2;
-  c_10 = (long int) _2;
+  _6 = a.0_1 & 4294967295;
+  c_10 = _6;
   if (c_10 != 0)
     goto <bb 3>; [INV]
   else
-    goto <bb 4>; [INV]
+    goto <bb 6>; [INV]
 
   <bb 3> :
   _4 = c_10 + 1;
   iftmp.2_12 = 2 / _4;
+  if (iftmp.2_12 != 0)
+    goto <bb 4>; [INV]
+  else
+    goto <bb 6>; [INV]
 
   <bb 4> :
+  if (_2 == 0)
+    goto <bb 5>; [INV]
+  else
+    goto <bb 6>; [INV]
+
+  <bb 5> :
+  foo ();
+
+  <bb 6> :
   a = 0;
   return 0;


after EVRP we have

+  # RANGE [0, 4294967295] NONZERO 4294967295
   c_10 = _6;
...
+  # RANGE [2, 4294967296] NONZERO 8589934591
   _4 = c_10 + 1;
+  # RANGE [0, 1] NONZERO 1
   iftmp.2_12 = 2 / _4;
   if (iftmp.2_12 != 0)

what we did in GCC 11 is simplified the following check

  <bb 4> :
  if (_2 == 0)
    goto <bb 5>; [INV]
  else
    goto <bb 6>; [INV]

based on iftmp.2_12 == [1, 1] via ranger and

evrp visiting BB4
Visiting controlling predicate if (iftmp.2_12 != 0)
Adding assert for iftmp.2_12 from iftmp.2_12 != 0
Intersecting
  long int ~[0, 0]  EQUIVALENCES: { iftmp.2_12 } (1 elements)
and
  long int [0, 1]
to
  long int [1, 1]  EQUIVALENCES: { iftmp.2_12 } (1 elements)
Intersecting
  long int [0, 1]
and
  long int [1, 1]
to
  long int [1, 1]
pushing new range for iftmp.2_12: long int [1, 1]  EQUIVALENCES: { iftmp.2_12 } (1 elements)
evrp visiting stmt if (_2 == 0)
Folding statement: if (_2 == 0)

Visiting conditional with predicate: if (_2 == 0)

With known ranges
        _2: unsigned int [1, +INF]  EQUIVALENCES: { _2 } (1 elements)

Predicate evaluates to: 0
Folded into: if (0 != 0)

that's now missing, somehow due to the folded IL if the bisect is correct.
Comment 3 Aldy Hernandez 2021-09-30 17:39:37 UTC
To elide the foo(), _2 must be non-zero on the 2->3 edge dominating the call.

Interestingly, a.0_1 is non-zero on the 2->3 edge, and we have:

_2 = (unsigned int) a.0_1

but somehow we have no knowledge of _2.

Andrew, wanna take a stab at this?

=========== BB 2 ============
Imports: a.0_1  
Exports: a.0_1  _6  c_10  
         _2 : a.0_1(I)  
         _6 : a.0_1(I)  
         c_10 : a.0_1(I)  _6  
Equivalence set : []
Equivalence set : [_6, c_10]
    <bb 2> :
    a.0_1 = a;
    _2 = (unsigned int) a.0_1;
    b = _2;
    _6 = a.0_1 & 4294967295;
    c_10 = _6;
    if (c_10 != 0)
      goto <bb 3>; [INV]
    else
      goto <bb 6>; [INV]

_6 : long int [0, 4294967295]
c_10 : long int [0, 4294967295]
2->3  (T) a.0_1 : 	long int [-INF, -1][1, +INF]
2->3  (T) _6 : 	long int [1, 4294967295]
2->3  (T) c_10 : 	long int [1, 4294967295]
2->6  (F) a.0_1 : 	long int [-INF, -4294967296][0, +INF]
2->6  (F) _6 : 	long int [0, 0]
2->6  (F) c_10 : 	long int [0, 0]
Comment 4 Andrew Macleod 2021-10-01 21:02:17 UTC

(In reply to Richard Biener from comment #2)
> FRE1 has the following difference, simplifying the (unsigned int) truncation.
> 
>    <bb 2> :
>    a.0_1 = a;
>    _2 = (unsigned int) a.0_1;
>    b = _2;
> -  c_10 = (long int) _2;
> +  _6 = a.0_1 & 4294967295;
> +  c_10 = _6;
>    if (c_10 != 0)
>      goto <bb 3>; [INV]
>    else
> 

Why does FRE make this transformation/simplification?  It removes a relationship between c_10 and _2. The reason ranger no longer can fold _2 == 0 is because the sequence is now:

    a.0_1 = a;
    _2 = (unsigned int) a.0_1;
    b = _2;
    _6 = a.0_1 & 4294967295;
    c_10 = _6;
    if (c_10 != 0)
      goto <bb 3>; [INV]

We do not find _2 is non-zero on the outgoing edge because _2 is not related to the calculation in the condition.  (ie c_10 no longer has a dependency on _2)

We do recalculate _2 based on the outgoing range of a.0_1, but with it being a 64 bit value and _2 being 32 bits, we only know the outgoing range of a.0_1 is non-zero.. we dont track any of the upper bits... 
 2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]
And when we recalculate _2 using that value, we still get varying because 0xFFFF0000 in not zero, but can still produce a zero in _2.

The problem is that the condition c_10 != 0 no longer related to the value of _2 in the IL... so ranger never sees it. and we cant represent the 2^16 subranges that end in [1,0xFFFF].

Before that transformation, 
  _2 = (unsigned int) a.0_1;
   b = _2;
  c_10 = (long int) _2;
The relationship is obvious, and ranger would relate the c_10 != 0 to _2 no problem.
Comment 5 rguenther@suse.de 2021-10-04 06:36:22 UTC
On Fri, 1 Oct 2021, amacleod at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102540
> 
> --- Comment #4 from Andrew Macleod <amacleod at redhat dot com> ---
> 
> 
> (In reply to Richard Biener from comment #2)
> > FRE1 has the following difference, simplifying the (unsigned int) truncation.
> > 
> >    <bb 2> :
> >    a.0_1 = a;
> >    _2 = (unsigned int) a.0_1;
> >    b = _2;
> > -  c_10 = (long int) _2;
> > +  _6 = a.0_1 & 4294967295;
> > +  c_10 = _6;
> >    if (c_10 != 0)
> >      goto <bb 3>; [INV]
> >    else
> > 
> 
> Why does FRE make this transformation/simplification?

It's a match.pd transform that transforms a zero-extend from a smaller
precision via two NOP_EXPRs to a single BIT_AND_EXPR which is better and
more canonical on GIMPLE.

>  It removes a
> relationship between c_10 and _2. The reason ranger no longer can fold _2 == 0
> is because the sequence is now:
> 
>     a.0_1 = a;
>     _2 = (unsigned int) a.0_1;
>     b = _2;
>     _6 = a.0_1 & 4294967295;
>     c_10 = _6;
>     if (c_10 != 0)
>       goto <bb 3>; [INV]
> 
> We do not find _2 is non-zero on the outgoing edge because _2 is not related to
> the calculation in the condition.  (ie c_10 no longer has a dependency on _2)
> 
> We do recalculate _2 based on the outgoing range of a.0_1, but with it being a
> 64 bit value and _2 being 32 bits, we only know the outgoing range of a.0_1 is
> non-zero.. we dont track any of the upper bits... 
>  2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]
> And when we recalculate _2 using that value, we still get varying because
> 0xFFFF0000 in not zero, but can still produce a zero in _2.
> 
> The problem is that the condition c_10 != 0 no longer related to the value of
> _2 in the IL... so ranger never sees it. and we cant represent the 2^16
> subranges that end in [1,0xFFFF].
> 
> Before that transformation, 
>   _2 = (unsigned int) a.0_1;
>    b = _2;
>   c_10 = (long int) _2;
> The relationship is obvious, and ranger would relate the c_10 != 0 to _2 no
> problem.

I see - too bad.  Note the transform made the dependence chain of _6
one instruction shorter without increasing the number of instructions
so it's a profitable transform.

Btw, the relation is still there but only indirectly via a.0_1.  The
old (E)VRP had this find_asserts(?) that produced assertions based
on the definitions - sth that now range-ops does(?), so it would
eventually have built assertions for a.0_1 for both conditions and
allow relations based on that?  I can't seem to find my way around
the VRP code now - pieces moved all over the place and so my mind
fails me on the searching task :/
Comment 6 Andrew Macleod 2021-10-04 17:15:10 UTC
> 
> >  It removes a
> > relationship between c_10 and _2. The reason ranger no longer can fold _2 == 0
> > is because the sequence is now:
> > 
> >     a.0_1 = a;
> >     _2 = (unsigned int) a.0_1;
> >     b = _2;
> >     _6 = a.0_1 & 4294967295;
> >     c_10 = _6;
> >     if (c_10 != 0)
> >       goto <bb 3>; [INV]
> > 
> > We do not find _2 is non-zero on the outgoing edge because _2 is not related to
> > the calculation in the condition.  (ie c_10 no longer has a dependency on _2)
> > 
> > We do recalculate _2 based on the outgoing range of a.0_1, but with it being a
> > 64 bit value and _2 being 32 bits, we only know the outgoing range of a.0_1 is
> > non-zero.. we dont track any of the upper bits... 
> >  2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]
> > And when we recalculate _2 using that value, we still get varying because
> > 0xFFFF0000 in not zero, but can still produce a zero in _2.
> > 
> > The problem is that the condition c_10 != 0 no longer related to the value of
> > _2 in the IL... so ranger never sees it. and we cant represent the 2^16
> > subranges that end in [1,0xFFFF].
> > 
> > Before that transformation, 
> >   _2 = (unsigned int) a.0_1;
> >    b = _2;
> >   c_10 = (long int) _2;
> > The relationship is obvious, and ranger would relate the c_10 != 0 to _2 no
> > problem.
> 
> I see - too bad.  Note the transform made the dependence chain of _6
> one instruction shorter without increasing the number of instructions
> so it's a profitable transform.
> 
> Btw, the relation is still there but only indirectly via a.0_1.  The
> old (E)VRP had this find_asserts(?) that produced assertions based
> on the definitions - sth that now range-ops does(?), so it would
> eventually have built assertions for a.0_1 for both conditions and
> allow relations based on that?  I can't seem to find my way around
> the VRP code now - pieces moved all over the place and so my mind
> fails me on the searching task :/

We do know that a.0_1 is non-zero on that edge:
2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]

the problem is that we can't currently represent that the bitmask operation causes all patterns ending in 0x00000000 to not occur.. we just leave it at ~[0,0].  which isn't sufficient for this use case. 

we don't currently track any equivalences between values of different precision.. (even though ranger once did).   Handling it as a general equivalence was fraught with issues. 

We might be able to add a new equivalence class "slice" or something.. I had considered it but hadn't seen a great need case.   This would make _6 a 32 bit slice of a.0_1 with range [1, 0xffffffff].
Then when we are querying for the cast
  _2 = (unsigned int) a.0_1;
we could also query the 32 bit equivalence slices of a.0_1, find _6, and get the outgoing range of [1,0xffffffff].. and apply that value.

It would probably resolve an entire class of things where we don't recognize an equivalence between a cast and a bitmask of equivalent precision.

This would also mean the reverse would apply.. ie if we instead branched on _2 != 0 we would also understand that _6 will be non-zero.
Comment 7 rguenther@suse.de 2021-10-05 06:19:40 UTC
On Mon, 4 Oct 2021, amacleod at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102540
> 
> --- Comment #6 from Andrew Macleod <amacleod at redhat dot com> ---
> 
> > 
> > >  It removes a
> > > relationship between c_10 and _2. The reason ranger no longer can fold _2 == 0
> > > is because the sequence is now:
> > > 
> > >     a.0_1 = a;
> > >     _2 = (unsigned int) a.0_1;
> > >     b = _2;
> > >     _6 = a.0_1 & 4294967295;
> > >     c_10 = _6;
> > >     if (c_10 != 0)
> > >       goto <bb 3>; [INV]
> > > 
> > > We do not find _2 is non-zero on the outgoing edge because _2 is not related to
> > > the calculation in the condition.  (ie c_10 no longer has a dependency on _2)
> > > 
> > > We do recalculate _2 based on the outgoing range of a.0_1, but with it being a
> > > 64 bit value and _2 being 32 bits, we only know the outgoing range of a.0_1 is
> > > non-zero.. we dont track any of the upper bits... 
> > >  2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]
> > > And when we recalculate _2 using that value, we still get varying because
> > > 0xFFFF0000 in not zero, but can still produce a zero in _2.
> > > 
> > > The problem is that the condition c_10 != 0 no longer related to the value of
> > > _2 in the IL... so ranger never sees it. and we cant represent the 2^16
> > > subranges that end in [1,0xFFFF].
> > > 
> > > Before that transformation, 
> > >   _2 = (unsigned int) a.0_1;
> > >    b = _2;
> > >   c_10 = (long int) _2;
> > > The relationship is obvious, and ranger would relate the c_10 != 0 to _2 no
> > > problem.
> > 
> > I see - too bad.  Note the transform made the dependence chain of _6
> > one instruction shorter without increasing the number of instructions
> > so it's a profitable transform.
> > 
> > Btw, the relation is still there but only indirectly via a.0_1.  The
> > old (E)VRP had this find_asserts(?) that produced assertions based
> > on the definitions - sth that now range-ops does(?), so it would
> > eventually have built assertions for a.0_1 for both conditions and
> > allow relations based on that?  I can't seem to find my way around
> > the VRP code now - pieces moved all over the place and so my mind
> > fails me on the searching task :/
> 
> We do know that a.0_1 is non-zero on that edge:
> 2->3  (T) a.0_1 :       long int [-INF, -1][1, +INF]
> 
> the problem is that we can't currently represent that the bitmask operation
> causes all patterns ending in 0x00000000 to not occur.. we just leave it at
> ~[0,0].  which isn't sufficient for this use case. 

Hmm, but we do have nonzero bits on SSA where we also store global
ranges, so there is a way to store the info and you could intersect
the sliced range produced from it with the range you got?

> we don't currently track any equivalences between values of different
> precision.. (even though ranger once did).   Handling it as a general
> equivalence was fraught with issues. 
> 
> We might be able to add a new equivalence class "slice" or something.. I had
> considered it but hadn't seen a great need case.   This would make _6 a 32 bit
> slice of a.0_1 with range [1, 0xffffffff].
> Then when we are querying for the cast
>   _2 = (unsigned int) a.0_1;
> we could also query the 32 bit equivalence slices of a.0_1, find _6, and get
> the outgoing range of [1,0xffffffff].. and apply that value.
> 
> It would probably resolve an entire class of things where we don't recognize an
> equivalence between a cast and a bitmask of equivalent precision.
> 
> This would also mean the reverse would apply.. ie if we instead branched on _2
> != 0 we would also understand that _6 will be non-zero.

I believe tracking known zero/one bits in addition to a range is more
useful - would that help in this case?
Comment 8 Andrew Macleod 2021-10-05 13:59:13 UTC
> > 
> > It would probably resolve an entire class of things where we don't recognize an
> > equivalence between a cast and a bitmask of equivalent precision.
> > 
> > This would also mean the reverse would apply.. ie if we instead branched on _2
> > != 0 we would also understand that _6 will be non-zero.
> 
> I believe tracking known zero/one bits in addition to a range is more
> useful - would that help in this case?

Thats in plan and what I first thought would fix it.  Reflecting on this problem however, it would only help on the zero side where all the bits are known zero, but the non-zero property we are looking for can't be reflected by known ones or zeros.. 

Unfortunately I don't see how we can solve this particular problem by tracking known bit values.. there wont be any known 1s or 0s in a.0_1...  just a particular bit pattern that cannot occur.

Another option would be if I can get a cheap enough low-opt pass of evrp (also on my list) we could run it really early before things get rearranged and then run the higher levels later.

Anyway, I'll keep thinking about this...
Comment 9 Richard Biener 2022-01-20 09:44:27 UTC
I think the GCC 12 IL would require tracking equivalences on parts of registers,
in this case that _2 is equal to the low part of a.0_1.  That is, one would
need to extend what CCP does with bit propagation to include equivalences.

There's also the if ((long)unsigned-var != 0) -> if (unsigned-var != 0)
transform that could save us here but we have that guarded with
single_use () and the converted value is used after the branching
but not the source so we'd have increased register pressure.

I do not see a good way to fix this issue, slightly altering the testcase to
do the undesired transform on the source level

static long a;
static unsigned b;
void foo(void);
int main() {
    long c, e;
    b = a;
    c = (unsigned)a;
    e = c ? 2 / (c + 1) : 0;
    if (e && !b)
        foo();
    a = 0;
}

causes the issue to appear also with GCC 11.
Comment 10 Andrew Macleod 2022-01-20 14:42:29 UTC
(In reply to Richard Biener from comment #9)
> I think the GCC 12 IL would require tracking equivalences on parts of
> registers,
> in this case that _2 is equal to the low part of a.0_1.  That is, one would
> need to extend what CCP does with bit propagation to include equivalences.
> 

I have in my workqueue some thoughts for GCC13 where we can track partial equivalences so we can see through different casts and masks. 

    a.0_1 = a;
    _2 = (unsigned int) a.0_1;
    b = _2;
    _6 = a.0_1 & 4294967295;
    c_10 = _6;
    if (c_10 != 0)
      goto <bb 3>; [INV]

_2 would be a 32 bit equivalence with a.0_1 generated by op_cast, and
_6 would also be a 32 bit equivalence with a.0_1 generated  op_bitwise_and.
They would all end up in the same partial equivalence set.

and when a range for anything in that set falls into the bitrange of the equivalence size,  like the [0,0] , then it can be applied (with some restrictions), to other members of the set.   Its an extension of some thought I had for trying to relates different size casts to each other, just extended to include masking.


There is a series of other PRs which would be fixed by this sort of ability. it has become prevalent enough to warrant the work I think. (Also 79191 91881 102705 102872)
Comment 11 Jakub Jelinek 2022-05-06 08:31:16 UTC
GCC 12.1 is being released, retargeting bugs to GCC 12.2.
Comment 12 Richard Biener 2022-08-19 08:24:26 UTC
GCC 12.2 is being released, retargeting bugs to GCC 12.3.
Comment 13 GCC Commits 2022-10-13 15:29:41 UTC
The master branch has been updated by Andrew Macleod <amacleod@gcc.gnu.org>:

https://gcc.gnu.org/g:6cc3394507a2303a18891d34222c53f679256c37

commit r13-3281-g6cc3394507a2303a18891d34222c53f679256c37
Author: Andrew MacLeod <amacleod@redhat.com>
Date:   Wed Oct 5 10:42:07 2022 -0400

    propagate partial equivs in the cache.
    
    Adjust on-entry cache propagation to look for and propagate both full
    and partial equivalences.
    
            gcc/
            PR tree-optimization/102540
            PR tree-optimization/102872
            * gimple-range-cache.cc (ranger_cache::fill_block_cache):
            Handle partial equivs.
            (ranger_cache::range_from_dom): Cleanup dump output.
    
            gcc/testsuite/
            * gcc.dg/pr102540.c: New.
            * gcc.dg/pr102872.c: New.
Comment 14 Andrew Macleod 2022-10-13 17:03:17 UTC
fixed.