Bug 77881 - [6 Regression] Non-optimal signed comparison on x86_64 since r146817
Summary: [6 Regression] Non-optimal signed comparison on x86_64 since r146817
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.0
: P2 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-10-06 12:17 UTC by Jakub Jelinek
Modified: 2018-10-26 11:22 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work: 7.0
Known to fail: 5.4.0, 6.2.0
Last reconfirmed: 2016-10-06 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2016-10-06 12:17:34 UTC
int
foo (long long int a, int b)
{
  if (a >= 0 || b)
    baz ();
}

int
bar (long long int a, int b)
{
  if (a < 0 || b)
    baz ();
}

on x86_64-linux with -O2 is compiled into following starting with r146817:
        testq   %rdi, %rdi
        jns     .L4
        testl   %esi, %esi
        jne     .L4
for foo, which looks reasonable, and
        shrq    $63, %rdi
        testb   %dil, %dil
        jne     .L9
        testl   %esi, %esi
        jne     .L9
in bar, where I'd expect and we used to emit in the past
        testq   %rdi, %rdi
        js      .L9
        testl   %esi, %esi
        jne     .L9
or something similar.  I wonder if we want a combine pattern that would transform right shift by precision - 1 followed by comparison of the result against 0 if the result of the right shift is dead into just signed comparison.
Comment 1 Richard Biener 2016-10-06 12:25:32 UTC
The question is why we expand a < 0 to shift and test in the first place.
Comment 2 Michael Matz 2016-10-06 12:27:25 UTC
Exactly.  Whatever makes it currently work for >=0 should be made to work for
<0 as well.
Comment 3 Jakub Jelinek 2016-10-06 12:32:03 UTC
Well, generally if we want to store a < 0 or a >= 0 into a bool/int variable etc., then the right shift is desirable, because we avoid branching or sets/setns, but not when the result is used in a conditional jump.
Comment 4 Michael Matz 2016-10-06 15:33:37 UTC
Actually, it's merely a deficiency in current combine not simplifying intermediate expressions enough.  One of the things that need to happen is 
the following transformation:

(compare:CCZ (subreg:QI (lshiftrt:DI (reg:DI 95)
            (const_int 63 [0x3f])) 0)
    (const_int 0 [0]))

--> (remove subreg)

(compare:CCZ (lshiftrt:DI (reg:DI 95)
            (const_int 63 [0x3f]))
    (const_int 0 [0]))

--> (recognize that it's a sign bit extract)

(ge (reg:DI 95) (const_int 0))

(or 'lt', doesn't matter, depends on the outer code by which the compare:CCZ
result itself is compared).

In current combine.c this requires two steps, the removal of the irrelevant
subreg (irrelevant in this specific context), and then the recognition of
the sign bit extraction.  With the first function combine has the chance
for two attempts of simplification because we have a sequence of three 
isnstructions to start with:

  t1 = ~a
  t2 = 255 & (t1 >> 63)
  flags = t2 != 0

The intermediate expression is combine_simplify_rtx'ed twice, so both steps
above happen, and we get good code.  In the second function we only have
two instructions to start with (the not is missing):

  t2 = 255 & (a >> 63)
  flags = t2 == 0

Only the first step above happens, we're left with the (lshiftrt(subreg)) and
nothing simplifies this further before it tries to recognize the insn
(which ultimately fails).

The same can also be seen when artificially forcing the expression to
be a tad more complicated:

int bar2 (long long int a, long long int a2, int b) {
  if ((a+a2) < 0 || b)
    baz();
}

Here, we also get good code again, simply because combine can have two
attempts at the intermediate expression.

After some amount of tracing combine I've come up with the below patch.
The simplify_comparison function already contains loops that effectively
retry simplification after a change occured.  But the code that actually removes
a useless subreg is after that loop.  Putting it into the loop as well
fixes the problem.  It can't be removed from the old place because between the
loop and subreg removal it might actually change the expression further
due to make_compound_operation (though I don't know if that really creates
subregs often).

combines normal facilities of not doing combines when intermediate results
are really used outside will take care of not creating useless code.

Index: combine.c
===================================================================
--- combine.c   (revision 235171)
+++ combine.c   (working copy)
@@ -11891,6 +11891,27 @@ simplify_comparison (enum rtx_code code,
          if (subreg_lowpart_p (op0)
              && GET_MODE_PRECISION (GET_MODE (SUBREG_REG (op0))) < mode_width)
            /* Fall through */ ;
+         else if (subreg_lowpart_p (op0)
+                  && GET_MODE_CLASS (GET_MODE (op0)) == MODE_INT
+                  && GET_MODE_CLASS (GET_MODE (SUBREG_REG (op0))) == MODE_INT
+                  && (code == NE || code == EQ)
+                  && (GET_MODE_PRECISION (GET_MODE (SUBREG_REG (op0)))
+                      <= HOST_BITS_PER_WIDE_INT)
+                  && !paradoxical_subreg_p (op0)
+                  && (nonzero_bits (SUBREG_REG (op0),
+                                    GET_MODE (SUBREG_REG (op0)))
+                      & ~GET_MODE_MASK (GET_MODE (op0))) == 0)
+           {
+             tem = gen_lowpart (GET_MODE (SUBREG_REG (op0)), op1);
+
+             if ((nonzero_bits (tem, GET_MODE (SUBREG_REG (op0)))
+                  & ~GET_MODE_MASK (GET_MODE (op0))) == 0)
+               {
+                 op0 = SUBREG_REG (op0), op1 = tem;
+                 continue;
+               }
+             break;
+           }
          else
            break;
Comment 5 Segher Boessenkool 2016-10-06 16:24:31 UTC
That looks good, please submit to gcc-patches?
Comment 6 Michael Matz 2016-11-15 14:03:00 UTC
Author: matz
Date: Tue Nov 15 14:02:28 2016
New Revision: 242414

URL: https://gcc.gnu.org/viewcvs?rev=242414&root=gcc&view=rev
Log:
	PR missed-optimization/77881
	* combine.c (simplify_comparison): Remove useless subregs
	also inside the loop, not just after it.
	(make_compound_operation): Recognize some subregs as being
	masking as well.

testsuite/
	* gcc.target/i386/pr77881.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr77881.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/combine.c
    trunk/gcc/testsuite/ChangeLog
Comment 7 Michael Matz 2016-11-15 14:05:57 UTC
Fixed for gcc 7.  Not planning backports myself , it's a minor code quality
regression only.  But keeping it open in case anyone else wants to.
Comment 8 Segher Boessenkool 2016-11-23 14:33:46 UTC
Author: segher
Date: Wed Nov 23 14:33:13 2016
New Revision: 242757

URL: https://gcc.gnu.org/viewcvs?rev=242757&root=gcc&view=rev
Log:
combine: Convert subreg-of-lshiftrt to zero_extract properly (PR78390)

r242414, for PR77881, introduces some bugs (PR78390, PR78438, PR78477).
It all has the same root cause: that patch makes combine convert every
lowpart subreg of a logical shift right to a zero_extract.  This cannot
work at all if it is not a constant shift, and it has to be a bit more
careful exactly which bits it extracts.


	PR target/77881
	PR bootstrap/78390
	PR target/78438
	PR bootstrap/78477
	* combine.c (make_compound_operation_int): Do not convert a subreg of
	a non-constant logical shift right to a zero_extract.  Handle the case
	where some zero bits have been shifted into the range covered by that
	subreg.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/combine.c
Comment 9 Jakub Jelinek 2017-10-10 13:27:48 UTC
GCC 5 branch is being closed
Comment 10 Eric Gallager 2018-10-13 17:56:28 UTC
(In reply to Michael Matz from comment #7)
> Fixed for gcc 7.  Not planning backports myself , it's a minor code quality
> regression only.  But keeping it open in case anyone else wants to.

Well, if anyone else wants to, now's their last chance...
Comment 11 Jakub Jelinek 2018-10-26 11:22:15 UTC
GCC 6 branch is being closed, fixed in 7.x.