69615 – 0 to limit signed range checks don't always use unsigned compare

Bug 69615 - 0 to limit signed range checks don't always use unsigned compare

Summary: 0 to limit signed range checks don't always use unsigned compare

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	5.3.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-02-02 07:30 UTC by Peter Cordes
Modified:	2018-11-19 14:14 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-02-02 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2016-02-02 07:30:41 UTC

gcc sometimes misses the unsigned-compare trick for checking if a signed value is between 0 and limit (where limit is known to be <= INT_MAX).

It seems that gcc fails when the upper limit is a variable, even if I shift or mask it down to a small range.  clang handles this case, so I'm sure I constructed my test case in a way that could be optimized.



All the code in this bug report is on godbolt, for ease of trying with older versions of gcc (including for ARM64/ARM/PPC), and with clang / icc13.  http://goo.gl/V7PFmv.   (I used -x c to compile as C, even though it only provides c++ compilers).

This appears to be arch-independent (unless my quick skim of asm for ISAs I barely know misled me...)

--------

The simplest case is when the upper limit is a compile-time constant.  There's one case where gcc and clang fail to optimize:  x<=(INT_MAX-1), or equivalently, x<INT_MAX.


#include <limits.h>
#include <stdint.h>
extern void ext(void);

// clang and gcc both optimize range checks up to INT_MAX-2 to a single unsigned compare
void r0_to_imax_2(int x){ if (x>=0 && x<=(INT_MAX-2)) ext(); }  // good code
void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); }  // bad code
void r0_to_imax  (int x){ if (x>=0 && x<=(INT_MAX-0)) ext(); }  // good code (test/js.  not shown)

gcc 5.3.0 -Ofast -mtune=haswell  compiles this to:

r0_to_imax_2:
        cmpl    $2147483645, %edi   # that's 0x7ffffffd
        jbe     .L4
        ret
.L4:    jmp     ext

r0_to_imax_1:
        movl    %edi, %eax
        subl    $0, %eax       ## Without any -mtune, uses test edi,edi
        js      .L5
        cmpl    $2147483647, %edi   # that's 0x7fffffff
        je      .L5
        jmp     ext
.L5:    ret

ICC13 compiles this last one to cmp  edi, 0x7ffffffe / ja, so unless my mental logic is wrong *and* icc13 is buggy, gcc and clang should still be able to  use the same optimization as for smaller upper-limits.  They don't: both clang and gcc use two compare-and-branches for r0_to_imax_1.

BTW, the movl %edi, %eax / subl $0, %eax sequence is used instead of the test instruction with -mtune=haswell, and even worse with -march=bdver2 where it even prevents fusion into a compare-and-branch m-op.  I'll file a separate bug report for that if anyone wants me to.  Agner Fog's microarch guide doesn't mention anything that would give that sequence an advantage over test, unless I'm missing something.  It slows AMD down more than (recent) Intel, but that's not what tuning for Haswell means. :P



Now, on to the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller.  (If the limit can be negative (or unsigned greater than INT_MAX) the optimization is impossible:  INT_MIN and other negative numbers could be "below" the limit.)


// gcc always fails to optimize this to an unsigned compare, but clang succeeds
void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x<lim2) ext();
}

compiles to (with -O3, and -mtune=core2 to avoid the sub $0 sillyness):

rangecheck_var:
        andl    $15, %esi
        cmpq    %rdi, %rsi
        jle     .L16
        testq   %rdi, %rdi
        js      .L16
        jmp     ext
.L16:   ret

Comment 1 Richard Biener 2016-02-02 09:01:16 UTC

GCC relies on some fold() routine to do this and we end up with

;; Function r0_to_imax_2 (null)
;; enabled by -tree-original


{
  if ((unsigned int) x <= 2147483645)
    {
      ext ();
    }
}


;; Function r0_to_imax_1 (null)
;; enabled by -tree-original


{
  if (x >= 0 && x != 2147483647)
    {
      ext ();
    }
}

;; Function r0_to_imax (null)
;; enabled by -tree-original


{
  if (x >= 0)
    {
      ext ();
    }


so it seems we are confused by the trick that triggers first, replacing
the <= INT_MAX-1 compare with a != INT_MAX compare and that not being handled
in the range construction code.

Looks like ifcombine doesn't handle it either (maybe_fold_and_comparisons).

void ext(void);
void r0_to_imax_1(int x){ if (x>=0 && x<=(__INT_MAX__-1)) ext(); }

Comment 2 Jakub Jelinek 2016-02-02 09:07:59 UTC

Shouldn't be hard to teach the range code in reassoc about this, but stage1 material.

Comment 3 Peter Cordes 2016-02-02 11:00:11 UTC

@Richard and Jakub:

That's just addressing the first part of my report, the problem with  x <= (INT_MAX-1), right?

You may have missed the second part of the problem, since I probably buried it under too much detail with the first:

In the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller:

// gcc always fails to optimize this to an unsigned compare, but clang succeeds
void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x<lim2) ext();
}


---

I noticed after I submitted the report that I maybe should have tagged it tree-optimization, thanks for fixing.  I don't really know gcc internals; I just file reports when it makes less than optimal code.

Comment 4 Jakub Jelinek 2016-02-02 11:16:18 UTC

I suppose even that is doable in the reassoc framework, or it could be done in match.pd just using the recorded value ranges, like richi has handled PR69595, or both.

Comment 5 Peter Cordes 2018-06-02 08:49:30 UTC

Update: https://godbolt.org/g/ZQDY1G

gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't.

void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x<lim2) ext();
}


gcc7 also fixed the -mtune=haswell bug where we got `movl %edi, %eax; subl $0, %eax` instead of  `test %edi,%edi`.

gcc8.1 still has the missed optimization with a compile-time constant 0..INT_MAX-1 range:

void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); }  // clang and gcc use 2 branches

Comment 6 Jakub Jelinek 2018-06-02 16:41:40 UTC

For r0_to_imax_1 the following works for me:
--- gcc/fold-const.c.jj	2018-05-31 20:53:33.000000000 +0200
+++ gcc/fold-const.c	2018-06-02 18:36:23.795635887 +0200
@@ -5084,6 +5084,35 @@ merge_ranges (int *pin_p, tree *plow, tr
       tem = high0, high0 = high1, high1 = tem;
     }
 
+  /* If the second range is != high1 where high1 is the type maximum of
+     the type, try first merging with < high1 range.  */
+  if (low1
+      && high1
+      && TREE_CODE (low1) == INTEGER_CST
+      && (TREE_CODE (TREE_TYPE (low1)) == INTEGER_TYPE
+	  || (TREE_CODE (TREE_TYPE (low1)) == ENUMERAL_TYPE
+	      && known_eq (TYPE_PRECISION (TREE_TYPE (low1)),
+			   GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (low1))))))
+      && tree_int_cst_equal (low1, TYPE_MAX_VALUE (TREE_TYPE (low1)))
+      && operand_equal_p (low1, high1, 0)
+      && merge_ranges (pin_p, plow, phigh, in0_p, low0, high0,
+		       !in1_p, NULL_TREE, range_predecessor (low1)))
+    return true;
+
+  /* Similarly for first range != low0 where low0 is the type minimum of
+     the type, try first merging with > low0 range.  */
+  if (low0
+      && high0
+      && (TREE_CODE (TREE_TYPE (low0)) == INTEGER_TYPE
+	  || (TREE_CODE (TREE_TYPE (low0)) == ENUMERAL_TYPE
+	      && known_eq (TYPE_PRECISION (TREE_TYPE (low0)),
+			   GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (low0))))))
+      && tree_int_cst_equal (low0, TYPE_MIN_VALUE (TREE_TYPE (low0)))
+      && operand_equal_p (low0, high0, 0)
+      && merge_ranges (pin_p, plow, phigh, !in0_p, range_successor (low0),
+		       NULL_TREE, in1_p, low1, high1))
+    return true;
+
   /* Now flag two cases, whether the ranges are disjoint or whether the
      second range is totally subsumed in the first.  Note that the tests
      below are simplified by the ones above.  */

The thing is that we canonicalize < INT_MAX (and <= INT_MAX-1) to != INT_MAX-1 and in the range code we'd better try the first form instead.

Comment 7 Jakub Jelinek 2018-06-04 07:38:28 UTC

Author: jakub
Date: Mon Jun  4 07:37:56 2018
New Revision: 261139

URL: https://gcc.gnu.org/viewcvs?rev=261139&root=gcc&view=rev
Log:
	PR tree-optimization/69615
	* fold-const.c (merge_ranges): If range1 is - [x, x] and x is the
	maximum or minimum of the type, try to merge it also as if
	range1 is + [-, x - 1] or + [x + 1, -].

	* gcc.dg/pr69615.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/pr69615.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/testsuite/ChangeLog

Comment 8 Jakub Jelinek 2018-06-07 07:41:49 UTC

Author: jakub
Date: Thu Jun  7 07:41:18 2018
New Revision: 261264

URL: https://gcc.gnu.org/viewcvs?rev=261264&root=gcc&view=rev
Log:
	PR tree-optimization/69615
	* tree-ssa-reassoc.c (optimize_range_tests_var_bound): If rhs2 is lhs
	of a cast from a same precision integral SSA_NAME in a bb dominated
	by first_bb, retry with rhs2 set to the rhs1 of the cast.  Don't emit
	cast to utype if rhs2 has already a compatible type.

	* gcc.dg/tree-ssa/pr69615.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr69615.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-reassoc.c

Comment 9 Martin Liška 2018-11-19 14:09:42 UTC

Jakub: Can you please update Known to work?

Comment 10 Jakub Jelinek 2018-11-19 14:14:12 UTC

Fixed for GCC9+.