gcc sometimes misses the unsigned-compare trick for checking if a signed value is between 0 and limit (where limit is known to be <= INT_MAX). It seems that gcc fails when the upper limit is a variable, even if I shift or mask it down to a small range. clang handles this case, so I'm sure I constructed my test case in a way that could be optimized. All the code in this bug report is on godbolt, for ease of trying with older versions of gcc (including for ARM64/ARM/PPC), and with clang / icc13. http://goo.gl/V7PFmv. (I used -x c to compile as C, even though it only provides c++ compilers). This appears to be arch-independent (unless my quick skim of asm for ISAs I barely know misled me...) -------- The simplest case is when the upper limit is a compile-time constant. There's one case where gcc and clang fail to optimize: x<=(INT_MAX-1), or equivalently, x<INT_MAX. #include <limits.h> #include <stdint.h> extern void ext(void); // clang and gcc both optimize range checks up to INT_MAX-2 to a single unsigned compare void r0_to_imax_2(int x){ if (x>=0 && x<=(INT_MAX-2)) ext(); } // good code void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); } // bad code void r0_to_imax (int x){ if (x>=0 && x<=(INT_MAX-0)) ext(); } // good code (test/js. not shown) gcc 5.3.0 -Ofast -mtune=haswell compiles this to: r0_to_imax_2: cmpl $2147483645, %edi # that's 0x7ffffffd jbe .L4 ret .L4: jmp ext r0_to_imax_1: movl %edi, %eax subl $0, %eax ## Without any -mtune, uses test edi,edi js .L5 cmpl $2147483647, %edi # that's 0x7fffffff je .L5 jmp ext .L5: ret ICC13 compiles this last one to cmp edi, 0x7ffffffe / ja, so unless my mental logic is wrong *and* icc13 is buggy, gcc and clang should still be able to use the same optimization as for smaller upper-limits. They don't: both clang and gcc use two compare-and-branches for r0_to_imax_1. BTW, the movl %edi, %eax / subl $0, %eax sequence is used instead of the test instruction with -mtune=haswell, and even worse with -march=bdver2 where it even prevents fusion into a compare-and-branch m-op. I'll file a separate bug report for that if anyone wants me to. Agner Fog's microarch guide doesn't mention anything that would give that sequence an advantage over test, unless I'm missing something. It slows AMD down more than (recent) Intel, but that's not what tuning for Haswell means. :P Now, on to the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller. (If the limit can be negative (or unsigned greater than INT_MAX) the optimization is impossible: INT_MIN and other negative numbers could be "below" the limit.) // gcc always fails to optimize this to an unsigned compare, but clang succeeds void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x<lim2) ext(); } compiles to (with -O3, and -mtune=core2 to avoid the sub $0 sillyness): rangecheck_var: andl $15, %esi cmpq %rdi, %rsi jle .L16 testq %rdi, %rdi js .L16 jmp ext .L16: ret
GCC relies on some fold() routine to do this and we end up with ;; Function r0_to_imax_2 (null) ;; enabled by -tree-original { if ((unsigned int) x <= 2147483645) { ext (); } } ;; Function r0_to_imax_1 (null) ;; enabled by -tree-original { if (x >= 0 && x != 2147483647) { ext (); } } ;; Function r0_to_imax (null) ;; enabled by -tree-original { if (x >= 0) { ext (); } so it seems we are confused by the trick that triggers first, replacing the <= INT_MAX-1 compare with a != INT_MAX compare and that not being handled in the range construction code. Looks like ifcombine doesn't handle it either (maybe_fold_and_comparisons). void ext(void); void r0_to_imax_1(int x){ if (x>=0 && x<=(__INT_MAX__-1)) ext(); }
Shouldn't be hard to teach the range code in reassoc about this, but stage1 material.
@Richard and Jakub: That's just addressing the first part of my report, the problem with x <= (INT_MAX-1), right? You may have missed the second part of the problem, since I probably buried it under too much detail with the first: In the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller: // gcc always fails to optimize this to an unsigned compare, but clang succeeds void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x<lim2) ext(); } --- I noticed after I submitted the report that I maybe should have tagged it tree-optimization, thanks for fixing. I don't really know gcc internals; I just file reports when it makes less than optimal code.
I suppose even that is doable in the reassoc framework, or it could be done in match.pd just using the recorded value ranges, like richi has handled PR69595, or both.
Update: https://godbolt.org/g/ZQDY1G gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't. void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x<lim2) ext(); } gcc7 also fixed the -mtune=haswell bug where we got `movl %edi, %eax; subl $0, %eax` instead of `test %edi,%edi`. gcc8.1 still has the missed optimization with a compile-time constant 0..INT_MAX-1 range: void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); } // clang and gcc use 2 branches
For r0_to_imax_1 the following works for me: --- gcc/fold-const.c.jj 2018-05-31 20:53:33.000000000 +0200 +++ gcc/fold-const.c 2018-06-02 18:36:23.795635887 +0200 @@ -5084,6 +5084,35 @@ merge_ranges (int *pin_p, tree *plow, tr tem = high0, high0 = high1, high1 = tem; } + /* If the second range is != high1 where high1 is the type maximum of + the type, try first merging with < high1 range. */ + if (low1 + && high1 + && TREE_CODE (low1) == INTEGER_CST + && (TREE_CODE (TREE_TYPE (low1)) == INTEGER_TYPE + || (TREE_CODE (TREE_TYPE (low1)) == ENUMERAL_TYPE + && known_eq (TYPE_PRECISION (TREE_TYPE (low1)), + GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (low1)))))) + && tree_int_cst_equal (low1, TYPE_MAX_VALUE (TREE_TYPE (low1))) + && operand_equal_p (low1, high1, 0) + && merge_ranges (pin_p, plow, phigh, in0_p, low0, high0, + !in1_p, NULL_TREE, range_predecessor (low1))) + return true; + + /* Similarly for first range != low0 where low0 is the type minimum of + the type, try first merging with > low0 range. */ + if (low0 + && high0 + && (TREE_CODE (TREE_TYPE (low0)) == INTEGER_TYPE + || (TREE_CODE (TREE_TYPE (low0)) == ENUMERAL_TYPE + && known_eq (TYPE_PRECISION (TREE_TYPE (low0)), + GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (low0)))))) + && tree_int_cst_equal (low0, TYPE_MIN_VALUE (TREE_TYPE (low0))) + && operand_equal_p (low0, high0, 0) + && merge_ranges (pin_p, plow, phigh, !in0_p, range_successor (low0), + NULL_TREE, in1_p, low1, high1)) + return true; + /* Now flag two cases, whether the ranges are disjoint or whether the second range is totally subsumed in the first. Note that the tests below are simplified by the ones above. */ The thing is that we canonicalize < INT_MAX (and <= INT_MAX-1) to != INT_MAX-1 and in the range code we'd better try the first form instead.
Author: jakub Date: Mon Jun 4 07:37:56 2018 New Revision: 261139 URL: https://gcc.gnu.org/viewcvs?rev=261139&root=gcc&view=rev Log: PR tree-optimization/69615 * fold-const.c (merge_ranges): If range1 is - [x, x] and x is the maximum or minimum of the type, try to merge it also as if range1 is + [-, x - 1] or + [x + 1, -]. * gcc.dg/pr69615.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr69615.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog
Author: jakub Date: Thu Jun 7 07:41:18 2018 New Revision: 261264 URL: https://gcc.gnu.org/viewcvs?rev=261264&root=gcc&view=rev Log: PR tree-optimization/69615 * tree-ssa-reassoc.c (optimize_range_tests_var_bound): If rhs2 is lhs of a cast from a same precision integral SSA_NAME in a bb dominated by first_bb, retry with rhs2 set to the rhs1 of the cast. Don't emit cast to utype if rhs2 has already a compatible type. * gcc.dg/tree-ssa/pr69615.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr69615.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-reassoc.c
Jakub: Can you please update Known to work?
Fixed for GCC9+.