While inspecting loop generation code on s390, I saw ivopts choose an IV candidate which does something peculiar. On x86 the same candidate is never chosen because of different cost estimations, yet the behavior can be evoked on x86, through "forcing" GCC to use the same candidate as on s390. I did this to confirm my suspicions and I'll show the x86 version here: This source void v(unsigned long *in, unsigned long *out, unsigned int n) { int i; for (i = 0; i < n; i++) { out[i] = in[i]; } } results in the following assembly, when ivopts candidate 7 is used: v: testl %edx, %edx je .L1 leal -1(%rdx), %eax leaq 8(,%rax,8), %rcx xorl %eax, %eax .L3: movq (%rdi,%rax), %rdx movq %rdx, (%rsi,%rax) addq $8, %rax cmpq %rcx, %rax jne .L3 .L1: rep ret Should the following be happening? leal -1(%rdx), %eax leaq 8(,%rax,8), %rcx i.e. %eax = n - 1 %rcx = 8 * (n + 1) The pattern can already be observed in ivopts' GIMPLE: <bb 4>: _15 = n_5(D) + 4294967295; _2 = (sizetype) _15; _1 = _2 + 1; _24 = _1 * 8; Why do we need the - 1 and subsequent + 1 when the %eax is zeroed afterwards anyway? Granted, this exact situation won't ever be observed on x86 as another ivopts candidate is chosen but on s390 this situation will amount to three instructions. If I see it correctly, the n - 1 comes from estimating the number of loop iterations, while the +1 is then correctly added by cand_value_at() because the loop counter is incremented before the exit test. Perhaps this is intended behavior and there is nothing wrong with it? Regards Robin

_15 = n_5(D) + 4294967295; _2 = (sizetype) _15; _1 = _2 + 1; cannot be simplified to (sizetype)n_5(D) because n_5 - 1 might underflow and then adding 1 in a wider type doesn't cancel the operation. So yes, this is expected but unfortunate. I don't think we do any costing on the code we insert on the loop entry edge to break ties between equal cost candidates. Well, if that is what happens. The real issue above is that we sometimes end up using sizetype variables for no good reason.

Ok, that's sensible but why is the - 1 necessary in the first place? n_5 - 1 can only underflow if n_5 == 0 which is checked by testl %edx, %edx before. This is in a previous basic block, unfortunately, and will not be regarded by ivopts then (because strictly speaking, it has no influence on the induction variable)?

(In reply to rdapp from comment #2) > Ok, that's sensible but why is the - 1 necessary in the first place? > > n_5 - 1 can only underflow if n_5 == 0 which is checked by > > testl %edx, %edx > > before. This is in a previous basic block, unfortunately, and will not be > regarded by ivopts then (because strictly speaking, it has no influence on > the induction variable)? yes, one way out is use loop preconditions to improve range information and then use that info to prove non-overflow. This has already been applied to loop niter analysis and scev overflow check. I once opened a PR about this in items of loop header bloated issue. (will update the PR number cause I don't have it now). Another point is I think Richard planned to improve range analysis wrto control flow? It could be helpful here in items of compilation time.

(In reply to rdapp from comment #2) > Ok, that's sensible but why is the - 1 necessary in the first place? > > n_5 - 1 can only underflow if n_5 == 0 which is checked by > > testl %edx, %edx > > before. This is in a previous basic block, unfortunately, and will not be > regarded by ivopts then (because strictly speaking, it has no influence on > the induction variable)? Actually ivopts does take preheader costs into consideration. Just the cost needs to be amortized over number of loop iterations when compiling for speed. After that, the cost gets more inaccurate. For example, instruction cost is generally 4, while average loop niters is 5, the cost becomes 0 in this way.

I still don't quite get why the "n - 1" is needed. Do we need it to possibly have an exit condition like if (i != n-1) or if (i <= n-1)? Am I missing something really obvious here?

(In reply to rdapp from comment #5) > I still don't quite get why the "n - 1" is needed. Do we need it to possibly > have an exit condition like > > if (i != n-1) or > if (i <= n-1)? > > Am I missing something really obvious here? The dump before ivopt is as below: <bb 2>: if (n_5(D) != 0) goto <bb 4>; else goto <bb 3>; <bb 3>: return; <bb 4>: <bb 5>: # i_18 = PHI <0(4), i_14(7)> _6 = (long unsigned int) i_18; _7 = _6 * 8; _9 = out_8(D) + _7; _11 = in_10(D) + _7; _12 = *_11; *_9 = _12; i_14 = i_18 + 1; i.0_4 = (unsigned int) i_14; if (i.0_4 < n_5(D)) goto <bb 7>; else goto <bb 6>; <bb 6>: goto <bb 3>; <bb 7>: goto <bb 5>; It comes from loop niter analysis, as in may_eliminate_iv, we have: (gdb) call debug_generic_expr(desc->niter) n_5(D) + 4294967295

(In reply to amker from comment #6) > It comes from loop niter analysis, as in may_eliminate_iv, we have: > > (gdb) call debug_generic_expr(desc->niter) > n_5(D) + 4294967295 and this is correct? I.e. the number of iterations is n - 1? I'd naively expect desc->niter = n_5(D) (Again, it might be necessary for some reason escapes me currently)

(In reply to rdapp from comment #7) > (In reply to amker from comment #6) > > > It comes from loop niter analysis, as in may_eliminate_iv, we have: > > > > (gdb) call debug_generic_expr(desc->niter) > > n_5(D) + 4294967295 > > and this is correct? I.e. the number of iterations is n - 1? I'd naively > expect > desc->niter = n_5(D) > > (Again, it might be necessary for some reason escapes me currently) It confused me too, but niter in tree_niter_desc means the number of time latch is executed, as commented in tree-ssa-loop.h: tree niter; /* The expression giving the number of iterations of a loop (provided that assumptions == true and may_be_zero == false), more precisely the number of executions of the latch of the loop. */

Note that VRP figures out that _15 = n_18 + 4294967295; Found new range for _15: [0, 4294967294] (so no underflow) _2 = (sizetype) _15; Found new range for _2: [0, 4294967294] _1 = _2 + 1; Found new range for _1: [1, 4294967295] but VRP doesn't have any transform that would take advantage of this (like computing _2 + 1 in unsigned int and only then casting to sizetype).

Created attachment 38144 [details] Tentative patch for VRP and loop-doloop Meanwhile I found the time to implement a pattern for VRP which seems to do the job on tree level. Although larger now (and not many cases are covered yet) I suppose I agree that it fits better in VRP than in ivopts and might even catch more cases than before. The fix on RTL level is nonetheless necessary since doloop calculates the number of iterations independently of the code before and also adds a +1. I cleaned up the fix a little by introducing niterp1_expr (niter plus 1) but there's still room for improvement. No regressions so far, improvement ideas welcome.

(In reply to rdapp from comment #10) > Created attachment 38144 [details] > Tentative patch for VRP and loop-doloop > > Meanwhile I found the time to implement a pattern for VRP which seems to do > the job on tree level. Although larger now (and not many cases are covered > yet) I suppose I agree that it fits better in VRP than in ivopts and might > even catch more cases than before. > > The fix on RTL level is nonetheless necessary since doloop calculates the > number of iterations independently of the code before and also adds a +1. > I cleaned up the fix a little by introducing niterp1_expr (niter plus 1) but > there's still room for improvement. > > No regressions so far, improvement ideas welcome. It's a bit very ad-hoc (looking at the VRP part). I think what we'd need to do is a) in VRP have a general helper telling you whether UNOP A or A BINOP B overflows/underflows b) in the existing patterns that would handle this and similar situations in match.pd: /* (A +- CST) +- CST -> A + CST */ (for outer_op (plus minus) (for inner_op (plus minus) ... use such a generic predicate to handle cases with (A +- CST) wrapped inside a conversion (it should already be fine to handle this case if A +- CST cannot overflow because TYPE_OVERFLOW_UNDEFINED). c) use fold_stmt (or gimple_simplify directly) in vrp_fold_stmt to invoke the folding machinery - eventually simply unconditionally call fold_stmt in tree-ssa-propagate.c:substitute_and_fold_dom_walker::before_dom_children rather than only when we progagated into a stmt. Even when doing the "ad-hoc" thing initially splitting out a) (implementing only a subset of operations) would be a useful thing to do. Note that during VRP itself you can use its own lattice while when not in VRP you'd have to rely on SSA_NAME_RANGE_INFO.

So, we don't optimize at -O2 long foo (int a) { return (long)(a + 1) - 1; } Note that (T)(A +- CST1) +- CST2 -> (T)A +- CST3 thus the combined addition in general needs to be done in the larger type _unless_ the absolute constant value of CST3 is equal or smaller than CST1. Handling the special case of CST1 == CST2 might be easiest here.

Created attachment 38535 [details] VRP/match.pd patch Found some time again and hacked together a fix for match.pd and VRP. The patch exports a VRP function no_range_overflow that is called from match.pd via tree.c. I realize that VRP didn't export any functions so far and if there's a better/idiomatic way to do it, please tell me. Further improvements or "code smell" hints very welcome. Currently this should fix cases like long foo(int i) { return (long) (i + 1) - 2; } Commutativity of a PLUS_EXPR is not yet being exploited and I'm relying on TYPE_OVERFLOW_UNDEFINED so far. The overflow checking could also be done more elegantly I suppose. This diff is extracted from the full patch which also includes the RTL level fix which I haven't included here since it has not changed. I hope the splitting didn't introduce new bugs but there are no regressions on s390x for the full patch.

On Fri, 20 May 2016, rdapp at linux dot vnet.ibm.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 > > --- Comment #13 from rdapp at linux dot vnet.ibm.com --- > Created attachment 38535 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38535&action=edit > VRP/match.pd patch > > Found some time again and hacked together a fix for match.pd and VRP. The patch > exports a VRP function no_range_overflow that is called from match.pd via > tree.c. I realize that VRP didn't export any functions so far and if there's a > better/idiomatic way to do it, please tell me. Further improvements or "code > smell" hints very welcome. > > Currently this should fix cases like > > long foo(int i) > { > return (long) (i + 1) - 2; > } > > Commutativity of a PLUS_EXPR is not yet being exploited and I'm relying on > TYPE_OVERFLOW_UNDEFINED so far. The overflow checking could also be done more > elegantly I suppose. > > This diff is extracted from the full patch which also includes the RTL level > fix which I haven't included here since it has not changed. I hope the > splitting didn't introduce new bugs but there are no regressions on s390x for > the full patch. So I think you should use get_range_info () instead of exporting stuff from VRP. Plus change the pattern to be more explicit about the inner operation (the comment suggests you care about +- only). Thus (for outer_op (plus minus) (for inner_op (plus minus) (simplify (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) VRP only has ranges for types that would combine with INTEGER_CSTs, not other constants so using INTEGER_CST is better here. The above has the advantage that you don't need to lookup SSA def stmts in your manually written code. For the range of @0 you'd then do (with { wide_int rmin, rmax; enum value_range_type rtype = get_range_info (@0, &rmin, &rmax); and you'd use wide-int operations to check for overflow rather than const_binop and looking for TREE_OVERFLOW. Both wi::add and wi::sub have variants with a overflow flag. As you rely on range-info the transform is indeed GIMPLE only but please follow other cases and wrap the whole pattern in #if GIMPLE #endif so it does not produce dead code in generic-match.c

Thanks for the suggestions. The omission of the inner op was actually more or less on purpose as I intended to capture the (convert @inner) in order to access @inner's value range as a whole rather than re-calculating what VRP already figured out. Is there a simple method to access @inner when capturing (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) ^---------------------------------^ @inner or, even-better both, @inner as well as @0 and @1, at the same time? (Apart from looking through use stmts) In my case VRP determines @0's range as ~[0,0] and @inner's as [0,INT_MAX-1]. VRP probably canonicalized the anti-range to a normal range and performed other simplifications in order to arrive at [0,INT_MAX-1]. If I cannot get @inner's VRP info with the capture above, would there be another way to obtain it? The TREE_OVERFLOW/const_binop code is copied from the (A +- CST) +- CST -> A + CST pattern and the result of const_binop is used in the final simplification. The overflow check could of course be done via wi::add/sub but wouldn't I just replicate what int_const_binop_1 is doing on top of it when I need the result anyway?

(In reply to rdapp from comment #15) > Is there a simple method to access @inner when > capturing > (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > ^---------------------------------^ > @inner > or, even-better both, @inner as well as @0 and @1, at the same time? (Apart > from looking through use stmts) (outer_op (convert (inner_op@3 SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) Does @3 do what you want here? (I didn't follow the discussion)

(In reply to Marc Glisse from comment #16) > (In reply to rdapp from comment #15) > > Is there a simple method to access @inner when > > capturing > > (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > > ^---------------------------------^ > > @inner > > or, even-better both, @inner as well as @0 and @1, at the same time? (Apart > > from looking through use stmts) > > (outer_op (convert (inner_op@3 SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > > Does @3 do what you want here? (I didn't follow the discussion) looks good, thanks :)

On Tue, 7 Jun 2016, rdapp at linux dot vnet.ibm.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 > > --- Comment #15 from rdapp at linux dot vnet.ibm.com --- > Thanks for the suggestions. The omission of the inner op was actually more or > less on purpose as I intended to capture the > (convert @inner) > in order to access @inner's value range as a whole rather than re-calculating > what VRP already figured out. > > Is there a simple method to access @inner when > capturing > (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > ^---------------------------------^ > @inner > or, even-better both, @inner as well as @0 and @1, at the same time? (Apart > from looking through use stmts) > > In my case VRP determines @0's range as ~[0,0] and @inner's as [0,INT_MAX-1]. > VRP probably canonicalized the anti-range to a normal range and performed other > simplifications in order to arrive at [0,INT_MAX-1]. If I cannot get @inner's > VRP info with the capture above, would there be another way to obtain it? > > The TREE_OVERFLOW/const_binop code is copied from the (A +- CST) +- CST -> A + > CST pattern and the result of const_binop is used in the final simplification. > The overflow check could of course be done via wi::add/sub but wouldn't I just > replicate what int_const_binop_1 is doing on top of it when I need the result > anyway? int_const_binop_1 is legacy, it is better to avoid building an INTEGER_CST if you throw away the result due to TREE_OVERFLOW.

(In reply to rguenther@suse.de from comment #18) > The match.pd patch is slowly assuming shape... Two things however I'm still unsure about: 1. An existing rule in match.pd checks for overflow of the combined constant in (a +- CST1) +- CST2 and does not perform the simplification when an overflow occurs. My case looks quite similar, however CST1 and CST2 are/might be unsigned ints/longs. I assume this is because non-overflow cannot be proved by the C frontend and it therefore internally uses unsigned representation? (cast) (a +- CST1) +- CST2 We can prove the non-overflow of (a +- CST1) via VRP but an overflow check of (CST1 +- CST2) would fail since ((UINT_MAX - 1) + 1) does overflow. So, in general, should we care about overflow of the combined operation at all after having established the inner operation does not overflow? If the overflow is well-defined and we overflow, the result should be valid. If the overflow is undefined, we could do anything, in particular optimize this which wouldn't even be too unexpected from a user's perspective. Wouldn't any overflow in the combined constant be caused anyway, even without combining? I think there are only two cases two discern, regardless of overflow: - abs(CST1 +- CST2) < abs(CST1), then we can simplify to (cast)(a +- (CST1 +- CST2)) - else, we can simplify to (cast)(a) +- (CST1 +- CST2) 2. Is there an idiomatic/correct way to check a VR_RANGE for overflow? Does it suffice to check if the range includes +-INF or +-INF(OVF)? I suspect other, naive methods like checking if min < max will fail, since the ranges are canonicalized.

(In reply to rdapp from comment #19) > (In reply to rguenther@suse.de from comment #18) > > > > The match.pd patch is slowly assuming shape... Two things however I'm still > unsure about: > > 1. > An existing rule in match.pd checks for overflow of the combined constant > in (a +- CST1) +- CST2 and does not perform the simplification when an > overflow occurs. My case looks quite similar, however CST1 and CST2 > are/might be unsigned ints/longs. I assume this is because non-overflow > cannot be proved by the C frontend and it therefore internally uses unsigned > representation? > > (cast) (a +- CST1) +- CST2 > > We can prove the non-overflow of (a +- CST1) via VRP but an overflow check > of (CST1 +- CST2) would fail since ((UINT_MAX - 1) + 1) does overflow. IIUC we can simply handle signed/unsigned type differently. Given a has int/uint type. For unsigned case: (unsigned long long)(a + cst1) + cst2 and a is unsigned int. It can be simplified into (unsigned long long)a + cst3 iff a + cst1 doesn't overflow/wrap. Since unsigned type can wrap, simplify (unsigned long long)cst1 + cst2 into cst3 is always safe. For signed case: (long long)(a + cst) + cst2 and a is signed int. It can be simplified into (long long)a + cst3 iff (long long)cst1 + cst2 doesn't overflow. We don't need to prove (a+cst) doesn't overflow. > > So, in general, should we care about overflow of the combined operation at > all after having established the inner operation does not overflow? > If the overflow is well-defined and we overflow, the result should be valid. > If the overflow is undefined, we could do anything, in particular optimize > this which wouldn't even be too unexpected from a user's perspective. > Wouldn't any overflow in the combined constant be caused anyway, even > without combining? > > I think there are only two cases two discern, regardless of overflow: > > - abs(CST1 +- CST2) < abs(CST1), then we can simplify to (cast)(a +- (CST1 > +- CST2)) > > - else, we can simplify to (cast)(a) +- (CST1 +- CST2) > > 2. > Is there an idiomatic/correct way to check a VR_RANGE for overflow? Does it > suffice to check if the range includes +-INF or +-INF(OVF)? I suspect other, > naive methods like checking if min < max will fail, since the ranges are > canonicalized.

(In reply to amker from comment #20) > IIUC we can simply handle signed/unsigned type differently. Given a has > int/uint type. > For unsigned case: (unsigned long long)(a + cst1) + cst2 and a is unsigned > int. > It can be simplified into (unsigned long long)a + cst3 iff a + cst1 doesn't > overflow/wrap. Since unsigned type can wrap, simplify (unsigned long > long)cst1 + cst2 into cst3 is always safe. > For signed case: (long long)(a + cst) + cst2 and a is signed int. > It can be simplified into (long long)a + cst3 iff (long long)cst1 + cst2 > doesn't overflow. We don't need to prove (a+cst) doesn't overflow. ok, the stipulation seems to be assume no signed overflow if the operation was already present but don't provoke a potentially new overflow by combining constants that previously would not have been. Makes sense. Your cases can be improved a little as Richard also pointed out: When abs(cst3) < abs(cst1) the combined operation can be done in the inner type (and the signs match so the overflow proof still holds).

(In reply to rdapp from comment #21) > (In reply to amker from comment #20) > > IIUC we can simply handle signed/unsigned type differently. Given a has > > int/uint type. > > For unsigned case: (unsigned long long)(a + cst1) + cst2 and a is unsigned > > int. > > It can be simplified into (unsigned long long)a + cst3 iff a + cst1 doesn't > > overflow/wrap. Since unsigned type can wrap, simplify (unsigned long > > long)cst1 + cst2 into cst3 is always safe. > > For signed case: (long long)(a + cst) + cst2 and a is signed int. > > It can be simplified into (long long)a + cst3 iff (long long)cst1 + cst2 > > doesn't overflow. We don't need to prove (a+cst) doesn't overflow. > > ok, the stipulation seems to be assume no signed overflow if the operation > was already present but don't provoke a potentially new overflow by > combining constants that previously would not have been. Makes sense. Your > cases can be improved a little as Richard also pointed out: When abs(cst3) < > abs(cst1) the combined operation can be done in the inner type (and the > signs match so the overflow proof still holds). Hmm, combine (long long)cst1 + cst2 is compilation time work, as long as it doesn't overflow. It doesn't matter in which type the combination is done? Oh, do you mean simplify the original expression into (long long)(a + cst3)?

(In reply to rdapp from comment #19) > (In reply to rguenther@suse.de from comment #18) > 2. > Is there an idiomatic/correct way to check a VR_RANGE for overflow? Does it > suffice to check if the range includes +-INF or +-INF(OVF)? I suspect other, > naive methods like checking if min < max will fail, since the ranges are > canonicalized. Hmm, not sure if I mis-understood your point. I think there is no *overflow* concept for a VR_RANGE (or a variable), overflow is always a semantic concept for an (arithmetic) operation. Take below loop as an example: unsigned int i; for (i = 4; i != 2; i += 2) { //.... } Variable i has VR_RANGE [0, MAX_VALUE - 1], but it does overflow in the loop.

Created attachment 38775 [details] Updated patch and test, match.pd/VRP (In reply to amker from comment #23) > Hmm, not sure if I mis-understood your point. > I think there is no *overflow* concept for a VR_RANGE (or a variable), > overflow is always a semantic concept for an (arithmetic) operation. Take > below loop as an example: > > unsigned int i; > for (i = 4; i != 2; i += 2) > { > //.... > } > > Variable i has VR_RANGE [0, MAX_VALUE - 1], but it does overflow in the loop. Yes, a variable by itself cannot overflow, only an operation can. VRP for example has a function extract_range_from_binary_expr_1 that performs some overflow checking internally. It does so by comparing the minima and maxima of the respective ranges from both operands. I assumed an overflow there would somehow be propagated into the ranges, hence the question how to access the cast expression in match.pd. Guess I was thinking too complicated, getting the range of both operands and checking if their common minimum is smaller than the maximum should suffice for now. Attached is the current version of the patch, no regressions on s390x and x86-64. The test could be made a bit more specific I suppose, currently it counts the number of "gimple_simplified"s.

Created attachment 38875 [details] Updated patch and test, math.pd/VRP Some cleanups and fixes, the patch can now handle more cases in the inner type. Also added more test cases.