Summary: | pointer arithmetic simplification | ||
---|---|---|---|
Product: | gcc | Reporter: | Marc Glisse <glisse> |
Component: | middle-end | Assignee: | Richard Biener <rguenth> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | paulo |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.9.0 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | 4.9.0 | |
Known to fail: | Last reconfirmed: | 2013-10-21 00:00:00 |
Description
Marc Glisse
2013-10-15 21:47:49 UTC
Interestingly, clang-3.4 and icc-13.1.3 don't fare any better than gcc here. I'll have a look. This case is a signed exact division followed by an unsigned multiply. We do that to avoid introducing undefined signed overflow. signed exact division and unsigned multiply still cancel out though, handling of these is in extract_muldiv (ugh). I'm not going to enhance that but pattern match this case. Testing Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c (revision 203885) +++ gcc/fold-const.c (working copy) @@ -11002,6 +11002,13 @@ fold_binary_loc (location_t loc, fold_build2_loc (loc, MULT_EXPR, type, build_int_cst (type, 2) , arg1)); + /* ((T) (X /[ex] C)) * C cancels out if the conversion is + sign-changing only. */ + if (TREE_CODE (arg1) == INTEGER_CST + && TREE_CODE (arg0) == EXACT_DIV_EXPR + && operand_equal_p (arg1, TREE_OPERAND (arg0, 1), 0)) + return fold_convert_loc (loc, type, TREE_OPERAND (arg0, 0)); + strict_overflow_p = false; if (TREE_CODE (arg1) == INTEGER_CST && 0 != (tem = extract_muldiv (op0, arg1, code, NULL_TREE, note that this simplification worked in 4.1.2 but stopped working in 4.2.0. Likely a wrong-code fix for extract_muldiv disabled this case (the way that function works makes special casing a pattern like above impossible). The patch will fix the regression part, to be left to optimize is the pointer offset association bits which should best be done in GIMPLE reassoc which doesn't yet associate POINTER_PLUS_EXPR yet. It could do that with "transparently" handling ptr p+ (offset + ...) as (sizetype) ptr + (offset + ...) and in the final chain, if ptr prevailed, associate it first and re-instantiate the POINTER_PLUS_EXPR, otherwise keep the sizetype result and convert it to the pointer type (sizetype arith has wrapping overflow so you cannot simply take any pointer in the addition chain as the base for a POINTER_PLUS_EXRP). Author: rguenth Date: Mon Oct 21 11:34:04 2013 New Revision: 203890 URL: http://gcc.gnu.org/viewcvs?rev=203890&root=gcc&view=rev Log: 2013-10-21 Richard Biener <rguenther@suse.de> PR middle-end/58742 * fold-const.c (fold_binary_loc): Fold ((T) (X /[ex] C)) * C to (T) X for sign-changing conversions (or no conversion). * c-c++-common/fold-divmul-1.c: New testcase. Added: trunk/gcc/testsuite/c-c++-common/fold-divmul-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog Thanks. For reference, as noted by Jakub and confirmed by Richard, we will also need at some point a gimple version for: int * fx (int *b, int *e) { ptrdiff_t p = e - b; return b + p; } (or s/ptrdiff_t/long long/, etc.) Regression fixed on trunk sofar. (In reply to Richard Biener from comment #4) > The patch will fix the regression part, to be left to optimize is the > pointer offset association bits which should best be done in GIMPLE > reassoc which doesn't yet associate POINTER_PLUS_EXPR yet. It could > do that with "transparently" handling > > ptr p+ (offset + ...) > > as > > (sizetype) ptr + (offset + ...) > > and in the final chain, if ptr prevailed, associate it first and > re-instantiate the POINTER_PLUS_EXPR, otherwise keep the sizetype > result and convert it to the pointer type (sizetype arith has > wrapping overflow so you cannot simply take any pointer in the > addition chain as the base for a POINTER_PLUS_EXRP). Not that easy as 'ptr' and '(sizetype) ptr' are not trivially computed equal by reassoc (it relies on CSE and has no value-numbering on its own). Author: rguenth Date: Mon Nov 18 15:13:14 2013 New Revision: 204965 URL: http://gcc.gnu.org/viewcvs?rev=204965&root=gcc&view=rev Log: 2013-11-18 Richard Biener <rguenther@suse.de> Backport from mainline 2013-10-21 Richard Biener <rguenther@suse.de> PR tree-optimization/58794 * fold-const.c (operand_equal_p): Compare FIELD_DECL operand of COMPONENT_REFs with OEP_CONSTANT_ADDRESS_OF left in place. * c-c++-common/torture/pr58794-1.c: New testcase. * c-c++-common/torture/pr58794-2.c: Likewise. 2013-10-21 Richard Biener <rguenther@suse.de> PR middle-end/58742 * fold-const.c (fold_binary_loc): Fold ((T) (X /[ex] C)) * C to (T) X for sign-changing conversions (or no conversion). * c-c++-common/fold-divmul-1.c: New testcase. 2013-11-06 Richard Biener <rguenther@suse.de> PR tree-optimization/58653 * tree-predcom.c (ref_at_iteration): Rewrite to generate a MEM_REF. (prepare_initializers_chain): Adjust. * gcc.dg/tree-ssa/predcom-6.c: New testcase. * gcc.dg/tree-ssa/predcom-7.c: Likewise. PR tree-optimization/59047 * tree-predcom.c (ref_at_iteration): Handle bitfield accesses properly. * gcc.dg/torture/pr59047.c: New testcase. 2013-10-15 Richard Biener <rguenther@suse.de> PR tree-optimization/58143 * tree-ssa-loop-im.c (arith_code_with_undefined_signed_overflow): New function. (rewrite_to_defined_overflow): Likewise. (move_computations_dom_walker::before_dom): Rewrite stmts with undefined signed overflow that are not always executed into unsigned arithmetic. * gcc.dg/torture/pr58143-1.c: New testcase. * gcc.dg/torture/pr58143-2.c: Likewise. * gcc.dg/torture/pr58143-3.c: Likewise. Added: branches/gcc-4_8-branch/gcc/testsuite/c-c++-common/fold-divmul-1.c branches/gcc-4_8-branch/gcc/testsuite/c-c++-common/torture/pr58794-1.c branches/gcc-4_8-branch/gcc/testsuite/c-c++-common/torture/pr58794-2.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/torture/pr58143-1.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/torture/pr58143-2.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/torture/pr58143-3.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/torture/pr59047.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/tree-ssa/predcom-6.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/tree-ssa/predcom-7.c Modified: branches/gcc-4_8-branch/gcc/ChangeLog branches/gcc-4_8-branch/gcc/fold-const.c branches/gcc-4_8-branch/gcc/testsuite/ChangeLog branches/gcc-4_8-branch/gcc/tree-predcom.c branches/gcc-4_8-branch/gcc/tree-ssa-loop-im.c Fixed for 4.8.3, not backporting to 4.7. Er, we are still missing a gimple version of that, aren't we? Or would you prefer a different PR? Ah yes, sorry. I consider the regression fixed, we can track the original optimization separately as well as fixing the regression in a more "gimple way". *** Bug 59209 has been marked as a duplicate of this bug. *** So we don't forget it, PR 59209 asks to simplify: (ptr+size)-ptr to size not just: ptr1+(ptr2-ptr1) to ptr2 Another example: http://stackoverflow.com/q/21253690/1918193 where we have (-DVERSION=2): _128 = img$_M_impl$_M_start_130 + 4000000; pretmp_146 = (long intD.12) _128; pretmp_145 = (long intD.12) img$_M_impl$_M_start_130; pretmp_76 = pretmp_146 - pretmp_145; pretmp_121 = pretmp_76 /[ex] 4; pretmp_120 = (size_typeD.24047) pretmp_121; We miss that pretmp_120 is a constant, VRP thus fails to eliminate the range checks, vectorization doesn't happen, and the code is more that 4 times slower than it should be. If the desired reassoc version is hard, a simple forwprop pattern matching would already go a long way to alleviate this issue. On Tue, 21 Jan 2014, glisse at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58742
>
> --- Comment #15 from Marc Glisse <glisse at gcc dot gnu.org> ---
> Another example: http://stackoverflow.com/q/21253690/1918193
>
> where we have (-DVERSION=2):
> _128 = img$_M_impl$_M_start_130 + 4000000;
> pretmp_146 = (long intD.12) _128;
> pretmp_145 = (long intD.12) img$_M_impl$_M_start_130;
> pretmp_76 = pretmp_146 - pretmp_145;
> pretmp_121 = pretmp_76 /[ex] 4;
> pretmp_120 = (size_typeD.24047) pretmp_121;
>
> We miss that pretmp_120 is a constant, VRP thus fails to eliminate the range
> checks, vectorization doesn't happen, and the code is more that 4 times slower
> than it should be.
>
> If the desired reassoc version is hard, a simple forwprop pattern matching
> would already go a long way to alleviate this issue.
I agree. I have a patch.
Author: rguenth Date: Tue Jan 28 14:53:52 2014 New Revision: 207194 URL: http://gcc.gnu.org/viewcvs?rev=207194&root=gcc&view=rev Log: 2014-01-28 Richard Biener <rguenther@suse.de> PR tree-optimization/58742 * tree-ssa-forwprop.c (associate_plusminus): Handle pointer subtraction of the form (T)(P + A) - (T)P. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-forwprop.c I'll tackle the EXACT_DIV_EXPR / MULT_EXPR cancellation next. Seems easy enough. The original identity will remain. That is, ;; Function fx (fx, funcdef_no=0, decl_uid=1743, symbol_order=0) fx (int * b, int * e) { long int e.0_2; long int b.1_4; long int _5; long unsigned int _6; int * _7; <bb 2>: e.0_2 = (long int) e_1(D); b.1_4 = (long int) b_3(D); _5 = e.0_2 - b.1_4; _6 = (long unsigned int) _5; _7 = b_3(D) + _6; return _7; } that is to be pattern-matched separately, in associate_pointerplus. Of course that's only a piece of associating as b+5+(e-b) will then still not be optimized. Author: rguenth Date: Wed Jan 29 11:08:59 2014 New Revision: 207232 URL: http://gcc.gnu.org/viewcvs?rev=207232&root=gcc&view=rev Log: 2014-01-29 Richard Biener <rguenther@suse.de> PR tree-optimization/58742 * tree-ssa-forwprop.c (associate_plusminus): Return true if we changed sth, defer EH cleanup to ... (ssa_forward_propagate_and_combine): ... here. Call simplify_mult. (simplify_mult): New function. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-forwprop.c Ok, I have a patch that does the remaining (including the duplicate bug). But for the (p + sz) - p case it can only optimize the sizeof (*p) == 1 without range information - for example for __SIZE_TYPE__ fx (int *a, __SIZE_TYPE__ sz) { int *b = a + sz; return b - a; } we get fx (int * a, long unsigned int sz) { int * b; long unsigned int _2; long int _7; long int _8; long unsigned int _9; <bb 2>: _2 = sz_1(D) * 4; _7 = (long int) _2; _8 = _7 /[ex] 4; _9 = (long unsigned int) _8; return _9; } as result which is basically (sz * 4) /[ex] 4 as we don't know whether the multiplication by 4 overflows (well, the C language may say it doesn't but the IL does not reflect this). If we make 'sz' a signed int then VRP could later optimize this (but it doesn't currently), or forwprop could use range information. On RTL we manage to optimize the signed int input case. Author: rguenth Date: Wed Jan 29 14:45:44 2014 New Revision: 207239 URL: http://gcc.gnu.org/viewcvs?rev=207239&root=gcc&view=rev Log: 2014-01-29 Richard Biener <rguenther@suse.de> PR tree-optimization/58742 * tree-ssa-forwprop.c (associate_pointerplus): Rename to associate_pointerplus_align. (associate_pointerplus_diff): New function. (associate_pointerplus): Likewise. Call associate_pointerplus_align and associate_pointerplus_diff. * gcc.dg/pr58742-1.c: New testcase. * gcc.dg/pr58742-2.c: Likewise. * gcc.dg/pr58742-3.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/pr58742-1.c trunk/gcc/testsuite/gcc.dg/pr58742-2.c trunk/gcc/testsuite/gcc.dg/pr58742-3.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-forwprop.c Fixed for 4.9. Thank you. Sadly, for the example in comment #15, this is not quite enough, I need to add forwprop+ccp right before the VRP1 pass (and then the range check is eliminated, the vectorizer works and perfs are the same as without range checking). Indeed, we learn that size is (start+4000000)-start quite late (need to inline, look through mem_refs, etc -> FRE2) so the previous forwprop pass is too early. VRP2 is too late if we hope to vectorize, and in any case it fails to remove the range checks, because it is confused by the new shape of the loops (possibly related to PR 25643, or not). The VRP2 failure looks funny with these consecutive lines: # ivtmp.80_92 = PHI <ivtmp.80_53(9), ivtmp.80_83(8)> # RANGE [10101, 989898] NONZERO 0x000000000000fffff _23 = ivtmp.80_92; if (ivtmp.80_92 > 999999) Really, we don't know that the comparison returns false? For the overflow in sizeof(*p) * sz, would it make sense to have the front-end generate, when it sees p+sz: if((long)sz>LONG_MAX/sizeof(*p)) __builtin_unreachable() (or abort or a sanitizer call depending on options), and a similar check for large negative values? It feels very heavy for such a common operation, but if the FE is the only one with the information, I am not sure how else to pass it down to gimple. I might file a low priority enhancement PR about extending reassoc to pointers, that would still cover some cases (and it wouldn't make the forwprop transformation useless because of single-use restrictions). (In reply to Marc Glisse from comment #24) > Thank you. > Sadly, for the example in comment #15, this is not quite enough, I need to > add forwprop+ccp right before the VRP1 pass (and then the range check is > eliminated, the vectorizer works and perfs are the same as without range > checking). VERSION=0 and VERSION=1 are the same speed for me now, VERSION=2 is a lot slower still. > Indeed, we learn that size is (start+4000000)-start quite late > (need to inline, look through mem_refs, etc -> FRE2) so the previous > forwprop pass is too early. Yeah, the issue is that while FRE does some expression simplification it doesn't wire into a common gimple pattern matcher (something I'd like to fix for 4.10). That is, the simplification forwprop performs should be done by FRE already. See tree-ssa-sccvn.c:simplify_binary_expression. > VRP2 is too late if we hope to vectorize, and in > any case it fails to remove the range checks, because it is confused by the > new shape of the loops (possibly related to PR 25643, or not). The VRP2 > failure looks funny with these consecutive lines: > > # ivtmp.80_92 = PHI <ivtmp.80_53(9), ivtmp.80_83(8)> > # RANGE [10101, 989898] NONZERO 0x000000000000fffff > _23 = ivtmp.80_92; > if (ivtmp.80_92 > 999999) > > Really, we don't know that the comparison returns false? Well, _23 is simply dead at this point and VRP computed _92 to be varying. > > For the overflow in sizeof(*p) * sz, would it make sense to have the > front-end generate, when it sees p+sz: if((long)sz>LONG_MAX/sizeof(*p)) > __builtin_unreachable() (or abort or a sanitizer call depending on options), > and a similar check for large negative values? It feels very heavy for such > a common operation, but if the FE is the only one with the information, I am > not sure how else to pass it down to gimple. From the no-undefined-overflow branch I'd take the idea of adding op variants with known no overflow. That is, add MULTNV_EXPR, PLUSNV_EXPR, MINUSNV_EXPR that can be used on unsigned types, too (you'd of course have to define what overflow means there - if a - b does not overflow then a + (-b) will - negate of x will always overflow if x is not zero). The idea of no-undefined-overflow branch was to make all ops wrapping by default (even signed type arithmetic) and make frontends explicitely use non-overflowing ops when language semantics says they are not overflowing. > I might file a low priority enhancement PR about extending reassoc to > pointers, that would still cover some cases (and it wouldn't make the > forwprop transformation useless because of single-use restrictions). (In reply to Richard Biener from comment #25) > VERSION=0 and VERSION=1 are the same speed for me now, They aren't quite for me (2.5 vs 2.7) but > VERSION=2 is a lot slower still. that's the part I am concerned with here. > Yeah, the issue is that while FRE does some expression simplification it > doesn't wire into a common gimple pattern matcher (something I'd like to > fix for 4.10). That is, the simplification forwprop performs should be > done by FRE already. See tree-ssa-sccvn.c:simplify_binary_expression. Ah, ok, that makes sense. I assume it would also have basic CCP-like functionality (forwprop can create constants but doesn't always fold the resulting constant operations). Looking forward to that! > > VRP2 is too late if we hope to vectorize, and in > > any case it fails to remove the range checks, because it is confused by the > > new shape of the loops (possibly related to PR 25643, or not). The VRP2 > > failure looks funny with these consecutive lines: > > > > # ivtmp.80_92 = PHI <ivtmp.80_53(9), ivtmp.80_83(8)> > > # RANGE [10101, 989898] NONZERO 0x000000000000fffff > > _23 = ivtmp.80_92; > > if (ivtmp.80_92 > 999999) > > > > Really, we don't know that the comparison returns false? > > Well, _23 is simply dead at this point and VRP computed _92 to be > varying. Yes. I just meant that, as a hack, for 2 SSA_NAME defined in the same BB where one is a copy of the other, we could merge their range info (in both directions) and it might in this special case work around the fact that VRP2 is confused by the loop. But that would be too fragile and hackish. > From the no-undefined-overflow branch I'd take the idea of adding op > variants with known no overflow. That is, add MULTNV_EXPR, PLUSNV_EXPR, > MINUSNV_EXPR that can be used on unsigned types, too (you'd of course > have to define what overflow means there - if a - b does not overflow > then a + (-b) will - negate of x will always overflow if x is not zero). Ah, yes, I'd forgotten about those. I always wondered if it is better to have many different tree codes or a single one with "options". Like MULT_EXPR with a parameter saying what happens on overflow: undefined, saturate, wrap, other (seems hard to handle "jump to this location" in the same). Or COMPARISON_EXPR with several bools telling what the return value is if a<b, a==b, a>b, one is NaN, and if it can raise exceptions (we don't have the corresponding 32 tree codes). Or the 5 DIV_EXPR variants (counting only integers). I guess it doesn't really matter. On Mon, 3 Feb 2014, glisse at gcc dot gnu.org wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58742
>
> --- Comment #26 from Marc Glisse <glisse at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #25)
> > VERSION=0 and VERSION=1 are the same speed for me now,
>
> They aren't quite for me (2.5 vs 2.7) but
>
> > VERSION=2 is a lot slower still.
>
> that's the part I am concerned with here.
>
> > Yeah, the issue is that while FRE does some expression simplification it
> > doesn't wire into a common gimple pattern matcher (something I'd like to
> > fix for 4.10). That is, the simplification forwprop performs should be
> > done by FRE already. See tree-ssa-sccvn.c:simplify_binary_expression.
>
> Ah, ok, that makes sense. I assume it would also have basic CCP-like
> functionality (forwprop can create constants but doesn't always fold the
> resulting constant operations). Looking forward to that!
>
> > > VRP2 is too late if we hope to vectorize, and in
> > > any case it fails to remove the range checks, because it is confused by the
> > > new shape of the loops (possibly related to PR 25643, or not). The VRP2
> > > failure looks funny with these consecutive lines:
> > >
> > > # ivtmp.80_92 = PHI <ivtmp.80_53(9), ivtmp.80_83(8)>
> > > # RANGE [10101, 989898] NONZERO 0x000000000000fffff
> > > _23 = ivtmp.80_92;
> > > if (ivtmp.80_92 > 999999)
> > >
> > > Really, we don't know that the comparison returns false?
> >
> > Well, _23 is simply dead at this point and VRP computed _92 to be
> > varying.
>
> Yes. I just meant that, as a hack, for 2 SSA_NAME defined in the same BB where
> one is a copy of the other, we could merge their range info (in both
> directions) and it might in this special case work around the fact that VRP2 is
> confused by the loop. But that would be too fragile and hackish.
>
> > From the no-undefined-overflow branch I'd take the idea of adding op
> > variants with known no overflow. That is, add MULTNV_EXPR, PLUSNV_EXPR,
> > MINUSNV_EXPR that can be used on unsigned types, too (you'd of course
> > have to define what overflow means there - if a - b does not overflow
> > then a + (-b) will - negate of x will always overflow if x is not zero).
>
> Ah, yes, I'd forgotten about those. I always wondered if it is better to have
> many different tree codes or a single one with "options". Like MULT_EXPR with a
> parameter saying what happens on overflow: undefined, saturate, wrap, other
> (seems hard to handle "jump to this location" in the same). Or COMPARISON_EXPR
> with several bools telling what the return value is if a<b, a==b, a>b, one is
> NaN, and if it can raise exceptions (we don't have the corresponding 32 tree
> codes). Or the 5 DIV_EXPR variants (counting only integers). I guess it doesn't
> really matter.
It matters for convenience with existing code like fold-const.c
which takes decomposed expression trees. You'd need to add a bunch
of flags there and pass them through appropriately. Much easier
to encode it in enum tree_code directly.
Richard.
|