c-c++-common/pr19807-2.c /* { dg-do link } */ extern void link_error(void); int i; int main() { int a[4]; if ((char*)&a[1] + 4*i + 4 != (char*)&a[i+2]) link_error(); return 0; } Fails with all optimization levels for all compilers. c-c++-common/pr19807-3.c /* { dg-do link } */ extern void link_error(void); int i; int main() { int a[4]; if (&a[1] + i + 1 != &a[i+2]) link_error(); return 0; } gcc 4.7 passes with -O[01] but fails with -O2+ gcc 4.8 and 4.9 fail with -O0 but passes with -O1+ (thanks to SLSR) gcc 5.0 since the fix for PR63148 fails which is a regression. We don't seem to systematically use sth like the tree-affine.c framework to simplify address comparisons and the fact that tree-reassoc.c doesn't in any way associate pointer arithmetic doesn't help either.
Note that due to the inconsistency with -fstrict-overflow I don't consider this an important optimization regression. But I'll see if I can improve some generic code to handle these better (with optimization). But for 5.0 only.
Ok, so we can (and do) forward addresses like &a[_9] into plain dereferences, so lowering these addresses early is probably not a good idea. We could lower them as a very first thing in GIMPLE reassoc but that would still need to apply transforms like (i + 1) * 4 -> i*4 + 4 to be most effective (though that generally applies). Otherwise I can't see an easy place to hook in simplification of i.0_3 = i; _4 = i.0_3 * 4; _5 = (sizetype) _4; _6 = _5 + 4; _7 = &a[1] + _6; _9 = i.0_3 + 2; _10 = &a[_9]; if (_7 != _10) which is what we get for the first testcase after initial scalar cleanups. SLSR helps somewhat in some cases but it's run very late. We could detect address comparisons with the same base and use the affine combination machinery to simplify them though. From somewhere. tree-ssa-forwprop.c probably. Simplify them to _7 - _10 CMP 0 (if the addresses have the same base object at least that should cancel). We canonicalize (base p+ off1) p+ off2 to base p+ (off1 + off2) which should help here as well. We could also only "fold" away the base object from the chain (not using the affine machinery). match-and-simplify doesn't help here because the replacement involves an expression created by get_inner_reference (well, a manual transform would work here, of course).
GCC 5.1 has been released.
GCC 5.2 is being released, adjusting target milestone to 5.3.
Ok, on trunk we now lower addresses somewhat but that doesn't help as we are still faced with i.0_3 = i; _4 = (sizetype) i.0_3; _5 = _4 + 1; _6 = _5 * 4; _7 = &a[1] + _6; _8 = i.0_3 + 2; _12 = (sizetype) _8; _13 = _12 * 4; _9 = &a + _13; if (_7 != _9) where FRE/PRE cannot see any equivalences, reassoc doesn't do anything and SLSR does - _13 = _12 * 4; + _13 = _6 + 4; but that's too late or rather DOM doesn't figure out the equivalence of &a + _6 + 4 and &a[1] + _6 (not that I see how it could).
GCC 5.3 is being released, adjusting target milestone.
GCC 5.4 is being released, adjusting target milestone.
So while we don't optimize this on the tree level, the addressing lowering that was introduced for gcc-6 sets things up so the RTL optimizers can detect the equivalences and remove the tests. So ISTM we can do two things here. Keep this as a regression since it'd be nice for the SSA/gimple optimizers to catch this. Or mark it was fixed for gcc-6 and gcc-7 since the RTL optimizers are able to clean this stuff up.
Keep it open, it still is an issue in VN that I'd like to address.
deferred.
So could we reassociate the address arithmetic in match.pd so that we fold away the pointer computation in favor of index adjustment in the ARRAY_REF? Do we have to worry about overflow in address reassociation? Using the gimple from c#2: i.0_3 = i; _4 = i.0_3 * 4; _5 = (sizetype) _4; _6 = _5 + 4; _7 = &a[1] + _6; _9 = i.0_3 + 2; _10 = &a[_9]; if (_7 != _10) Transform it into: i.0_3 = i; temp = i.0_3 + 1 + 1; /* +1 from pointer arith, +1 from array index ] _4 = i.0_3 * 4; /* DEAD */ _5 = (sizetype) _4; /* DEAD */ _6 = _5 + 4; /* DEAD */ _7 = &a[temp]; _9 = i.0_3 + 2; _10 = &a[_9]; if (_7 != _10) That gives us a fighting chance to see that temp is equivalent to _9 and that the ultimate addresses are equal.
It happens that we optimize both cases now (with optimization only). Would be still a regression vs. GCC 4.7 at -O0 but IMHO we shouldn't care about optimizing this at -O0. Thus fixed. I'll add the testcases.
Author: rguenth Date: Wed Dec 5 14:55:59 2018 New Revision: 266827 URL: https://gcc.gnu.org/viewcvs?rev=266827&root=gcc&view=rev Log: 2018-12-05 Richard Biener <rguenther@suse.de> PR middle-end/63184 * c-c++-common/pr19807-2.c: New testcase. * c-c++-common/pr19807-3.c: Likewise. Added: trunk/gcc/testsuite/c-c++-common/pr19807-2.c trunk/gcc/testsuite/c-c++-common/pr19807-3.c Modified: trunk/gcc/testsuite/ChangeLog
The new testcases FAIL on quite a number of targets (I'm seeing it on sparc-sun-solaris2.11, 32 and 64-bit, but there are also gcc-testresults postings for aarch64-unknown-linux-gnu, armv8l-unknown-linux-gnueabihf, ia64-suse-linux-gnu, moxie-unknown-elf,powerpc64-unknown-linux-gnu, powerpc64le-unknown-linux-gnu: +FAIL: c-c++-common/pr19807-2.c -std=gnu++14 (test for excess errors) +FAIL: c-c++-common/pr19807-2.c -std=gnu++17 (test for excess errors) +FAIL: c-c++-common/pr19807-2.c -std=gnu++98 (test for excess errors) +FAIL: c-c++-common/pr19807-3.c -std=gnu++14 (test for excess errors) +FAIL: c-c++-common/pr19807-3.c -std=gnu++17 (test for excess errors) +FAIL: c-c++-common/pr19807-3.c -std=gnu++98 (test for excess errors) +FAIL: c-c++-common/pr19807-2.c -Wc++-compat (test for excess errors) Excess errors: Undefined first referenced symbol in file link_error /var/tmp//ccvTdBwc.o +FAIL: c-c++-common/pr19807-3.c -Wc++-compat (test for excess errors)
Oh, we only optimize it on RTL on x86, possibly with the help of combine and complex addressing modes... I'll adjust the testcase.
Now to answer Jeff my original idea was to change how VN handles addresses to use tree-affine in there. But that takes some time... Given that we lower non-invariant addresses now another idea would be to recognize _4 = _3 + 4; _5 = &a[1] + _4; and reassociate that to combine the + 4 with the &a[1] (which is &a + 4). Now the reassoc pass associates the constant last which makes this unreliable to detect in a match.pd pattern (the constant may be far away) but the reassoc pass itself could, when associating a chain, look for whether the SSA name we start from is (single-)used in a POINTER_PLUS_EXPR and the constant element in the reassoc ops[] array can be combined with a constant offset in the address operand. I'll see if that works out.
Author: rguenth Date: Thu Dec 6 10:20:39 2018 New Revision: 266846 URL: https://gcc.gnu.org/viewcvs?rev=266846&root=gcc&view=rev Log: 2018-12-06 Richard Biener <rguenther@suse.de> PR middle-end/63184 * c-c++-common/pr19807-2.c: Try link only on x86, add xfailed optimized dump scanning. * c-c++-common/pr19807-3.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/pr19807-2.c trunk/gcc/testsuite/c-c++-common/pr19807-3.c
That works but it doesn't perform the folding in the end. Still combining &a + b + c + d + e + 2 into (&a + 2) + b + c + d + e is better since most targets can compute symbol+offset in a single instruction. We then end up with <bb 2> [local count: 1073741824]: i.0_1 = i; _2 = i.0_1 * 4; _3 = (sizetype) _2; _5 = &MEM[(void *)&a + 8B] + _3; _13 = _3 + 8; _7 = &a + _13; if (_5 != _7) goto <bb 3>; [53.47%] else goto <bb 4>; [46.53%] where the issue is that the 2nd opportunity for &a + _3 + 8 only appears after SLSR which transforms _6 = i.0_1 + 2; _12 = (sizetype) _6; _13 = _12 * 4; _7 = &a + _13; to _6 = i.0_1 + 2; _12 = (sizetype) _6; _13 = _3 + 8; _7 = &a + _13; and we have NEXT_PASS (pass_reassoc, false /* insert_powi_p */); NEXT_PASS (pass_strength_reduction); and I'm not sure it's a good idea to swap those two... If I do we fold things (albeit quite late). diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c index a9f45bfd891..fb1f8014633 100644 --- a/gcc/tree-ssa-reassoc.c +++ b/gcc/tree-ssa-reassoc.c @@ -5988,6 +5988,31 @@ reassociate_bb (basic_block bb) } } + /* If the association chain is used in a single + POINTER_PLUS_EXPR with an invariant first operand + then combine a constant element with the invariant + address. */ + use_operand_p use_p; + gimple *use_stmt; + if (ops.length () > 1 + && rhs_code == PLUS_EXPR + && TREE_CODE (ops.last ()->op) == INTEGER_CST + && single_imm_use (lhs, &use_p, &use_stmt) + && is_gimple_assign (use_stmt) + && gimple_assign_rhs_code (use_stmt) == POINTER_PLUS_EXPR + && TREE_CODE (gimple_assign_rhs1 (use_stmt)) == ADDR_EXPR) + { + last = ops.pop (); + tree addr = gimple_assign_rhs1 (use_stmt); + addr = build1 (ADDR_EXPR, TREE_TYPE (addr), + fold_build2 (MEM_REF, + TREE_TYPE (TREE_TYPE (addr)), + addr, + fold_convert (ptr_type_node, + last->op))); + gimple_assign_set_rhs1 (use_stmt, addr); + } + tree new_lhs = lhs; /* If the operand vector is now empty, all operands were consumed by the __builtin_powi optimization. */
Btw, we alread manage to do sth like that on RTL. typedef __UINTPTR_TYPE__ uintptr_t; char a[1024]; char foo (uintptr_t b, uintptr_t c, uintptr_t d, uintptr_t e, uintptr_t f) { uintptr_t tem = b + c + d + e + f + 1; return *(a + tem); } with the patch changes - _14 = f_10(D) + 1; - _15 = e_9(D) + _14; - _16 = d_8(D) + _15; - _17 = c_7(D) + _16; - tem_11 = b_6(D) + _17; - _5 = &a + tem_11; + _14 = e_9(D) + f_10(D); + _15 = d_8(D) + _14; + _16 = c_7(D) + _15; + tem_11 = b_6(D) + _16; + _5 = &MEM[(void *)&a + 1B] + tem_11; _13 = *_5; return _13; but the asssembler is the same addq %rsi, %rdi addq %rdx, %rdi addq %rcx, %rdi movzbl a+1(%r8,%rdi), %eax ret disabling TER makes it worse without the patch but not with: leaq 1(%r8,%rcx), %rax addq %rax, %rdx addq %rdx, %rsi movzbl a(%rsi,%rdi), %eax ret so I guess the incremental RTL build canonicalizes the constant outermost. At least the patch looks like being in the correct direction for the goal of removing TER... Still most of the time the issue will be that ptr + a + b + c + 1 cannot be handled because of the intermediate conversion. That could be mitigated by moving the constant part out of an outer conversion (when permitted).
Created attachment 45179 [details] patch This patch fixes the testcases but it comes at a cost without much visible benefit (trying 258098 expansions in a C,C++ bootstrap with just 4 hits, all in the same place).
Created attachment 45180 [details] reassoc patch The reassoc patch, not solving the issue in its own.
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
So what is interesting is for aarch64, GCC is able to figure out the two values are equal but only after register allocation (note LLVM has a similar issue too). Anyways at forwprop4 we have: _5 = &a[1] + _4; _13 = _3 + 8; _7 = &a + _13; if (_5 != _7) I think I have a patch (I need to format it correctly though): (for neeq (ne eq) (simplify (neeq (pointer_plus ADDR_EXPR@0 @1) (pointer_plus ADDR_EXPR@2 @3)) (with { poly_int64 diff; } (if (ptr_difference_const (@0, @2, &diff)) (neeq (plus { build_int_cst_type (type, diff); } (convert (minus @1 @3))) {build_zero_cst (type); }))))) I also added this one too as I saw that was not being folded either: (simplify (pointer_diff (pointer_plus ADDR_EXPR@0 @1) (pointer_plus ADDR_EXPR@2 @3)) (with { poly_int64 diff; } (if (ptr_difference_const (@0, @2, &diff)) (plus { build_int_cst_type (type, diff); } (convert (minus @1 @3))))))
Created attachment 51411 [details] patch which I am testing
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578859.html
The master branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>: https://gcc.gnu.org/g:564efbf40077c25623cdd6ce2f911c56b5b08f6c commit r12-3364-g564efbf40077c25623cdd6ce2f911c56b5b08f6c Author: Andrew Pinski <apinski@marvell.com> Date: Mon Sep 6 00:52:18 2021 +0000 Fix PR tree-optimization/63184: add simplification of (& + A) != (& + B) These two testcases have been failing since GCC 5 but things have improved such that adding a simplification to match.pd for this case is easier than before. In the end we have the following IR: .... _5 = &a[1] + _4; _7 = &a + _13; if (_5 != _7) So we can fold the _5 != _7 into: (&a[1] - &a) + _4 != _13 The subtraction is folded into constant by ptr_difference_const. In this case, the full expression gets folded into a constant and we are able to remove the if statement. OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. gcc/ChangeLog: PR tree-optimization/63184 * match.pd: Add simplification of pointer_diff of two pointer_plus with addr_expr in the first operand of each pointer_plus. Add simplificatoin of ne/eq of two pointer_plus with addr_expr in the first operand of each pointer_plus. gcc/testsuite/ChangeLog: PR tree-optimization/63184 * c-c++-common/pr19807-2.c: Enable for all targets and remove the xfail. * c-c++-common/pr19807-3.c: Likewise.
Fixed on the trunk and I don't think this should be backported as it has been a missed optimization for 7 years or more.