GCC has some problems eliminating overhead from C++ range-based for loops. Consider the program #include <stddef.h> #include <cstring> #include <experimental/string_view> using string_view = std::experimental::string_view; class Foo { constexpr static size_t Length = 9; char ascii_[Length]; public: Foo(); string_view view() const { return string_view(ascii_, Length); } }; void testWithLoopValue(const Foo foo, size_t ptr, char *buf_) { for (auto c : foo.view()) buf_[ptr++] = c; } compiled as g++ -O3 -S -std=c++1z k.cpp ldist determines that this is a memcpy of length expressed as _14 _18 = (unsigned long) &MEM[(void *)&foo + 9B]; _4 = &foo.ascii_ + 1; _3 = (unsigned long) _4; _16 = _18 + 1; _14 = _16 - _3; and dom3 improves this to _18 = (unsigned long) &MEM[(void *)&foo + 9B]; _3 = (unsigned long) &MEM[(void *)&foo + 1B]; _16 = _18 + 1; _14 = _16 - _3; But this is not further simplified to 9 until combine, where it is too late, and a call to memcpy is generated instead of the expected inlined version.
Not sure what the best angle is here. _18 = (unsigned long) &MEM[(void *)&foo + 9B]; _16 = _18 + 1; looks like it would be better as: _19 = (unsigned long) &foo; _16 = _19 + 10; (I guess we don't want to produce &MEM[(void *)&foo + 10B]) But that might not be true anymore if _18 had several uses. Reassoc could also play a role, either to cancel the two +1, or to put the two converted addresses together. I guess extra match.pd patterns might be doable as well, but I am scared of the number of variants that might appear.
(In reply to Marc Glisse from comment #1) > Not sure what the best angle is here. > > _18 = (unsigned long) &MEM[(void *)&foo + 9B]; > _16 = _18 + 1; > > looks like it would be better as: > > _19 = (unsigned long) &foo; > _16 = _19 + 10; > > (I guess we don't want to produce &MEM[(void *)&foo + 10B]) > > But that might not be true anymore if _18 had several uses. We do already have /* Pattern match tem1 = (long) ptr1; tem2 = (long) ptr2; tem3 = tem2 - tem1; tem4 = (unsigned long) tem3; tem5 = ptr1 + tem4; and produce tem5 = ptr2; */ (simplify (pointer_plus @0 (convert?@2 (minus@3 (convert @1) (convert @0)))) /* Conditionally look through a sign-changing conversion. */ (if (TYPE_PRECISION (TREE_TYPE (@2)) == TYPE_PRECISION (TREE_TYPE (@3)) && ((GIMPLE && useless_type_conversion_p (type, TREE_TYPE (@1))) || (GENERIC && type == TREE_TYPE (@1)))) @1)) for example... But I guess the issue is &MEM[&foo + 9B] vs. &foo.ascii_ (I'm quite sure operand_equal_p doesn't treat those as equal). At least propagation passes will canonicalize &foo.ascii_ + 1 to &MEM[&foo + 9B] (but that transform is not done on its own AFAIR). forwprop does that as well, but not as part of its "copy" lattice init. Fixing that results in <bb 2> [10.00%]: _18 = (unsigned long) &MEM[(void *)&foo + 9B]; _3 = (unsigned long) &MEM[(void *)&foo + 1B]; _16 = _18 + 1; _14 = _16 - _3; _13 = buf__9(D) + ptr_6(D); __builtin_memcpy (_13, &foo, _14); after forwprop4, so we still miss some patterns here. I guess reassoc makes our job harder in this case... disabling reassoc2 yields <bb 2> [10.00%]: _13 = buf__9(D) + ptr_6(D); MEM[(char * {ref-all})_13] = MEM[(char * {ref-all})&foo]; return; at forwprop4. So eventually reassoc misses the &MEM trick as well or we should really apply that generally (ISTR it messed up some object-size warning stuff if done generally). > Reassoc could also play a role, either to cancel the two +1, or to put the > two converted addresses together. > > I guess extra match.pd patterns might be doable as well, but I am scared of > the number of variants that might appear.
Created attachment 40376 [details] forwprop fix The forwprop piece, untested (apart from on the testcase). No time to explore further before the start of the next year.
Note that the attached patch will actually end up propagating that &MEM form everywhere (and run into those object-size fallout issues).
Actually disabling reassoc is enough to "fix" it, the patch itself is not needed, fixed by VRP then (which also folds stmts and knows the &MEM trick).
And the real inefficiency is niter analysis returning unsimplified (unsigned long) &MEM[(void *)&foo + 9B] - (unsigned long) ((const char *) &foo.ascii_ + 1) from here: #1 0x000000000134f088 in number_of_iterations_ne (loop=0x7ffff5534550, type=<pointer_type 0x7ffff569f3f0>, iv=0x7fffffffd4b0, final=<addr_expr 0x7ffff51b4620>, niter=0x7fffffffd5e0, exit_must_be_taken=true, bnds=0x7fffffffd3d0) at /space/rguenther/src/svn/gcc-7-branch/gcc/tree-ssa-loop-niter.c:991 991 fold_convert (niter_type, iv->base)); (gdb) l 986 else 987 { 988 s = fold_convert (niter_type, iv->step); 989 c = fold_build2 (MINUS_EXPR, niter_type, 990 fold_convert (niter_type, final), 991 fold_convert (niter_type, iv->base)); where final: &MEM[(void *)&foo + 9B] iv->base: (const char *) &foo.ascii_ + 1 one could either try to enhance the associate: case of fold_binary or use tree-affine above, like with Index: gcc/tree-ssa-loop-niter.c =================================================================== --- gcc/tree-ssa-loop-niter.c (revision 245276) +++ gcc/tree-ssa-loop-niter.c (working copy) @@ -42,6 +42,7 @@ along with GCC; see the file COPYING3. #include "tree-chrec.h" #include "tree-scalar-evolution.h" #include "params.h" +#include "tree-affine.h" /* The maximum number of dominator BBs we search for conditions @@ -986,9 +987,12 @@ number_of_iterations_ne (struct loop *lo else { s = fold_convert (niter_type, iv->step); - c = fold_build2 (MINUS_EXPR, niter_type, - fold_convert (niter_type, final), - fold_convert (niter_type, iv->base)); + aff_tree aff_ivbase, aff_final; + tree_to_aff_combination (iv->base, niter_type, &aff_ivbase); + tree_to_aff_combination (final, niter_type, &aff_final); + aff_combination_scale (&aff_ivbase, -1); + aff_combination_add (&aff_ivbase, &aff_final); + c = aff_combination_to_tree (&aff_ivbase); } mpz_init (max); though this is horribly incomplete (fixes the testcase though) and if, then niter analysis should use aff_trees throughout (not re-build trees for the intermediate stuff). With that patch we immediately get '9' as the number of iterations and thus ldist produces <bb 2> [10.00%]: _18 = buf__9(D) + ptr_6(D); __builtin_memcpy (_18, &foo, 9);
A simple fold-const.c "fix" is the following: Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c (revision 245276) +++ gcc/fold-const.c (working copy) @@ -785,7 +785,9 @@ split_tree (location_t loc, tree in, tre the value is not affected. For reals, the value might be affected, so we can't. */ && ((code == PLUS_EXPR && TREE_CODE (in) == MINUS_EXPR) - || (code == MINUS_EXPR && TREE_CODE (in) == PLUS_EXPR)))) + || (code == MINUS_EXPR + && (TREE_CODE (in) == PLUS_EXPR + || TREE_CODE (in) == POINTER_PLUS_EXPR))))) { tree op0 = TREE_OPERAND (in, 0); tree op1 = TREE_OPERAND (in, 1); I'm giving it bootstrap/regtest - that would be likely a quite beneficial change for all the middle-end machinery still using GENERIC...
Author: rguenth Date: Fri Apr 21 12:09:20 2017 New Revision: 247061 URL: https://gcc.gnu.org/viewcvs?rev=247061&root=gcc&view=rev Log: 2017-04-21 Richard Biener <rguenther@suse.de> PR tree-optimization/78847 * fold-const.c (split_tree): Handle POINTER_PLUS_EXPR. * g++.dg/tree-ssa/pr78847.C: New testcase. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog
Fixed for GCC 8.