Bug 78847 - pointer arithmetic from c++ ranged-based for loop not optimized
Summary: pointer arithmetic from c++ ranged-based for loop not optimized
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P3 enhancement
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-12-17 19:32 UTC by krister.walfridsson
Modified: 2019-06-15 00:40 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work: 8.0
Known to fail:
Last reconfirmed: 2016-12-19 00:00:00


Attachments
forwprop fix (486 bytes, patch)
2016-12-20 10:16 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description krister.walfridsson 2016-12-17 19:32:40 UTC
GCC has some problems eliminating overhead from C++ range-based for loops. Consider the program

#include <stddef.h>
#include <cstring>
#include <experimental/string_view>
using string_view = std::experimental::string_view;

class Foo {
    constexpr static size_t Length = 9;
    char ascii_[Length];
public:
    Foo();
    string_view view() const {
        return string_view(ascii_, Length);
    }
};

void testWithLoopValue(const Foo foo, size_t ptr, char *buf_) {
  for (auto c : foo.view())
    buf_[ptr++] = c;
}

compiled as
  g++ -O3 -S -std=c++1z k.cpp


ldist determines that this is a memcpy of length expressed as _14

  _18 = (unsigned long) &MEM[(void *)&foo + 9B];
  _4 = &foo.ascii_ + 1;
  _3 = (unsigned long) _4;
  _16 = _18 + 1;
  _14 = _16 - _3;

and dom3 improves this to

  _18 = (unsigned long) &MEM[(void *)&foo + 9B];
  _3 = (unsigned long) &MEM[(void *)&foo + 1B];
  _16 = _18 + 1;
  _14 = _16 - _3;

But this is not further simplified to 9 until combine, where it is too late, and a call to memcpy is generated instead of the expected inlined version.
Comment 1 Marc Glisse 2016-12-19 12:58:15 UTC
Not sure what the best angle is here.

_18 = (unsigned long) &MEM[(void *)&foo + 9B];
_16 = _18 + 1;

looks like it would be better as:

_19 = (unsigned long) &foo;
_16 = _19 + 10;

(I guess we don't want to produce &MEM[(void *)&foo + 10B])

But that might not be true anymore if _18 had several uses.

Reassoc could also play a role, either to cancel the two +1, or to put the two converted addresses together.

I guess extra match.pd patterns might be doable as well, but I am scared of the number of variants that might appear.
Comment 2 Richard Biener 2016-12-20 10:14:33 UTC
(In reply to Marc Glisse from comment #1)
> Not sure what the best angle is here.
> 
> _18 = (unsigned long) &MEM[(void *)&foo + 9B];
> _16 = _18 + 1;
> 
> looks like it would be better as:
> 
> _19 = (unsigned long) &foo;
> _16 = _19 + 10;
> 
> (I guess we don't want to produce &MEM[(void *)&foo + 10B])
> 
> But that might not be true anymore if _18 had several uses.

We do already have

/* Pattern match
     tem1 = (long) ptr1;
     tem2 = (long) ptr2;
     tem3 = tem2 - tem1;
     tem4 = (unsigned long) tem3;
     tem5 = ptr1 + tem4;
   and produce
     tem5 = ptr2;  */
(simplify
  (pointer_plus @0 (convert?@2 (minus@3 (convert @1) (convert @0))))
  /* Conditionally look through a sign-changing conversion.  */
  (if (TYPE_PRECISION (TREE_TYPE (@2)) == TYPE_PRECISION (TREE_TYPE (@3))
       && ((GIMPLE && useless_type_conversion_p (type, TREE_TYPE (@1)))
            || (GENERIC && type == TREE_TYPE (@1))))
   @1))

for example...  But I guess the issue is &MEM[&foo + 9B] vs. &foo.ascii_
(I'm quite sure operand_equal_p doesn't treat those as equal).  At least
propagation passes will canonicalize &foo.ascii_ + 1 to &MEM[&foo + 9B]
(but that transform is not done on its own AFAIR).  forwprop does that
as well, but not as part of its "copy" lattice init.  Fixing that
results in

  <bb 2> [10.00%]:
  _18 = (unsigned long) &MEM[(void *)&foo + 9B];
  _3 = (unsigned long) &MEM[(void *)&foo + 1B];
  _16 = _18 + 1;
  _14 = _16 - _3;
  _13 = buf__9(D) + ptr_6(D);
  __builtin_memcpy (_13, &foo, _14);

after forwprop4, so we still miss some patterns here.  I guess reassoc
makes our job harder in this case...  disabling reassoc2 yields

  <bb 2> [10.00%]:
  _13 = buf__9(D) + ptr_6(D);
  MEM[(char * {ref-all})_13] = MEM[(char * {ref-all})&foo];
  return;

at forwprop4.  So eventually reassoc misses the &MEM trick as well or we
should really apply that generally (ISTR it messed up some object-size
warning stuff if done generally).

> Reassoc could also play a role, either to cancel the two +1, or to put the
> two converted addresses together.
> 
> I guess extra match.pd patterns might be doable as well, but I am scared of
> the number of variants that might appear.
Comment 3 Richard Biener 2016-12-20 10:16:15 UTC
Created attachment 40376 [details]
forwprop fix

The forwprop piece, untested (apart from on the testcase).  No time to explore further before the start of the next year.
Comment 4 Richard Biener 2016-12-20 10:56:07 UTC
Note that the attached patch will actually end up propagating that &MEM form everywhere (and run into those object-size fallout issues).
Comment 5 Richard Biener 2016-12-20 12:48:22 UTC
Actually disabling reassoc is enough to "fix" it, the patch itself is not needed,
fixed by VRP then (which also folds stmts and knows the &MEM trick).
Comment 6 Richard Biener 2017-02-08 11:35:23 UTC
And the real inefficiency is niter analysis returning unsimplified

(unsigned long) &MEM[(void *)&foo + 9B] - (unsigned long) ((const char *) &foo.ascii_ + 1)

from here:

#1  0x000000000134f088 in number_of_iterations_ne (loop=0x7ffff5534550, 
    type=<pointer_type 0x7ffff569f3f0>, iv=0x7fffffffd4b0, 
    final=<addr_expr 0x7ffff51b4620>, niter=0x7fffffffd5e0, 
    exit_must_be_taken=true, bnds=0x7fffffffd3d0)
    at /space/rguenther/src/svn/gcc-7-branch/gcc/tree-ssa-loop-niter.c:991
991                            fold_convert (niter_type, iv->base));
(gdb) l
986       else
987         {
988           s = fold_convert (niter_type, iv->step);
989           c = fold_build2 (MINUS_EXPR, niter_type,
990                            fold_convert (niter_type, final),
991                            fold_convert (niter_type, iv->base));

where

final: &MEM[(void *)&foo + 9B]
iv->base: (const char *) &foo.ascii_ + 1

one could either try to enhance the associate: case of fold_binary or use
tree-affine above, like with

Index: gcc/tree-ssa-loop-niter.c
===================================================================
--- gcc/tree-ssa-loop-niter.c   (revision 245276)
+++ gcc/tree-ssa-loop-niter.c   (working copy)
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.
 #include "tree-chrec.h"
 #include "tree-scalar-evolution.h"
 #include "params.h"
+#include "tree-affine.h"
 
 
 /* The maximum number of dominator BBs we search for conditions
@@ -986,9 +987,12 @@ number_of_iterations_ne (struct loop *lo
   else
     {
       s = fold_convert (niter_type, iv->step);
-      c = fold_build2 (MINUS_EXPR, niter_type,
-                      fold_convert (niter_type, final),
-                      fold_convert (niter_type, iv->base));
+      aff_tree aff_ivbase, aff_final;
+      tree_to_aff_combination (iv->base, niter_type, &aff_ivbase);
+      tree_to_aff_combination (final, niter_type, &aff_final);
+      aff_combination_scale (&aff_ivbase, -1);
+      aff_combination_add (&aff_ivbase, &aff_final);
+      c = aff_combination_to_tree (&aff_ivbase);
     }
 
   mpz_init (max);


though this is horribly incomplete (fixes the testcase though) and if, then
niter analysis should use aff_trees throughout (not re-build trees for the
intermediate stuff).  With that patch we immediately get '9' as the
number of iterations and thus ldist produces

  <bb 2> [10.00%]:
  _18 = buf__9(D) + ptr_6(D);
  __builtin_memcpy (_18, &foo, 9);
Comment 7 Richard Biener 2017-02-08 11:56:52 UTC
A simple fold-const.c "fix" is the following:

Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c    (revision 245276)
+++ gcc/fold-const.c    (working copy)
@@ -785,7 +785,9 @@ split_tree (location_t loc, tree in, tre
                  the value is not affected.  For reals, the value might be
                  affected, so we can't.  */
               && ((code == PLUS_EXPR && TREE_CODE (in) == MINUS_EXPR)
-                  || (code == MINUS_EXPR && TREE_CODE (in) == PLUS_EXPR))))
+                  || (code == MINUS_EXPR
+                      && (TREE_CODE (in) == PLUS_EXPR
+                          || TREE_CODE (in) == POINTER_PLUS_EXPR)))))
     {
       tree op0 = TREE_OPERAND (in, 0);
       tree op1 = TREE_OPERAND (in, 1);

I'm giving it bootstrap/regtest - that would be likely a quite beneficial change for all the middle-end machinery still using GENERIC...
Comment 8 Richard Biener 2017-04-21 12:09:52 UTC
Author: rguenth
Date: Fri Apr 21 12:09:20 2017
New Revision: 247061

URL: https://gcc.gnu.org/viewcvs?rev=247061&root=gcc&view=rev
Log:
2017-04-21  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/78847
	* fold-const.c (split_tree): Handle POINTER_PLUS_EXPR.

	* g++.dg/tree-ssa/pr78847.C: New testcase.

Added:
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/testsuite/ChangeLog
Comment 9 Richard Biener 2017-04-21 12:10:16 UTC
Fixed for GCC 8.