[Bug tree-optimization/91975] worse code for small array copy using pointer arithmetic than array indexing

Fri Oct 4 07:28:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91975

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2019-10-04
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
f1 and g1 are detected as memcpy by loop-distribution.  f0 is unrolled
completely by late unrolling:

  Loop size: 10
  Estimated size after unrolling: 10

while g0 is not:

  Loop size: 8
  Estimated size after unrolling: 20

so the size estimation doesn't quite "work" here.  f0 body before unrolling:

  <bb 3> [local count: 954449108]:
  # i_14 = PHI <0(2), i_10(4)>
  # prephitmp_19 = PHI <0(2), pretmp_18(4)>
  # ivtmp_3 = PHI <8(2), ivtmp_13(4)>
  _1 = (long unsigned int) i_14;
  _2 = _1 * 4;
  _4 = &b + _2;
  *_4 = prephitmp_19;
  i_10 = i_14 + 1;
  ivtmp_13 = ivtmp_3 - 1;
  if (ivtmp_13 != 0)
    goto <bb 4>; [87.50%]
  else
    goto <bb 5>; [12.50%]

  <bb 4> [local count: 835156388]:
  _12 = (long unsigned int) i_10;
  _11 = _12 * 4;
  _16 = &a + _11;
  pretmp_18 = MEM[(const int *)_16];
  goto <bb 3>; [100.00%]

g0 body:

  <bb 3> [local count: 954449108]:
  # s_16 = PHI <&a(2), s_7(4)>
  # d_17 = PHI <&b(2), d_8(4)>
  # i_18 = PHI <0(2), i_10(4)>
  # prephitmp_4 = PHI <0(2), pretmp_5(4)>
  # ivtmp_3 = PHI <8(2), ivtmp_1(4)>
  s_7 = s_16 + 4;
  d_8 = d_17 + 4;
  *d_17 = prephitmp_4;
  i_10 = i_18 + 1;
  ivtmp_1 = ivtmp_3 - 1;
  if (ivtmp_1 != 0)
    goto <bb 4>; [87.50%]
  else
    goto <bb 5>; [12.50%]

  <bb 4> [local count: 835156388]:
  pretmp_5 = MEM[(const int *)s_16 + 4B];
  goto <bb 3>; [100.00%]

for g0 we do not think that the s_7 = s_16 + 4 are going to be optimized "away"
but for f0 we think that _4 = &b + _2 will.  Those are actually the same.

diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 5952cad7bba..d38959c3aa2 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -195,9 +195,8 @@ constant_after_peeling (tree op, gimple *stmt, class loop
*loop)
   /* Induction variables are constants when defined in loop.  */
   if (loop_containing_stmt (stmt) != loop)
     return false;
-  tree ev = analyze_scalar_evolution (loop, op);
-  if (chrec_contains_undetermined (ev)
-      || chrec_contains_symbols (ev))
+  tree ev = instantiate_parameters (loop, analyze_scalar_evolution (loop,
op));
+  if (chrec_contains_undetermined (ev))
     return false;
   return true;
 }

fixes this but we still end up with

size: 8-6, last_iteration: 7-6
  Loop size: 8
  Estimated size after unrolling: 10
Not unrolling loop 1: size would grow.

and not unrolling because the not unrolled estimate is lower than that for f0
(that costs &a + i * 4 as 2 while g0 has IV + 4).

I'm testing the above anyway.