[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

Fri Jan 27 10:39:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
> 
> --- Comment #4 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> > 
> >   int i;
> >   for (i = 0; i < 128; ++i)
> >     {
> >       a[i] = a[i] + 1;
> >       b[i] = d[i];
> >       c[i] = a[i] / d[i];
> >     }
> > 
> > where the testcase expects b[i] = d[i] to be split out as memcpy but
> > the other two partitions to be fused.
> > 
> > Generally the cost model lacks computing the number of input/output streams
> > of a partition and a target interface to query it about limits.  Usually
> > store bandwidth is not equal to load bandwidth and not re-used store streams
> > can benefit from non-temporal stores being used by libc.
> > 
> > In your testcase I wonder whether distributing to
> > 
> >     for (int j = 0; j < x; j++)
> >       {
> >         for (int i = 0; i < y; i++)
> > 	  {
> > 	    c[j][i] = b[j][i] - a[j][i];
> >           }
> >       }
> >     memcpy (a, b, ...);
> > 
> > would be faster in the end (or even doing the memcpy first in this case).
> > 
> > Well, for now let's be more conservative given the cost model really is
> > lacking.
> 
> The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
> regression for Cortex-A72 after r242038, and I can fix that with
> -fno-tree-loop-distribute-patterns.
> 
> In that benchmark you've got 3 instances of the above pattern, so you end up
> with 3 memcpy calls after:
> 
>       DO 300 J=1,N
>       DO 300 I=1,M
>       UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
>       VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
>       POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
>       U(I,J) = UNEW(I,J)
>       V(I,J) = VNEW(I,J)
>       P(I,J) = PNEW(I,J)
>   300 CONTINUE
> 
> 3 memcpy calls compared to 3 vector store instructions doesn't seem the right
> tradeoff to me. Sorry if I reduced the testcase too far to make the balance
> clear.

Itanic seems to like it though:

http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png

For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown
as well around that time.  I guess it depends if the CPU is already
throttled by load/compute bandwith here.  What should be profitable
is to distribute the above to three loops (w/o memcpy).  So after
the patch doing -ftree-loop-distribution.  Patch being

Index: gcc/tree-loop-distribution.c
===================================================================
--- gcc/tree-loop-distribution.c        (revision 244963)
+++ gcc/tree-loop-distribution.c        (working copy)
@@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec<
       for (int j = i + 1;
           partitions.iterate (j, &partition); ++j)
        {
-         if (!partition_builtin_p (partition)
-             && similar_memory_accesses (rdg, into, partition))
+         if (similar_memory_accesses (rdg, into, partition))
            {
              if (dump_file && (dump_flags & TDF_DETAILS))
                {