[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Fri Jan 27 10:39:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
>
> --- Comment #4 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> >
> > int i;
> > for (i = 0; i < 128; ++i)
> > {
> > a[i] = a[i] + 1;
> > b[i] = d[i];
> > c[i] = a[i] / d[i];
> > }
> >
> > where the testcase expects b[i] = d[i] to be split out as memcpy but
> > the other two partitions to be fused.
> >
> > Generally the cost model lacks computing the number of input/output streams
> > of a partition and a target interface to query it about limits. Usually
> > store bandwidth is not equal to load bandwidth and not re-used store streams
> > can benefit from non-temporal stores being used by libc.
> >
> > In your testcase I wonder whether distributing to
> >
> > for (int j = 0; j < x; j++)
> > {
> > for (int i = 0; i < y; i++)
> > {
> > c[j][i] = b[j][i] - a[j][i];
> > }
> > }
> > memcpy (a, b, ...);
> >
> > would be faster in the end (or even doing the memcpy first in this case).
> >
> > Well, for now let's be more conservative given the cost model really is
> > lacking.
>
> The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
> regression for Cortex-A72 after r242038, and I can fix that with
> -fno-tree-loop-distribute-patterns.
>
> In that benchmark you've got 3 instances of the above pattern, so you end up
> with 3 memcpy calls after:
>
> DO 300 J=1,N
> DO 300 I=1,M
> UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
> VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
> POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
> U(I,J) = UNEW(I,J)
> V(I,J) = VNEW(I,J)
> P(I,J) = PNEW(I,J)
> 300 CONTINUE
>
> 3 memcpy calls compared to 3 vector store instructions doesn't seem the right
> tradeoff to me. Sorry if I reduced the testcase too far to make the balance
> clear.
Itanic seems to like it though:
http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png
For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown
as well around that time. I guess it depends if the CPU is already
throttled by load/compute bandwith here. What should be profitable
is to distribute the above to three loops (w/o memcpy). So after
the patch doing -ftree-loop-distribution. Patch being
Index: gcc/tree-loop-distribution.c
===================================================================
--- gcc/tree-loop-distribution.c (revision 244963)
+++ gcc/tree-loop-distribution.c (working copy)
@@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec<
for (int j = i + 1;
partitions.iterate (j, &partition); ++j)
{
- if (!partition_builtin_p (partition)
- && similar_memory_accesses (rdg, into, partition))
+ if (similar_memory_accesses (rdg, into, partition))
{
if (dump_file && (dump_flags & TDF_DETAILS))
{
More information about the Gcc-bugs
mailing list