43174 – Teaching SCEV about ADDR_EXPR causes regression

Bug 43174 - Teaching SCEV about ADDR_EXPR causes regression

Summary: Teaching SCEV about ADDR_EXPR causes regression

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.5.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	4.7pending
	Show dependency tree / graph

Reported:	2010-02-25 14:42 UTC by Alexander Monakov
Modified:	2017-01-13 09:40 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2010-02-25 15:26:20

Attachments
Simplify increments in IVopts using final values of inner loop IVs (1.56 KB, patch) 2010-03-01 17:43 UTC, Alexander Monakov	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Monakov 2010-02-25 14:42:34 UTC

With patch from here: http://gcc.gnu.org/ml/gcc-patches/2010-02/msg00668.html
IVopts begin to create IVs for expressions like &a0[i][j][0].  This may cause regressions in stack usage and code size (also possibly speed).  Test case:

/* ---8<--- */
enum {N=123};
int a0[N][N][N], a1[N][N][N], a2[N][N][N], a3[N][N][N],
    a4[N][N][N], a5[N][N][N], a6[N][N][N], a7[N][N][N];

int foo() {
  int i, j, k, s = 0;
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      for (k = 0; k < N; k++) {
      s += a0[i][j][k]; s += a1[i][j][k]; s += a2[i][j][k]; s += a3[i][j][k];
      s += a4[i][j][k]; s += a5[i][j][k]; s += a6[i][j][k]; s += a7[i][j][k];
      }
  return s;
}
/* ---8<--- */

Without the patch, IVopts produce one IV for j loop and 8 IVs for k loop.  With the patch, IVopts additionally produce 8 IVs for j loop (with 123*4 increment), 4 of which live on stack (on x86-64, -O2).

Creation of IVs that live on stack is likely due to inexact register pressure estimation in IVopts.

However, it would be nice if IVopts could notice that it's cheaper to take the final value of inner loop IVs (e.g. &a0[i][j][k]) instead of incrementing IV holding &a0[i][j][0] by 123*4.  It would decrease register pressure and allow to generate perfect code for the test case.

Comment 1 Alexander Monakov 2010-03-01 17:43:46 UTC

Created attachment 20001 [details]
Simplify increments in IVopts using final values of inner loop IVs

A quick & dirty attempt to implement register pressure reduction in outer loops by using final values of inner loop IVs.  Currently, given
for (i = 0; i < N; i++)
  for (j = 0; j < N; j++)
    s += a[i][j];
we generate something like
<bb1>
L1:
s.0 = PHI(0, s.2)
i.0 = PHI(0, i.1)
ivtmp.0 = &a[i.0][0]
<bb2>
L2:
s.1 = PHI(s.0, s.2)
j.0 = PHI(122, j.1)
ivtmp.1 = PHI(ivtmp.0, ivtmp.2)
s.2 = s.1 + MEM(ivtmp.1)
ivtmp.2 = ivtmp.1 + 4
j.1 = j.0 - 1
if (j.1 >= 0) goto L2
<bb3>
i.1 = i.0 + 1
if (i.1 <= 122) goto L1

This together with the patch mentioned in the previous comment allows to generate:
ivtmp.0 = &a[0][0]
<bb1>
L1:
s.0 = PHI(0, s.2)
i.0 = PHI(122, i.1)
ivtmp.1 = PHI(ivtmp.0, ivtmp.4)
<bb2>
L2:
s.1 = PHI(s.0, s.2)
j.0 = PHI(122, j.1)
ivtmp.2 = PHI(ivtmp.1, ivtmp.3)
s.2 = s.1 + MEM(ivtmp.2)
ivtmp.3 = ivtmp.2 + 4
j.1 = j.0 - 1
if (j.1 >= 0) goto L2
<bb3>
ivtmp.4 = ivtmp.3 // would be ivtmp.4 = ivtmp.1 + stride
i.1 = i.0 - 1
if (i.1 >= 0) goto L1

The improvement is that ivtmp.1 is not live across the inner loop.

The approach is to store final values of IVs in a hashtable, mapping SSA_NAME of initial value in the preheader to aff_tree with final value, and then try to replace increments of new IVs with uses of IVs from inner loops (currently I just implemented a brute force loop over all IV uses to find a useful entry in that hashtable).
Does this make sense and sound acceptable?

Comment 2 rakdver@kam.mff.cuni.cz 2010-03-03 09:21:09 UTC

Subject: Re:  Teaching SCEV about ADDR_EXPR
	causes regression

> This together with the patch mentioned in the previous comment allows to
> generate:
> ivtmp.0 = &a[0][0]
> <bb1>
> L1:
> s.0 = PHI(0, s.2)
> i.0 = PHI(122, i.1)
> ivtmp.1 = PHI(ivtmp.0, ivtmp.4)
> <bb2>
> L2:
> s.1 = PHI(s.0, s.2)
> j.0 = PHI(122, j.1)
> ivtmp.2 = PHI(ivtmp.1, ivtmp.3)
> s.2 = s.1 + MEM(ivtmp.2)
> ivtmp.3 = ivtmp.2 + 4
> j.1 = j.0 - 1
> if (j.1 >= 0) goto L2
> <bb3>
> ivtmp.4 = ivtmp.3 // would be ivtmp.4 = ivtmp.1 + stride
> i.1 = i.0 - 1
> if (i.1 >= 0) goto L1
> 
> The improvement is that ivtmp.1 is not live across the inner loop.
> 
> The approach is to store final values of IVs in a hashtable, mapping SSA_NAME
> of initial value in the preheader to aff_tree with final value, and then try to
> replace increments of new IVs with uses of IVs from inner loops (currently I
> just implemented a brute force loop over all IV uses to find a useful entry in
> that hashtable).
> Does this make sense and sound acceptable?

the approach seems ok.  However, it is not immediately clear that performing the
replacement is a good idea -- it trades of register pressure for creating new
dependences, i.e., it makes register allocation easier, but scheduling harder.
So, some performance testing is necessary to check this,

Zdenek

Comment 3 bin cheng 2015-08-17 09:25:28 UTC

Note for the three levels of loop example, GCC chooses one IV for both j and k loops, thus generates pretty clean output on x86_64 with O2.  

For the simple example, now gcc can eliminate comparison iv with the address candidate, and generates below codes:

  <bb 2>:
  ivtmp.18_14 = (unsigned long) &a;
  _10 = &a + 60516;
  _28 = (unsigned long) _10;
  goto <bb 7>;

  <bb 3>:

  <bb 4>:
  # s_18 = PHI <s_9(3), s_19(7)>
  # ivtmp.9_12 = PHI <ivtmp.9_1(3), ivtmp.9_4(7)>
  _22 = (void *) ivtmp.9_12;
  _8 = MEM[base: _22, offset: 0B];
  s_9 = _8 + s_18;
  ivtmp.9_1 = ivtmp.9_12 + 4;
  if (ivtmp.9_1 != _27)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  # s_17 = PHI <s_9(4)>
  ivtmp.18_15 = ivtmp.18_5 + 492;
  if (ivtmp.18_15 != _28)
    goto <bb 6>;
  else
    goto <bb 8>;

  <bb 6>:

  <bb 7>:
  # s_19 = PHI <s_17(6), 0(2)>
  # ivtmp.18_5 = PHI <ivtmp.18_15(6), ivtmp.18_14(2)>
  ivtmp.9_4 = ivtmp.18_5;
  _29 = ivtmp.18_5 + 492;
  _27 = _29;
  goto <bb 4>;

  <bb 8>:
  # s_16 = PHI <s_17(5)>
  return s_16;

With this, following gimple optimizers are able to CSE the opportunity of "ivtmp.18_5 + 492".  As a result, optimal code is generated as in optimized dump:


  <bb 2>:
  ivtmp.18_14 = (unsigned long) &a;
  _28 = (unsigned long) &MEM[(void *)&a + 60516B];
  goto <bb 5>;

  <bb 3>:
  # s_18 = PHI <s_9(3), s_19(5)>
  # ivtmp.9_12 = PHI <ivtmp.9_1(3), ivtmp.18_5(5)>
  _22 = (void *) ivtmp.9_12;
  _8 = MEM[base: _22, offset: 0B];
  s_9 = _8 + s_18;
  ivtmp.9_1 = ivtmp.9_12 + 4;
  if (ivtmp.9_1 != _29)
    goto <bb 3>;
  else
    goto <bb 4>;

  <bb 4>:
  if (_28 != _29)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  # s_19 = PHI <s_9(4), 0(2)>
  # ivtmp.18_5 = PHI <_29(4), ivtmp.18_14(2)>
  _29 = ivtmp.18_5 + 492;
  goto <bb 3>;

  <bb 6>:
  return s_9;

This in effect is the transformation you wanted in this PR, but I doubt if GCC can do this if it can't eliminate the inner loop's comparison using ivtmp.9_1 at the first place.

On the other hand, for cases that we can use IV's final value, it maybe likely for GCC to eliminate the comparison IV.

Comment 4 bin cheng 2015-08-17 09:26:35 UTC

Looks the ADDR_EXPR issue is fixed by revision 185129.

Comment 5 Richard Biener 2017-01-13 09:40:41 UTC

Looks indeed fixed on trunk:

        xorl    %eax, %eax
.L2:
        leaq    -60516(%rsi), %rdx
        .p2align 4,,10
        .p2align 3
.L6:
        leaq    492(%rdx), %rcx
        .p2align 4,,10
        .p2align 3
.L3:
        addl    a0(%rdx), %eax
        addq    $4, %rdx
        addl    a1-4(%rdx), %eax
        addl    a2-4(%rdx), %eax
        addl    a3-4(%rdx), %eax
        addl    a4-4(%rdx), %eax
        addl    a5-4(%rdx), %eax
        addl    a6-4(%rdx), %eax
        addl    a7-4(%rdx), %eax
        cmpq    %rdx, %rcx
        jne     .L3
        cmpq    %rcx, %rsi
        jne     .L6
        addq    $60516, %rsi
        cmpq    $7503984, %rsi
        jne     .L2