40029 – [4.5 Regression] Big degradation on swim/mgrid on powerpc 32/64 after alias improvement merge (gcc r145494)

Bug 40029 - [4.5 Regression] Big degradation on swim/mgrid on powerpc 32/64 after alias improvement merge (gcc r145494)

Summary: [4.5 Regression] Big degradation on swim/mgrid on powerpc 32/64 after alias i...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.5.0

Importance:	P2 normal
Target Milestone:	4.5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2009-05-05 17:54 UTC by Luis Machado
Modified:	2009-11-30 13:11 UTC (History)
CC List:	5 users (show)

See Also:
Host:	powerpc--*
Target:	powerpc--*
Build:	powerpc--*
Known to work:
Known to fail:
Last reconfirmed:	2009-05-21 14:04:13

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luis Machado 2009-05-05 17:54:28 UTC

CPU2000's swim and mgrid had ~10% slowdown after the merge of the alias improvement branch.

GCC was configured with the following:

/gcc/HEAD/configure --target=powerpc64-linux --host=powerpc64-linux
--build=powerpc64-linux --with-cpu=default32 --enable-threads=posix
--enable-shared --enable-__cxa_atexit --with-gmp=/gmp --with-mpfr=mpfr
--with-long-double-128 --enable-decimal-float --enable-secure-plt
--disable-bootstrap --disable-alsa --prefix=/install/gcc/HEAD
build_alias=powerpc64-linux host_alias=powerpc64-linux
target_alias=powerpc64-linux --enable-languages=c,c++,fortran --no-create
--no-recursion

Compile flags used: -m[32|64] -O3 -mcpu=power[4|5|6] -ffast-math -ftree-loop-linear -funroll-loops -fpeel-loops

Will provide more details soon.

Comment 1 Luis Machado 2009-05-11 18:04:05 UTC

Good asm code for a hot loop in swim's "calc1" function

10001e10:	lfd     f12,-10672(r11)
10001e14:	lfd     f9,-10672(r9)
10001e18:	addi    r21,r21,16
10001e1c:	lfd     f7,-10680(r11)
10001e20:	lfd     f6,-10672(r6)
10001e24:	fmul    f3,f9,f9
10001e28:	cmpw    r21,r0
10001e2c:	fadd    f4,f7,f12
10001e30:	lfd     f22,-10680(r9)
10001e34:	lfd     f10,-10664(r9)
10001e38:	addi    r9,r9,16
10001e3c:	lfd     f23,-10672(r5)
10001e40:	lfd     f13,-10664(r5)
10001e44:	addi    r5,r5,16
10001e48:	lfd     f5,-10664(r11)
10001e4c:	fsub    f28,f23,f9
10001e50:	fsub    f25,f13,f10
10001e54:	lfd     f13,-10672(r4)
10001e58:	addi    r11,r11,16
10001e5c:	fadd    f5,f12,f5
10001e60:	fsub    f20,f13,f0
10001e64:	fmul    f9,f11,f9
10001e68:	fmadd   f27,f22,f22,f3
10001e6c:	fmadd   f30,f10,f10,f3
10001e70:	lfd     f3,-10680(r8)
10001e74:	fadd    f26,f4,f6
10001e78:	fmul    f10,f11,f10
10001e7c:	fmul    f24,f28,f2
10001e80:	fmul    f21,f25,f2
10001e84:	fmul    f4,f9,f4
10001e88:	fmadd   f22,f0,f0,f27
10001e8c:	fadd    f27,f8,f7
10001e90:	fadd    f23,f26,f8
10001e94:	fmul    f26,f0,f11
10001e98:	lfd     f8,-10664(r6)
10001e9c:	lfd     f0,-10664(r4)
10001ea0:	addi    r6,r6,16
10001ea4:	fadd    f29,f5,f8
10001ea8:	fsub    f25,f0,f13
10001eac:	addi    r4,r4,16
10001eb0:	fmsub   f28,f20,f1,f24
10001eb4:	lfd     f20,-10672(r8)
10001eb8:	fmul    f5,f10,f5
10001ebc:	addi    r8,r8,16
10001ec0:	stfd    f4,-10672(r22)
10001ec4:	stfd    f5,-10664(r22)
10001ec8:	addi    r22,r22,16
10001ecc:	fmul    f27,f26,f27
10001ed0:	fadd    f24,f6,f29
10001ed4:	fmsub   f29,f25,f1,f21
10001ed8:	fdiv    f28,f28,f23
10001edc:	fmadd   f25,f13,f13,f30
10001ee0:	fadd    f6,f6,f12
10001ee4:	fmadd   f30,f3,f3,f22
10001ee8:	stfd    f27,-10680(r3)
10001eec:	fdiv    f29,f29,f24
10001ef0:	fmadd   f3,f20,f20,f25
10001ef4:	fmul    f20,f13,f11
10001ef8:	fmadd   f7,f30,f31,f7
10001efc:	stfd    f7,-10680(r10)
10001f00:	fmadd   f12,f3,f31,f12
10001f04:	fmul    f13,f20,f6
10001f08:	stfd    f12,-10672(r10)
10001f0c:	stfd    f13,-10672(r3)
10001f10:	addi    r10,r10,16
10001f14:	addi    r3,r3,16
10001f18:	stfd    f28,-10672(r7)
10001f1c:	stfd    f29,-10664(r7)
10001f20:	addi    r7,r7,16
10001f24:	bne     10001e10 <calc1_+0x1b0>

----------
Bad asm code for the same loop

10001a60:	addis   r27,r9,-435
10001a64:	addis   r12,r11,-2176
10001a68:	lfd     f13,-7440(r27)
10001a6c:	lfd     f10,28344(r12)
10001a70:	addis   r8,r11,-1958
10001a74:	addis   r10,r11,-1740
10001a78:	fsub    f7,f10,f13
10001a7c:	lfd     f8,-704(r8)
10001a80:	lfd     f10,0(r9)
10001a84:	addis   r7,r9,-218
10001a88:	addis   r28,r9,1523
10001a8c:	lfd     f9,-29752(r10)
10001a90:	fadd    f6,f12,f10
10001a94:	fsub    f2,f8,f0
10001a98:	addis   r12,r11,218
10001a9c:	addis   r27,r9,2176
10001aa0:	fadd    f5,f11,f9
10001aa4:	fadd    f11,f11,f12
10001aa8:	addi    r9,r9,8
10001aac:	cmpw    r6,r9
10001ab0:	fmul    f1,f7,f30
10001ab4:	fmul    f7,f13,f13
10001ab8:	fmul    f13,f13,f3
10001abc:	fadd    f31,f5,f6
10001ac0:	lfd     f5,29040(r7)
10001ac4:	fmsub   f2,f2,f29,f1
10001ac8:	fmadd   f1,f0,f0,f7
10001acc:	fmul    f0,f0,f3
10001ad0:	fmul    f6,f13,f6
10001ad4:	stfd    f6,-6728(r28)
10001ad8:	fdiv    f2,f2,f31
10001adc:	fmadd   f5,f5,f5,f1
10001ae0:	fmul    f31,f0,f11
10001ae4:	fmr     f0,f8
10001ae8:	stfd    f31,0(r11)
10001aec:	fmr     f11,f9
10001af0:	addi    r11,r11,8
10001af4:	fadd    f1,f5,f4
10001af8:	fmr     f4,f7
10001afc:	fmadd   f5,f1,f28,f12
10001b00:	fmr     f12,f10
10001b04:	stfd    f5,-28344(r27)
10001b08:	stfd    f2,-29040(r12)
10001b0c:	bne+    10001a60 <calc1_+0xe0>

----------

Looking into the differences for both cases, the good code seems to be traversing the loop in a different way than the bad one, using smaller displacements for each load/store. The bad case uses bigger displacements.

Also, it looks like we have a bigger unrolling factor on the good case (longer code, more loads) compared to the bad case.

Comment 2 Richard Biener 2009-05-21 14:04:13 UTC

That's

      DO 100 J=1,N
      DO 100 I=1,M
      CU(I+1,J) = .5D0*(P(I+1,J)+P(I,J))*U(I+1,J)
      CV(I,J+1) = .5D0*(P(I,J+1)+P(I,J))*V(I,J+1)
      Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1)
     1          -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1))
      H(I,J) = P(I,J)+.25D0*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J)
     1               +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J))
  100 CONTINUE

right?

4.4 can do predictive commoning on it while trunk can't - this also unrolls
the loop twice.  On trunk we are likely confused by PRE that already
partially performs what predictive commoning would do.  Disabling PRE
makes predictive commoning work but doesn't unroll the loop (same as
with disabling PRE in 4.4).  It is likely the full redundancies PRE
discovers that cause the unrolling.

That said - this looks like yet another unfortunate pass ordering problem
to me.

Comment 3 Richard Biener 2009-05-21 14:10:03 UTC

Testcase:

      SUBROUTINE CALC1                                                          
      IMPLICIT REAL*8   (A-H, O-Z)                                              
      PARAMETER (N1=1335, N2=1335)                                              
      COMMON  U(N1,N2), V(N1,N2), P(N1,N2),                                     
     2        CU(N1,N2), CV(N1,N2),                                             
     *        Z(N1,N2), H(N1,N2)                                                
      COMMON /CONS/ DX,DY                                                       
      FSDX = 4.D0/DX                                                            
      FSDY = 4.D0/DY                                                            
      DO 100 J=1,N                                                              
      DO 100 I=1,M                                                              
      CU(I+1,J) = .5D0*(P(I+1,J)+P(I,J))*U(I+1,J)                               
      CV(I,J+1) = .5D0*(P(I,J+1)+P(I,J))*V(I,J+1)                               
      Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1)                 
     1          -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1))               
      H(I,J) = P(I,J)+.25D0*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J)                    
     1               +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J))                          
  100 CONTINUE                                                                  
      RETURN                                                                    
      END

Comment 4 Richard Biener 2009-05-27 20:57:51 UTC

Actually PRE seems to be more powerful than predictive commoning here.  We
just lose one opportunity while gaining.  With predictive commoning we have
8 loads and 4 stores, 11 multiplications and one division.
With PRE it is 6 loads and 4 stores, 10 multiplications and one division.
The only thing we gain from predictive commoning in 4.4 is unrolling the
loop once.

Comment 5 Luis Machado 2009-05-29 19:52:21 UTC

From predictive commoning we gain a bit more performance, probably due to the bigger unrolling factor.

Any chance of the unrolling taking place while still using PRE?

Thanks,
Luis

Comment 6 rguenther@suse.de 2009-05-29 20:15:38 UTC

Subject: Re:  [4.5 Regression] Big degradation on
 swim/mgrid on powerpc 32/64 after alias improvement merge (gcc r145494)

On Fri, 29 May 2009, luisgpm at linux dot vnet dot ibm dot com wrote:

> ------- Comment #5 from luisgpm at linux dot vnet dot ibm dot com  2009-05-29 19:52 -------
> From predictive commoning we gain a bit more performance, probably due to the
> bigger unrolling factor.
> 
> Any chance of the unrolling taking place while still using PRE?

-funroll[-all]-loops doesn't seem to do the job.  I didn't check if
enabling sms would do it.  Other unrolling on the tree level is only
implemented as side-effect of other optimizations (like vectorization
or predictive commoning or prefetching) :/

Richard.

Comment 7 Richard Biener 2009-11-30 13:11:36 UTC

I believe this was fixed with

2009-07-22  Michael Matz  <matz@suse.de>

        PR tree-optimization/35229
        PR tree-optimization/39300

        * tree-ssa-pre.c (includes): Include tree-scalar-evolution.h.
        (inhibit_phi_insertion): New function.
        (insert_into_preds_of_block): Call it for REFERENCEs.
        (init_pre): Initialize and finalize scalar evolutions.
        * Makefile.in (tree-ssa-pre.o): Depend on tree-scalar-evolution.h .


which avoids the PRE and enables predictive commoning again (on x86_64
only the tail loop of the vectorized variant is).