CPU2000's swim and mgrid had ~10% slowdown after the merge of the alias improvement branch. GCC was configured with the following: /gcc/HEAD/configure --target=powerpc64-linux --host=powerpc64-linux --build=powerpc64-linux --with-cpu=default32 --enable-threads=posix --enable-shared --enable-__cxa_atexit --with-gmp=/gmp --with-mpfr=mpfr --with-long-double-128 --enable-decimal-float --enable-secure-plt --disable-bootstrap --disable-alsa --prefix=/install/gcc/HEAD build_alias=powerpc64-linux host_alias=powerpc64-linux target_alias=powerpc64-linux --enable-languages=c,c++,fortran --no-create --no-recursion Compile flags used: -m[32|64] -O3 -mcpu=power[4|5|6] -ffast-math -ftree-loop-linear -funroll-loops -fpeel-loops Will provide more details soon.
Good asm code for a hot loop in swim's "calc1" function 10001e10: lfd f12,-10672(r11) 10001e14: lfd f9,-10672(r9) 10001e18: addi r21,r21,16 10001e1c: lfd f7,-10680(r11) 10001e20: lfd f6,-10672(r6) 10001e24: fmul f3,f9,f9 10001e28: cmpw r21,r0 10001e2c: fadd f4,f7,f12 10001e30: lfd f22,-10680(r9) 10001e34: lfd f10,-10664(r9) 10001e38: addi r9,r9,16 10001e3c: lfd f23,-10672(r5) 10001e40: lfd f13,-10664(r5) 10001e44: addi r5,r5,16 10001e48: lfd f5,-10664(r11) 10001e4c: fsub f28,f23,f9 10001e50: fsub f25,f13,f10 10001e54: lfd f13,-10672(r4) 10001e58: addi r11,r11,16 10001e5c: fadd f5,f12,f5 10001e60: fsub f20,f13,f0 10001e64: fmul f9,f11,f9 10001e68: fmadd f27,f22,f22,f3 10001e6c: fmadd f30,f10,f10,f3 10001e70: lfd f3,-10680(r8) 10001e74: fadd f26,f4,f6 10001e78: fmul f10,f11,f10 10001e7c: fmul f24,f28,f2 10001e80: fmul f21,f25,f2 10001e84: fmul f4,f9,f4 10001e88: fmadd f22,f0,f0,f27 10001e8c: fadd f27,f8,f7 10001e90: fadd f23,f26,f8 10001e94: fmul f26,f0,f11 10001e98: lfd f8,-10664(r6) 10001e9c: lfd f0,-10664(r4) 10001ea0: addi r6,r6,16 10001ea4: fadd f29,f5,f8 10001ea8: fsub f25,f0,f13 10001eac: addi r4,r4,16 10001eb0: fmsub f28,f20,f1,f24 10001eb4: lfd f20,-10672(r8) 10001eb8: fmul f5,f10,f5 10001ebc: addi r8,r8,16 10001ec0: stfd f4,-10672(r22) 10001ec4: stfd f5,-10664(r22) 10001ec8: addi r22,r22,16 10001ecc: fmul f27,f26,f27 10001ed0: fadd f24,f6,f29 10001ed4: fmsub f29,f25,f1,f21 10001ed8: fdiv f28,f28,f23 10001edc: fmadd f25,f13,f13,f30 10001ee0: fadd f6,f6,f12 10001ee4: fmadd f30,f3,f3,f22 10001ee8: stfd f27,-10680(r3) 10001eec: fdiv f29,f29,f24 10001ef0: fmadd f3,f20,f20,f25 10001ef4: fmul f20,f13,f11 10001ef8: fmadd f7,f30,f31,f7 10001efc: stfd f7,-10680(r10) 10001f00: fmadd f12,f3,f31,f12 10001f04: fmul f13,f20,f6 10001f08: stfd f12,-10672(r10) 10001f0c: stfd f13,-10672(r3) 10001f10: addi r10,r10,16 10001f14: addi r3,r3,16 10001f18: stfd f28,-10672(r7) 10001f1c: stfd f29,-10664(r7) 10001f20: addi r7,r7,16 10001f24: bne 10001e10 <calc1_+0x1b0> ---------- Bad asm code for the same loop 10001a60: addis r27,r9,-435 10001a64: addis r12,r11,-2176 10001a68: lfd f13,-7440(r27) 10001a6c: lfd f10,28344(r12) 10001a70: addis r8,r11,-1958 10001a74: addis r10,r11,-1740 10001a78: fsub f7,f10,f13 10001a7c: lfd f8,-704(r8) 10001a80: lfd f10,0(r9) 10001a84: addis r7,r9,-218 10001a88: addis r28,r9,1523 10001a8c: lfd f9,-29752(r10) 10001a90: fadd f6,f12,f10 10001a94: fsub f2,f8,f0 10001a98: addis r12,r11,218 10001a9c: addis r27,r9,2176 10001aa0: fadd f5,f11,f9 10001aa4: fadd f11,f11,f12 10001aa8: addi r9,r9,8 10001aac: cmpw r6,r9 10001ab0: fmul f1,f7,f30 10001ab4: fmul f7,f13,f13 10001ab8: fmul f13,f13,f3 10001abc: fadd f31,f5,f6 10001ac0: lfd f5,29040(r7) 10001ac4: fmsub f2,f2,f29,f1 10001ac8: fmadd f1,f0,f0,f7 10001acc: fmul f0,f0,f3 10001ad0: fmul f6,f13,f6 10001ad4: stfd f6,-6728(r28) 10001ad8: fdiv f2,f2,f31 10001adc: fmadd f5,f5,f5,f1 10001ae0: fmul f31,f0,f11 10001ae4: fmr f0,f8 10001ae8: stfd f31,0(r11) 10001aec: fmr f11,f9 10001af0: addi r11,r11,8 10001af4: fadd f1,f5,f4 10001af8: fmr f4,f7 10001afc: fmadd f5,f1,f28,f12 10001b00: fmr f12,f10 10001b04: stfd f5,-28344(r27) 10001b08: stfd f2,-29040(r12) 10001b0c: bne+ 10001a60 <calc1_+0xe0> ---------- Looking into the differences for both cases, the good code seems to be traversing the loop in a different way than the bad one, using smaller displacements for each load/store. The bad case uses bigger displacements. Also, it looks like we have a bigger unrolling factor on the good case (longer code, more loads) compared to the bad case.
That's DO 100 J=1,N DO 100 I=1,M CU(I+1,J) = .5D0*(P(I+1,J)+P(I,J))*U(I+1,J) CV(I,J+1) = .5D0*(P(I,J+1)+P(I,J))*V(I,J+1) Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1) 1 -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1)) H(I,J) = P(I,J)+.25D0*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J) 1 +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J)) 100 CONTINUE right? 4.4 can do predictive commoning on it while trunk can't - this also unrolls the loop twice. On trunk we are likely confused by PRE that already partially performs what predictive commoning would do. Disabling PRE makes predictive commoning work but doesn't unroll the loop (same as with disabling PRE in 4.4). It is likely the full redundancies PRE discovers that cause the unrolling. That said - this looks like yet another unfortunate pass ordering problem to me.
Testcase: SUBROUTINE CALC1 IMPLICIT REAL*8 (A-H, O-Z) PARAMETER (N1=1335, N2=1335) COMMON U(N1,N2), V(N1,N2), P(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2) COMMON /CONS/ DX,DY FSDX = 4.D0/DX FSDY = 4.D0/DY DO 100 J=1,N DO 100 I=1,M CU(I+1,J) = .5D0*(P(I+1,J)+P(I,J))*U(I+1,J) CV(I,J+1) = .5D0*(P(I,J+1)+P(I,J))*V(I,J+1) Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1) 1 -U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1)) H(I,J) = P(I,J)+.25D0*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J) 1 +V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J)) 100 CONTINUE RETURN END
Actually PRE seems to be more powerful than predictive commoning here. We just lose one opportunity while gaining. With predictive commoning we have 8 loads and 4 stores, 11 multiplications and one division. With PRE it is 6 loads and 4 stores, 10 multiplications and one division. The only thing we gain from predictive commoning in 4.4 is unrolling the loop once.
From predictive commoning we gain a bit more performance, probably due to the bigger unrolling factor. Any chance of the unrolling taking place while still using PRE? Thanks, Luis
Subject: Re: [4.5 Regression] Big degradation on swim/mgrid on powerpc 32/64 after alias improvement merge (gcc r145494) On Fri, 29 May 2009, luisgpm at linux dot vnet dot ibm dot com wrote: > ------- Comment #5 from luisgpm at linux dot vnet dot ibm dot com 2009-05-29 19:52 ------- > From predictive commoning we gain a bit more performance, probably due to the > bigger unrolling factor. > > Any chance of the unrolling taking place while still using PRE? -funroll[-all]-loops doesn't seem to do the job. I didn't check if enabling sms would do it. Other unrolling on the tree level is only implemented as side-effect of other optimizations (like vectorization or predictive commoning or prefetching) :/ Richard.
I believe this was fixed with 2009-07-22 Michael Matz <matz@suse.de> PR tree-optimization/35229 PR tree-optimization/39300 * tree-ssa-pre.c (includes): Include tree-scalar-evolution.h. (inhibit_phi_insertion): New function. (insert_into_preds_of_block): Call it for REFERENCEs. (init_pre): Initialize and finalize scalar evolutions. * Makefile.in (tree-ssa-pre.o): Depend on tree-scalar-evolution.h . which avoids the PRE and enables predictive commoning again (on x86_64 only the tail loop of the vectorized variant is).