Bug 35659 - [4.3/4.4 Regression] Miscompiled code with -O2 (but not with -O2 -funroll-loops) on ia64
Summary: [4.3/4.4 Regression] Miscompiled code with -O2 (but not with -O2 -funroll-loo...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.3.2
Assignee: Maxim Kuvyrkov
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2008-03-21 18:48 UTC by Kevin McCarty
Modified: 2008-08-06 06:38 UTC (History)
9 users (show)

See Also:
Host: ia64-linux-gnu
Target: ia64-linux-gnu
Build: ia64-linux-gnu
Known to work: 4.2.3
Known to fail: 4.3.0
Last reconfirmed: 2008-06-26 09:23:37


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin McCarty 2008-03-21 18:48:58 UTC
Hi,

[This bug was initially submitted to the Debian BTS at
http://bugs.debian.org/466948 -- at the request of Debian's gcc maintainer, I
am also sending it here]

There is apparently a miscompilation of CERNLIB code in the "kernlib" library on ia64 with gfortran-4.3 at optimization level -O2 or higher.  (Oddly, compiling with -O2 -funroll-loops does not expose the problem, which is why I didn't notice it earlier.)

I've verified that this problem occurs on ia64 with gfortran-4.3 versions 4.3-20080202-1, 4.3-20080219-1 and 4.3.0.  For the most recent trial:

(sid)kmccarty@merulo:~$ gfortran-4.3 -v
Using built-in specs.
Target: ia64-linux-gnu
Configured with: ../src/configure linux gnu
Thread model: posix
gcc version 4.3.1 20080309 (prerelease) (Debian 4.3.0-1)

Please see the complete test case I've provided at http://people.debian.org/~kmccarty/tlsc-test-case.tar.gz

[Note, the files libkernlib.a and libkerngent.a in the tarball were compiled on ia64 from CERNLIB source code, which is GPL; it is available from the Debian source package of "cernlib" version 2006.dfsg.2-10]

[Note 2, I have updated this test case tarball to the gfortran version noted
above since I originally submitted this bug to the Debian BTS.]

The file tlsc.F in the test case is where the optimization bug happens.  I saved files tlsc.f, tlsc.s and tlsc.o into subdirectories of the test case for three different combinations of compile flags:

-O2 (causes miscompiled code) in obj-fail subdir

-O0 (seems fine)
-O2 -funroll-loops (seems fine) both in obj-success subdir.

Also the output of the test case, saved as output.txt in each subdir.

You can regenerate all of these files with these three combinations of compiler flags quickly with "make output".

The presence or absence of -fno-automatic on the gfortran command line (I omitted it in all three cases) appears to make no difference.

best regards,
Kevin McCarty
Comment 1 Richard Biener 2008-03-22 13:21:41 UTC
Does it work with gcc 4.2?
Comment 2 Kevin McCarty 2008-03-24 16:47:50 UTC
(In reply to comment #1)
> Does it work with gcc 4.2?

Yes, it does, at least with this version:

(sid)kmccarty@merulo:~$ gfortran-4.2 -v
Using built-in specs.
Target: ia64-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --disable-libmudflap --disable-libssp --with-system-libunwind --enable-checking=release --build=ia64-linux-gnu --host=ia64-linux-gnu --target=ia64-linux-gnu
Thread model: posix
gcc version 4.2.3 (Debian 4.2.3-2)

The test case works in all cases with gfortran 4.2, regardless of whether I use libkernlib.a and libkerngent.a compiled with gfortran 4.2 or 4.3.  (But please let me know if you need copies of these libraries built with gfortran 4.2 anyway.)

I also just noticed, surprisingly, that with gfortran 4.3, if I compile the test program with -O3 or with -O3 -funroll-loops, the test succeeds.  So oddly enough it is ONLY with -O2 (without -funroll-loops) that the test fails.  (Again, this behavior is independent of which gfortran version I use to build libkernlib.a and libkerngent.a.)  Please also let me know if you need the intermediate .f and .o files built with gfortran-4.3 -O3.
Comment 3 Richard Biener 2008-03-24 17:32:45 UTC
Possibly a scheduling issue.
Comment 4 Steve Ellcey 2008-05-20 21:01:17 UTC
I wonder if this is related to PR target/35695, the floating point division bug that Jim Wilson fixed.  Could you try it with ToT or with the latest 4.3 branch, both of which have Jim's fix in them.
Comment 5 Kevin McCarty 2008-05-21 03:50:51 UTC
(In reply to comment #4)
> I wonder if this is related to PR target/35695, the floating point division bug
> that Jim Wilson fixed.  Could you try it with ToT or with the latest 4.3
> branch, both of which have Jim's fix in them.

I tried with the gcc-4_3-branch from Subversion today, but it doesn't change things: plain -O2 fails but both -O2 -funroll-loops and -O1 succeed on the test case.

For the record:

(sid)kmccarty@merulo:~$ ~/gcc-4.3-branch/bin/gfortran -v
Using built-in specs.
Target: ia64-unknown-linux-gnu
Configured with: ./configure --enable-languages=c,fortran --prefix=/home/kmccarty/gcc-4.3-branch --enable-shared --with-mpfr=/home/kmccarty/gcc-4.3-branch
Thread model: posix
gcc version 4.3.1 20080521 (prerelease) (GCC)
Comment 6 Richard Biener 2008-05-21 10:39:35 UTC
Mark, you made this P5, but ia64-linux is a secondary platform.  P3 again to
get it on the radar.  But not higher priority because we don't have exactly what
I would call a testcase.
Comment 7 Mark Mitchell 2008-05-23 01:29:04 UTC
I agree on both points: (1) I should not have marked this P5, and (2) we do not yet have a test case.  Marked as "WAITING".
Comment 8 Steve Ellcey 2008-06-03 17:59:13 UTC
I looked at this bug and I can reproduce it using the precompiled archives from the link.  I have not tried to get the CERN sources to create a small 'real' test case.  I noticed that the bug does not happen on ToT and found that the test started working correctly with version 133081.  The patch submitted in this version is for PR 34677 which claims to be a missed optimization issue as opposed to a codegen bug fix so I am not sure if the bug is really fixed by that change or if that change just masks the problem like -funroll-loops does.

2008-03-10  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/34677
        * tree-ssa-pre.c (modify_expr_node_pool): Remove.
        (poolify_tree): Likewise.
        (modify_expr_template): Likewise.
        (poolify_modify_stmt): Likewise.
        (insert_fake_stores): Handle all component-ref style stores
        in addition to INDIRECT_REF.  Also handle complex types.
        Do not poolify the inserted load.
        (realify_fake_stores): Do not rebuild the tree but only
        make it a SSA_NAME copy.
        (init_pre): Remove initialzation of modify_expr_template.
        Do not allocate modify_expr_node_pool.
        (fini_pre): Do not free modify_expr_node_pool.
Comment 9 Richard Biener 2008-06-04 08:59:50 UTC
The only code-generation changing part of the patch is

        (insert_fake_stores): Handle all component-ref style stores
        in addition to INDIRECT_REF.  Also handle complex types.

but that only fixes a missed-optimization, so it probably makes the problem
latent on the trunk.
Comment 10 Richard Biener 2008-06-06 14:59:17 UTC
4.3.1 is being released, adjusting target milestone.
Comment 11 Jakub Jelinek 2008-06-24 14:44:52 UTC
      PROGRAM PR35659
      DIMENSION A(1000), B(1010), AUX(8), IPIV(8), X(16)
      COMMON /TLSDIM/ M1,M,N,L,IER
      COMMON /SLATE/ V1,V2,IAR(24),DUM(14)
      DATA A/0, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 1, 0.200000003,
     1  0.0399999991, 0.00800000038, 1, 0.400000006, 0.159999996,
     2  0.064000003, 1, 0.600000024, 0.360000014, 0.216000006, 1,
     3  0.800000012, 0.639999986, 0.512000024, 1, 1, 1, 1, 968*0./
      DATA B/1, 2, 1, 1.22140002, 1.49179995, 1.82210004,
     4  2.22550011, 2.7183001, 0, 0, 1000*0./
      M1 = 2
      M = 8
      N = 4
      L = 1
      IER = 0
      V1 = 0
      V2 = 1.40129846e-45
      IAR(:) = 538976288
      DUM(:) = 1.35631564e-19
      CALL TLSC(A,B,AUX,IPIV,1.,X)
      IF (ABS(X(1) - 0.99785352).GE.0.1) CALL ABORT
      IF (ABS(X(2) - 1.0).GE.0.1) CALL ABORT
      IF (ABS(X(3) - 0.50107324).GE.0.1) CALL ABORT
      IF (ABS(X(4) - 0.21670136).GE.0.1) CALL ABORT
      END
      SUBROUTINE TLSMSQ (B,L,M,F)                                                                                                
      COMMON /SLATE/ DUM(38),I,JB                                                                                                
      DIMENSION      B(*)                                                                                                        
      F  = 0.                                                                                                                    
      JB = 1                                                                                                                     
      DO           10        I=1,M                                                                                               
      F  = F + B(JB)*B(JB)                                                                                                       
   10 JB = JB + L                                                                                                                
      RETURN                                                                                                                     
      END                                   
      SUBROUTINE TLSWOP (A,AD,N,NR)                                                                                              
      COMMON /SLATE/ DUM(37),H,I,JA                                                                                              
      DIMENSION      A(*), AD(*)                                                                                                 
      JA = 1                                                                                                                     
      DO           10        I=1,NR                                                                                              
      H  = A(JA)                                                                                                                 
      A(JA)  = AD(JA)                                                                                                            
      AD(JA) = H                                                                                                                 
   10 JA = JA + N                                                                                                                
      RETURN                                                                                                                     
      END                                      
      SUBROUTINE TLUK (A,IASEP,NR,SIG,BETA)
      COMMON /SLATE/ DUM(37),I,JA,LL
      DIMENSION A(*)
      SIG= 0.
      JA = 1
      LL = 0
      DO           10        I=1,NR
      IF     (A(JA).EQ.0.)             GO TO     10
      LL = I
      SIG= SIG + A(JA)* A(JA)
   10 JA = JA + IASEP
      NR = LL
      IF     (NR.EQ.0)                 RETURN
      SIG  = SIGN (SQRT (SIG),A(1))
      BETA = A(1) + SIG
      A(1) = BETA
      BETA = 1. / (SIG * BETA)
      RETURN
      END
      SUBROUTINE TLSTEP (A,B,IASEP,IBSEP,NR,NC,BETA)
      COMMON /SLATE/ DUM(34),H,I,IB,J,JA,JB
      DIMENSION      A(*), B(*)
      IB = 0
      DO           30        J=1,NC
      IB = IB + 1
      H  = 0.
      JA = 1
      JB = IB
      DO           10        I=1,NR
      H  = H + A(JA) * B(JB)
      JA = JA +IASEP
   10 JB = JB + IBSEP
      H  = H * BETA
      JA = 1
      JB = IB
      DO           20        I=1,NR
      B(JB) = B(JB) - A(JA) * H
      JA = JA +IASEP
   20 JB = JB + IBSEP
   30 CONTINUE
      RETURN
      END

together with tlsc.f can work as a testcase, and reproduces the problem with current 4.3 branch.
Comment 12 Jakub Jelinek 2008-06-24 15:04:50 UTC
Even smaller reproducer:
      PROGRAM PR35659
      DIMENSION A(1000), B(1010), AUX(8), IPIV(8), X(16)
      COMMON /TLSDIM/ M1,M,N,L,IER
      COMMON /SLATE/ V1,V2,IAR(24),DUM(14)
      DATA A/0, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 1, 0.200000003,
     1  0.0399999991, 0.00800000038, 1, 0.400000006, 0.159999996,
     2  0.064000003, 1, 0.600000024, 0.360000014, 0.216000006, 1,
     3  0.800000012, 0.639999986, 0.512000024, 1, 1, 1, 1, 968*0./
      DATA B/1, 2, 1, 1.22140002, 1.49179995, 1.82210004,
     4  2.22550011, 2.7183001, 0, 0, 1000*0./
      M1 = 2
      M = 8
      N = 4
      L = 1
      IER = 0
      V1 = 0
      V2 = 1.40129846e-45
      IAR(:) = 538976288
      DUM(:) = 1.35631564e-19
      CALL TLSC(A,B,AUX,IPIV,1.,X)
      END
      SUBROUTINE TLSMSQ (B,L,M,F)
      DIMENSION      B(*)
      IF (M.NE.2) CALL ABORT
      STOP 0
      END
      SUBROUTINE TLSWOP (A,AD,N,NR)
      DIMENSION      A(*), AD(*)
      CALL ABORT
      END
      SUBROUTINE TLUK (A,IASEP,NR,SIG,BETA)
      DIMENSION A(*)
      CALL ABORT
      END
      SUBROUTINE TLSTEP (A,B,IASEP,IBSEP,NR,NC,BETA)
      DIMENSION      A(*), B(*)
      CALL ABORT
      END

The miscompiled TLSC calls the first TLSMSQ routine with 8.0 rather than 2.0
as the 3rd argument.
Comment 13 Jakub Jelinek 2008-06-25 10:20:56 UTC
And the miscompiled tlsc.f inline (compile with just -O2):
      SUBROUTINE TLSC (A,B,AUX,IPIV,EPS,X)
      COMMON /TLSDIM/ M1,M,N,L,IER
      COMMON /SLATE/ BETA,H,I,IB,IB1,ID,ID1,IEND,II,IST,J,JA,JB,JK
     +              ,JST,K,KPIV,KR,KST,KT,K1,LV,MR,M11,NK,NR,PIV,PIVT
     +              ,SIG,DUM(11)
      DIMENSION      A(*), AUX(*), B(*), IPIV(*), X(*)
      IF (N.GT.M.OR.M1.GT.N)  GO TO 90
      K1  = MAX (N,L)
      IER = 1
      DO           5         K=1,N
    5 IPIV(K) = K
      IST = - N
      JB  = 1 - L
      M11 = M1 + 1
      MR  = M1
      DO           50        K=1,N
      IF     (K.GT.M1)                 MR = M
      IST = IST + N + 1
      JB  = JB + L
      LV  = MR - K + 1
      PIV = 0.
      ID  = IST - N
      DO           20        J=K,N
      IF     (K.EQ.1 .OR. K.EQ.M11)    GO TO     10
      PIVT = AUX(J) - A(ID)*A(ID)
      GO TO      15
   10 I = ID + N
      IF (LV .EQ. 1) GO TO 12
      CALL TLSMSQ (A(I),N,LV,PIVT)
      GO TO 15
   12 PIVT= A(I)*A(I)
   15 AUX(J) = PIVT
      ID = ID + 1
      IF     (PIVT*EPS.LE.PIV)         GO TO     20
      PIV  = PIVT
      KPIV = J
   20 CONTINUE
      I = KPIV - K
      IF     (I.LE.0)                  GO TO     25
      H = AUX(K)
      AUX(K) = AUX(KPIV)
      AUX(KPIV) = H
      ID = IST + I
      NR = M - K + 1
      CALL TLSWOP (A(IST),A(ID),N,NR)
   25 CALL TLUK (A(IST),N,LV,SIG,BETA)
      IF     (LV.EQ.0)                 GO TO     90
      J = K1 + K
      AUX(J)=-SIG
      IF     (K.GE.N)                  GO TO     30
      NK = N - K
      IF (LV.EQ.1)                     GO TO     27
      CALL TLSTEP (A(IST),A(IST+1),N,N,LV,NK,BETA)
      GO TO        30
   27 DO           28        J=1,NK
      JST = IST + J
   28 A(JST) = A(JST)*(1.-BETA*A(IST)**2)
   30 IB = (K-1) * L + 1
      IF (LV.EQ.1)                     GO TO     32
      CALL TLSTEP (A(IST),B(IB),N,L,LV,L,BETA)
      GO TO      34
   32 DO           33        J=1,L
      JST = IB + J - 1
   33 B(JST) = B(JST)*(1.-BETA*A(IST)**2)
   34 IPIV(KPIV) = IPIV(K)
      IPIV(K) = KPIV
      IF     (K.GT.M1)                 GO TO     50
      DO           45        I=M11,M
      ID1 = IST + (I-K)*N
      IF     (A(ID1).EQ.0)             GO TO     45
      H  = - A(ID1)/SIG
      A(ID1) = H
      ID1 = ID1 + 1
      ID = IST + 1
      DO           35        J=1,NK
      A(ID1) = A(ID1) - H*A(ID)
      ID1 = ID1 + 1
   35 ID  = ID + 1
      IB1 = 1 + (I-1)*L
      IB = JB
      DO           40        J=1,L
      B(IB1) = B(IB1) - H*B(IB)
      IB1 = IB1 + 1
   40 IB  = IB + 1
   45 CONTINUE
   50 CONTINUE
      IER = N * IER
      KT  = N
      JK  = (N-1) * L
      K   = K1 + N
      PIV = 1./AUX(K)
      DO           55        K=1,L
      JK = JK + 1
   55 X(JK) = PIV * B(JK)
      KR = N - 1
      IF     (KR.LE.0)                 GO TO     70
      JST = KR * (N+1) + 2
      DO           65        J=1,KR
      JST = JST - N - 1
      IEND= (KR-J+1) * N
      K   = K1 + KR - J + 1
      PIV = 1./AUX(K)
      KST = K-K1
      ID  = IPIV(KST)-KST
      KST = (KR-J) * L
      DO           65        K=1,L
      KST = KST + 1
      H=B(KST)
      II = KST
      DO           60        I=JST,IEND
      II = II + L
   60 H  = H - A(I) * X(II)
      II = KST + ID *L
      X(KST) = X(II)
      X(II)  = PIV * H
   65 CONTINUE
   70 IST = KT*L
      DO           80        J=1,L
      IST = IST + 1
      H   = 0.
      JA  = IST
      IF     (M.LE.KT)                 GO TO     80
      NR  = M - KT
      IF     (NR.EQ.1)                 GO TO     75
      CALL TLSMSQ (B(IST),L,NR,H)
      GO TO      80
   75 H = B(IST)*B(IST)
   80 AUX(J) = H
      RETURN
   90 IER = -1001
      RETURN
      END

The problem is that slate.k (aka prephitmp.78) is read before it is stored,
so it has the 0x20202020 value instead of 1.
At tlsc.f.198r.compgotos the code still looks correct:
(insn 1857 1434 1443 8 tlsc.f:16 (set (reg:SI 14 r14 [1392])
        (const_int 1 [0x1])) 4 {*movsi_internal} (expr_list:REG_EQUIV (const_int 1 [0x1])
        (nil)))

(insn 1443 1857 242 8 tlsc.f:16 (set (mem/s/c:SI (post_modify:DI (reg/f:DI 53 r59 [1516])
                (plus:DI (reg/f:DI 53 r59 [1516])
                    (const_int -60 [0xffffffffffffffc4]))) [2 slate.k+0 S4 A32])
        (reg:SI 14 r14 [1392])) 4 {*movsi_internal} (expr_list:REG_INC (reg/f:DI 53 r59 [1516])
        (expr_list:REG_EQUAL (const_int 1 [0x1])
            (nil))))
...
(insn 197 228 247 8 tlsc.f:17 (set (reg:DI 18 r18)
        (zero_extend:DI (mem/s/c:SI (reg/f:DI 38 r44 [1517]) [2 slate.k+0 S4 A32]))) 103 {zero_extendsidi2} (expr_list:REG_EQUAL 
(mem/s/c:SI (const:DI (plus:DI (symbol_ref:DI ("slate_") <var_decl 0x2000000003c54c80 slate>)
                    (const_int 60 [0x3c]))) [2 slate.k+0 S4 A32])
        (nil)))

(insn 247 197 198 8 tlsc.f:22 (set (reg:SI 19 r19 [665])
        (minus:SI (reg:SI 20 r20 [orig:560 D.775 ] [560])
            (reg:SI 32 r38 [orig:529 prephitmp.68 ] [529]))) 165 {subsi3} (nil))

(insn 198 247 253 8 tlsc.f:17 (set (reg:BI 262 p6 [626])
        (le:BI (reg:SI 18 r18 [orig:522 prephitmp.78 ] [522])
            (reg:SI 22 r22 [orig:459 prephitmp.258 ] [459]))) 298 {*cmpsi_normal} (nil))
but tlsc.f.200r.mach is already wrong:
(insn:TI 197 2150 1443 8 tlsc.f:17 (set (reg:DI 18 r18)
        (zero_extend:DI (unspec:SI [
                    (mem/s/c:SI (reg/f:DI 38 r44 [1517]) [2 slate.k+0 S4 A32])
                ] 40))) 17 {zero_extendsidi2_advanced} (expr_list:REG_EQUAL (mem/s/c:SI (const:DI (plus:DI (symbol_ref:DI ("slate_") <var_decl 0x2000000003c54c80 slate>)
                    (const_int 60 [0x3c]))) [2 slate.k+0 S4 A32])
        (nil)))

(insn 1443 197 1427 8 tlsc.f:16 (set (mem/s/c:SI (post_modify:DI (reg/f:DI 53 r59 [1516])
                (plus:DI (reg/f:DI 53 r59 [1516])
                    (const_int -60 [0xffffffffffffffc4]))) [2 slate.k+0 S4 A32])
        (reg:SI 14 r14 [1392])) 4 {*movsi_internal} (expr_list:REG_DEAD (reg:SI 14 r14 [1392])
        (expr_list:REG_INC (reg/f:DI 53 r59 [1516])
            (expr_list:REG_EQUAL (const_int 1 [0x1])
                (nil)))))
...
(insn:TI 198 2148 223 8 tlsc.f:17 (set (reg:BI 262 p6 [626])
        (le:BI (reg:SI 18 r18 [orig:522 prephitmp.78 ] [522])
            (reg:SI 22 r22 [orig:459 prephitmp.258 ] [459]))) 298 {*cmpsi_normal} (nil))

Note that the scheduler swapped the slate_.k = 1 store with prephitmp.78 = slate_.k read.  The same can be seen in the assembly:
        ld4.a r18 = [r44]       //, slate.k
[.LBE19:]
        .loc 1 16 0
        st4 [r59] = r14, -60    // slate.k, tmp1392
where r44 == r59.
(gdb) disas $pc $pc+16
Dump of assembler code from 0x4000000000000ec0 to 0x4000000000000ed0:
0x4000000000000ec0 <tlsc+512>:  [MMI]       ld4.a r18=[r44]
0x4000000000000ec1 <tlsc+513>:              st4 [r59]=r14,-60
0x4000000000000ec2 <tlsc+514>:              adds r16=48,r41
End of assembler dump.
(gdb) p/x $r44
$3 = 0x6000000000004e4c
(gdb) p/x $r59
$4 = 0x6000000000004e4c

insn 197 is the only ld4.a in tlsc_ routine.
Comment 14 Jakub Jelinek 2008-06-25 11:31:14 UTC
I have no idea why is speculation even attempted here (it doesn't make any sense,
the pointer is surely non-NULL, it points to a global variable), and apparently nothing checks whether it is safe to move over the speculative load over
the store (at least, I've put a breakpoint on nonoverlapping_memrefs_p
and {{,canon_}true,anti,output}_dependence and none of them hit with any MEMs
with r44 or POST_MODIFY r59, -60.  Maxim, speculation is your baby, could you please have a look?
Comment 15 Jakub Jelinek 2008-06-25 11:32:50 UTC
Wrong-code bug on secondary arch.
Comment 16 Maxim Kuvyrkov 2008-06-25 21:33:24 UTC
I can't reproduce the error with today mainline.  When I put in one file 'PROGRAM PR35659' and 'SUBROUTINE TLSC (A,B,AUX,IPIV,EPS,X)' and compile it with any optimization level I get the same "STOP 0" message.  Am I doing something wrong?

I can't spot the problem in the dumps you posted at debian.org.  tlsc.s has a single example of data speculation which seems to be fine.  Scheduler speculatively moves
	ld4.a r18 = [r44]
before
	st4 [r59] = r14, -60
and then also speculates several uses of r18:
	cmp4.ge p6, p7 = r22, r18
...
	(p7) ld4 r14 = [r62]
...
	(p7) st4 [r77] = r14
then it checks the speculation:
	chk.a.clr r18, .L69
and recovers if speculation failed:
.L69:
	.mmi
	ld4 r18 = [r44]
	;;
	cmp4.ge p6, p7 = r22, r18
	nop 0
	;;
	.mmi
	nop 0
	(p7) ld4 r14 = [r62]
	nop 0
	;;
	.mib
	(p7) st4 [r77] = r14
	nop 0
	br .L70

Anyway, can you help me reproduce the issue, so I can take a closer look?
Comment 17 Matthias Klose 2008-06-25 22:16:05 UTC
Subject: Re:  [4.3/4.4 Regression] Miscompiled code with -O2 (but not with -O2 -funroll-loops) on ia64

mkuvyrkov at gcc dot gnu dot org writes:
> Anyway, can you help me reproduce the issue, so I can take a closer look?

please email me a ssh key, if access to a Debian machine would help.
Comment 18 Jakub Jelinek 2008-06-26 08:01:27 UTC
You can reproduce it with a cross compiler too.
Just use 4.3 branch (IMHO unrelated changes on the trunk made this bug latent)
and don't combine everything into the same file.
The #c12 program+routines have to be in one file, #c13 in a different one.
The latter is miscompiled, if you are on ia64, you can compile the former
with any optimization options, the latter with -O2, link, run.
If you have just a cross compiler, just compile the #c13 source and inspect the
assembly (look at ld4.a:
        addl r41 = @ltoff(slate_#), r1
...
        adds r14 = 60, r41
...
        .mmi
        mov r44 = r14
        mov r50 = r15
        mov r59 = r14
... ! Instructions that don't modify r44 nor r59, no labels
        .mmi
        ld4.a r18 = [r44]      ! *r44 aka slate_.k is uninitialized here, 0x20202020
        st4 [r59] = r14, -60   ! This stores 1 into slate_.k
        adds r16 = 48, r41
and look at the *.compgotos dump that right before scheduling that slate_.k = 1
preceeded the prephitmp.78 (== r18) = slate_.k load, while after scheduling
it is the other way around.  The alias set looks correct for both MEMs (4) and
MEM_EXPR/MEM_OFFSET etc. too, so IMHO if alias.c was asked if the two MEMs can overlap, it would certainly say so.
Comment 19 Jakub Jelinek 2008-06-26 08:41:25 UTC
To be more precise, the problem is in speculating conditional store:
        ld4.a r18 = [r44]
        st4 [r59] = r14, -60
...
        cmp4.ge p6, p7 = r22, r18
        (p7) ld4 r14 = [r62]
...
        (p7) st4 [r77] = r14
        chk.a.clr r18, .L69
.L70:
...
.L69:
        ld4 r18 = [r44]
        ;;
        cmp4.ge p6, p7 = r22, r18
        ;;
        (p7) ld4 r14 = [r62]
        ;;
        (p7) st4 [r77] = r14
        br .L70

Now, ld4.a returns the stale value in r18 (0x20202020), $r22 is 2 and so in the code before chk.a.clr p7 is 1, so [r62] is loaded into r14 and stored into [r77]
(i.e. MR = M in IF     (K.GT.M1)                 MR = M is executed).
Then chk.a.clr realizes the [r44] memory changed, so branches to .L69 to do the speculated stuff again.  This time r18 is 1, so p7 will be 0 and MR = M is not
done.  But this really doesn't and can't undo the st4 [r77] = r14 store that
already happened before chk.a.clr and wasn't supposed to happen.
Comment 20 Maxim Kuvyrkov 2008-06-26 09:21:10 UTC
Subject: Re:  [4.3/4.4 Regression] Miscompiled code with
 -O2 (but not with -O2 -funroll-loops) on ia64

jakub at gcc dot gnu dot org wrote:
> ------- Comment #19 from jakub at gcc dot gnu dot org  2008-06-26 08:41 -------
> To be more precise, the problem is in speculating conditional store:
>         ld4.a r18 = [r44]
>         st4 [r59] = r14, -60
> ...
>         cmp4.ge p6, p7 = r22, r18
>         (p7) ld4 r14 = [r62]
> ...
>         (p7) st4 [r77] = r14
>         chk.a.clr r18, .L69
> .L70:
> ...
> .L69:
>         ld4 r18 = [r44]
>         ;;
>         cmp4.ge p6, p7 = r22, r18
>         ;;
>         (p7) ld4 r14 = [r62]
>         ;;
>         (p7) st4 [r77] = r14
>         br .L70
> 
> Now, ld4.a returns the stale value in r18 (0x20202020), $r22 is 2 and so in the
> code before chk.a.clr p7 is 1, so [r62] is loaded into r14 and stored into
> [r77]
> (i.e. MR = M in IF     (K.GT.M1)                 MR = M is executed).
> Then chk.a.clr realizes the [r44] memory changed, so branches to .L69 to do the
> speculated stuff again.  This time r18 is 1, so p7 will be 0 and MR = M is not
> done.  But this really doesn't and can't undo the st4 [r77] = r14 store that
> already happened before chk.a.clr and wasn't supposed to happen.

Oh, now I see the problem.  I need some time to check what the best fix 
would be.  Probably, the fix will be to exclude predicated instructions 
from speculation set, but I'd like to consider alternatives.


Thanks,

Maxim
Comment 21 Maxim Kuvyrkov 2008-06-26 09:23:37 UTC
Assign to self
Comment 22 Steve Ellcey 2008-08-01 23:28:01 UTC
Maxim, have you had time to look at this bug?  Given that it is generating bad code and is in 4.3.0 and 4.3.1 I was wondering if it will be fixed for 4.3.2.
Comment 23 Maxim Kuvyrkov 2008-08-02 18:15:12 UTC
(In reply to comment #22)
> Maxim, have you had time to look at this bug?  Given that it is generating bad
> code and is in 4.3.0 and 4.3.1 I was wondering if it will be fixed for 4.3.2.

Sorry for the delay.  I posted the fix at http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00112.html.

I would appreciate if someone could test the cumulative patch of three fixes for ia64 speculation support or provide a single-file executable testcase for this bug.  Here are links to two other bugfixes for ia64 speculation support:
http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00110.html
http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00111.html
Comment 24 Steve Ellcey 2008-08-04 17:34:42 UTC
I bootstrapped the 3 patches on mainline and 4.3 branch and verified that they fix the problem that is reported on the 4.3 branch with the patches.
Comment 25 Maxim Kuvyrkov 2008-08-06 06:25:12 UTC
Subject: Bug 35659

Author: mkuvyrkov
Date: Wed Aug  6 06:23:47 2008
New Revision: 138759

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138759
Log:
	PR target/35659
	* haifa-sched.c (sched_insn_is_legitimate_for_speculation_p): Move ...
	* sched-deps.c (sched_insn_is_legitimate_for_speculation_p): ... here.
	Don't allow predicated instructions for data speculation.
	* sched-int.h (sched_insn_is_legitimate_for_speculation_p): Move
	declaration.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/haifa-sched.c
    trunk/gcc/sched-deps.c
    trunk/gcc/sched-int.h

Comment 26 Maxim Kuvyrkov 2008-08-06 06:36:24 UTC
Subject: Bug 35659

Author: mkuvyrkov
Date: Wed Aug  6 06:34:18 2008
New Revision: 138762

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138762
Log:
	Backport from mainline:
	2008-08-06  Maxim Kuvyrkov  <maxim@codesourcery.com>

	PR target/35659
	* haifa-sched.c (sched_insn_is_legitimate_for_speculation_p): Move ...
	* sched-deps.c (sched_insn_is_legitimate_for_speculation_p): ... here.
	Don't allow predicated instructions for data speculation.
	* sched-int.h (sched_insn_is_legitimate_for_speculation_p): Move
	declaration.

Modified:
    branches/gcc-4_3-branch/gcc/ChangeLog
    branches/gcc-4_3-branch/gcc/haifa-sched.c
    branches/gcc-4_3-branch/gcc/sched-deps.c
    branches/gcc-4_3-branch/gcc/sched-int.h

Comment 27 Maxim Kuvyrkov 2008-08-06 06:38:50 UTC
Should be fixed on both trunk and 4_3-branch.
Comment 28 Paolo Bonzini 2009-02-06 07:33:21 UTC
Subject: Bug 35659

Author: bonzini
Date: Fri Feb  6 07:33:05 2009
New Revision: 143980

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=143980
Log:
2009-02-06  Paolo Bonzini  <bonzini@gnu.org>

	PR tree-optimization/35659
	* tree-ssa-sccvn.c (vn_constant_eq, vn_reference_eq, vn_nary_op_eq
	vn_phi_eq): Shortcut if hashcode does not match.
	(vn_reference_op_compute_hash): Do not call iterative_hash_expr for
	NULL operands.
	* tree-ssa-pre.c (pre_expr_hash): Look at hashcode if available,
	and avoid iterative_hash_expr.
	(FOR_EACH_VALUE_ID_IN_SET): New.
	(value_id_compare): Remove.
	(sorted_array_from_bitmap_set): Use FOR_EACH_VALUE_ID_IN_SET to
	sort expressions by value id.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-pre.c
    trunk/gcc/tree-ssa-sccvn.c