Bug 64247

Summary: malloc alignment and -mavx
Product: gcc Reporter: Joost VandeVondele <Joost.VandeVondele>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: WAITING ---    
Severity: normal CC: guillaume, Joost.VandeVondele, neleai
Priority: P3    
Version: 4.9.2   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2015-11-04 00:00:00

Description Joost VandeVondele 2014-12-10 05:50:45 UTC
This is a bit an odd bug report, as it so far is just a somewhat worrying observation:

If I run our simulation package, the output appears to change if we pipe (|) an output to file via tee or if we direct the output (>) to file. This can be reproduced with setting the flag GFORTRAN_UNBUFFERED_ALL as in :

> GFORTRAN_UNBUFFERED_ALL=0  ../../../exe/local/cp2k.sopt O_KG.inp 
POWELL| Number of function evaluations 3658

or

> GFORTRAN_UNBUFFERED_ALL=1  ../../../exe/local/cp2k.sopt O_KG.inp 
 POWELL| Number of function evaluations   10000

once the code converges (3658), once not.

If I run the same code under valgrind, I get no warnings/errors, but the iteration count changes again to the old value.

 GFORTRAN_UNBUFFERED_ALL=1  valgrind ../../../exe/local/cp2k.sopt O_KG.inp 
 POWELL| Number of function evaluations   3658
[..]
==453== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 5)

This is all pretty reproducible on that particular machine. Only with gcc 4.9.2, not with earlier versions. In fact, if I pick up the old fortran runtime libs (4.8.X) (but same cp2k binary) the problem seems to disappear.

However, I believe that this odd behavior with GFORTRAN_UNBUFFERED_ALL is maybe coincidental. Indeed, our nightly tester can change value 'randomly' and it looks like the result depends on the environment, in a surprising way.

Any suggestions ? 

The only wild guess I have so far is that, maybe, code flow is influenced by the alignment of buffers returned by malloc, but I won't know how to prove that ?
Comment 1 Richard Biener 2014-12-10 12:27:06 UTC
Are you sure you are not using uninitialized memory?  try using the various
-fsanitize flags (not sure if uninit memory use is in =address or =undefined).
Comment 2 Joost VandeVondele 2014-12-10 12:28:42 UTC
(In reply to Richard Biener from comment #1)
> Are you sure you are not using uninitialized memory?  try using the various
> -fsanitize flags (not sure if uninit memory use is in =address or
> =undefined).

yes, tried that, also clean (in addition to valgrind).
Comment 3 Joost VandeVondele 2014-12-10 12:30:30 UTC
(In reply to Richard Biener from comment #1)
> Are you sure you are not using uninitialized memory?  try using the various
> -fsanitize flags (not sure if uninit memory use is in =address or
> =undefined).

as a ps, uninit memory use in is neither, that would require memsan, which is not ported to gcc.
Comment 4 Joost VandeVondele 2014-12-11 16:29:10 UTC
I think we isolated the issue. With -mavx the numerical results seem to depend on the precise alignment of allocated data from malloc, i.e. 16 or 32  (..E0 or F0) byte are not equivalent. As a result, the results from our code fluctuate randomly from run to run. Does that make sense ?
Comment 5 Joost VandeVondele 2014-12-11 19:22:14 UTC
The following is a test program that illustrates the issue:

> cat test.f90 
SUBROUTINE gemm(C,A,B,N)
 REAL*8 :: A(N,N), B(N,N),C(N,N)
 C=0
 DO i=1,N
  DO j=1,N
   DO k=1,N     
    C(i,j)=C(i,j)+A(k,i)*B(k,j)
   ENDDO
  ENDDO
 ENDDO
END SUBROUTINE

!
! simple test driver, computing the same matrix multiplication 
! with the same data, but with arrays aligned differently, 
! once at x, once at x+16 bytes.
! Illustrates that results will vary,
! depending randomly on what alignment malloc returns.
!
PROGRAM TEST
REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A,B,C
REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av1,Bv1,Cv1
REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av2,Bv2,Cv2
REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av3,Bv3,Cv3
INTEGER :: N
READ(5,*) N
ALLOCATE(Av1(N*N+2),Av2(N*N+2),Bv1(N*N+2), &
         Bv2(N*N+2),Cv1(N*N+2),Cv2(N*N+2))
CALL RANDOM_NUMBER(Av1)
CALL RANDOM_NUMBER(Bv1)

! try one: results as if allocated at boundary x
Av2(1:N*N)=Av1(1:N*N)
Bv2(1:N*N)=Bv1(1:N*N)
A(1:N,1:N)=>Av2(1:N*N)
B(1:N,1:N)=>Bv2(1:N*N)
C(1:N,1:N)=>Cv2(1:N*N)
CALL gemm(C,A,B,N)
Cv1(1:N*N)=Cv2(1:N*N)

! try two: results as if allocated at boundary x+16
Av2(1+2:N*N+2)=Av1(1:N*N)
Bv2(1+2:N*N+2)=Bv1(1:N*N)
A(1:N,1:N)=>Av2(1+2:N*N+2)
B(1:N,1:N)=>Bv2(1+2:N*N+2)
C(1:N,1:N)=>Cv2(1+2:N*N+2)
CALL gemm(C,A,B,N)
IF (ANY(Cv1(1:N*N).NE.Cv2(1+2:N*N+2))) CALL ABORT()

END PROGRAM TEST

> gfortran -g -march=sandybridge -mavx -O3 -ffast-math test.f90 && echo 10 | ./a.out

Program aborted. Backtrace:
#0  0x7F5C53950387
#1  0x7F5C53951A72
#2  0x7F5C53A23378
#3  0x4012D6 in test at test.f90:48
Aborted
Comment 6 Joost VandeVondele 2014-12-11 20:26:22 UTC
some similarity with the problem discussed PR55916, except that this case doesn't require __float128
Comment 7 Ondrej Bilka 2014-12-11 20:51:35 UTC
That looks like invalid bug. Fortran allows reassociate a+(b+c) into (a+b)+c which give different result. You will get same instability if you compile program with different versions of gcc.
Comment 8 Joost VandeVondele 2014-12-11 20:54:04 UTC
(In reply to Ondrej Bilka from comment #7)
> That looks like invalid bug. Fortran allows reassociate a+(b+c) into (a+b)+c
> which give different result. You will get same instability if you compile
> program with different versions of gcc.

sure (well actually you mean it can evaluate a+b+c in either way, if there are parens the order is set). However, the issue here is that the same binary will produce different results from run to run.. that seems not OK.
Comment 9 Joost VandeVondele 2014-12-11 21:17:54 UTC
A variation on the testcase, to indicate how this behavior leads to conflicts with the Fortran language standard. A routine declared 'PURE' and called with all intent(in) arguments having the same value, returns different results.

> gfortran -g -mavx -O3 -ffast-math test.f90 && echo 8 | ./a.out
 pure routine gives different results at iter:            1
 pure routine gives different results at iter:            3
 pure routine gives different results at iter:            5
 pure routine gives different results at iter:            7
 pure routine gives different results at iter:            9

> cat test.f90
MODULE M1
CONTAINS
  PURE SUBROUTINE gemm(C,A,B,N)
   INTEGER, INTENT(IN) :: N
   REAL*8, INTENT(IN)  :: A(N,N), B(N,N)
   REAL*8, INTENT(OUT) :: C(N,N)
   C=0
   DO i=1,N
    DO j=1,N
     DO k=1,N     
      C(i,j)=C(i,j)+A(k,i)*B(k,j)
     ENDDO
    ENDDO
   ENDDO
  END SUBROUTINE gemm
END MODULE

PROGRAM TEST
USE M1
REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A1,B1,C1
REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A2,B2,C2
INTEGER :: N
READ(5,*) N
ALLOCATE(A1(N,N),B1(N,N),C1(N,N))
CALL RANDOM_NUMBER(A1)
CALL RANDOM_NUMBER(B1)
CALL GEMM(C1,A1,B1,N)

DO I=1,10
  ALLOCATE(A2(N,N),B2(N,N),C2(N,N))
  A2=A1 ; B2=B1
  CALL GEMM(C2,A2,B2,N)
  IF (ANY(C1.NE.C2)) THEN
     WRITE(6,*) "pure routine gives different results at iter: ",i
  ENDIF
ENDDO

END PROGRAM TEST
Comment 10 Dominique d'Humieres 2015-11-04 14:08:53 UTC
A few comments:

(1) Why do you want to use PURE in this context?

(2) On x86_64-apple-darwin14 I have run the test in comment 9 with 4.8 up to trunk (6.0), various options, and various values of N without getting the message "pure routine gives different results at iter: ... ". IIRC the default alignment on darwin is bigger than the one in linux.

(3) It may help to have pr68101 implemented.

(4) I agree with comment 7 that this PR is invalid: although I am not an expert in vectorization, I understand that the vectorization may depend on the alignment (peeling, versioning, ... ), thus I expect that the round-off error may depend on the alignment. If I add a line

print *, maxval(abs(Cv1(1:N*N)-Cv2(1+2:N*N+2)))/epsilon(Cv1)

before 'IF (ANY(Cv1(1:N*N).NE.Cv2(1+2:N*N+2))) CALL ABORT()', I get 4 for N=10, 48 for N=100, and 768 for N=1000, roughly what I expect for round-off errors.
Comment 11 Joost VandeVondele 2015-11-04 14:49:43 UTC
(In reply to Dominique d'Humieres from comment #10)
> A few comments:
> 
> (1) Why do you want to use PURE in this context?

because this is a pure procedure ? 

Comment 7 is not too the point (indeed reassociation can happen) as explained afterwards

The bug here is that the same (serial) binary gives different results from run to run, as soon as the user enables avx, and there is no way around.

I agree that it would be fixed if there was an option as askin in pr68101. (and yes, this could help with performance as well).