Summary: | malloc alignment and -mavx | ||
---|---|---|---|
Product: | gcc | Reporter: | Joost VandeVondele <Joost.VandeVondele> |
Component: | middle-end | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | WAITING --- | ||
Severity: | normal | CC: | guillaume, Joost.VandeVondele, neleai |
Priority: | P3 | ||
Version: | 4.9.2 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2015-11-04 00:00:00 |
Description
Joost VandeVondele
2014-12-10 05:50:45 UTC
Are you sure you are not using uninitialized memory? try using the various -fsanitize flags (not sure if uninit memory use is in =address or =undefined). (In reply to Richard Biener from comment #1) > Are you sure you are not using uninitialized memory? try using the various > -fsanitize flags (not sure if uninit memory use is in =address or > =undefined). yes, tried that, also clean (in addition to valgrind). (In reply to Richard Biener from comment #1) > Are you sure you are not using uninitialized memory? try using the various > -fsanitize flags (not sure if uninit memory use is in =address or > =undefined). as a ps, uninit memory use in is neither, that would require memsan, which is not ported to gcc. I think we isolated the issue. With -mavx the numerical results seem to depend on the precise alignment of allocated data from malloc, i.e. 16 or 32 (..E0 or F0) byte are not equivalent. As a result, the results from our code fluctuate randomly from run to run. Does that make sense ? The following is a test program that illustrates the issue: > cat test.f90 SUBROUTINE gemm(C,A,B,N) REAL*8 :: A(N,N), B(N,N),C(N,N) C=0 DO i=1,N DO j=1,N DO k=1,N C(i,j)=C(i,j)+A(k,i)*B(k,j) ENDDO ENDDO ENDDO END SUBROUTINE ! ! simple test driver, computing the same matrix multiplication ! with the same data, but with arrays aligned differently, ! once at x, once at x+16 bytes. ! Illustrates that results will vary, ! depending randomly on what alignment malloc returns. ! PROGRAM TEST REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A,B,C REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av1,Bv1,Cv1 REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av2,Bv2,Cv2 REAL*8, DIMENSION(:), POINTER, CONTIGUOUS :: Av3,Bv3,Cv3 INTEGER :: N READ(5,*) N ALLOCATE(Av1(N*N+2),Av2(N*N+2),Bv1(N*N+2), & Bv2(N*N+2),Cv1(N*N+2),Cv2(N*N+2)) CALL RANDOM_NUMBER(Av1) CALL RANDOM_NUMBER(Bv1) ! try one: results as if allocated at boundary x Av2(1:N*N)=Av1(1:N*N) Bv2(1:N*N)=Bv1(1:N*N) A(1:N,1:N)=>Av2(1:N*N) B(1:N,1:N)=>Bv2(1:N*N) C(1:N,1:N)=>Cv2(1:N*N) CALL gemm(C,A,B,N) Cv1(1:N*N)=Cv2(1:N*N) ! try two: results as if allocated at boundary x+16 Av2(1+2:N*N+2)=Av1(1:N*N) Bv2(1+2:N*N+2)=Bv1(1:N*N) A(1:N,1:N)=>Av2(1+2:N*N+2) B(1:N,1:N)=>Bv2(1+2:N*N+2) C(1:N,1:N)=>Cv2(1+2:N*N+2) CALL gemm(C,A,B,N) IF (ANY(Cv1(1:N*N).NE.Cv2(1+2:N*N+2))) CALL ABORT() END PROGRAM TEST > gfortran -g -march=sandybridge -mavx -O3 -ffast-math test.f90 && echo 10 | ./a.out Program aborted. Backtrace: #0 0x7F5C53950387 #1 0x7F5C53951A72 #2 0x7F5C53A23378 #3 0x4012D6 in test at test.f90:48 Aborted some similarity with the problem discussed PR55916, except that this case doesn't require __float128 That looks like invalid bug. Fortran allows reassociate a+(b+c) into (a+b)+c which give different result. You will get same instability if you compile program with different versions of gcc. (In reply to Ondrej Bilka from comment #7) > That looks like invalid bug. Fortran allows reassociate a+(b+c) into (a+b)+c > which give different result. You will get same instability if you compile > program with different versions of gcc. sure (well actually you mean it can evaluate a+b+c in either way, if there are parens the order is set). However, the issue here is that the same binary will produce different results from run to run.. that seems not OK. A variation on the testcase, to indicate how this behavior leads to conflicts with the Fortran language standard. A routine declared 'PURE' and called with all intent(in) arguments having the same value, returns different results. > gfortran -g -mavx -O3 -ffast-math test.f90 && echo 8 | ./a.out pure routine gives different results at iter: 1 pure routine gives different results at iter: 3 pure routine gives different results at iter: 5 pure routine gives different results at iter: 7 pure routine gives different results at iter: 9 > cat test.f90 MODULE M1 CONTAINS PURE SUBROUTINE gemm(C,A,B,N) INTEGER, INTENT(IN) :: N REAL*8, INTENT(IN) :: A(N,N), B(N,N) REAL*8, INTENT(OUT) :: C(N,N) C=0 DO i=1,N DO j=1,N DO k=1,N C(i,j)=C(i,j)+A(k,i)*B(k,j) ENDDO ENDDO ENDDO END SUBROUTINE gemm END MODULE PROGRAM TEST USE M1 REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A1,B1,C1 REAL*8, DIMENSION(:,:), POINTER, CONTIGUOUS :: A2,B2,C2 INTEGER :: N READ(5,*) N ALLOCATE(A1(N,N),B1(N,N),C1(N,N)) CALL RANDOM_NUMBER(A1) CALL RANDOM_NUMBER(B1) CALL GEMM(C1,A1,B1,N) DO I=1,10 ALLOCATE(A2(N,N),B2(N,N),C2(N,N)) A2=A1 ; B2=B1 CALL GEMM(C2,A2,B2,N) IF (ANY(C1.NE.C2)) THEN WRITE(6,*) "pure routine gives different results at iter: ",i ENDIF ENDDO END PROGRAM TEST A few comments: (1) Why do you want to use PURE in this context? (2) On x86_64-apple-darwin14 I have run the test in comment 9 with 4.8 up to trunk (6.0), various options, and various values of N without getting the message "pure routine gives different results at iter: ... ". IIRC the default alignment on darwin is bigger than the one in linux. (3) It may help to have pr68101 implemented. (4) I agree with comment 7 that this PR is invalid: although I am not an expert in vectorization, I understand that the vectorization may depend on the alignment (peeling, versioning, ... ), thus I expect that the round-off error may depend on the alignment. If I add a line print *, maxval(abs(Cv1(1:N*N)-Cv2(1+2:N*N+2)))/epsilon(Cv1) before 'IF (ANY(Cv1(1:N*N).NE.Cv2(1+2:N*N+2))) CALL ABORT()', I get 4 for N=10, 48 for N=100, and 768 for N=1000, roughly what I expect for round-off errors. (In reply to Dominique d'Humieres from comment #10) > A few comments: > > (1) Why do you want to use PURE in this context? because this is a pure procedure ? Comment 7 is not too the point (indeed reassociation can happen) as explained afterwards The bug here is that the same (serial) binary gives different results from run to run, as soon as the user enables avx, and there is no way around. I agree that it would be fixed if there was an option as askin in pr68101. (and yes, this could help with performance as well). |