The sse vectorizer seems to miss one of the simplest cases: #include <cstdio> #include <cstdlib> double loop(double a, size_t n){ // initialise differently so compiler doesn't simplify double sum1=0.1, sum2=0.2, sum3=0.3, sum4=0.4, sum5=0.5, sum6=0.6; for(size_t i=0; i<n; i++){ sum1+=a; sum2+=a; sum3+=a; sum4+=a; sum5+=a; sum6+=a; } return sum1+sum2+sum3+sum4+sum5+sum6-2.1-6.0*a*n; } int main(int argc, char** argv) { size_t n=1000000; double a=1.1; printf("res=%f\n", loop(a,n)); return EXIT_SUCCESS; } g++-4.6.2 -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 test.cpp test.cpp:7: note: not vectorized: unsupported use in stmt. test.cpp:4: note: vectorized 0 loops in function. We get six addsd operations - whereas an optimisation should have given us three addpd operations. .L3: addq $1, %rax addsd %xmm0, %xmm6 cmpq %rdi, %rax addsd %xmm0, %xmm5 addsd %xmm0, %xmm4 addsd %xmm0, %xmm3 addsd %xmm0, %xmm2 addsd %xmm0, %xmm1 jne .L3
You need -ffast-math to allow floating point reduction. You also need -fno-vect-cost-model, because the vectorization is not profitable in this case.
(In reply to comment #1) g++-4.6.2 -S -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 \ -ffast-math -fno-vect-cost-model gives me exactly the same assembly code as above (which I'm surprised a bit as -funsafe-math-optimizations might as well have eliminated the loop completely). The optimal assembly, however, I would expect to be something like: .L3: addq $1, %rax addpd %xmm0, %xmm3 cmpq %rdi, %rax addpd %xmm0, %xmm2 addpd %xmm0, %xmm1 jne .L3 Where the vector (sum1,sum2) is stored in xmm1, (sum3,sum4) stored in xmm2, etc and (a,a) stored in xmm0. This speeds it up by a factor of 2 and is completely equivalent to the scalar case so I don't see why -ffast-math (which implies -funsafe-math-optimizations) should be necessary in this case, either.
It gets vectorized with 4.7. I guess, due to this 4.7 patch http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00620.html.
Looks like there has been some great progress in gcc 4.7! Still I think it behaves slightly buggy. (1) In this case it should work without -funsafe-math-optimizations but it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math -fassociative-math to make it work. (2) The prediction: 7: not vectorized: vectorization not profitable. is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up by factor of 2. (3) If I change all double's into float's in the code above it seems to work without forcing it (-fno-vect-cost-model): g++-4.7 -S -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 \ -funsafe-math-optimizations test.cpp Analyzing loop at test.cpp:7 Vectorizing loop at test.cpp:7 7: vectorizing stmts using SLP. 7: LOOP VECTORIZED. test.cpp:4: note: vectorized 1 loops in function. However, it hasn't vectorized it at all as the assembly shows: .L11: addq $1, %rax addss %xmm0, %xmm3 cmpq %rax, %rdi addss %xmm0, %xmm4 addss %xmm0, %xmm7 addss %xmm0, %xmm6 addss %xmm0, %xmm5 addss %xmm0, %xmm1 ja .L11
(In reply to comment #4) > Looks like there has been some great progress in gcc 4.7! > > Still I think it behaves slightly buggy. > > (1) In this case it should work without -funsafe-math-optimizations but > it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math > -fassociative-math to make it work. > It's reduction, when we vectorize we change the order of computation. In order to be able to do that for floating point we need flag_associative_math. > (2) The prediction: > 7: not vectorized: vectorization not profitable. > is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up > by factor of 2. > > (3) If I change all double's into float's in the code above it seems to > work without forcing it (-fno-vect-cost-model): > > > g++-4.7 -S -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 \ > -funsafe-math-optimizations test.cpp > > Analyzing loop at test.cpp:7 > > > Vectorizing loop at test.cpp:7 > > 7: vectorizing stmts using SLP. > 7: LOOP VECTORIZED. > test.cpp:4: note: vectorized 1 loops in function. > > > However, it hasn't vectorized it at all as the assembly shows: > > .L11: > addq $1, %rax > addss %xmm0, %xmm3 > cmpq %rax, %rdi > addss %xmm0, %xmm4 > addss %xmm0, %xmm7 > addss %xmm0, %xmm6 > addss %xmm0, %xmm5 > addss %xmm0, %xmm1 > ja .L11 I think you are looking at the scalar epilogue. The number of iterations is unknown, so we need an epilogue loop for the case that number of iterations is not a multiple of 4.
> I think you are looking at the scalar epilogue. The number of iterations is > unknown, so we need an epilogue loop for the case that number of iterations is > not a multiple of 4. While investigating pr51597, I have found that vectorized loops in programs as simple as subroutine spmmult(x,b,ad) implicit none integer, parameter :: nxyz=1008315 real(8),dimension(nxyz):: x,b,ad b = ad*x end subroutine spmmult !========================================= has always an additional non-vectorized loop, i.e. a vectorized one L3: movsd (%r9,%rax), %xmm1 addq $1, %rcx movapd (%r10,%rax), %xmm0 movhpd 8(%r9,%rax), %xmm1 mulpd %xmm1, %xmm0 movlpd %xmm0, (%r8,%rax) movhpd %xmm0, 8(%r8,%rax) addq $16, %rax cmpq $504156, %rcx jbe L3 and a non-vectorized one L5: movsd -8(%rdi,%rax,8), %xmm0 mulsd -8(%rdx,%rax,8), %xmm0 movsd %xmm0, -8(%rsi,%rax,8) addq $1, %rax cmpq %rcx, %rax jne L5 even when the above loops are unrolled. How can the loop L5 be unrolled if it is only there for a "scalar epilogue"?
(In reply to comment #5) > > (3) If I change all double's into float's in the code above it seems to > I think you are looking at the scalar epilogue. The number of iterations is > unknown, so we need an epilogue loop for the case that number of iterations is > not a multiple of 4. Yes you're right. Sorry about that, my mistake. > > (1) In this case it should work without -funsafe-math-optimizations but > > it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math > > -fassociative-math to make it work. > > > > It's reduction, when we vectorize we change the order of computation. In order > to be able to do that for floating point we need flag_associative_math. In some cases it might be necessary but not here: sum1+=a; sum2+=a; gives exactly the same result as (sum1, sum2) += (a, a); Lets take a more applied example, say calculating the sum of 1/i: double harmon(int n) { double sum=0.0; for(int i=1; i<n; i++){ sum += 1.0/i; } return sum; } This requires reordering of the sum to be vectorized, so in this case I agree we need -funsafe-math-optimizations. However, one could manually split the sum double harmon(int n) { assert(n%2==0); double sum1=0.0, sum2=0.0; for(int i=1; i<n; i+=2){ sum1 += 1.0/i; sum2 += 1.0/(i+1); } return sum1+sum2; } and now I'd expect the compiler to vectorize this without -funsafe-math-optimizations as it doesn't change any computational results: (sum1, sum2) += (1.0/i, 1.0/(i+1)); I can attach a test case with that example if that'd be useful?
(In reply to comment #6) > While investigating pr51597, I have found that vectorized loops in programs as > simple as > > subroutine spmmult(x,b,ad) > implicit none > integer, parameter :: nxyz=1008315 > real(8),dimension(nxyz):: x,b,ad > b = ad*x > end subroutine spmmult !========================================= > > has always an additional non-vectorized loop, This loop has a prologue loop for alignment purposes. > L5: > movsd -8(%rdi,%rax,8), %xmm0 > mulsd -8(%rdx,%rax,8), %xmm0 > movsd %xmm0, -8(%rsi,%rax,8) > addq $1, %rax > cmpq %rcx, %rax > jne L5 > > even when the above loops are unrolled. How can the loop L5 be unrolled if it > is only there for a "scalar epilogue"? It can't be unrolled, since the alignment is unknown, so we don't know the number of iterations of the prologue loop, and, therefore, we don't know the number of iterations of the epilogue.
(In reply to comment #7) > > In some cases it might be necessary but not here: > > sum1+=a; > sum2+=a; > > gives exactly the same result as > > (sum1, sum2) += (a, a); > So, you are suggesting to remove the need in flag_associative_math for fp for cases when a reduction computation is already unrolled by the vectorization factor. Sounds reasonable to me.
Hmm. But we are vectorizing sum += a[i] sum += a[i+1] the same as sum += a[i+1] sum += a[i] no? Thus you have to check whether the summation occours in "memory order"?
Right. We need to check that there is no load permutation.
> > even when the above loops are unrolled. How can the loop L5 be unrolled if it > > is only there for a "scalar epilogue"? > > It can't be unrolled, since the alignment is unknown, so we don't know the > number of iterations of the prologue loop, and, therefore, we don't know the > number of iterations of the epilogue. Well, it is unrolled with -funroll-loops, for instance if I compile with '-Ofast -funroll-loops --param max-unroll-times=4', I get L3: movsd (%r8,%r11), %xmm3 addq $4, %r10 movsd 16(%r8,%r11), %xmm5 movsd 32(%r8,%r11), %xmm7 movhpd 8(%r8,%r11), %xmm3 movsd 48(%r8,%r11), %xmm9 movhpd 24(%r8,%r11), %xmm5 movapd (%r9,%r11), %xmm4 movhpd 40(%r8,%r11), %xmm7 movapd 16(%r9,%r11), %xmm6 movhpd 56(%r8,%r11), %xmm9 movapd 32(%r9,%r11), %xmm8 mulpd %xmm3, %xmm4 movapd 48(%r9,%r11), %xmm10 mulpd %xmm5, %xmm6 mulpd %xmm7, %xmm8 mulpd %xmm9, %xmm10 movlpd %xmm4, (%rcx,%r11) movhpd %xmm4, 8(%rcx,%r11) movlpd %xmm6, 16(%rcx,%r11) movhpd %xmm6, 24(%rcx,%r11) movlpd %xmm8, 32(%rcx,%r11) movhpd %xmm8, 40(%rcx,%r11) movlpd %xmm10, 48(%rcx,%r11) movhpd %xmm10, 56(%rcx,%r11) addq $64, %r11 cmpq $504156, %r10 jbe L3 and L5: movsd -8(%rdi,%r9,8), %xmm15 leaq 1(%r9), %rbx leaq 2(%r9), %r8 movsd -8(%rdi,%rbx,8), %xmm0 leaq 3(%r9), %rcx movsd -8(%rdi,%r8,8), %xmm1 mulsd -8(%rdx,%r9,8), %xmm15 movsd -8(%rdi,%rcx,8), %xmm2 mulsd -8(%rdx,%rbx,8), %xmm0 mulsd -8(%rdx,%r8,8), %xmm1 mulsd -8(%rdx,%rcx,8), %xmm2 movsd %xmm15, -8(%rsi,%r9,8) addq $4, %r9 cmpq %r12, %r9 movsd %xmm0, -8(%rsi,%rbx,8) movsd %xmm1, -8(%rsi,%r8,8) movsd %xmm2, -8(%rsi,%rcx,8) jne L5 So both the vectorized and the unvectorized loops are unrolled four times. This does not seem logical to me if the L5 loop was there only to handle a left over scalar (AFAIU %xmm* store only one or two doubles and there is at most one left if the length is odd or if the length is even and the first one has been peeled for alignement). I am also puzzled by the way the vectors as stored back as a pair movlpd %xmm4, (%rcx,%r11) movhpd %xmm4, 8(%rcx,%r11) Why not a 'movapd' instead?
(In reply to comment #9) > So, you are suggesting to remove the need in flag_associative_math for fp for > cases when a reduction computation is already unrolled by the vectorization > factor. Sounds reasonable to me. Yes I think that's it, basically only require flag_associative_math if the order of summation or products is changed by the vectorizer. That is quite important I think, as most of the time -ffast-math / -funsafe-math-optimizations / -fassociative-math might not be acceptable for many projects. However, I don't fully understand Richard Guenther's example. Yes his example requires -fassociative-math to be vectorized, however, my example would translate to something like sum1 += a[i]; sum2 += a[i+1]; and now it doesn't matter if it's executed this way or the other way around sum2 += a[i+1]; sum1 += a[i]; Second issue is just to double check the profitability calculation as it wrongly decided: 7: not vectorized: vectorization not profitable.
(In reply to comment #13) > > However, I don't fully understand Richard Guenther's example. Yes his > example requires -fassociative-math to be vectorized, however, my example > would translate to something like > > sum1 += a[i]; > sum2 += a[i+1]; > > and now it doesn't matter if it's executed this way or the other way > around > > sum2 += a[i+1]; > sum1 += a[i]; The problem is probably more in implementation. The change of order will also change between sum1 and sum2, so when you want to return sum1+sum2, the vectorized version will return sum2+sum1.
So here is the interesting for the trunk, With -O3 we can vectorize the loop because we are using a SLP vectorizer but -Ofast we don't as we say the vectorization is too costly. The inner most loop for -O3: .L3: addq $1, %rax addpd %xmm1, %xmm2 addpd %xmm1, %xmm3 addpd %xmm1, %xmm4 cmpq %rax, %rdi jne .L3 The SLP vectorizer has done it since 11+. Here is the inner loop for -Ofast: .L3: addq $1, %rax addsd %xmm0, %xmm3 addsd %xmm0, %xmm6 addsd %xmm0, %xmm1 addsd %xmm0, %xmm5 addsd %xmm0, %xmm2 addsd %xmm0, %xmm4 cmpq %rax, %rdi jne .L3 as you can see we don't vectorize it.