Bug 51499 - -Ofast does not vectorize while -O3 does.
Summary: -Ofast does not vectorize while -O3 does.
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2011-12-10 18:13 UTC by fb.programming
Modified: 2021-08-07 14:50 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-08-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description fb.programming 2011-12-10 18:13:35 UTC
The sse vectorizer seems to miss one of the simplest cases:

#include <cstdio>
#include <cstdlib>

double loop(double a, size_t n){
   // initialise differently so compiler doesn't simplify
   double sum1=0.1, sum2=0.2, sum3=0.3, sum4=0.4, sum5=0.5, sum6=0.6;
   for(size_t i=0; i<n; i++){
      sum1+=a; sum2+=a; sum3+=a; sum4+=a; sum5+=a; sum6+=a;
   }
   return sum1+sum2+sum3+sum4+sum5+sum6-2.1-6.0*a*n;
}

int main(int argc, char** argv) {
   size_t n=1000000;
   double a=1.1;
   printf("res=%f\n", loop(a,n));
   return EXIT_SUCCESS;
}

g++-4.6.2 -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 test.cpp

test.cpp:7: note: not vectorized: unsupported use in stmt.
test.cpp:4: note: vectorized 0 loops in function.

We get six addsd operations - whereas an optimisation should have
given us three addpd operations.

.L3:
	addq	$1, %rax
	addsd	%xmm0, %xmm6
	cmpq	%rdi, %rax
	addsd	%xmm0, %xmm5
	addsd	%xmm0, %xmm4
	addsd	%xmm0, %xmm3
	addsd	%xmm0, %xmm2
	addsd	%xmm0, %xmm1
	jne	.L3
Comment 1 Ira Rosen 2011-12-11 07:37:36 UTC
You need -ffast-math to allow floating point reduction.
You also need -fno-vect-cost-model, because the vectorization is not profitable in this case.
Comment 2 fb.programming 2011-12-11 08:33:40 UTC
(In reply to comment #1)

g++-4.6.2 -S -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 \
          -ffast-math  -fno-vect-cost-model

gives me exactly the same assembly code as above (which I'm surprised
a bit as -funsafe-math-optimizations might as well have eliminated the
loop completely).

The optimal assembly, however, I would expect to be something like:

.L3:
    addq    $1, %rax
    addpd    %xmm0, %xmm3
    cmpq    %rdi, %rax
    addpd    %xmm0, %xmm2
    addpd    %xmm0, %xmm1
    jne    .L3

Where the vector (sum1,sum2) is stored in xmm1, (sum3,sum4) stored in
xmm2, etc and (a,a) stored in xmm0. This speeds it up by a factor of 2
and is completely equivalent to the scalar case so I don't see why
-ffast-math (which implies -funsafe-math-optimizations) should be
necessary in this case, either.
Comment 3 Ira Rosen 2011-12-11 08:48:24 UTC
It gets vectorized with 4.7.
I guess, due to this 4.7 patch http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00620.html.
Comment 4 fb.programming 2011-12-11 11:52:30 UTC
Looks like there has been some great progress in gcc 4.7!

Still I think it behaves slightly buggy.

(1) In this case it should work without -funsafe-math-optimizations but
    it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
   -fassociative-math to make it work.

(2) The prediction:
       7: not vectorized: vectorization not profitable.
    is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
    by factor of 2.

(3) If I change all double's into float's in the code above it seems to
    work without forcing it (-fno-vect-cost-model):


   g++-4.7 -S -Wall -O2  -ftree-vectorize -ftree-vectorizer-verbose=2 \
           -funsafe-math-optimizations test.cpp

   Analyzing loop at test.cpp:7


   Vectorizing loop at test.cpp:7

   7: vectorizing stmts using SLP.
   7: LOOP VECTORIZED.
   test.cpp:4: note: vectorized 1 loops in function.


    However, it hasn't vectorized it at all as the assembly shows:

.L11:
	addq	$1, %rax
	addss	%xmm0, %xmm3
	cmpq	%rax, %rdi
	addss	%xmm0, %xmm4
	addss	%xmm0, %xmm7
	addss	%xmm0, %xmm6
	addss	%xmm0, %xmm5
	addss	%xmm0, %xmm1
	ja	.L11
Comment 5 Ira Rosen 2011-12-11 13:30:41 UTC
(In reply to comment #4)
> Looks like there has been some great progress in gcc 4.7!
> 
> Still I think it behaves slightly buggy.
> 
> (1) In this case it should work without -funsafe-math-optimizations but
>     it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
>    -fassociative-math to make it work.
> 

It's reduction, when we vectorize we change the order of computation. In order to be able to do that for floating point we need flag_associative_math.

> (2) The prediction:
>        7: not vectorized: vectorization not profitable.
>     is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
>     by factor of 2.
> 
> (3) If I change all double's into float's in the code above it seems to
>     work without forcing it (-fno-vect-cost-model):
> 
> 
>    g++-4.7 -S -Wall -O2  -ftree-vectorize -ftree-vectorizer-verbose=2 \
>            -funsafe-math-optimizations test.cpp
> 
>    Analyzing loop at test.cpp:7
> 
> 
>    Vectorizing loop at test.cpp:7
> 
>    7: vectorizing stmts using SLP.
>    7: LOOP VECTORIZED.
>    test.cpp:4: note: vectorized 1 loops in function.
> 
> 
>     However, it hasn't vectorized it at all as the assembly shows:
> 
> .L11:
>     addq    $1, %rax
>     addss    %xmm0, %xmm3
>     cmpq    %rax, %rdi
>     addss    %xmm0, %xmm4
>     addss    %xmm0, %xmm7
>     addss    %xmm0, %xmm6
>     addss    %xmm0, %xmm5
>     addss    %xmm0, %xmm1
>     ja    .L11


I think you are looking at the scalar epilogue. The number of iterations is unknown, so we need an epilogue loop for the case that number of iterations is not a multiple of 4.
Comment 6 Dominique d'Humieres 2011-12-11 14:14:01 UTC
> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.

While investigating pr51597, I have found that vectorized loops in programs as simple as

subroutine spmmult(x,b,ad)
implicit none
integer, parameter :: nxyz=1008315
real(8),dimension(nxyz):: x,b,ad
b = ad*x
end subroutine spmmult               !=========================================

has always an additional non-vectorized loop, i.e. a vectorized one

L3:
        movsd   (%r9,%rax), %xmm1
        addq    $1, %rcx
        movapd  (%r10,%rax), %xmm0
        movhpd  8(%r9,%rax), %xmm1
        mulpd   %xmm1, %xmm0
        movlpd  %xmm0, (%r8,%rax)
        movhpd  %xmm0, 8(%r8,%rax)
        addq    $16, %rax
        cmpq    $504156, %rcx
        jbe     L3

and a non-vectorized one

L5:
        movsd   -8(%rdi,%rax,8), %xmm0
        mulsd   -8(%rdx,%rax,8), %xmm0
        movsd   %xmm0, -8(%rsi,%rax,8)
        addq    $1, %rax
        cmpq    %rcx, %rax
        jne     L5

even when the above loops are unrolled. How can the loop L5 be unrolled if it is only there for a "scalar epilogue"?
Comment 7 fb.programming 2011-12-11 14:55:13 UTC
(In reply to comment #5)

> > (3) If I change all double's into float's in the code above it seems to

> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.

Yes you're right. Sorry about that, my mistake.


> > (1) In this case it should work without -funsafe-math-optimizations but
> >     it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
> >    -fassociative-math to make it work.
> > 
> 
> It's reduction, when we vectorize we change the order of computation. In order
> to be able to do that for floating point we need flag_associative_math.

In some cases it might be necessary but not here:

 sum1+=a;
 sum2+=a;

gives exactly the same result as

 (sum1, sum2) += (a, a);

Lets take a more applied example, say calculating the sum of 1/i:

   double harmon(int n) {
      double sum=0.0;
      for(int i=1; i<n; i++){
         sum += 1.0/i;
      }
      return sum;
   }

This requires reordering of the sum to be vectorized, so in this case
I agree we need -funsafe-math-optimizations.
However, one could manually split the sum 

   double harmon(int n) {
      assert(n%2==0);
      double sum1=0.0, sum2=0.0;
      for(int i=1; i<n; i+=2){
         sum1 += 1.0/i;
         sum2 += 1.0/(i+1);
      }
      return sum1+sum2;
   }

and now I'd expect the compiler to vectorize this without
-funsafe-math-optimizations as it doesn't change any computational
results:

         (sum1, sum2) += (1.0/i, 1.0/(i+1));

I can attach a test case with that example if that'd be useful?
Comment 8 Ira Rosen 2011-12-12 11:03:59 UTC
(In reply to comment #6)

> While investigating pr51597, I have found that vectorized loops in programs as
> simple as
> 
> subroutine spmmult(x,b,ad)
> implicit none
> integer, parameter :: nxyz=1008315
> real(8),dimension(nxyz):: x,b,ad
> b = ad*x
> end subroutine spmmult               !=========================================
> 
> has always an additional non-vectorized loop,

This loop has a prologue loop for alignment purposes.

> L5:
>         movsd   -8(%rdi,%rax,8), %xmm0
>         mulsd   -8(%rdx,%rax,8), %xmm0
>         movsd   %xmm0, -8(%rsi,%rax,8)
>         addq    $1, %rax
>         cmpq    %rcx, %rax
>         jne     L5
> 
> even when the above loops are unrolled. How can the loop L5 be unrolled if it
> is only there for a "scalar epilogue"?

It can't be unrolled, since the alignment is unknown, so we don't know the number of iterations of the prologue loop, and, therefore, we don't know the number of iterations of the epilogue.
Comment 9 Ira Rosen 2011-12-12 11:13:24 UTC
(In reply to comment #7)
> 
> In some cases it might be necessary but not here:
> 
>  sum1+=a;
>  sum2+=a;
> 
> gives exactly the same result as
> 
>  (sum1, sum2) += (a, a);
> 

So, you are suggesting to remove the need in flag_associative_math for fp for cases when a reduction computation is already unrolled by the vectorization factor. Sounds reasonable to me.
Comment 10 Richard Biener 2011-12-12 11:24:22 UTC
Hmm.  But we are vectorizing

  sum += a[i]
  sum += a[i+1]

the same as

  sum += a[i+1]
  sum += a[i]

no?  Thus you have to check whether the summation occours in "memory order"?
Comment 11 Ira Rosen 2011-12-12 11:27:26 UTC
Right. We need to check that there is no load permutation.
Comment 12 Dominique d'Humieres 2011-12-12 12:47:54 UTC
> > even when the above loops are unrolled. How can the loop L5 be unrolled if it
> > is only there for a "scalar epilogue"?
>
> It can't be unrolled, since the alignment is unknown, so we don't know the
> number of iterations of the prologue loop, and, therefore, we don't know the
> number of iterations of the epilogue.

Well, it is unrolled with -funroll-loops, for instance if I compile with '-Ofast -funroll-loops --param max-unroll-times=4', I get

L3:
        movsd   (%r8,%r11), %xmm3
        addq    $4, %r10
        movsd   16(%r8,%r11), %xmm5
        movsd   32(%r8,%r11), %xmm7
        movhpd  8(%r8,%r11), %xmm3
        movsd   48(%r8,%r11), %xmm9
        movhpd  24(%r8,%r11), %xmm5
        movapd  (%r9,%r11), %xmm4
        movhpd  40(%r8,%r11), %xmm7
        movapd  16(%r9,%r11), %xmm6
        movhpd  56(%r8,%r11), %xmm9
        movapd  32(%r9,%r11), %xmm8
        mulpd   %xmm3, %xmm4
        movapd  48(%r9,%r11), %xmm10
        mulpd   %xmm5, %xmm6
        mulpd   %xmm7, %xmm8
        mulpd   %xmm9, %xmm10
        movlpd  %xmm4, (%rcx,%r11)
        movhpd  %xmm4, 8(%rcx,%r11)
        movlpd  %xmm6, 16(%rcx,%r11)
        movhpd  %xmm6, 24(%rcx,%r11)
        movlpd  %xmm8, 32(%rcx,%r11)
        movhpd  %xmm8, 40(%rcx,%r11)
        movlpd  %xmm10, 48(%rcx,%r11)
        movhpd  %xmm10, 56(%rcx,%r11)
        addq    $64, %r11
        cmpq    $504156, %r10
        jbe     L3

and

L5:
        movsd   -8(%rdi,%r9,8), %xmm15
        leaq    1(%r9), %rbx
        leaq    2(%r9), %r8
        movsd   -8(%rdi,%rbx,8), %xmm0
        leaq    3(%r9), %rcx
        movsd   -8(%rdi,%r8,8), %xmm1
        mulsd   -8(%rdx,%r9,8), %xmm15
        movsd   -8(%rdi,%rcx,8), %xmm2
        mulsd   -8(%rdx,%rbx,8), %xmm0
        mulsd   -8(%rdx,%r8,8), %xmm1
        mulsd   -8(%rdx,%rcx,8), %xmm2
        movsd   %xmm15, -8(%rsi,%r9,8)
        addq    $4, %r9
        cmpq    %r12, %r9
        movsd   %xmm0, -8(%rsi,%rbx,8)
        movsd   %xmm1, -8(%rsi,%r8,8)
        movsd   %xmm2, -8(%rsi,%rcx,8)
        jne     L5

So both the vectorized and the unvectorized loops are unrolled four times. This does not seem logical to me if the L5 loop was there only to handle a left over scalar (AFAIU %xmm* store only one or two doubles and there is at most one left if the length is odd or if the length is even and the first one has been peeled for alignement).

I am also puzzled by the way the vectors as stored back as a pair

        movlpd  %xmm4, (%rcx,%r11)
        movhpd  %xmm4, 8(%rcx,%r11)

Why not a 'movapd' instead?
Comment 13 fb.programming 2011-12-12 14:20:58 UTC
(In reply to comment #9)

> So, you are suggesting to remove the need in flag_associative_math for fp for
> cases when a reduction computation is already unrolled by the vectorization
> factor. Sounds reasonable to me.

Yes I think that's it, basically only require flag_associative_math if
the order of summation or products is changed by the vectorizer. That is
quite important I think, as most of the time
 -ffast-math / -funsafe-math-optimizations / -fassociative-math
might not be acceptable for many projects.

However, I don't fully understand Richard Guenther's example. Yes his
example requires -fassociative-math to be vectorized, however, my example
would translate to something like

  sum1 += a[i];
  sum2 += a[i+1];

and now it doesn't matter if it's executed this way or the other way
around

  sum2 += a[i+1];
  sum1 += a[i];

Second issue is just to double check the profitability calculation
as it wrongly decided:

  7: not vectorized: vectorization not profitable.
Comment 14 Ira Rosen 2011-12-13 16:27:19 UTC
(In reply to comment #13)
> 
> However, I don't fully understand Richard Guenther's example. Yes his
> example requires -fassociative-math to be vectorized, however, my example
> would translate to something like
> 
>   sum1 += a[i];
>   sum2 += a[i+1];
> 
> and now it doesn't matter if it's executed this way or the other way
> around
> 
>   sum2 += a[i+1];
>   sum1 += a[i];

The problem is probably more in implementation. The change of order will also change between sum1 and sum2, so when you want to return sum1+sum2, the vectorized version will return sum2+sum1.
Comment 15 Andrew Pinski 2021-08-07 05:19:35 UTC
So here is the interesting for the trunk,
With -O3 we can vectorize the loop because we are using a SLP vectorizer but
-Ofast we don't as we say the vectorization is too costly.

The inner most loop for -O3:
.L3:
        addq    $1, %rax
        addpd   %xmm1, %xmm2
        addpd   %xmm1, %xmm3
        addpd   %xmm1, %xmm4
        cmpq    %rax, %rdi
        jne     .L3

The SLP vectorizer has done it since 11+.

Here is the inner loop for -Ofast:
.L3:
        addq    $1, %rax
        addsd   %xmm0, %xmm3
        addsd   %xmm0, %xmm6
        addsd   %xmm0, %xmm1
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm2
        addsd   %xmm0, %xmm4
        cmpq    %rax, %rdi
        jne     .L3

as you can see we don't vectorize it.