Bug 51848 - GCC is not able to vectorize when a constant value is also added to the sum of array expression inside a loop.
Summary: GCC is not able to vectorize when a constant value is also added to the sum o...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2012-01-13 13:40 UTC by Venkataramanan
Modified: 2024-02-16 08:22 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 11.4.1, 14.0, 8.1.0, 9.1.0
Known to fail: 7.1.0
Last reconfirmed: 2012-01-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Venkataramanan 2012-01-13 13:40:05 UTC
This below test case is simulated from "air.f90" benchmark of polyhedren. 

What I see is vectorization makes "air" run faster with ICC than GCC by about 16%,
but I am not sure if all that comes from vectorization alone.

While analysing the assembly differences, found that GCC is not vectorizing the below case wheres ICC does vectorize.

(Snip)
      DIMENSION NPX(30) , NPY(30)
      COMMON /XD1   / MXPy, NDX
      COMMON /XD2  / MXPx
      MXPx = 0
      DO i = 1 , NDX
         MXPx = MXPx + NPX(i)+1
      ENDDO
!
      END
(Snip)


Machine: x86_64-unknown-linux-gnu
GCC revison: 183151 
ICC revision: 12.1.0.233 Build 2

gcc -Ofast -march=corei7-avx  -limf -lsvml -L /tool/intel/lib/intel64/ -mveclibabi=svml   pattern1.f90 -ftree-vectorizer-verbose=2 -S

Analyzing loop at pattern1.f90:5

5: not vectorized: unsupported use in stmt.
5: not vectorized: unsupported use in stmt.
pattern1.f90:9: note: vectorized 0 loops in function.


ifort -march=corei7-avx  -O3  -limf -lsvml -L /tool/intel/lib/intel64/  pattern1.f90  -vec-report -S -fsource-asm

pattern1.f90(5): (col. 7) remark: LOOP WAS VECTORIZED.


For the expression: 

MXPx = MXPx + NPX(i)+1

        
The constant "1" is converted to a vector packet as shown below
 
 .L_2il0floatpacket.0:
        .long   0x00000001,0x00000001,0x00000001,0x00000001

The assembly pattern for the vectorization portion in ICC looks like as shown below:

       
The total expression now becomes vectorizable. 

vmovdqu   .L_2il0floatpacket.0(%rip), %xmm0

..B1.5:                         # Preds ..B1.5 ..B1.4
        vpaddd    _unnamed_main$_$NPX.0.1(,%rax,4), %xmm0, %xmm2 #6.10
        addq      $4, %rax                                      #5.7
        vpaddd    %xmm2, %xmm1, %xmm1                           #6.22
        cmpq      %rdx, %rax                                    #5.7
        jb        ..B1.5        # Prob 96%                      #5.7

Please provide your thoughts on this and possible vectorization improvement in GCC for this pattern.
Comment 1 Dominique d'Humieres 2012-01-13 16:00:20 UTC
I confirm that the loop in the codelet is not vectorized (it is if the "+1" is removed and taken into account as "MXPx=NDX").

> but I am not sure if all that comes from vectorization alone.

Doing the above change improves the vectorization, but not the timing. One "property" of the test air.f90 is to have several nested loops with "bad" nesting (slow index first). I don't know if this was done to test compilers, but if I reverse the loops manually as in

--- air.f90	2009-08-28 14:22:26.000000000 +0200
+++ air_v1.f90	2005-11-09 17:33:12.000000000 +0100
@@ -400,8 +400,8 @@
 !
 ! COMPUTE THE FLUX TERMS
 !
-      DO i = 1 , MXPx
-         DO j = 1 , MXPy
+      DO j = 1 , MXPy
+         DO i = 1 , MXPx
 !
 ! compute vanleer fluxes
 !
@@ -657,8 +657,8 @@
       ENDDO
 !
 ! COMPUTE THE FLUX TERMS
-      DO i = 1 , MXPx
-         DO j = 1 , MXPy
+      DO j = 1 , MXPy
+         DO i = 1 , MXPx
 !
 ! compute vanleer fluxes
 !
@@ -838,8 +838,8 @@
 ! FIND THE LOCAL TIME STEPS
 !
       dt = 100
-      DO i = 1 , MXPx
-         DO j = 1 , MXPy
+      DO j = 1 , MXPy
+         DO i = 1 , MXPx
             as = DSQRT(P(i,j)/RHO(i,j)*GMA)
             rdltx = RHO(i,j)*DABS(U(i,j))*ddx(i,j)/xmu(i,j)
             rdlty = RHO(i,j)*DABS(V(i,j))*ddy(i,j)/xmu(i,j)
@@ -880,13 +880,13 @@
          DO iy = 1 , NDY
             maxy = maxy + NPY(iy) + 1
             dtd = 100.0
-            DO i = minx , maxx
-               DO j = miny , maxy
+            DO j = miny , maxy
+               DO i = minx , maxx
                   IF ( dtt(i,j).LE.dtd ) dtd = dtt(i,j)
                ENDDO
             ENDDO
-            DO i = minx , maxx
-               DO j = miny , maxy
+            DO j = miny , maxy
+               DO i = minx , maxx
                   dtt(i,j) = dtd
                ENDDO
             ENDDO
@@ -958,8 +958,8 @@
       con2 = 0.0
       con3 = 0.0
       con4 = 0.0
-      DO i = 1 , MXPx
-         DO j = 1 , MXPy
+      DO j = 1 , MXPy
+         DO i = 1 , MXPx
             con1 = con1 + DABS(u1(i,j)-u1o(i,j))/dtt(i,j)
             con2 = con2 + DABS(u2(i,j)-u2o(i,j))/dtt(i,j)
             con3 = con3 + DABS(u3(i,j)-u3o(i,j))/dtt(i,j)

the timing goes from

[macbook] lin/test% time a.out > /dev/null
7.233u 0.023s 0:07.25 100.0%	0+0k 0+8io 0pf+0w

to

[macbook] lin/test% time a.out > /dev/null
6.353u 0.021s 0:06.37 100.0%	0+0k 0+8io 0pf+0w

I have made a few attempt to obtain gfortran to do these loops interchange using graphite without success so far.
Comment 2 Ira Rosen 2012-01-15 08:06:09 UTC
We don't recognize this as reduction because we look for:

     a1 = phi < a0, a2 >
     a3 = ...
     a2 = operation (a3, a1)

and here we have

     a1 = phi < a0, a2 >
     a3 = ...
     a4 = operation (a3, a1)
     a2 = a4 + 1
Comment 3 Richard Biener 2012-01-16 10:14:08 UTC
(In reply to comment #2)
> We don't recognize this as reduction because we look for:
> 
>      a1 = phi < a0, a2 >
>      a3 = ...
>      a2 = operation (a3, a1)
> 
> and here we have
> 
>      a1 = phi < a0, a2 >
>      a3 = ...
>      a4 = operation (a3, a1)
>      a2 = a4 + 1

It seems reassociation should "fix" this.  IIRC we had some special code
in there that was supposed to handle this.
Comment 4 Richard Biener 2012-07-13 08:59:26 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 5 Alan Lawrence 2015-06-16 16:54:42 UTC
(In reply to Richard Biener from comment #3)
> (In reply to comment #2)
> > We don't recognize this as reduction because we look for:
> > 
> >      a1 = phi < a0, a2 >
> >      a3 = ...
> >      a2 = operation (a3, a1)
> > 
> > and here we have
> > 
> >      a1 = phi < a0, a2 >
> >      a3 = ...
> >      a4 = operation (a3, a1)
> >      a2 = a4 + 1
> 
> It seems reassociation should "fix" this.  IIRC we had some special code
> in there that was supposed to handle this.

We do reassociate equivalent functions in C, for example

float x[30];

float bar() {
  float s = 0;
  for (int i = 0; i < 30; i++)
    {
      s += x[i];
      s += 1;
    }
  return s;
}

which vectorizes just fine.
Comment 6 Richard Biener 2024-02-16 08:22:43 UTC
long fixed.