Bug 49849 - loop optimization prevents vectorization
Summary: loop optimization prevents vectorization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2011-07-26 07:45 UTC by vincenzo Innocente
Modified: 2021-09-12 05:21 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-07-26 09:21:16


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2011-07-26 07:45:57 UTC
In the following example I suspect that some sort of loop merging at O3 prevent the optimization of the second inner loop in "bar"
compare
c++ -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=7 -c vectHist.cpp -ffast-math
c++ -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=7 -c vectHist.cpp -ffast-math



what I do not understand is that if (following man page) I compare O2 and O3 with
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
>   -fgcse-after-reload         		[enabled]
>   -finline-functions          		[enabled]
>   -fipa-cp-clone              		[enabled]
>   -fpredictive-commoning      		[enabled]
>   -ftree-loop-distribute-patterns 	[enabled]
>   -ftree-vectorize            		[enabled]
>   -funswitch-loops            		[enabled]

I still get
c++ -std=gnu++0x -DNDEBUG -Wall -O2 -ftree-vectorize -msse4 -fvisibility-inlines-hidden -ftree-vectorizer-verbose=2 --param vect-max-version-for-alias-checks=30 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores -fipa-pta -Wunsafe-loop-optimizations -fgcse-sm -fgcse-las -c vectHist.cpp -ffast-math -funswitch-loops -ftree-loop-distribute-patterns -fpredictive-commoning -finline-functions -fipa-cp-clone -fgcse-after-reload

vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.4986_4];

vectHist.cpp:16: note: vectorized 0 loops in function.

vectHist.cpp:35: note: not vectorized: data ref analysis failed D.4977_30 = hist[D.4976_29];

vectHist.cpp:33: note: LOOP VECTORIZED.
vectHist.cpp:31: note: not vectorized: data ref analysis failed D.4957_13 = co[D.4956_12];

vectHist.cpp:25: note: vectorized 1 loops in function.

while changing just O2 in 03 (that at this point should be not really effective as I added all options by hand) does not vectorize…
c++ -std=gnu++0x -DNDEBUG -Wall -O3 -mavx -ftree-vectorize -msse4 -fvisibility-inlines-hidden -ftree-vectorizer-verbose=2 --param vect-max-version-for-alias-checks=30 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores -fipa-pta -Wunsafe-loop-optimizations -fgcse-sm -fgcse-las -c vectHist.cpp -ffast-math -funswitch-loops -ftree-loop-distribute-patterns -fpredictive-commoning -finline-functions -fipa-cp-clone -fgcse-after-reload 
vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.5125_4];

vectHist.cpp:17: note: not vectorized: data ref analysis failed x_5 = co[D.5125_4];

vectHist.cpp:16: note: vectorized 0 loops in function.

vectHist.cpp:30: note: not vectorized: data ref analysis failed D.5096_55 = co[D.5095_54];

vectHist.cpp:30: note: not vectorized: data ref analysis failed D.5096_55 = co[D.5095_54];

vectHist.cpp:25: note: vectorized 0 loops in function.

note how it does not report anything about loops at lines 31,33 and 35

---------------------------
// a classroom example
#include<cmath>

const int N=1024;

float __attribute__ ((aligned(16))) a[N];
float __attribute__ ((aligned(16))) b[N];
float __attribute__ ((aligned(16))) c[N];
float __attribute__ ((aligned(16))) d[N];
int __attribute__ ((aligned(16)))   k[N];



float __attribute__ ((aligned(16))) co[12];
float __attribute__ ((aligned(16))) hist[100];


// do not expect GCC to vectorize (yet)
void foo() {
  for (int i=0; i!=N; ++i) {
    float x = co[k[i]];
    float y = a[i]/std::sqrt(x*b[i]);
    ++hist[int(y)];
  } 
}


// let's give it an hand: split the loop so that the "heavy duty one" vectorize
void bar() {
  const int S=8;
  int loops = N/S;
  float x[S];
  float y[S];
  for (int j=0; j!=loops; ++j) {
    for (int i=0; i!=S; ++i)
      x[i] = co[k[j+i]];
    for (int i=0; i!=S; ++i) // this should vectorize
      y[i] = a[j+i]/std::sqrt(x[i]*b[j+i]);
    for (int i=0; i!=S; ++i)
      ++hist[int(y[i])];
  } 
}
Comment 1 vincenzo Innocente 2011-07-26 08:30:45 UTC
it may be a duplicate of my own PR49730
as

void bar2(int jj) {
  const int S=8;
  float x[S];
  float y[S];
  int j = jj*S;
  for (int i=0; i!=S; ++i)
    x[i] = co[k[j+i]];
  for (int i=0; i!=S; ++i) // this should vectorize
    y[i] = a[j+i]/std::sqrt(x[i]*b[j+i]);
  for (int i=0; i!=S; ++i)
    ++hist[int(y[i])];
} 

vectorize at 03

(of course in the example I submitted previously the external loop should read

  for (int jj=0; jj!=loops; ++jj) {
    int j = jj*S;

)
Comment 2 Richard Biener 2011-07-26 09:21:16 UTC
The loop likely completely unrolled, you can disable that with
--param max-completely-peel-times=1.

I think scalar-code vectorization does not handle this right now because
the temporary arrays that would help it have store-motion applied (and
should be later optimized away, but are not).
Comment 3 vincenzo Innocente 2011-07-26 09:38:13 UTC
Thanks Richard,
--param max-completely-peel-times=1
does the trick and, in my real life example, does not have any adverse effect elsewhere
while it speeds up the loop as expected.
More in general,
Do you think that GCC will ever be able to transform things like foo into bar by itself?
Comment 4 Richard Biener 2011-07-26 09:45:55 UTC
(In reply to comment #3)
> Thanks Richard,
> --param max-completely-peel-times=1
> does the trick and, in my real life example, does not have any adverse effect
> elsewhere
> while it speeds up the loop as expected.
> More in general,
> Do you think that GCC will ever be able to transform things like foo into bar
> by itself?

I hope so ;)  The graphite framework is supposed to provide us with
this kind of features.