This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
FYI: gcc2.95* -O3 performance (ILP tests)
- To: gcc at gcc dot gnu dot org
- Subject: FYI: gcc2.95* -O3 performance (ILP tests)
- From: Igor Markov <imarkov at cs dot ucla dot edu>
- Date: Wed, 06 Oct 1999 15:05:28 -0700
- Organization: UCLA, Computer Science
Greetings,
Below are some curious performance results I got while
timing different implementations of a simple loop. I would
appreciate any comments on these, but I also hope that this
info will help improve optimizing capabilities of gcc.
I am using C++, so the below examples were tried with g++,
but, really, there is not much C++ in them.
The loop is:
for(i=0;i!=size;i++) sum+=buf[i];
The testing program goes like this:
---------------------------------------------------------------------
const int size=1200; // divides by 2 and 3 -- for easier unrolling
const int repeat=1024*1000;
int i,j;
int buf[size];
for(i=0;i!=size;i++) buf[i]=2*i-1;
int sum=0;
Timer tm; // my own class based on getrusage() for precise timing
for(j=0;j!=repeat;j++) for(i=0;i!=size;i++) sum+=buf[i];
tm.stop();
cout << tm << endl;
cout << "Sum = " << sum << endl;
---------------------------------------------------------------------
The compilers I used were
g++2.95 on Linux2.2 12 (RH6.0)/2xPII-350
g++2.95.1 on Solaris2.7 /Sun Ultra-10 @300MHz
SunPro CC4.2 (at least a year old) on the same Sun platform
Note that the two processors have relatively similar frequences,
at least not 2x away, as the performance I get in some tests
(well..workstation processors are supposed to be slower on int. ops).
I used -O3 -funroll-loops with g++ and -O5 with SunPro CC4.2
(I am not including CC5.0 comparisons, but its loop optimization
is considerably better than that in CC4.2)
In all time measurements, fluctuations among difft runs were
in several %%.
The above code takes 4.00 user sec on Pentium with g++2.95
8.929 user sec on Ultra-10 with g++2.95.1
8.73 user sec on Ultra-10 with CC4.2
Now, I am changing the inner loop to this
for(i=0;i!=size;i+=2) { sum+=(buf[i]+buf[i+1]); }
New results 3.63 user sec on Pentium with g++2.95
6.02 user sec on Ultra-10 with g++2.95.1
7.04 user sec on Ultra-10 with CC4.2
Now, I am trying to introduce explicit instruction-level parallelizm
(ILP)
------------------------------------------------------------------------
int sum1=0, sum2=0;
for(j=0;j!=repeat;j++)
for(i=0;i!=size;i+=2)
{
sum1+=buf[i];
sum2+=buf[i+1];
}
sum=sum1+sum2;
------------------------------------------------------------------------
New results 3.93 user sec on Pentium with g++2.95
5.04 user sec on Ultra-10 with g++2.95.1
4.91 user sec on Ultra-10 with CC4.2
And, finally, I introduce more ILP:
---------------------------------------------------------------------
int sum1=0, sum2=0, sum3=0;
for(j=0;j!=repeat;j++)
for(i=0;i!=size;i+=3)
{
sum1+=buf[i];
sum2+=buf[i+1];
sum3+=buf[i+2];
}
sum=sum1+sum2+sum3;
---------------------------------------------------------------------
New results 4.01 user sec on Pentium with g++2.95
5.23 user sec on Ultra-10 with g++2.95.1
4.31 user sec on Ultra-10 with CC4.2
Analysis and questions:
why do the compilers do not perform the simple loop unrolling
I did at first? It seems to be doable automatically (?)
I assume that performance on Ultra-10 is being helped by explicit
ILP because the pipelines are deeper on Sparc processors (maybe,
there are more integer op. units). However, g++ clearly misses
on this improvement.
Finally, I was dissapointed that the best optimization for Sparc
was actually making things worse for Pentium. While things like
this are probably bound to happen, this particular case (a very
simple loop) may indicate room for improvement for gcc.
Comments?
Igor
--
Igor Markov office: (310) 206-0179
http://vlsicad.cs.ucla.edu/~imarkov