This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

FYI: gcc2.95* -O3 performance (ILP tests)



  Greetings,

    Below are some curious performance results I got while
  timing different implementations of a simple loop. I would
  appreciate any comments on these, but I also hope that this
  info will help improve optimizing capabilities of gcc.
  I am using C++, so the below examples were tried with g++,
  but, really, there is not much C++ in them.
  
    The loop is:
         for(i=0;i!=size;i++) sum+=buf[i]; 

    The testing program goes like this:

---------------------------------------------------------------------  
   const int size=1200;  // divides by 2 and 3 -- for easier unrolling
   const int repeat=1024*1000;
   int i,j;
   int buf[size];
   for(i=0;i!=size;i++) buf[i]=2*i-1;
   int sum=0;

   Timer tm;  // my own class based on getrusage() for precise timing
     for(j=0;j!=repeat;j++) for(i=0;i!=size;i++) sum+=buf[i];
   tm.stop();
   cout << tm << endl;
   cout << "Sum = " << sum << endl;
---------------------------------------------------------------------

   The compilers I used were

    g++2.95    on Linux2.2 12 (RH6.0)/2xPII-350
    g++2.95.1  on Solaris2.7         /Sun Ultra-10 @300MHz
    SunPro CC4.2 (at least a year old) on the same Sun platform

    Note that the two processors have relatively similar frequences,
    at least not 2x away, as the performance I get in some tests
    (well..workstation processors are supposed to be slower on int. ops).

    I used -O3 -funroll-loops with g++ and -O5 with SunPro CC4.2
    (I am not including CC5.0 comparisons, but its loop optimization
     is considerably better than that in CC4.2)

    In all time measurements, fluctuations among difft runs were
    in several %%.
 
    The above code takes  4.00  user sec on Pentium with g++2.95
                          8.929 user sec on Ultra-10 with g++2.95.1
                          8.73  user sec on Ultra-10 with CC4.2
  
    Now, I am changing the inner loop to this
               
     for(i=0;i!=size;i+=2) { sum+=(buf[i]+buf[i+1]); }                          
     
    New results           3.63 user sec on Pentium with  g++2.95
                          6.02  user sec on Ultra-10 with g++2.95.1
                          7.04  user sec on Ultra-10 with CC4.2

    Now, I am trying to introduce explicit instruction-level parallelizm
    (ILP)    

------------------------------------------------------------------------
     int sum1=0, sum2=0;
     for(j=0;j!=repeat;j++) 
     for(i=0;i!=size;i+=2)
       {
         sum1+=buf[i];
         sum2+=buf[i+1];
       }
     sum=sum1+sum2;  
------------------------------------------------------------------------

    New results           3.93 user sec on Pentium  with g++2.95
                          5.04  user sec on Ultra-10 with g++2.95.1
                          4.91  user sec on Ultra-10 with CC4.2

    And, finally, I introduce more ILP:

---------------------------------------------------------------------
    int sum1=0, sum2=0, sum3=0;
     for(j=0;j!=repeat;j++)
       for(i=0;i!=size;i+=3)
       {
         sum1+=buf[i];
         sum2+=buf[i+1];
         sum3+=buf[i+2];
       }
    sum=sum1+sum2+sum3; 
---------------------------------------------------------------------

    New results           4.01 user sec on Pentium  with g++2.95
                          5.23  user sec on Ultra-10 with g++2.95.1
                          4.31  user sec on Ultra-10 with CC4.2

    Analysis and questions:

        why do the compilers do not perform the simple loop unrolling
        I did at first? It seems to be doable automatically (?)
 
        I assume that performance on Ultra-10 is being helped by explicit
        ILP because the pipelines are deeper on Sparc processors (maybe,
        there are more integer op. units). However, g++ clearly misses
        on this improvement.

        Finally, I was dissapointed that the best optimization for Sparc
        was actually making things worse for Pentium. While things like
        this are probably bound to happen, this particular case (a very
        simple loop) may indicate room for improvement for gcc.

   Comments?

								Igor          
-- 
  Igor Markov  office: (310) 206-0179   
  http://vlsicad.cs.ucla.edu/~imarkov


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]