This is the mail archive of the
mailing list for the GCC project.
Re: food for optimizer developers
- From: Vladimir Makarov <vmakarov at redhat dot com>
- To: "Ralf W. Grosse-Kunstleve" <rwgk at yahoo dot com>
- Cc: gcc at gcc dot gnu dot org
- Date: Wed, 11 Aug 2010 17:04:50 -0400
- Subject: Re: food for optimizer developers
- References: <email@example.com>
On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote:
To get a full picture, it would be nice to see icc times too.
I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63
This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHzI think it is more important (and harder) to make gfortran closer to ifort.
All files needed to easily reproduce the results are here:
See the README file or the example commands below.
- Is there a way to make the g++ version as fast as ifort?
I can not say about your fragment of LAPACK. But about 15 years ago I
worked on manual LAPACK optimization for an Alpha processor. As I
remember LAPACK is quite memory bound benchmark. The hottest spot was
matrix multiplication which is used in many LAPACK places. The matrix
multiplication in LAPACK is already moderately optimized by using
temporary variable and that makes it 1.5 faster (if cache is not enough
to hold matrices) than normal algorithm. But proper loop optimizations
(tiling mostly) could improve it in more 4 times.
So I guess and hope graphite project finally will improve LAPACK by
After solving memory bound problem, loop vectorization is another
important optimization which could improve LAPACK. Unfortunately, GCC
vectorizes less loops (it was about 2 time less when last time I
checked) than ifort. I did not analyze what is the reason for this.
After solving vectorization problem, another important lower-level loop
optimization is modulo scheduling (even if modern x86/x86_64 processor
are out of order) because OOO processors can look only through a few
branches. And as I remember, Intel compiler does make modulo scheduling
frequently. GCC modulo-scheduling is quite constraint.
That is my thoughts but I might be wrong because I have no time to
confirm my speculations. If you really want to help GCC developers, you
could make comparison analysis of the code generated by ifort and
gfortran and find what optimizations GCC misses. GCC has few resources
and developers who could solve the problems are very busy. Intel
optimization compiler team (besides researchers) is much bigger than
whole GCC community. Taking this into account and that they have much
more info about their processors, I don't think gfortran will generate a
better or equal code for floating point benchmarks in near future.