This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [OT] GCC vs Intel C++ compiler benchmark
On Sunday 27 January 2002 23:50, Andreas Jaeger wrote:
> Does anybody have a list of optimizations that are in icc and inferior
> or missing in GCC - and would help to improve performance of GCC on
> common platforms?
>
> Andreas
When generating sse code, gcc does not optimize the situation where a
"reverse" subtract or divide is called for. icc chooses to gnerate x87 code
sequences which involve fdivr or fsubr, rather than generate sse[2] code with
an additional move. A better solution, where loop unrolling might be used to
resolve the situation, would be to alternate register assignments over pairs
of loop iterations, when that permits use of efficient sse code. No doubt,
other architectures could benefit from such an optimization.
gcc has no ability to generate parallel sse instructions. There may be ways
to do this beyond those which icc uses, which is to recognize repeated
operations in unrolled loops which are eligible for auto-vectorization. icc
has an option to force unrolling even when the size of the loop exceeds the
normal threshold, in order to facilitate auto-vectorization.
Several standard optimizations are lacking in both icc and gcc. Loop fission
or fusion should be considered in order to approach (not exceed) optimum use
of the register set and the associativity of the write buffering system. On
pentiumpro architectures, associativity for write buffering is 4
(documentation available on developer.intel.com); the pentium4 equivalent
limit is 6. Loops should be arranged so that the number of array sections
written accords with this value, where that is feasible. For nested
loops, outer loop unrolling may be needed, keeping inner loop unrolling to
the minimum which is used effectively for parallelization. Outer loop
unrolling may reduce the need for consideration of loop nest inversions.
Loop fusion by the compiler has proven difficult to use effectively in C, but
is essential in Fortran. Examples which come to mind are the MipsPro and the
IA64 compilers. I don't know that fission has been explored adequately; the
MipsPro compilers seemed to always split loops as much as possible and then
recombine, possibly in a different way from the original source, and that
seemed possibly not a good strategy for C.