This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [OT] GCC vs Intel C++ compiler benchmark

From: Tim Prince <tprinceusa at mindspring dot com>
To: Andreas Jaeger <aj at suse dot de>,tprince at computer dot org
Cc: Claus Fischer <claus dot fischer at clausfischer dot com>,gcc at gcc dot gnu dot org
Date: Mon, 28 Jan 2002 05:58:57 -0800
Subject: Re: [OT] GCC vs Intel C++ compiler benchmark
References: <20020127124821.A25764@clausfischer.com> <E16UzeT-0002ev-00@mclean.mail.mindspring.net> <ho8zaisylf.fsf@gee.suse.de>
Reply-to: tprince at computer dot org

On Sunday 27 January 2002 23:50, Andreas Jaeger wrote:

> Does anybody have a list of optimizations that are in icc and inferior
> or missing in GCC - and would help to improve performance of GCC on
> common platforms?
>
> Andreas

When generating sse code, gcc does not optimize the situation where a 
"reverse" subtract or divide is called for. icc chooses to gnerate x87 code 
sequences which involve fdivr or fsubr, rather than generate sse[2] code with 
an additional move.  A better solution, where loop unrolling might be used to 
resolve the situation, would be to alternate register assignments over pairs 
of loop iterations, when that permits use of efficient sse code.  No doubt, 
other architectures could benefit from such an optimization.

gcc has no ability to generate parallel sse instructions.  There may be ways 
to do this beyond those which icc uses, which is to recognize repeated 
operations in unrolled loops which are eligible for auto-vectorization.  icc 
has an option to force unrolling even when the size of the loop exceeds the 
normal threshold, in order to facilitate auto-vectorization.

Several standard optimizations are lacking in both icc and gcc.  Loop fission 
or fusion should be considered in order to approach (not exceed) optimum use 
of the register set and the associativity of the write buffering system.  On 
pentiumpro architectures, associativity for write buffering is 4 
(documentation available on developer.intel.com); the pentium4 equivalent 
limit is 6.   Loops should be arranged so that the number of array sections 
written accords with this value, where that is feasible.  For nested 
loops, outer loop unrolling may be needed, keeping inner loop unrolling to 
the minimum which is used effectively for parallelization.  Outer loop 
unrolling may reduce the need for consideration of loop nest inversions.

Loop fusion by the compiler has proven difficult to use effectively in C, but 
is essential in Fortran.  Examples which come to mind are the MipsPro and the 
IA64 compilers.  I don't know that fission has been explored adequately; the 
MipsPro compilers seemed to always split loops as much as possible and then 
recombine, possibly in a different way from the original source, and that 
seemed possibly not a good strategy for C.

References:
- [OT] GCC vs Intel C++ compiler benchmark
  - From: Claus Fischer
- Re: [OT] GCC vs Intel C++ compiler benchmark
  - From: Tim Prince
- Re: [OT] GCC vs Intel C++ compiler benchmark
  - From: Andreas Jaeger

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]