Bug 21628 - GCC much slower than ICL. Lack of inlining?
Summary: GCC much slower than ICL. Lack of inlining?
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 3.4.1
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-17 15:43 UTC by laurent
Modified: 2008-09-28 07:59 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description laurent 2005-05-17 15:43:12 UTC
I first posted this problem at gcc-help@gcc.gnu.org. I was advice to post my 
problem here.

I have a program with many many inline template functions.
It is essential for the execution speed that every (or almost every) function 
marked as inline, becomes really inlined by the compiler.

I already compiled the program with Intel Compiler (ICL) on Visual C++, and it 
works fine and fast. I verified that the functions are then really inlined.

But with GCC 3.4.X (Linux & Cygwin) the same program is much slower (5-20 times)
than the version compiled with ICL. The '-Winline' option of GCC shows me that 
many functions are not inlined like they should.

The compiler considers the 'inline' keyword as an hint, but does not follow it. 
I tried to set various options of GCC, but nothing is satisfactory as far: -
finline-limit 100000000 --param large-function-growth=1000000 --param max-
inline-insns-single=1000000 ...

I am convicted that the poor performance is due to the lack of inlining because 
I get slow execution speed with ICL when the functions are not marked 
as 'inline'. With the '-Winline' option of GCC, I see every not inlined 
functions.

Also the SSE mode of the following test program should be much quicker than 
without SIMD, but requires much more inlining. ICL manages it, GCC not at all.

Do you know a mean to force GCC to obey the inline statement, or to increase 
the limits that these compilers internally have? Or do you have an alternative?


It is not possible to give a small test program. If you want to test on your 
own, I propose you download my library at this address, and compile the 
following test. (No need to compile the library, it is STL-like) 
http://www.ient.rwth-aachen.de/team/laurent/genial/genial.html

#define FFT_LEVEL 32
#include "signal/fft.h"
int main()
{
  DenseVector<complex<float> >::self X(32,0);
  DenseVector<complex<float> >::self Y(X.size(),0);
  double t0=get_time();
  for (int i=0; i<1000000; ++i)
    fft(X,Y);
  cout << get_time()-t0 << endl;
}

The execution time on a Pentium 4, 3.2GHz:
With ICL on Windows:
-No simd: 0.368s
-SSE: 0.126s
-SSE3: 0.112s
With GCC 3.4 on Cygwin/Linux (-O3 -msse3 -UWIN32 -ftemplate-depth-36 -lstlport)
-No SIMD : 0.969s
-SSE: 2.069s

For more informations, contact me per email (see home page)

Thanks
Comment 1 Manuel López-Ibáñez 2007-11-16 15:49:33 UTC
What does -Winline say?

Have you tried with always_inline? Example:

     /* Prototype.  */
     inline void foo (const char) __attribute__((always_inline));

See 

  http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

and

  http://gcc.gnu.org/onlinedocs/gcc/Inline.html

Comment 2 laurent 2007-11-16 17:46:43 UTC
(In reply to comment #1)
> What does -Winline say?
> 
> Have you tried with always_inline? Example:
> 
>      /* Prototype.  */
>      inline void foo (const char) __attribute__((always_inline));
> 
Whaow, I have posted this report for a while...!!!

As I posted, GCC was at version 3.x.
"Winline" said that many functions were not inlined despite of the presence of the keyword 'inline'.
yes, I did try "__attribute__((__always_inline__))". 

But Since version 4.2, GCC seems to respect this attribute, at least!!! 
This was a great improvement for me, I have really waited for this feature.

I once found a page, where a very important person in the Linux world (cannot remember who now, Linux Toward probably) complained about the lack of inlining in linux-Kernel, that there were no way to force GCC, etc...
I am glad that this person was heard by GCC developers...

It improved a lot the performance of my library compiled with GCC.
But honestly ICL (Intel Compiler for Windows) is still much better in optimisations.
Comment 3 Richard Biener 2007-11-16 18:00:31 UTC
Note that for completely inlining kernels you can use the __attribute__((flatten))
on the *calling* function.  Usually with expression templates that is the function
containing the loops, like

void __attribute__((flatten)) doit()
{
  for (;;)
    lots_of_calls_to_inline ();
}

and it will make sure to inline all calls done in doit (recursively, so no calls
will be left in the final version).  Also starting with GCC 4.2 (and much
improved on trunk which will become 4.3) using profile-feedback will
improve inline performance a lot (use -fprofile-generate, run, -fprofile-use).

I'll close this bug as worksforme as it doesn't have a useful testcase and
from my experience with tramp3d-v4 performance of ICC sucks compared to
GCC because ICC inlines too little ;)
Comment 4 Manuel López-Ibáñez 2007-11-16 18:01:55 UTC
(In reply to comment #2)

> Whaow, I have posted this report for a while...!!!

I guess this report fell through the cracks of bugzilla. Reporting the status on new versions of GCC would have probably helped.

> But Since version 4.2, GCC seems to respect this attribute, at least!!! 
> This was a great improvement for me, I have really waited for this feature.

OK. So this is fixed. Thanks for the report nonetheless. And sorry for the delay.

> But honestly ICL (Intel Compiler for Windows) is still much better in
> optimisations.

But we are better in freedom. ;-)
Comment 5 Paolo Carlini 2007-11-16 18:03:40 UTC
(In reply to comment #2)
> I once found a page, where a very important person in the Linux world (cannot
> remember who now, Linux Toward probably) complained about the lack of inlining
> in linux-Kernel, that there were no way to force GCC, etc...
> I am glad that this person was heard by GCC developers...

I don't think all the inlining improvements (many) can be traced back to any specific individual complaining (not even Linus Torvalds ;)

> It improved a lot the performance of my library compiled with GCC.
> But honestly ICL (Intel Compiler for Windows) is still much better in
> optimisations.

Details would be certainly welcome. Ideally, a reduced snippet, to pursue the optimization people to take action reasonably quickly...
Comment 6 laurent 2007-11-16 20:42:49 UTC
> Note that for completely inlining kernels you can use the
> __attribute__((flatten))
> on the *calling* function.  Usually with expression templates that is the
> function
> containing the loops, like
> void __attribute__((flatten)) doit()
> {
>   for (;;)
>     lots_of_calls_to_inline ();
> }
> and it will make sure to inline all calls done in doit (recursively, so no
> calls
> will be left in the final version).  Also starting with GCC 4.2 (and much
> improved on trunk which will become 4.3) using profile-feedback will
> improve inline performance a lot (use -fprofile-generate, run, -fprofile-use).
Good to know! Thanks for the advices!

> But we are better in freedom. ;-)
Much better!

> OK. So this is fixed. Thanks for the report nonetheless. And sorry for the
> delay.
No problemo. Thank to all of you.

> I don't think all the inlining improvements (many) can be traced back to any
> specific individual complaining (not even Linus Torvalds ;)
(Ups! sorry for having misspelled the name of Linus Torvalds!)
You are most probably right. I was nevertheless happy to notice I was not alone to complain about the problem. 

> Details would be certainly welcome. Ideally, a reduced snippet, to pursue the
> optimization people to take action reasonably quickly...
Hmm, difficult. 
I just sometimes compare execution speed of numerical calculations from different compilers (ICL,VC2005,GCC), and ICL is often quicker by maybe 10%.
If I have more specific and easier examples, I'll post them.

I especially appreciate the way GCC notifies the compilation errors from deep nested templates. I could not have programmed deep nested template expression with the complicated error messages form ICL or VC2005.

I have to say that ICL has obviously not respected the __forceinline directive any more since the version 9 and 10, this is for me a clear regression.
I do not know exactly the changes in these latest versions, but I do not want to exchange with my good old version 8.1.

Thanks
Comment 7 laurent 2008-09-27 11:40:11 UTC
Hello

I reopen the discussion because I noticed a problem in relation with "__attribute__((__always_inline__))" when I tried to compile my library as a DLL.

GCC now forces inlines well, and is now as quick as ICL for my generic implementation of the FFT Fast Fourier Transform).
So I would like to progressively use GCC as my favorite compiler.

GCC works fine if I use "inline" as usual (but my program is slow).
But I get hundreds error messages if I use the macro "#define inline __attribute__((__always_inline__))". For example:

obj\Release\src\copy.o:copy.cpp:(.text+0x0): first defined here
obj\Release\src\dct.o:dct.cpp:(.text+0xe): multiple definition of `_ferror'
obj\Release\src\copy.o:copy.cpp:(.text+0xe): first defined here
obj\Release\src\dct.o:dct.cpp:(.text+0x1c): multiple definition of `operator new(unsigned int, void*)'
obj\Release\src\copy.o:copy.cpp:(.text+0x1c): first defined here

I use the CodeBlock Environnement and MinGW GCC 4.3.1 (downloaded from http://www.tdragon.net/recentgcc/)
My problem might be related with the following bug report:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37121

Any Clue????

--------------------------------
I was asked above from Paolo if I could give a test case where ICL is quicker than GCC. I will publish the new version 2.2 of my library in a couple of days (http://www.ient.rwth-aachen.de/~laurent/genial/genial.html).
If you are still interested, I could give then a small test case using my library where ICL is much quicker than GCC and VC2008 (but I don't care much about VC2008).



Comment 8 Richard Biener 2008-09-28 02:11:26 UTC
Try
  #define inline inline __attribute__((always_inline))
instead.  The inline keyword changes linkage, so you have to keep it.
If you keep having problems open a new bugreport please, the performance
issue seems to be still solved.
Comment 9 laurent 2008-09-28 07:59:59 UTC
(In reply to comment #8)
> Try
>   #define inline inline __attribute__((always_inline))
> instead.  The inline keyword changes linkage, so you have to keep it.
> If you keep having problems open a new bugreport please, the performance
> issue seems to be still solved.
> 

Thank you! It works. 
Sorry for my question.

I have still noticed before another problem with "__attribute__((always_inline))".
I will write a bug report in a few days, as soon as I will reproduce it with a small test case.
The bug actually accurs at a few positions in the STL with a error message somewhat like "sorry unimplemented, could not inline". I could temporary fix it with a few change in the STL. 

Are you interested if I post a new report for a performance issue in comparison to ICL? This performance issue is the only one I know, that still prevents me from prefering GCC to ICL.