This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: template classes faster than derived classes?

From: "John S. Fine" <johnsfine at verizon dot net>
To: Nava Whiteford <new at sgenomics dot org>
Cc: Brian Budge <brian dot budge at gmail dot com>, gcc-help at gcc dot gnu dot org
Date: Tue, 24 Nov 2009 16:22:51 -0500
Subject: Re: template classes faster than derived classes?
References: <20091124194416.GB7575@sgenomics.org> <5b7094580911241155s7d36b729v92c2046170da897d@mail.gmail.com> <20091124201919.GA7719@sgenomics.org> <4B0C4132.90306@verizon.net> <20091124205311.GB7719@sgenomics.org>

Nava Whiteford wrote:


In this case the templated version doesn't seem to have the same huge
advantage. Templated 20.73s against 21.1s for the classed version. I would guess
real, but not huge.

Do these numbers seem reasonable?

I don't know what exactly the compiler optimized out, especially whether it changed the divide by 2.2 into a multiple by the reciprocal.

I assume it optimized away the call for the template version. Apparently it no longer optimizes away the whole loop for the templated version. Apparently it doesn't optimize away the vtable lookup nor the call for the non templated version.

Because the branch is the same every time through the loop, there is no branch misprediction on the call. Similarly no cache misses on the push and pop of the return address etc. That makes the difference between inlined and virtual call a lot smaller in this test than it would be in average use. But not as small as you measured. There is a bigger factor.

CPU's overlap a lot of operations. They especially overlap things like floating point divide with all the flow of control things involved in that virtual call.

I'm not certain, but I think the optimized code combined with ability of the CPU to execute ahead may mean the floating point divide (or maybe even the reciprocal multiply) is still pending as the CPU goes ahead into the next iteration of c->get_i()

So if c->get_i() is super fast (inlined) it may finish and then the CPU must wait for the divide before going further. If c->get_i() is much slower it still may be only a trivial amount slower than the divide, so the overlapped time is about the same.

If there were no such overlap, I'd expect a bigger difference between inline and virtual. If a divide were overlapped, I'd expect a virtual call with no branch mis prediction to be entirely covered by the overlap, so no difference in total execution time. So your result seems to fit a multiply overlapped with the virtual call. But I'm far from sure. I'd need to see the generated asm code to have even a better guess.

As for the main question: Like most performance questions, simple tests lead to consistently distorted answers. Performance is a complex question.

References:
- template classes faster than derived classes?
  - From: Nava Whiteford
- Re: template classes faster than derived classes?
  - From: Brian Budge
- Re: template classes faster than derived classes?
  - From: Nava Whiteford
- Re: template classes faster than derived classes?
  - From: John S. Fine
- Re: template classes faster than derived classes?
  - From: Nava Whiteford

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]