This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: template classes faster than derived classes?


Nava Whiteford wrote:

In this case the templated version doesn't seem to have the same huge advantage. Templated 20.73s against 21.1s for the classed version. I would guess real, but not huge.

Do these numbers seem reasonable?


I don't know what exactly the compiler optimized out, especially whether it changed the divide by 2.2 into a multiple by the reciprocal.

I assume it optimized away the call for the template version. Apparently it no longer optimizes away the whole loop for the templated version. Apparently it doesn't optimize away the vtable lookup nor the call for the non templated version.

Because the branch is the same every time through the loop, there is no branch misprediction on the call. Similarly no cache misses on the push and pop of the return address etc. That makes the difference between inlined and virtual call a lot smaller in this test than it would be in average use. But not as small as you measured. There is a bigger factor.

CPU's overlap a lot of operations. They especially overlap things like floating point divide with all the flow of control things involved in that virtual call.

I'm not certain, but I think the optimized code combined with ability of the CPU to execute ahead may mean the floating point divide (or maybe even the reciprocal multiply) is still pending as the CPU goes ahead into the next iteration of c->get_i()

So if c->get_i() is super fast (inlined) it may finish and then the CPU must wait for the divide before going further. If c->get_i() is much slower it still may be only a trivial amount slower than the divide, so the overlapped time is about the same.

If there were no such overlap, I'd expect a bigger difference between inline and virtual. If a divide were overlapped, I'd expect a virtual call with no branch mis prediction to be entirely covered by the overlap, so no difference in total execution time. So your result seems to fit a multiply overlapped with the virtual call. But I'm far from sure. I'd need to see the generated asm code to have even a better guess.

As for the main question: Like most performance questions, simple tests lead to consistently distorted answers. Performance is a complex question.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]