Performance problem

Łukasz Lew
Wed Sep 24 22:35:00 GMT 2008

2008/9/24 John Fine <>:
> Łukasz Lew wrote:
>> I fixed the problem (I think) with rdtsc on 64bit architectures.
> Seems to work.  Why was it previously correct for 32 bit?  Did the 32 bit
> compiler already combine the correct two registers?

I have no idea?
But it seems to not compile on 32bit.

>>> You may be very right about the register allocation.
>>> I tuned my code on 4.2 and small "irrelevant" changes changed the
>>> perfomance badly
>>> and asm output revealed among other things different registers.
> That doesn't really prove much.  Without some very good output from
> Opannotate, I don't know how to tell the real reason for the performance
> difference.
indeed, but opannotate on assembler doesn't give here much
the 10% difference is spread irregulary. some parts are slower, some are faster.
but asm of both versions correspond to each other very well except
differen registers and offsets.

>>> I use Oprofile a lot, and tried to pinpoint the difference but asm
>>> output is too different
>>> while c++ annotation  is too weak because of heavy inlining.
> I'm trying to understand and/or fix the use of Opannotate for some much
> harder problems, so I was curious enough to try it on your program.  I
> compiled your program x86_64 with gcc 4.4.  Even if I got good results, that
> wouldn't tell you anything about 32 bit gcc 4.3.

Can you send me the log from my benchmark?
And your processor model?

If you can do the same for g++4.3, that would be very useful for me.

> But I got surprisingly bad results.  I haven't previously seen such bad
> results from opannotate without using heavily templated code.  But I also
> haven't used a gcc 4.4 compiled program with opannotate before.
> In --source mode nearly all the total time was missing (not associated with
> any source line).

I have the same problem with g++-4.3.
My guess that this is due to heavy inlining.
btw. you would be surprised how much slower it gets if you turn off
allways inline gcc attribute.

> In mixed source and assembly view, I think all the time

Is it possible to get mixed view?

> was shown, but I don't think the assembly code corresponded very accurately
> with the source code and the time was in some very surprising lumps.  I
> usually can interpret such lumps (usually the instruction after an L2 cache
> miss or the instruction after a mispredicted branch).  But that didn't seem
> to fit the execution time lumps in your code.

L1 misses hit my code performance as well.

> The few points in your source code that had most of the total execution time
> were inlined multiple times with different register usage each time.  No one
> inline copy of any such routine had as much as 4% of the total execution
> time.  That tends to wreck the theory that a minor change somewhere has
> caused a big difference by changing register allocation.

Can you be more specific?
How do you know which part was inlined where?

>  There wouldn't be
> that sort of correlation in the way it changes register allocation across a
> bunch of different inlinings of the same function that already differ from
> each other in register allocation.

but do you observe the 10% difference in performance that I have on my machine?

This is getting promising, thanks for your help.

Is there any alternative for OProfile?
If not, then why it is so undeveloped?

More information about the Gcc-help mailing list