This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Performance problem
2008/9/24 John Fine <johnsfine@verizon.net>:
> Łukasz Lew wrote:
>>
>> I fixed the problem (I think) with rdtsc on 64bit architectures.
>> http://www.mimuw.edu.pl/~lew/libego_benchmark.tgz
>>
>
> Seems to work. Why was it previously correct for 32 bit? Did the 32 bit
> compiler already combine the correct two registers?
I have no idea?
But it seems to not compile on 32bit.
>>>
>>> You may be very right about the register allocation.
>>> I tuned my code on 4.2 and small "irrelevant" changes changed the
>>> perfomance badly
>>> and asm output revealed among other things different registers.
>>>
>
> That doesn't really prove much. Without some very good output from
> Opannotate, I don't know how to tell the real reason for the performance
> difference.
indeed, but opannotate on assembler doesn't give here much
the 10% difference is spread irregulary. some parts are slower, some are faster.
but asm of both versions correspond to each other very well except
differen registers and offsets.
>
>
>>> I use Oprofile a lot, and tried to pinpoint the difference but asm
>>> output is too different
>>> while c++ annotation is too weak because of heavy inlining.
>>>
>
> I'm trying to understand and/or fix the use of Opannotate for some much
> harder problems, so I was curious enough to try it on your program. I
> compiled your program x86_64 with gcc 4.4. Even if I got good results, that
> wouldn't tell you anything about 32 bit gcc 4.3.
Can you send me the log from my benchmark?
And your processor model?
If you can do the same for g++4.3, that would be very useful for me.
>
> But I got surprisingly bad results. I haven't previously seen such bad
> results from opannotate without using heavily templated code. But I also
> haven't used a gcc 4.4 compiled program with opannotate before.
>
> In --source mode nearly all the total time was missing (not associated with
> any source line).
I have the same problem with g++-4.3.
My guess that this is due to heavy inlining.
btw. you would be surprised how much slower it gets if you turn off
allways inline gcc attribute.
> In mixed source and assembly view, I think all the time
Is it possible to get mixed view?
> was shown, but I don't think the assembly code corresponded very accurately
> with the source code and the time was in some very surprising lumps. I
> usually can interpret such lumps (usually the instruction after an L2 cache
> miss or the instruction after a mispredicted branch). But that didn't seem
> to fit the execution time lumps in your code.
L1 misses hit my code performance as well.
>
> The few points in your source code that had most of the total execution time
> were inlined multiple times with different register usage each time. No one
> inline copy of any such routine had as much as 4% of the total execution
> time. That tends to wreck the theory that a minor change somewhere has
> caused a big difference by changing register allocation.
Can you be more specific?
How do you know which part was inlined where?
> There wouldn't be
> that sort of correlation in the way it changes register allocation across a
> bunch of different inlinings of the same function that already differ from
> each other in register allocation.
but do you observe the 10% difference in performance that I have on my machine?
This is getting promising, thanks for your help.
Lukasz
PS
Is there any alternative for OProfile?
If not, then why it is so undeveloped?