This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Performance gain through dereferencing?


Hi David,

Sorry, I had included more information in an earlier draft which I edited out for brevity.

> You cannot learn useful timing
> information from a single run of a short
> test like this - there are far too many
> other factors that come into play.

I didn't mention that I have run it dozens of times. I know that blunt runtime measurements on a non-realtime system tend to be non-reproducible, and that they are inadequate for exact measurements. But the difference here is so large that the result is highly significant, in spite of the "amateurish" setup. The run I am showing here is typical. One of my four cores is surely idle at any given moment, and there is no I/O, so the variations are small.

You cannot learn useful timing information from unoptimised code.

I beg to disagree. While in this case the problem (and indeed eventually the whole program ;-) ) goes away with optimization that may not be the case in less trivial scenarios. And optimization or not -- I would always contend that *p = n is **not slower** than i = n. But it is. Something is wrong ;-).

So I'd like to direct our attention to the generated code and its performance (because such code conceivably could appear as the result of an optimized compiler run as well, in less trivial scenarios). What puzzles me is: How can it be that two instructions are slower than a very similar pair of instructions plus another one? (And that question is totally unrelated to optimization.)

Otherwise the
result could be nothing more than a quirk of the way caching worked out.

Could you explain how caching could play a role here if all variables and adresses are on the stack and are likely to be in the same memory page? (I'm not being sarcastic -- I may miss something obvious).

I can imagine that somehow the processor architecture is better utilized by the faster version (e.g. because short inner loops pipleline worse or whatever). For what it's worth, the programs were running on a i7-3632QM.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]