This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Performance gain through dereferencing?

On 16/04/14 17:57, Peter Schneider wrote:
Hi David,

Sorry, I had included more information in an earlier draft which I
edited out for brevity.

(Sorry for the late reply - Easter is a /serious/ holiday in Norway.)

 > You cannot learn useful timing
 > information from a single run of a short
 > test like this - there are far too many
 > other factors that come into play.

I didn't mention that I have run it dozens of times. I know that blunt
runtime measurements on a non-realtime system tend to be
non-reproducible, and that they are inadequate for exact measurements.
But the difference here is so large that the result is highly
significant, in spite of the "amateurish" setup. The run I am showing
here is typical. One of my four cores is surely idle at any given
moment, and there is no I/O, so the variations are small.

The differences you have noted are not large - you need to be much clearer about what you are actually measuring here. Typically, you will want to look at the time taken for an empty loop, and the time for the working loops - and you will want to compare for loops of different sizes, and with large enough loops that things such as startup time are negligible.

You cannot learn useful timing information from unoptimised code.

I beg to disagree. While in this case the problem (and indeed eventually
the whole program ;-) ) goes away with optimization that may not be the
case in less trivial scenarios. And optimization or not -- I would
always contend that *p = n is **not slower** than i = n. But it is.
Something is wrong ;-).

Of preference, use optimisation to ensure the compiler and code is not doing anything extra, and "volatile" to control how the test code is generated.

So I'd like to direct our attention to the generated code and its
performance (because such code conceivably could appear as the result of
an optimized compiler run as well, in less trivial scenarios). What
puzzles me is: How can it be that two instructions are slower than a
very similar pair of instructions plus another one? (And that question
is totally unrelated to optimization.)

It is far from easy to understand what /actually/ happens inside a modern x86 processor. The incoming instructions get translated into internal RISC instructions, which are scheduled, pipelined, and executed in different ways at different points in the pipeline. It is certainly conceivable that by some quirk the three instruction version ends up faster than the two instruction version.

I am not trying to tell you that your measurements or discovery are wrong here - I am just trying to be sure that measurement errors are eliminated before jumping to any conclusions.

I would also like to see a series of code snippets that produce different code showing similar effects. If these start to show a clear pattern, then it will be worth looking at and testing on different processors - it could be that there is scope for a peephole optimisation here.

Otherwise the
result could be nothing more than a quirk of the way caching worked out.

Could you explain how caching could play a role here if all variables
and adresses are on the stack and are likely to be in the same memory
page? (I'm not being sarcastic -- I may miss something obvious).

It may be /likely/ that they are on the same memory page, but it is not guaranteed. It is also quite likely that they are on different cache lines - sometimes the split by cache line can have an effect. I am not trying to give absolute answers here - merely suggesting other possible semi-random effects that might make this case a one-off rather than common behaviour.

I can imagine that somehow the processor architecture is better utilized
by the faster version (e.g. because short inner loops pipleline worse or
whatever). For what it's worth, the programs were running on a i7-3632QM.

Usually that's the case - shorter loops run faster. But it is hard to be sure, and sometimes a loop that is too short will lead to bigger delays from stalls in the pipeline. Getting the fastest possible speed out of a modern x86 processor is more of an art than a science, and it is a long time since simple logic gave all the right answers.

But by all means, investigate the effect further - it is certainly possible that you have come across a quirk in the processor that can be used to generate faster code.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]