This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Optimization of array access

From: John Fine <johnsfine at verizon dot net>
To: Christoph Bartoschek <bartoschek at gmx dot de>
Cc: gcc-help at gcc dot gnu dot org
Date: Fri, 25 Apr 2008 10:08:18 -0400
Subject: Re: Optimization of array access
References: <200804242333.52382.bartoschek@gmx.de>

How much faster? I looked for a bug in the code in case one version was doing more or less work than you expected, but I didn't find one. The size of the speedup might give a better hint at the nature of the bug assuming there is one.

Assuming no bug, there is one subtle effect that might make the pointer version a little faster:

First, you should be aware that by accessing 8 million integers sequentially you are flushing the cache and the execution time will be dominated by the memory accesses to load the cache. Every int in cache will be updated just once before being written back. That happens by cache line rather than individual ints, but the ints are still read from main memory and written to main memory per update, never reused in cache.

The actual instructions used to do the access and update don't matter much if it all. Nor does the overhead of the pointers. Anything happening within CPU or cache is essentially free compared to the time needed to read and write main memory.

A cache is only partially associative and imperfectly LRU. In the super regular access pattern of your array, that won't matter. It will still discard the oldest contents and keep the newest contents. Those newest contents will never be reused before they become oldest, so the cache's LRU decisions will be 100% wrong for predicting the actual access pattern.

But when you allocate many chunks with new, the addresses won't be as consistent, so the partial associativity will cause the cache to make discard decisions that are far different from LRU. It is possible that some very old cache contents will be kept while newer contents are discarded. So a few cache lines might still be valid from the last pass when they come around again in the next pass. Then memory accesses would be saved and the whole loop would run a little faster.

If we assume a high number of ints, such as half a million, fit in cache and we assume absurdly high disruption in LRU causing half the cache lines to be preserved, then a quarter million of the 8 million accesses could be saved on nine out of ten times through the loop, so .9/32 = 2.8% of the accesses or nearly 2.8% of total time.

So for a speedup of up to 2.8%, this might be the reason (though that would be a much more lucky LRU disrution than I'd expect). If the speedup is more than 2.8% it must be something else entirely.

Christoph Bartoschek wrote:

Hi,

compiling the following code with g++ -O2 gives me unexpected results. The test_pointer() function seems to be faster although it uses more memory and has to use lots of pointer dereferenzes. Replacing line 19 by lines 15-18 has no effect on the runtime.

Why is simulating an 3d-array by pointers faster than using a contingous memory array?

References:
- Optimization of array access
  - From: Christoph Bartoschek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]