This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

streams is slow under gcc


Streams (http://www.cs.virginia.edu/stream/) is a benchmark that
measures the effective memory bandwidth on a platform.  Effective in
this case is how much sustained memory bandwidth can the compiler get
out of the system.  The goal is to give some indication of how well
memory bandwidth limited applications will run.

I have done some investigated of the results for gcc on x86.  On
recent platforms Athlon, P4 Xeon and, x86_64 gcc generally achieves
50% of the achievable memory bandwidth.  This includes gcc-3.2 with
sse support.

To come to the above conclusion I wrote a hand optimized memory copy
to see what the achievable, as opposed to theoretical memory bandwidth
numbers were.  As well as a hand optimized memory read, and a hand
optimized memory write.

Rough numbers:
Memory               CPU     Theoretical achievable gcc       intel-7beta
PC2100               Athlon  2100MB/s    2000MB/s   800MB/s
PC2700               x86_64  2700MB/s    2670MB/s   1200MB/s  1500MB/s
Dual Channel PC1600  P4Xeon  3200MB/s    2800MB/s   1400MB/s  1700MB/s

For a hand optimized memory read or a hand optimized memory write I
only get about 2/3 of the theoretical, while I come very close to the
theoretical for copy operations.

The hand optimized assembly does the following things:
1) Processes data in chunks small enough to fit in the L1 cache
2) For each chunk first walks backwards through the chunk reading
   one 32bit word per cache line, forcing the data into the cache.
3) Fills all 8 sse registers with data from the chunk going forward
4) Stores all 8 sse registers using a non temporal store.

Using the a non temporal store for this kind of application raises
my store speed by 3x.  A non intuitive but very interesting result.

I there a gcc option other than -msse2 that I could use to tell it a
loop is memory bound, and so apply appropriate optimizations?

If not what would be the recommended course to fix gcc so that it
performs well in memory bandwidth limited code?

Eric


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]