This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
streams is slow under gcc
- From: ebiederm at xmission dot com (Eric W. Biederman)
- To: gcc at gcc dot gnu dot org
- Date: 12 Oct 2002 11:41:51 -0600
- Subject: streams is slow under gcc
Streams (http://www.cs.virginia.edu/stream/) is a benchmark that
measures the effective memory bandwidth on a platform. Effective in
this case is how much sustained memory bandwidth can the compiler get
out of the system. The goal is to give some indication of how well
memory bandwidth limited applications will run.
I have done some investigated of the results for gcc on x86. On
recent platforms Athlon, P4 Xeon and, x86_64 gcc generally achieves
50% of the achievable memory bandwidth. This includes gcc-3.2 with
sse support.
To come to the above conclusion I wrote a hand optimized memory copy
to see what the achievable, as opposed to theoretical memory bandwidth
numbers were. As well as a hand optimized memory read, and a hand
optimized memory write.
Rough numbers:
Memory CPU Theoretical achievable gcc intel-7beta
PC2100 Athlon 2100MB/s 2000MB/s 800MB/s
PC2700 x86_64 2700MB/s 2670MB/s 1200MB/s 1500MB/s
Dual Channel PC1600 P4Xeon 3200MB/s 2800MB/s 1400MB/s 1700MB/s
For a hand optimized memory read or a hand optimized memory write I
only get about 2/3 of the theoretical, while I come very close to the
theoretical for copy operations.
The hand optimized assembly does the following things:
1) Processes data in chunks small enough to fit in the L1 cache
2) For each chunk first walks backwards through the chunk reading
one 32bit word per cache line, forcing the data into the cache.
3) Fills all 8 sse registers with data from the chunk going forward
4) Stores all 8 sse registers using a non temporal store.
Using the a non temporal store for this kind of application raises
my store speed by 3x. A non intuitive but very interesting result.
I there a gcc option other than -msse2 that I could use to tell it a
loop is memory bound, and so apply appropriate optimizations?
If not what would be the recommended course to fix gcc so that it
performs well in memory bandwidth limited code?
Eric