This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: benchmarking (or almabench)

From: Daniel Berlin <dberlin at dberlin dot org>
To: "S. Bosscher" <S dot Bosscher at student dot tudelft dot nl>
Cc: 'Jeremy Sanders ' <jss at ast dot cam dot ac dot uk>,"'gcc at gcc dot gnu dot org '" <gcc at gcc dot gnu dot org>
Date: Tue, 22 Apr 2003 11:11:09 -0400
Subject: Re: benchmarking (or almabench)

On Tuesday, April 22, 2003, at 11:08 AM, S. Bosscher wrote:

-march=pentium4 is known to pessimise code compared to -march=i686 for some benchmarks, see PR 8474. Maybe you're seeing the same problem?

Actually, if i had to guess, i'd put my money on the vectorization. Notice ICC vectorized two loops in his example, and obviously, we vectorized 0. :)

If those were compute intensive loops, ....


Greetz
Steven


-----Original Message-----
From: Jeremy Sanders
To: gcc at gcc dot gnu dot org
Sent: 22-4-03 16:43
Subject: benchmarking (or almabench)

I've been looking at compiling the almabench benchmark again with gcc.
See:

http://gcc.gnu.org/ml/gcc/2003-01/msg00037.html

With a pentium4 processor I'm getting drastically different times for
the
running the code output from icc and gcc. icc produces code which is up
to
2.7 times faster than gcc code for this program.

(with gcc mainline)

/data/jss/gcc-3.3/bin/g++ -o almabench.o -O2 -mfpmath=sse -msse -msse2
-march=pentium4 -finline-limit=10000 -c almabench.cpp
/data/jss/gcc-3.3/bin/g++ -o almabench -O2 -mfpmath=sse -msse -msse2
-march=pentium4 -finline-limit=10000 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.121u 0.060s 0:33.31 93.6%	0+0k 0+0io 212pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.148u 0.052s 0:33.61 92.7%	0+0k 0+0io 212pf+0w

(I've also tried without sse and march, and there's little difference.
I've also tried fprofile-arcs, which doesn't do anything. inline-limit
has no real effect).

With icc 7.1.

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -c almabench.cpp
icc -o almabench -O2 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.494u 0.013s 0:17.71 93.1%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.445u 0.029s 0:17.53 93.8%	0+0k 0+0io 116pf+0w

That's 88% faster than gcc.

Enabling P4 optimisation (okay gcc can't do vectorization):

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -tpp7 -xW -march=pentium4 -c almabench.cpp
almabench.cpp(219) : (col. 5) remark: LOOP WAS VECTORIZED.
almabench.cpp(230) : (col. 5) remark: LOOP WAS VECTORIZED.
icc -o almabench -O2 -tpp7 -xW -march=pentium4 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.318u 0.005s 0:12.09 93.5%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.277u 0.007s 0:12.08 93.2%	0+0k 0+0io 116pf+0w

That's 2.75 times faster than gcc's code.

Obviously this benchmark is synthetic, but it suggests gcc isn't optimising something in this code very well. We've also seen similar effects with other floating-point intensive code. Any suggestions? I can supply assembler output for both if anyone would like a look!

Jeremy

-- Jeremy Sanders <jss at ast dot cam dot ac dot uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053

Follow-Ups:
- Re: benchmarking (or almabench)
  - From: Jeremy Sanders

References:
- RE: benchmarking (or almabench)
  - From: S. Bosscher

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]