This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: gomp slowness


On Wed, 2007-10-17 at 10:09 -0700, Joe Buck wrote:
> On Thu, Oct 18, 2007 at 03:00:02AM +1000, skaller wrote:
> > Hi, I have just run and timed a couple of tutorial examples for
> > openMP using gcc (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) on a dual core
> > Athlon amd64, with OMP_NUM_THREADS set to 1 and 2, and occasionally 
> > 8 I found that 1 thread outperforms 2 by almost 2:1 on all the examples,
> > and 8 is only fractionally slower than 2. The code was compiled
> > with just -fopenmp, no optimisation switches. OS: Linux, Ubuntu
> > gutsy (7.10) with Linux 2.26.22-14-rt (with real time patches).
> 
> Try again with optimization switches.

OK, tried with -O2.

combined_mp.c: real: 12 seconds 1 thread, 8 seconds 2 threads.
CPU: 12.2 vs 12.9 approx. SPEEDUP. Very coarse parallelism here
(i.e. almost no overhead synchronising threads).

LU_mp.c: 1.23 with 1 threads, 1.96 with 2 threads. SLOWDOWN.
This is an LU decomposition with matrix 800x800, omp parallel
on final internal 800x800 loop, i.e. fairly fine grained 
parallelism.

1 thread: 1.23 user .009 system.
2 threads, 3.75 user, 0.22 system

That's 3x amount of CPU time, which explains the why the
real time performance is poor.

If that's typical, then it seems to indicate o(1K) floating point
operations per section isn't enough on an Athlon x2 to warrant
parallelism.

This is the Ubuntu/Debian built system, 
so if they're conservative the build machine may turn off
the special mutex-free increments etc in case the user
machine doesn't support those instructions.. (although
I did think they were standard on all AMD devices,
since they're ordinary instructions with a LOCK prefix).

Q: why is optimisation required here? I'd have thought even
more benefit would be obtained for non-optimised code
(because it uses more CPU etc).

My only hint there is non-optimal RAM (or L2?) accesses. 
(all the code in these micro tests should be easily cached, 
amd has separate caches for each core).

I tried the LU decomposition with 8 threads. The real time is
1.961, i.e. more or less the same as for 2 threads, 
as is the total CPU time. The system time went up from 0.22s to
0.128s, since each core would be context switching 4 threads.
Note: Linux kernel with RT patches.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]