This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: gomp slowness


On Thu, Oct 18, 2007 at 02:47:44PM +1000, skaller wrote:
> 
> On Thu, 2007-10-18 at 12:02 +0800, Biplab Kumar Modak wrote:
> > skaller wrote:
> > > On Wed, 2007-10-17 at 18:14 +0100, Biagio Lucini wrote:
> > >> skaller wrote:
> > > 
> > >> It would be interesting to try with another compiler. Do you have access 
> > >> to another OpenMP-enabled compiler?
> > > 
> > > Unfortunately no, unless MSVC++ in VS2005 has openMP.
> > > I have an Intel licence but they're too tied up with commerical
> > > vendors and it doesn't work on Ubuntu (it's built for Fedora and Suse).
> > > 
> > If possible, you can post the source code. I've a MSVC 2005 license (I 
> > bought it to get OpenMP working with it).
> > 
> > I can then give it a try. I have a dual core PC. :)
> 
> OK, attached.

On LU_mp.c according to oprofile more than 95% of time is spent in the inner
loop, rather than any kind of waiting.  On quad core with OMP_NUM_THREADS=4
all 4 threads eat 99.9% of CPU and the inner loop is identical between
OMP_NUM_THREADS=1 and OMP_NUM_THREADS=4.  I believe this benchmark is highly
memory bound rather than CPU intensive, so the relative difference between
OMP_NUM_THREADS={1,2,4} is very likely not in what GCC or other OpenMP
implementation does, but in what kind of cache patterns it generates.

OMP_NUM_THREADS=1 /tmp/LU_mp; OMP_NUM_THREADS=2 GOMP_CPU_AFFINITY=0,1 /tmp/LU_mp; \
OMP_NUM_THREADS=2 GOMP_CPU_AFFINITY=0,2 /tmp/LU_mp; OMP_NUM_THREADS=4 /tmp/LU_mp
Completed decomposition in 4.830 seconds
Completed decomposition in 5.970 seconds
Completed decomposition in 9.140 seconds
Completed decomposition in 11.480 seconds

shows this quite clearly.  This Intel quad core CPU shares 4MB L2 cache between
core 0 and 1 and between core 2 and 3.  So, if you run the two threads on
cores sharing the same L2 cache, it is only slightly slower than one thread,
while running it on cores with different L2 caches shows a huge slowdown.

So, I very much doubt you'd get much better results with other OpenMP
implementations.  I believe how the 3 arrays are layed out on the stack
is what really matters most for this case, the synchronization overhead is
in the noise.

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]