This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

GSoC libgomp task project: What should I do next?

From: Sho Nakatani <dev dot laysakura at gmail dot com>
To: gcc at gcc dot gnu dot org
Cc: Jakub Jelinek <jakub at redhat dot com>
Date: Wed, 29 Jun 2011 04:06:11 +0900
Subject: GSoC libgomp task project: What should I do next?

Hello,

I'm improving OpenMP Task implementation in libgomp as a Google Summer
of Code project.
Since the mid-term evaluation is approaching, I would like to tell you
what I did so far.

1. I read papers on task parallelism systems such as Cilk, Intel TBB,
and OpenMP Task.

2. I read source codes of two OpenMP Task implementations. Nanos4 and OpenUH.

3. I read libgomp source codes surrounding task implementation.

4. I thought that one of the biggest problem in libgomp task
implementation is that
   multiple processors lock only one task queue so many times.
   So I conducted a simple experiment to prove it.
   Attached "libgomp_lock._fib.eps" shows how long each CPU waits
   for task queue lock to be released.
   It is written by executing a program that use libgomp task.
   It calculates 22nd fibonacci number using libgomp task.
   The red line means the time span in which each CPU is waiting for mutex lock
   of task queue to be released.
   You can see each CPU is waiting for most of the execution time.
   I was convinced that each CPU (worker thread) has to have a task queue
   as many other task parallel implementation do.

5. I implemented and tested task queue [0]. It is held by each worker thread.
   Tasks in a task queue is usually popped by its own worker thread.
   When a worker thread has empty task queue, it steal a task from
other worker thread.

6. I decided to use Portable Coroutine Library (PCL) [1] as OpenUH does to
   realise user context switches.
   It's true that user-level context switches are not necessary to
implement tied task.
   But it is neccessary to implement efficient untied task. In fact,
many task parallel
   systems have user-level context switches like Cilk, OpenUH, and Nanos.
   Although my first aim was to implement better tied task in libgomp,
   tied task is slower than untied task. Also, tied task can be
realised by just adding
   some limitations to untied tasks.
   So I'm using PCL and firstly implementing untied task.

7. Before writing codes in libgomp, I wrote codes [2] to emulate the behavior of
    OpenMP Task both to deeply understand the implementation of it and
    evaluate the system I would implement in libgomp.
    It emulates a fibonacci program, in which each fib(N) is bound to a task.
    The codes contains:
    - Worker threads
    - Task queues
    - Task scheduler
    - OpenMP Task ABIs (like GOMP_task, GOMP_taskwait, and so on)
    - fibonacci function in which ABIs are explicitly called

8. I realised that task creation cost is not negligible when too many tasks
   are spawned. So I added cut-off depth.

9. I conducted a simple evaluation to compare my current implementation (7.),
    libgomp, and Nanos4.
    It shows thatn my current implementation is much faster than libgomp.
    And it seems faster than Nanos4... But it is probably because I don't know
    how to configure the task scheduling strategy in Nanos4.
    It is not perfectly scalable but it may be because tasks in
fibonacci application
    are fine-grain to ignore the overhead of task scheduler.


This is what I have done.
Then I would like to ask you what I should do next.

In my opinion, I have two reasonable choices.

(A) Try to change implementation of libgomp following my current implementation.
     Then test and evaluate it using some applications.
     (I think Barcelona OpenMP Task Suite [3] contains good applications
     to test OpenMP Task implementations)

(B) Ensure my current implementation is worth merging in libgomp
     before actually writing codes in libgomp.
     I wrote in 7. that I only emulated and evaluated a fibonacci program.
     But it is possible to emulate other applications like N-Queen or FFT.

(A) is better personally because I don't have enough time before the
GSoC term ends.
Give me your advice.


[0] I learned the implementation of the task queue from that of Cilk.
     You can see the basic idea of it in
http://www.few.vu.nl/~kielmann/cgc/2011/slides/mihalcea.pdf

[1] http://www.xmailserver.org/libpcl.html
     The codes are available with GPL2.

[2] https://github.com/laysakura/GSoC_libgomp_task/tree/master/my_data_structures
     You can compile and test the emulation codes by:
       $ make test_fib_by_hand
       $ export OMP_NUM_THREADS=(int number)
       $ ./test_fib_by_hand (N for fib(N))

[3] http://nanos.ac.upc.edu/content/barcelona-openmp-task-suite


Sho Nakatani

Attachment: libgomp_lock_fib.eps
Description: PostScript document

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]