This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hello, I'm improving OpenMP Task implementation in libgomp as a Google Summer of Code project. Since the mid-term evaluation is approaching, I would like to tell you what I did so far. 1. I read papers on task parallelism systems such as Cilk, Intel TBB, and OpenMP Task. 2. I read source codes of two OpenMP Task implementations. Nanos4 and OpenUH. 3. I read libgomp source codes surrounding task implementation. 4. I thought that one of the biggest problem in libgomp task implementation is that multiple processors lock only one task queue so many times. So I conducted a simple experiment to prove it. Attached "libgomp_lock._fib.eps" shows how long each CPU waits for task queue lock to be released. It is written by executing a program that use libgomp task. It calculates 22nd fibonacci number using libgomp task. The red line means the time span in which each CPU is waiting for mutex lock of task queue to be released. You can see each CPU is waiting for most of the execution time. I was convinced that each CPU (worker thread) has to have a task queue as many other task parallel implementation do. 5. I implemented and tested task queue [0]. It is held by each worker thread. Tasks in a task queue is usually popped by its own worker thread. When a worker thread has empty task queue, it steal a task from other worker thread. 6. I decided to use Portable Coroutine Library (PCL) [1] as OpenUH does to realise user context switches. It's true that user-level context switches are not necessary to implement tied task. But it is neccessary to implement efficient untied task. In fact, many task parallel systems have user-level context switches like Cilk, OpenUH, and Nanos. Although my first aim was to implement better tied task in libgomp, tied task is slower than untied task. Also, tied task can be realised by just adding some limitations to untied tasks. So I'm using PCL and firstly implementing untied task. 7. Before writing codes in libgomp, I wrote codes [2] to emulate the behavior of OpenMP Task both to deeply understand the implementation of it and evaluate the system I would implement in libgomp. It emulates a fibonacci program, in which each fib(N) is bound to a task. The codes contains: - Worker threads - Task queues - Task scheduler - OpenMP Task ABIs (like GOMP_task, GOMP_taskwait, and so on) - fibonacci function in which ABIs are explicitly called 8. I realised that task creation cost is not negligible when too many tasks are spawned. So I added cut-off depth. 9. I conducted a simple evaluation to compare my current implementation (7.), libgomp, and Nanos4. It shows thatn my current implementation is much faster than libgomp. And it seems faster than Nanos4... But it is probably because I don't know how to configure the task scheduling strategy in Nanos4. It is not perfectly scalable but it may be because tasks in fibonacci application are fine-grain to ignore the overhead of task scheduler. This is what I have done. Then I would like to ask you what I should do next. In my opinion, I have two reasonable choices. (A) Try to change implementation of libgomp following my current implementation. Then test and evaluate it using some applications. (I think Barcelona OpenMP Task Suite [3] contains good applications to test OpenMP Task implementations) (B) Ensure my current implementation is worth merging in libgomp before actually writing codes in libgomp. I wrote in 7. that I only emulated and evaluated a fibonacci program. But it is possible to emulate other applications like N-Queen or FFT. (A) is better personally because I don't have enough time before the GSoC term ends. Give me your advice. [0] I learned the implementation of the task queue from that of Cilk. You can see the basic idea of it in http://www.few.vu.nl/~kielmann/cgc/2011/slides/mihalcea.pdf [1] http://www.xmailserver.org/libpcl.html The codes are available with GPL2. [2] https://github.com/laysakura/GSoC_libgomp_task/tree/master/my_data_structures You can compile and test the emulation codes by: $ make test_fib_by_hand $ export OMP_NUM_THREADS=(int number) $ ./test_fib_by_hand (N for fib(N)) [3] http://nanos.ac.upc.edu/content/barcelona-openmp-task-suite Sho Nakatani
Attachment:
libgomp_lock_fib.eps
Description: PostScript document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |