This is the mail archive of the
mailing list for the GCC project.
Re: [libgomp, GSoC'19] Work-stealing task scheduling
- From: Ray Kim <msca8h at naver dot com>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Mon, 02 Sep 2019 21:48:18 +0900
- Subject: Re: [libgomp, GSoC'19] Work-stealing task scheduling
- References: <firstname.lastname@example.org> <20190902122127.GO2120@tucnak>
> Your mailer sadly made the patch not applicable, I had to spent quite a long
> time to just undo the undesirable line breaking. Please fix your mailer not
> to break lines automatically, or at least send patches as attachments on
> which hopefully it will not do that on.
Sorry for wasting your time again.
I just changed my mailer but missed to changed the wrapping option.
This will not occur from now on.
> > This patch implemented work-stealing task scheduling for GSoC'19 final
> > evaluations.
> > Currently there are some issues that needs to be further addressed,
> > however I think it is functional.
> > This that could be improved are as follows:
> > 1. Currently the threads busy wait even when the task queues are empty
> > until the stopping criterion are met.
> > The previous way the task waits were implemented do not work well on
> > this implementation.
> > So I initially removed the task wait feature all together.
> For the GSoC submission I guess I can live with that, but not long term,
> many apps oversubscribe the machine, run more threads than the hw has
> available and unbounded busy waiting in that case is unacceptable.
> For latency reasons some busy waiting is performed, but the user should have
> control over how long that busy waiting is done and then have a fallback to
> sleeping. See e.g. OMP_WAIT_POLICY or GOMP_SPINCOUNT env variables that
> allow to control that behavior.
I totally agree with you.
This was a quick hack to match the GSoC submission.
I think though, that implementing the waiting will take some additional effort.
> E.g. for the taskwait case, when you run gomp_execute_task and it returns
> false, hasn't executed any task, can't you perhaps task the depend_lock
> on the paren't or just atomically set some state bit that you are about to
> sleep and go to sleep, and then on the other side when doing
> __atomic_sub_fetch (&parent->num_children, 1, MEMMODEL_ACQ_REL);
> check the return value and if num_children is 0, verify if the parent isn't
> sleeping and if it could be sleeping, wake it up.
> Guess similarly for taskgroup wait or waiting for dependencies.
This is close to how taskwait was originally implemented,
However a deadlock kept causing the threads to never wake up.
I suppose waking up the parent from the task has some issue in the work-stealing context?
Also, I think the threads should be woken up whenever a new task is added to the team.
Since we're doing work-stealing, any thread can run the newly added task.
This however, wasn't possible in the previous implementation,
since the parent task of the newly added task cannot access the tasks of the sleeping threads.
I think I'll probably have to figure out a clearer way to implement this.
> > 2. The barrier interfaces use a big bold lock.
> > Because the barrier related functions such as gomp_team_barrier_done
> > are not atomic (as far as I believe), they are protected using team-
> > >barrier_lock.
> Can't those operations that need protection just be turned into __atomic_*
Yes, I'll work on this.
> Have you managed to run some benchmarks? E.g. the EPCC microbenchmarks,
> LCALS/CLOMP, etc.?
I didn't run any serious benchmark but plan to run the Barcelona OMP task benchmark.
Benchmarks on simple tasks revealed that the work-stealing version is a little bit
slower than the GSoC 2nd evaluation version.
This could probably be caused by the busy waiting.
I'll post benchmark results once the other issues are resolved.
> I must say I don't understand why do you use atomics on
> gomp_task_state state as kind of critical section, instead of say
> depend_lock mutex? I'd note that e.g. the data structures layout needs to
> be carefully considered so that there is no cache line ping-pong on often
> atomically modified cache lines from other threads.
This was actually the part I suffered the most.
Using depend_lock is not possible since either the last child,
or the thread executing the parent will free the task.
Once the task is freed, accessing the depend_lock will cause a segfault.
I really couldn't come up with a more safer, clearer solution.
If you have any better way to implement this, please let me know.
I'll be really greatful.
I'll try to improve the issues you raised.
Thanks for the thorough review.