This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PR libgomp/56073: benchmark regression due to PR libgomp/51376 fix


On Tue, 2013-01-22 at 21:33 +1030, Alan Modra wrote:
> SPEComp2012 376.kdtree shows a huge regression after my PR51376 fix,
> with the benchmark, which finishes normally in a few minutes, failing
> to complete after hours of running on a power7 machine.  Using a
> reduced data set showed typically over a 200x slowdown.  So, why is
> this happening?  Well, it appears that the benchmark hits exactly the
> libgomp code paths that previously accessed task->children without
> locking and now we need to obtain task_lock.  The slowdown is due to
> massive task_lock contention.
> 
> I tried the solution in
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00235.html and found it
> gave a small improvement, but there was still far too much
> contention due to hitting task_lock in GOMP_task.  The question then
> was whether locking was really necessary there, and on analysing it
> properly I believe I was far too conservative in forcing the
> "children" access to be inside a task_lock held region.  That
> particular task.children will only be set if the current thread
> creates new threads in its task work function!  ie. Some other thread
> won't set it, so there's no need to worry that the current thread
> might see a stale NULL value and miss calling gomp_clear_parent.
> (It's true that the current thread might see a stale non-NULL value,
> but that doesn't matter.  Gaining the lock will ensure
> gomp_clear_parent sees the real value of task.children.)
> 
> With this patch, PR51376 stays fixed and we're back to a reasonable
> time for kdtree.  I'm seeing a 20% slowdown in my quick and dirty
> testing, but some of that will be due to different optimisation and
> tuning in the libgomp builds.
> 
> I did consider (and test) another small refinement.  The release
> barrier in gomp_sem_post is sufficient to ensure correct memory
> ordering, so you can write:
> 
> 		      if (parent->in_taskwait)
> 			{
> 			  gomp_sem_post (&parent->taskwait_sem);
> 			  parent->children = NULL;
> 			}
> 		      else
> 			__atomic_store_n (&parent->children, NULL,
> 					  MEMMODEL_RELEASE);
> 
> However, this reorders posting the semaphore and writing
> parent->children.  I think doing so is OK but am wary of trying to be
> too clever where multiple threads are involved..
> 
> Bootstrapped and regression tested powerpc64-linux.  OK to apply?

Could you please document the synchronization bits in more detail?  For
example, if you have release stores, document with which acquire loads
they synchronize with, and vice versa.  This makes it much easier for
others to read and review the code, and check the (intent of) the
synchronization bits.  Also, it is often useful to document which cases
cannot happen (and thus would allow for using weaker memory orders, for
example); this absence of problems is usually difficult to prove, so
it's easier for others if you document this if you've already checked
it.  Thanks.

More comments below.

> 	PR libgomp/51376
> 	PR libgomp/56073
> 	* task.c (GOMP_task): Revert 2011-12-09 change.
> 	(GOMP_taskwait): Likewise.  Instead use atomic load with acquire
> 	barrier to read task->children..
> 	(gomp_barrier_handle_tasks): ..and matching atomic store with
> 	release barrier here when setting parent->children to NULL.
> 
> Index: libgomp/task.c
> ===================================================================
> --- libgomp/task.c	(revision 195354)
> +++ libgomp/task.c	(working copy)
> @@ -116,11 +116,10 @@ GOMP_task (void (*fn) (void *), void *data, void (
>  	}
>        else
>  	fn (data);
> -      if (team != NULL)
> +      if (task.children != NULL)

Why does this not need an acquire barrier?

>  	{
>  	  gomp_mutex_lock (&team->task_lock);
> -	  if (task.children != NULL)

Are you sure that you can remove this check to task.children here,
especially if you don't use an acquire barrier previously?

Can there be several threads trying to call gomp_clear_parent?

> -	    gomp_clear_parent (task.children);
> +	  gomp_clear_parent (task.children);
>  	  gomp_mutex_unlock (&team->task_lock);

Please add comments or reference the comments in the other pieces of
synchronizing code related to this.

>  	}
>        gomp_end_task ();
> @@ -258,7 +257,13 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t st
>  		    parent->children = child_task->next_child;
>  		  else
>  		    {
> -		      parent->children = NULL;
> +		      /* We access task->children in GOMP_taskwait
> +			 outside of the task lock mutex region, so
> +			 need a release barrier here to ensure memory
> +			 written by child_task->fn above is flushed
> +			 before the NULL is written.  */
> +		      __atomic_store_n (&parent->children, NULL,
> +					MEMMODEL_RELEASE);
>  		      if (parent->in_taskwait)
>  			gomp_sem_post (&parent->taskwait_sem);
>  		    }
> @@ -291,7 +296,8 @@ GOMP_taskwait (void)
>    struct gomp_task *child_task = NULL;
>    struct gomp_task *to_free = NULL;
>  
> -  if (task == NULL || team == NULL)
> +  if (task == NULL
> +      || __atomic_load_n (&task->children, MEMMODEL_ACQUIRE) == NULL)

Please mention with which stores this acquire load synchronizes.

>      return;
>  
>    gomp_mutex_lock (&team->task_lock);
> 

Torvald


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]