[Bug target/105873] [amdgcn][OpenMP] task reductions fail with "team master not responding; slave thread aborting"

jakub at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Jun 7 17:15:52 GMT 2022


--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
  int retry = 100;
      if (retry-- == 0)
          /* It really shouldn't happen that barriers get out of sync, but
             if they do then this will loop until they realign, so we need
             to avoid an infinite loop where the thread just isn't there.  */
          const char msg[] = ("Barrier sync failed (another thread died?);"
                              " aborting.");
          write (2, msg, sizeof (msg)-1);

      asm ("s_barrier" ::: "memory");
      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
          gomp_barrier_handle_tasks (state);
          gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
      generation |= gen & BAR_WAITING_FOR_TASK;
  while (gen != state + BAR_INCR);

I wonder if this (and similar loop later on) shouldn't reset the retry count
back to 100 if gomp_barrier_handle_tasks is run.  Because there is really no
limit on how many times it can occur.

The quite often used model in many OpenMP programs is that of parallel master,
(or parallel single), where just one thread creates tasks and the other threads
are waiting in gomp_team_barrier_wait_end.  Then upon creation of a task by the
single thread that hasn't reached the barrier yet gomp_team_barrier_wake is
called.  Now, with the gcn implementation, initially that means all threads but
one do s_barrier in gomp_team_barrier_wait_end and one thread will do
gomp_team_barrier_wake which also does s_barrier, at that point I guess all the
threads are woken up, the single thread continues its work, one of the other
threads will likely pick up that task and the rest of them will go back to
sleep (s_barrier).
Now, we can have 2 quite different scenarios, one (not very good OpenMP
program)  where the single (or master) thread does some compute expensive work
and always after a while creates some short lived cheap task, let's say it does
that 200 times in a loop.
That will surely trigger the above "Barrier sync failed (another thread died?);
aborting." case, as in such a scenario it will only allow 100 such iterations.
A more usual case is when the single/master thread creates lots of tasks and
then reach the barrier itself, initially it will be the same as the above
mentioned scenario, but soon all or most of the threads will be busy and not
waiting on s_barrier, until the work is done, then everybody sleeps in

More information about the Gcc-bugs mailing list