diff mbox series

[libgomp,nvptx] Fix hang in gomp_team_barrier_wait_end

Message ID 20210420112344.GA7277@delia
State New
Headers show
Series [libgomp,nvptx] Fix hang in gomp_team_barrier_wait_end | expand

Commit Message

Tom de Vries April 20, 2021, 11:23 a.m. UTC
Hi,

Consider the following omp fragment.
...
  #pragma omp target
  #pragma omp parallel num_threads (2)
  #pragma omp task
    ;
...

This hangs at -O0 for nvptx.

Investigating the behaviour gives us the following trace of events:
- both threads execute GOMP_task, where they:
  - deposit a task, and
  - execute gomp_team_barrier_wake
- thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
  proceeds to wait at the team barrier
- thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
  calls gomp_barrier_handle_tasks, where it:
  - executes both tasks and marks the team barrier done
  - executes a gomp_team_barrier_wake which wakes up thread 1
- thread 1 exits the team barrier
- thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
  the team barrier.
- thread 0 hangs.

To understand why there is a hang here, it's good to understand how things
are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
a copy of the libgomp/config/linux/bar.c implementation, with uses of both
futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
...
  if (bar->total > 1)
    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
...

The point where thread 0 goes to wait at the team barrier, corresponds in
the linux implementation with a do_wait.  In the linux case, the call to
do_wait doesn't hang, because it's waiting for bar->generation to become
a certain value, and if bar->generation already has that value, it just
proceeds, without any need for coordination with other threads.

In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
logical barrier, which never happens: thread 1 is lingering in the
thread pool at the thread pool barrier (using a different logical barrier),
waiting to join a new team.

The easiest way to fix this is to revert to the posix implementation for
bar.{c,h}.

Another way would be to revert to the linux implementation for bar.{c,h},
and implement the primitives futex_wait and do_wait using nvptx insns.

This patch instead implements a minimal fix (which makes the implementation
deviate further from the linux one).

The hang was only observed in gomp_team_barrier_wait_end, but we propagate the
fix to its twin gomp_team_barrier_wait_cancel_end as well.

The fix is based on the assumptions that at the point of the fix, after the
call to gomp_barrier_handle_tasks:
- all tasks are done
  (an assert is added to check this), and consequently:
- the executing thread is the only thread left in the team barrier
  (so it's accurate to set nthreads to 1).

Tested libgomp on x86_64 with nvptx accelerator.

Any comments?

Thanks,
- Tom

[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

libgomp/ChangeLog:

2021-04-20  Tom de Vries  <tdevries@suse.de>

	PR target/99555
	* config/nvptx/bar.c (gomp_team_barrier_wait_end)
	(gomp_team_barrier_wait_cancel_end): Don't try to sync with team threads
	that have left the team barrier.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
	workarounds.
	* testsuite/libgomp.c/pr99555-1.c: Same.
	* testsuite/libgomp.fortran/task-detach-6.f90: Same.

---
 libgomp/config/nvptx/bar.c                         | 32 ++++++++++++++++------
 .../testsuite/libgomp.c-c++-common/task-detach-6.c |  8 ------
 libgomp/testsuite/libgomp.c/pr99555-1.c            |  8 ------
 .../testsuite/libgomp.fortran/task-detach-6.f90    | 12 --------
 4 files changed, 24 insertions(+), 36 deletions(-)

Comments

Alexander Monakov April 20, 2021, 4:11 p.m. UTC | #1
Hello Tom,

Thank you for the investigation and the detailed writeup. It was difficult for
me to infer the internal API contracts here (and still is), sorry about the
mistake.

Most importantly: does GCN handle this, and if yes, how? I think the solution
should be the same for config/gcn and config/nvptx (I guess this is a question
for Andrew).

Some comments inline below:

On Tue, 20 Apr 2021, Tom de Vries wrote:

> Hi,
> 
> Consider the following omp fragment.
> ...
>   #pragma omp target
>   #pragma omp parallel num_threads (2)
>   #pragma omp task
>     ;
> ...
> 
> This hangs at -O0 for nvptx.
> 
> Investigating the behaviour gives us the following trace of events:
> - both threads execute GOMP_task, where they:
>   - deposit a task, and
>   - execute gomp_team_barrier_wake
> - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
>   proceeds to wait at the team barrier

Shouldn't it try to handle deposited tasks before suspending on the barrier?

I guess you are describing what the code does, I'm just commenting that I'm
confused why it behaves so.

> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>   calls gomp_barrier_handle_tasks, where it:
>   - executes both tasks and marks the team barrier done
>   - executes a gomp_team_barrier_wake which wakes up thread 1
> - thread 1 exits the team barrier

Up to this point it looks reasonable.

> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>   the team barrier.

At this point the code should realize that the team barrier was already released
and not attempt to wait on it again. Maybe by inspecting the generation counter?

I may be wrong though, I don't understand the overall flow well enough yet.

> - thread 0 hangs.
> 
> To understand why there is a hang here, it's good to understand how things
> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
> futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
> ...
>   if (bar->total > 1)
>     asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> ...
> 
> The point where thread 0 goes to wait at the team barrier, corresponds in
> the linux implementation with a do_wait.  In the linux case, the call to
> do_wait doesn't hang, because it's waiting for bar->generation to become
> a certain value, and if bar->generation already has that value, it just
> proceeds, without any need for coordination with other threads.
> 
> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
> logical barrier, which never happens: thread 1 is lingering in the
> thread pool at the thread pool barrier (using a different logical barrier),
> waiting to join a new team.
> 
> The easiest way to fix this is to revert to the posix implementation for
> bar.{c,h}.
> 
> Another way would be to revert to the linux implementation for bar.{c,h},
> and implement the primitives futex_wait and do_wait using nvptx insns.

I don't think implementing futex_wait is possible on nvptx.

Alexander

> This patch instead implements a minimal fix (which makes the implementation
> deviate further from the linux one).
> 
> The hang was only observed in gomp_team_barrier_wait_end, but we propagate the
> fix to its twin gomp_team_barrier_wait_cancel_end as well.
> 
> The fix is based on the assumptions that at the point of the fix, after the
> call to gomp_barrier_handle_tasks:
> - all tasks are done
>   (an assert is added to check this), and consequently:
> - the executing thread is the only thread left in the team barrier
>   (so it's accurate to set nthreads to 1).
> 
> Tested libgomp on x86_64 with nvptx accelerator.
> 
> Any comments?
> 
> Thanks,
> - Tom
> 
> [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
> 
> libgomp/ChangeLog:
> 
> 2021-04-20  Tom de Vries  <tdevries@suse.de>
> 
> 	PR target/99555
> 	* config/nvptx/bar.c (gomp_team_barrier_wait_end)
> 	(gomp_team_barrier_wait_cancel_end): Don't try to sync with team threads
> 	that have left the team barrier.
> 	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
> 	workarounds.
> 	* testsuite/libgomp.c/pr99555-1.c: Same.
> 	* testsuite/libgomp.fortran/task-detach-6.f90: Same.
> 
> ---
>  libgomp/config/nvptx/bar.c                         | 32 ++++++++++++++++------
>  .../testsuite/libgomp.c-c++-common/task-detach-6.c |  8 ------
>  libgomp/testsuite/libgomp.c/pr99555-1.c            |  8 ------
>  .../testsuite/libgomp.fortran/task-detach-6.f90    | 12 --------
>  4 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
> index c5c2fa8829b..058a8d4d5ca 100644
> --- a/libgomp/config/nvptx/bar.c
> +++ b/libgomp/config/nvptx/bar.c
> @@ -78,6 +78,7 @@ void
>  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>  {
>    unsigned int generation, gen;
> +  unsigned int nthreads = bar->total;
>  
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> @@ -90,6 +91,15 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>        if (__builtin_expect (team->task_count, 0))
>  	{
>  	  gomp_barrier_handle_tasks (state);
> +	  /* Assert that all tasks have been handled.  */
> +	  if (team->task_count != 0)
> +	    __builtin_abort ();
> +	  /* In gomp_barrier_handle_tasks, the team barrier has been marked
> +	     as done, and all pending threads woken up.  So this is now the
> +	     last and only thread in the barrier.  Adjust nthreads to
> +	     reflect the new situation, to make sure we don't hang
> +	     indefinitely at the bar.sync below.  */
> +	  nthreads = 1;
>  	  state &= ~BAR_WAS_LAST;
>  	}
>        else
> @@ -97,8 +107,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>  	  state &= ~BAR_CANCELLED;
>  	  state += BAR_INCR - BAR_WAS_LAST;
>  	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +	  if (nthreads > 1)
> +	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>  	  return;
>  	}
>      }
> @@ -107,8 +117,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>    state &= ~BAR_CANCELLED;
>    do
>      {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +      if (nthreads > 1)
> +	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
>  	{
> @@ -140,6 +150,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>  				   gomp_barrier_state_t state)
>  {
>    unsigned int generation, gen;
> +  unsigned int nthreads = bar->total;
>  
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> @@ -156,14 +167,19 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>        if (__builtin_expect (team->task_count, 0))
>  	{
>  	  gomp_barrier_handle_tasks (state);
> +	  /* Assert that all tasks have been handled.  */
> +	  if (team->task_count != 0)
> +	    __builtin_abort ();
> +	  /* See comment in gomp_team_barrier_wait_end.  */
> +	  nthreads = 1;
>  	  state &= ~BAR_WAS_LAST;
>  	}
>        else
>  	{
>  	  state += BAR_INCR - BAR_WAS_LAST;
>  	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +	  if (nthreads > 1)
> +	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>  	  return false;
>  	}
>      }
> @@ -174,8 +190,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
>    generation = state;
>    do
>      {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> +      if (nthreads > 1)
> +	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (gen & BAR_CANCELLED, 0))
>  	return true;
> diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> index f18b57bf047..e5c2291e6ff 100644
> --- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> +++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> @@ -2,9 +2,6 @@
>  
>  #include <omp.h>
>  #include <assert.h>
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "on_device_arch.h"
>  
>  /* Test tasks with detach clause on an offload device.  Each device
>     thread spawns off a chain of tasks, that can then be executed by
> @@ -12,11 +9,6 @@
>  
>  int main (void)
>  {
> -  //TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>    int x = 0, y = 0, z = 0;
>    int thread_count;
>    omp_event_handle_t detach_event1, detach_event2;
> diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
> index bd33b93716b..7386e016fd2 100644
> --- a/libgomp/testsuite/libgomp.c/pr99555-1.c
> +++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
> @@ -2,16 +2,8 @@
>  
>  // { dg-additional-options "-O0" }
>  
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "../libgomp.c-c++-common/on_device_arch.h"
> -
>  int main (void)
>  {
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>  #pragma omp target
>  #pragma omp parallel // num_threads(1)
>  #pragma omp task
> diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> index e4373b4c6f1..03a3b61540d 100644
> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> @@ -1,6 +1,5 @@
>  ! { dg-do run }
>  
> -! { dg-additional-sources on_device_arch.c }
>    ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
>  
>  ! Test tasks with detach clause on an offload device.  Each device
> @@ -14,17 +13,6 @@ program task_detach_6
>    integer :: x = 0, y = 0, z = 0
>    integer :: thread_count
>  
> -  interface
> -    integer function on_device_arch_nvptx() bind(C)
> -    end function on_device_arch_nvptx
> -  end interface
> -
> -  !TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx () /= 0) then
> -     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
> -     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
> -  end if
> -
>    !$omp target map (tofrom: x, y, z) map (from: thread_count)
>      !$omp parallel private (detach_event1, detach_event2)
>        !$omp single
>
Tom de Vries April 21, 2021, 4:10 p.m. UTC | #2
On 4/20/21 6:11 PM, Alexander Monakov wrote:
> Hello Tom,
> 
> Thank you for the investigation and the detailed writeup. It was difficult for
> me to infer the internal API contracts here (and still is),

Hi Alexander,

thanks for the review.

Yep, same here.

> sorry about the
> mistake.
> 
> Most importantly: does GCN handle this, and if yes, how? I think the solution
> should be the same for config/gcn and config/nvptx (I guess this is a question
> for Andrew).
> 

I looked into gcn/bar.c at gomp_team_barrier_wait_end and found:
...
  int retry = 100;
  do
    {
      if (retry-- == 0)
        {
          /* It really shouldn't happen that barriers get out of sync,
             but
             if they do then this will loop until they realign, so we
             need
             to avoid an infinite loop where the thread just isn't
             there.  */
          const char msg[]
             = ("Barrier sync failed (another thread died?);"
                " aborting.");
          write (2, msg, sizeof (msg)-1);
          abort();
...
which doesn't look promising.

> Some comments inline below:
> 
> On Tue, 20 Apr 2021, Tom de Vries wrote:
> 
>> Hi,
>>
>> Consider the following omp fragment.
>> ...
>>   #pragma omp target
>>   #pragma omp parallel num_threads (2)
>>   #pragma omp task
>>     ;
>> ...
>>
>> This hangs at -O0 for nvptx.
>>
>> Investigating the behaviour gives us the following trace of events:
>> - both threads execute GOMP_task, where they:
>>   - deposit a task, and
>>   - execute gomp_team_barrier_wake
>> - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
>>   proceeds to wait at the team barrier
> 
> Shouldn't it try to handle deposited tasks before suspending on the barrier?
> 
> I guess you are describing what the code does, I'm just commenting that I'm
> confused why it behaves so.
> 

Ack.  Yeah, sorry I've got no idea about how openmp internals are
supposed to function.

>> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>>   calls gomp_barrier_handle_tasks, where it:
>>   - executes both tasks and marks the team barrier done
>>   - executes a gomp_team_barrier_wake which wakes up thread 1
>> - thread 1 exits the team barrier
> 
> Up to this point it looks reasonable.
> 
>> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>>   the team barrier.
> 
> At this point the code should realize that the team barrier was already released
> and not attempt to wait on it again. Maybe by inspecting the generation counter?
> 

Perhaps we can indeed piece together a fix like that.

The problem for me is that writing this sort of fix requires a good
understanding of the semantics of the various fields of gomp_barrier_t,
and I don't have that.

> I may be wrong though, I don't understand the overall flow well enough yet.
> 
>> - thread 0 hangs.
>>
>> To understand why there is a hang here, it's good to understand how things
>> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
>> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
>> futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
>> ...
>>   if (bar->total > 1)
>>     asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
>> ...
>>
>> The point where thread 0 goes to wait at the team barrier, corresponds in
>> the linux implementation with a do_wait.  In the linux case, the call to
>> do_wait doesn't hang, because it's waiting for bar->generation to become
>> a certain value, and if bar->generation already has that value, it just
>> proceeds, without any need for coordination with other threads.
>>
>> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
>> logical barrier, which never happens: thread 1 is lingering in the
>> thread pool at the thread pool barrier (using a different logical barrier),
>> waiting to join a new team.
>>
>> The easiest way to fix this is to revert to the posix implementation for
>> bar.{c,h}.
>>
>> Another way would be to revert to the linux implementation for bar.{c,h},
>> and implement the primitives futex_wait and do_wait using nvptx insns.
> 
> I don't think implementing futex_wait is possible on nvptx.
> 

Well, I gave it a try, attached below.  Can you explain why you think
it's not possible, or pinpoint a problem in the implementation?

[ The benefit of this specific approach for me is separation of
concerns: we copy a working solution as fully as possible, and isolate
the nvptx-specific code to two functions.  This requires us to
understand and provide the semantics of these two functions, and nothing
more. ]

Thanks,
- Tom
Alexander Monakov April 21, 2021, 5:02 p.m. UTC | #3
On Wed, 21 Apr 2021, Tom de Vries wrote:

> > I don't think implementing futex_wait is possible on nvptx.
> > 
> 
> Well, I gave it a try, attached below.  Can you explain why you think
> it's not possible, or pinpoint a problem in the implementation?

Responding only to this for now. When I said futex_wait I really meant
Linux futex wait, where the API is tied to a 32-bit futex control word
and nothing else. Your implementation works with a gomp_barrier_t that
includes more than one field. It would be confusing to call it a
"futex wait", it is not a 1:1 replacement.

(i.e. unlike a proper futex, it can work only for gomp_barrier_t objects)

Alexander
Tom de Vries April 22, 2021, 11:11 a.m. UTC | #4
On 4/21/21 7:02 PM, Alexander Monakov wrote:
> On Wed, 21 Apr 2021, Tom de Vries wrote:
> 
>>> I don't think implementing futex_wait is possible on nvptx.
>>>
>>
>> Well, I gave it a try, attached below.  Can you explain why you think
>> it's not possible, or pinpoint a problem in the implementation?
> 
> Responding only to this for now. When I said futex_wait I really meant
> Linux futex wait, where the API is tied to a 32-bit futex control word
> and nothing else. Your implementation works with a gomp_barrier_t that
> includes more than one field. It would be confusing to call it a
> "futex wait", it is not a 1:1 replacement.
> 
> (i.e. unlike a proper futex, it can work only for gomp_barrier_t objects)

Ah, I see, agreed, that makes sense.  I was afraid there was some
fundamental problem that I overlooked.

Here's an updated version.  I've tried to make it clear that the
futex_wait/wake are locally used versions, not generic functionality.

The main change in structure is that I'm now using the
generation_to_barrier trick from the rtems port, allowing linux/bar.c to
be included rather than copied (because the barrier argument is now
implicit).

Furthermore, I've reviewed the MEMMODELs used for the atomic accesses,
and updated a few.

Also now the cpu_relax from doacross.h is used.

Thanks,
- Tom
Alexander Monakov April 23, 2021, 3:45 p.m. UTC | #5
On Thu, 22 Apr 2021, Tom de Vries wrote:

> Ah, I see, agreed, that makes sense.  I was afraid there was some
> fundamental problem that I overlooked.
> 
> Here's an updated version.  I've tried to make it clear that the
> futex_wait/wake are locally used versions, not generic functionality.

Could you please regenerate the patch passing appropriate flags to
'git format-patch' so it presents a rewrite properly (see documentation
for --patience and --break-rewrites options). The attached patch was mostly
unreadable, I'm afraid.

Alexander
Tom de Vries April 23, 2021, 4:48 p.m. UTC | #6
On 4/23/21 5:45 PM, Alexander Monakov wrote:
> On Thu, 22 Apr 2021, Tom de Vries wrote:
> 
>> Ah, I see, agreed, that makes sense.  I was afraid there was some
>> fundamental problem that I overlooked.
>>
>> Here's an updated version.  I've tried to make it clear that the
>> futex_wait/wake are locally used versions, not generic functionality.
> 
> Could you please regenerate the patch passing appropriate flags to
> 'git format-patch' so it presents a rewrite properly (see documentation
> for --patience and --break-rewrites options). The attached patch was mostly
> unreadable, I'm afraid.

Sure.  I did notice that the patch was not readable, but I didn't known
there were options to improve that, so thanks for pointing that out.

Thanks,
- Tom
Tom de Vries May 19, 2021, 2:52 p.m. UTC | #7
On 4/23/21 6:48 PM, Tom de Vries wrote:
> On 4/23/21 5:45 PM, Alexander Monakov wrote:
>> On Thu, 22 Apr 2021, Tom de Vries wrote:
>>
>>> Ah, I see, agreed, that makes sense.  I was afraid there was some
>>> fundamental problem that I overlooked.
>>>
>>> Here's an updated version.  I've tried to make it clear that the
>>> futex_wait/wake are locally used versions, not generic functionality.
>> Could you please regenerate the patch passing appropriate flags to
>> 'git format-patch' so it presents a rewrite properly (see documentation
>> for --patience and --break-rewrites options). The attached patch was mostly
>> unreadable, I'm afraid.
> Sure.  I did notice that the patch was not readable, but I didn't known
> there were options to improve that, so thanks for pointing that out.
> 

Ping.  Any comments?

Thanks,
- Tom

> 0001-libgomp-nvptx-Fix-hang-in-gomp_team_barrier_wait_end.patch
> 
> From d3053a7ec7444b371ee29097a673e637b0d369d9 Mon Sep 17 00:00:00 2001
> From: Tom de Vries <tdevries@suse.de>
> Date: Tue, 20 Apr 2021 08:47:03 +0200
> Subject: [PATCH 1/4] [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
> 
> Consider the following omp fragment.
> ...
>   #pragma omp target
>   #pragma omp parallel num_threads (2)
>   #pragma omp task
>     ;
> ...
> 
> This hangs at -O0 for nvptx.
> 
> Investigating the behaviour gives us the following trace of events:
> - both threads execute GOMP_task, where they:
>   - deposit a task, and
>   - execute gomp_team_barrier_wake
> - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
>   proceeds to wait at the team barrier
> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>   calls gomp_barrier_handle_tasks, where it:
>   - executes both tasks and marks the team barrier done
>   - executes a gomp_team_barrier_wake which wakes up thread 1
> - thread 1 exits the team barrier
> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>   the team barrier.
> - thread 0 hangs.
> 
> To understand why there is a hang here, it's good to understand how things
> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
> futex_wake and do_wait replaced with uses of ptx insn bar.sync:
> ...
>   if (bar->total > 1)
>     asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> ...
> 
> The point where thread 0 goes to wait at the team barrier, corresponds in
> the linux implementation with a do_wait.  In the linux case, the call to
> do_wait doesn't hang, because it's waiting for bar->generation to become
> a certain value, and if bar->generation already has that value, it just
> proceeds, without any need for coordination with other threads.
> 
> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
> logical barrier, which never happens: thread 1 is lingering in the
> thread pool at the thread pool barrier (using a different logical barrier),
> waiting to join a new team.
> 
> The easiest way to fix this is to revert to the posix implementation for
> bar.{c,h}.  That however falls back on a busy-waiting approach, and
> does not take advantage of the ptx bar.sync insn.
> 
> Instead, we revert to the linux implementation for bar.c,
> and implement bar.c local functions futex_wait and futex_wake using the
> bar.sync insn.
> 
> This is a WIP version that does not yet take performance into consideration,
> but instead focuses on copying a working version as completely as possible,
> and isolating the machine-specific changes to as few functions as
> possible.
> 
> The bar.sync insn takes an argument specifying how many threads are
> participating, and that doesn't play well with the futex syntax where it's
> not clear in advance how many threads will be woken up.
> 
> This is solved by waking up all waiting threads each time a futex_wait or
> futex_wake happens, and possibly going back to sleep with an updated thread
> count.
> 
> Tested libgomp on x86_64 with nvptx accelerator, both as-is and with
> do_spin hardcoded to 1.
> 
> libgomp/ChangeLog:
> 
> 2021-04-20  Tom de Vries  <tdevries@suse.de>
> 
> 	PR target/99555
> 	* config/nvptx/bar.c (generation_to_barrier): New function, copied
> 	from config/rtems/bar.c.
> 	(futex_wait, futex_wake): New function.
> 	(do_spin, do_wait): New function, copied from config/linux/wait.h.
> 	(gomp_barrier_wait_end, gomp_barrier_wait_last)
> 	(gomp_team_barrier_wake, gomp_team_barrier_wait_end):
> 	(gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove
> 	and replace with include of config/linux/bar.c.
> 	* config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
> 	(gomp_barrier_init): Init new fields.
> 	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
> 	workarounds.
> 	* testsuite/libgomp.c/pr99555-1.c: Same.
> 	* testsuite/libgomp.fortran/task-detach-6.f90: Same.
> ---
>  libgomp/config/nvptx/bar.c                    | 388 ++++++++----------
>  libgomp/config/nvptx/bar.h                    |   4 +
>  .../libgomp.c-c++-common/task-detach-6.c      |   8 -
>  libgomp/testsuite/libgomp.c/pr99555-1.c       |   8 -
>  .../libgomp.fortran/task-detach-6.f90         |  12 -
>  5 files changed, 180 insertions(+), 240 deletions(-)
>  rewrite libgomp/config/nvptx/bar.c (76%)
> 
> diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
> dissimilarity index 76%
> index c5c2fa8829b..e0e6e5ed839 100644
> --- a/libgomp/config/nvptx/bar.c
> +++ b/libgomp/config/nvptx/bar.c
> @@ -1,212 +1,176 @@
> -/* Copyright (C) 2015-2021 Free Software Foundation, Inc.
> -   Contributed by Alexander Monakov <amonakov@ispras.ru>
> -
> -   This file is part of the GNU Offloading and Multi Processing Library
> -   (libgomp).
> -
> -   Libgomp is free software; you can redistribute it and/or modify it
> -   under the terms of the GNU General Public License as published by
> -   the Free Software Foundation; either version 3, or (at your option)
> -   any later version.
> -
> -   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
> -   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> -   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> -   more details.
> -
> -   Under Section 7 of GPL version 3, you are granted additional
> -   permissions described in the GCC Runtime Library Exception, version
> -   3.1, as published by the Free Software Foundation.
> -
> -   You should have received a copy of the GNU General Public License and
> -   a copy of the GCC Runtime Library Exception along with this program;
> -   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> -   <http://www.gnu.org/licenses/>.  */
> -
> -/* This is an NVPTX specific implementation of a barrier synchronization
> -   mechanism for libgomp.  This type is private to the library.  This
> -   implementation uses atomic instructions and bar.sync instruction.  */
> -
> -#include <limits.h>
> -#include "libgomp.h"
> -
> -
> -void
> -gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
> -{
> -  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> -    {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> -      bar->awaited = bar->total;
> -      __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
> -			MEMMODEL_RELEASE);
> -    }
> -  if (bar->total > 1)
> -    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -}
> -
> -void
> -gomp_barrier_wait (gomp_barrier_t *bar)
> -{
> -  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
> -}
> -
> -/* Like gomp_barrier_wait, except that if the encountering thread
> -   is not the last one to hit the barrier, it returns immediately.
> -   The intended usage is that a thread which intends to gomp_barrier_destroy
> -   this barrier calls gomp_barrier_wait, while all other threads
> -   call gomp_barrier_wait_last.  When gomp_barrier_wait returns,
> -   the barrier can be safely destroyed.  */
> -
> -void
> -gomp_barrier_wait_last (gomp_barrier_t *bar)
> -{
> -  /* Deferring to gomp_barrier_wait does not use the optimization opportunity
> -     allowed by the interface contract for all-but-last participants.  The
> -     original implementation in config/linux/bar.c handles this better.  */
> -  gomp_barrier_wait (bar);
> -}
> -
> -void
> -gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
> -{
> -  if (bar->total > 1)
> -    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -}
> -
> -void
> -gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
> -{
> -  unsigned int generation, gen;
> -
> -  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> -    {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> -      struct gomp_thread *thr = gomp_thread ();
> -      struct gomp_team *team = thr->ts.team;
> -
> -      bar->awaited = bar->total;
> -      team->work_share_cancelled = 0;
> -      if (__builtin_expect (team->task_count, 0))
> -	{
> -	  gomp_barrier_handle_tasks (state);
> -	  state &= ~BAR_WAS_LAST;
> -	}
> -      else
> -	{
> -	  state &= ~BAR_CANCELLED;
> -	  state += BAR_INCR - BAR_WAS_LAST;
> -	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -	  return;
> -	}
> -    }
> -
> -  generation = state;
> -  state &= ~BAR_CANCELLED;
> -  do
> -    {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> -      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
> -	{
> -	  gomp_barrier_handle_tasks (state);
> -	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> -	}
> -      generation |= gen & BAR_WAITING_FOR_TASK;
> -    }
> -  while (gen != state + BAR_INCR);
> -}
> -
> -void
> -gomp_team_barrier_wait (gomp_barrier_t *bar)
> -{
> -  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
> -}
> -
> -void
> -gomp_team_barrier_wait_final (gomp_barrier_t *bar)
> -{
> -  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
> -  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> -    bar->awaited_final = bar->total;
> -  gomp_team_barrier_wait_end (bar, state);
> -}
> -
> -bool
> -gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
> -				   gomp_barrier_state_t state)
> -{
> -  unsigned int generation, gen;
> -
> -  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> -    {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> -      /* BAR_CANCELLED should never be set in state here, because
> -	 cancellation means that at least one of the threads has been
> -	 cancelled, thus on a cancellable barrier we should never see
> -	 all threads to arrive.  */
> -      struct gomp_thread *thr = gomp_thread ();
> -      struct gomp_team *team = thr->ts.team;
> -
> -      bar->awaited = bar->total;
> -      team->work_share_cancelled = 0;
> -      if (__builtin_expect (team->task_count, 0))
> -	{
> -	  gomp_barrier_handle_tasks (state);
> -	  state &= ~BAR_WAS_LAST;
> -	}
> -      else
> -	{
> -	  state += BAR_INCR - BAR_WAS_LAST;
> -	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
> -	  if (bar->total > 1)
> -	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -	  return false;
> -	}
> -    }
> -
> -  if (__builtin_expect (state & BAR_CANCELLED, 0))
> -    return true;
> -
> -  generation = state;
> -  do
> -    {
> -      if (bar->total > 1)
> -	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> -      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> -      if (__builtin_expect (gen & BAR_CANCELLED, 0))
> -	return true;
> -      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
> -	{
> -	  gomp_barrier_handle_tasks (state);
> -	  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> -	}
> -      generation |= gen & BAR_WAITING_FOR_TASK;
> -    }
> -  while (gen != state + BAR_INCR);
> -
> -  return false;
> -}
> -
> -bool
> -gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
> -{
> -  return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar));
> -}
> -
> -void
> -gomp_team_barrier_cancel (struct gomp_team *team)
> -{
> -  gomp_mutex_lock (&team->task_lock);
> -  if (team->barrier.generation & BAR_CANCELLED)
> -    {
> -      gomp_mutex_unlock (&team->task_lock);
> -      return;
> -    }
> -  team->barrier.generation |= BAR_CANCELLED;
> -  gomp_mutex_unlock (&team->task_lock);
> -  gomp_team_barrier_wake (&team->barrier, INT_MAX);
> -}
> +/* Copyright (C) 2015-2021 Free Software Foundation, Inc.
> +   Contributed by Alexander Monakov <amonakov@ispras.ru>
> +
> +   This file is part of the GNU Offloading and Multi Processing Library
> +   (libgomp).
> +
> +   Libgomp is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
> +   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> +   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> +   more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* This is an NVPTX specific implementation of a barrier synchronization
> +   mechanism for libgomp.  This type is private to the library.  This
> +   implementation uses atomic instructions and bar.sync instruction.  */
> +
> +#include <limits.h>
> +#include "libgomp.h"
> +
> +/* For cpu_relax.  */
> +#include "doacross.h"
> +
> +/* Assuming ADDR is &bar->generation, return bar.  Copied from
> +   rtems/bar.c.  */
> +
> +static gomp_barrier_t *
> +generation_to_barrier (int *addr)
> +{
> +  char *bar
> +    = (char *) addr - __builtin_offsetof (gomp_barrier_t, generation);
> +  return (gomp_barrier_t *)bar;
> +}
> +
> +/* Implement futex_wait-like behaviour to plug into the linux/bar.c
> +   implementation.  Assumes ADDR is &bar->generation.   */
> +
> +static inline void
> +futex_wait (int *addr, int val)
> +{
> +  gomp_barrier_t *bar = generation_to_barrier (addr);
> +
> +  if (bar->total < 2)
> +    /* A barrier with less than two threads, nop.  */
> +    return;
> +
> +  gomp_mutex_lock (&bar->lock);
> +
> +  /* Futex semantics: only go to sleep if *addr == val.  */
> +  if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_ACQUIRE) != val, 0))
> +    {
> +      gomp_mutex_unlock (&bar->lock);
> +      return;
> +    }
> +
> +  /* Register as waiter.  */
> +  unsigned int waiters
> +    = __atomic_add_fetch (&bar->waiters, 1, MEMMODEL_ACQ_REL);
> +  if (waiters == 0)
> +    __builtin_abort ();
> +  unsigned int waiter_id = waiters;
> +
> +  if (waiters > 1)
> +    {
> +      /* Wake other threads in bar.sync.  */
> +      asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
> +
> +      /* Ensure that they have updated waiters.  */
> +      asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters));
> +    }
> +
> +  gomp_mutex_unlock (&bar->lock);
> +
> +  while (1)
> +    {
> +      /* Wait for next thread in barrier.  */
> +      asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
> +
> +      /* Get updated waiters.  */
> +      unsigned int updated_waiters
> +	= __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
> +
> +      /* Notify that we have updated waiters.  */
> +      asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
> +
> +      waiters = updated_waiters;
> +
> +      if (waiter_id > waiters)
> +	/* A wake happened, and we're in the group of woken threads.  */
> +	break;
> +
> +      /* Continue waiting.  */
> +    }
> +}
> +
> +/* Implement futex_wake-like behaviour to plug into the linux/bar.c
> +   implementation.  Assumes ADDR is &bar->generation.  */
> +
> +static inline void
> +futex_wake (int *addr, int count)
> +{
> +  gomp_barrier_t *bar = generation_to_barrier (addr);
> +
> +  if (bar->total < 2)
> +    /* A barrier with less than two threads, nop.  */
> +    return;
> +
> +  gomp_mutex_lock (&bar->lock);
> +  unsigned int waiters = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE);
> +  if (waiters == 0)
> +    {
> +      /* No threads to wake.  */
> +      gomp_mutex_unlock (&bar->lock);
> +      return;
> +    }
> +
> +  if (count == INT_MAX)
> +    /* Release all threads.  */
> +    __atomic_store_n (&bar->waiters, 0, MEMMODEL_RELEASE);
> +  else if (count < bar->total)
> +    /* Release count threads.  */
> +    __atomic_add_fetch (&bar->waiters, -count, MEMMODEL_ACQ_REL);
> +  else
> +    /* Count has an illegal value.  */
> +    __builtin_abort ();
> +
> +  /* Wake other threads in bar.sync.  */
> +  asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
> +
> +  /* Let them get the updated waiters.  */
> +  asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1)));
> +
> +  gomp_mutex_unlock (&bar->lock);
> +}
> +
> +/* Copied from linux/wait.h.  */
> +
> +static inline int do_spin (int *addr, int val)
> +{
> +  unsigned long long i, count = gomp_spin_count_var;
> +
> +  if (__builtin_expect (__atomic_load_n (&gomp_managed_threads,
> +					 MEMMODEL_RELAXED)
> +			> gomp_available_cpus, 0))
> +    count = gomp_throttled_spin_count_var;
> +  for (i = 0; i < count; i++)
> +    if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_RELAXED) != val, 0))
> +      return 0;
> +    else
> +      cpu_relax ();
> +  return 1;
> +}
> +
> +/* Copied from linux/wait.h.  */
> +
> +static inline void do_wait (int *addr, int val)
> +{
> +  if (do_spin (addr, val))
> +    futex_wait (addr, val);
> +}
> +
> +/* Reuse the linux implementation.  */
> +#define GOMP_WAIT_H 1
> +#include "../linux/bar.c"
> diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h
> index 9bf3d914a02..c69426e1629 100644
> --- a/libgomp/config/nvptx/bar.h
> +++ b/libgomp/config/nvptx/bar.h
> @@ -38,6 +38,8 @@ typedef struct
>    unsigned generation;
>    unsigned awaited;
>    unsigned awaited_final;
> +  unsigned waiters;
> +  gomp_mutex_t lock;
>  } gomp_barrier_t;
>  
>  typedef unsigned int gomp_barrier_state_t;
> @@ -57,6 +59,8 @@ static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
>    bar->awaited = count;
>    bar->awaited_final = count;
>    bar->generation = 0;
> +  bar->waiters = 0;
> +  gomp_mutex_init (&bar->lock);
>  }
>  
>  static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count)
> diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> index f18b57bf047..e5c2291e6ff 100644
> --- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> +++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
> @@ -2,9 +2,6 @@
>  
>  #include <omp.h>
>  #include <assert.h>
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "on_device_arch.h"
>  
>  /* Test tasks with detach clause on an offload device.  Each device
>     thread spawns off a chain of tasks, that can then be executed by
> @@ -12,11 +9,6 @@
>  
>  int main (void)
>  {
> -  //TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>    int x = 0, y = 0, z = 0;
>    int thread_count;
>    omp_event_handle_t detach_event1, detach_event2;
> diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
> index bd33b93716b..7386e016fd2 100644
> --- a/libgomp/testsuite/libgomp.c/pr99555-1.c
> +++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
> @@ -2,16 +2,8 @@
>  
>  // { dg-additional-options "-O0" }
>  
> -#include <unistd.h> // For 'alarm'.
> -
> -#include "../libgomp.c-c++-common/on_device_arch.h"
> -
>  int main (void)
>  {
> -  if (on_device_arch_nvptx ())
> -    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
> -		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
> -
>  #pragma omp target
>  #pragma omp parallel // num_threads(1)
>  #pragma omp task
> diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> index e4373b4c6f1..03a3b61540d 100644
> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> @@ -1,6 +1,5 @@
>  ! { dg-do run }
>  
> -! { dg-additional-sources on_device_arch.c }
>    ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
>  
>  ! Test tasks with detach clause on an offload device.  Each device
> @@ -14,17 +13,6 @@ program task_detach_6
>    integer :: x = 0, y = 0, z = 0
>    integer :: thread_count
>  
> -  interface
> -    integer function on_device_arch_nvptx() bind(C)
> -    end function on_device_arch_nvptx
> -  end interface
> -
> -  !TODO See '../libgomp.c/pr99555-1.c'.
> -  if (on_device_arch_nvptx () /= 0) then
> -     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
> -     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
> -  end if
> -
>    !$omp target map (tofrom: x, y, z) map (from: thread_count)
>      !$omp parallel private (detach_event1, detach_event2)
>        !$omp single
> -- 2.28.0
>
Thomas Schwinge May 20, 2021, 9:52 a.m. UTC | #8
Hi Tom!

First, thanks for looking into this PR99555!


I can't comment on the OpenMP/nvptx changes, so just the following:

On 2021-04-23T18:48:01+0200, Tom de Vries <tdevries@suse.de> wrote:
> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> @@ -1,6 +1,5 @@
>  ! { dg-do run }
>
> -! { dg-additional-sources on_device_arch.c }
>    ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }

Please remove the 'dg-prune-output', too.  ;-)


Your changes leave
'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx',
'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h',
'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused.  Should we
keep those for a potential future use (given that they've been tested to
work) or remove (as now unused, danger of bit-rot)?


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Tom de Vries May 20, 2021, 11:41 a.m. UTC | #9
On 5/20/21 11:52 AM, Thomas Schwinge wrote:
> Hi Tom!
> 
> First, thanks for looking into this PR99555!
> 
> 
> I can't comment on the OpenMP/nvptx changes, so just the following:
> 
> On 2021-04-23T18:48:01+0200, Tom de Vries <tdevries@suse.de> wrote:
>> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
>> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
>> @@ -1,6 +1,5 @@
>>  ! { dg-do run }
>>
>> -! { dg-additional-sources on_device_arch.c }
>>    ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
> 
> Please remove the 'dg-prune-output', too.  ;-)
> 

Ack, updated patch.

> Your changes leave
> 'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx',
> 'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h',
> 'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused.  Should we
> keep those for a potential future use (given that they've been tested to
> work) or remove (as now unused, danger of bit-rot)?

I vote to leave them in, they look useful, and I think the danger of
bit-rot is less than the danger of not knowing/remembering that they
once where there and having to start from scratch.

Thanks,
- Tom
Tobias Burnus Nov. 26, 2021, 12:10 p.m. UTC | #10
*PING* -While I am not the patch author, I think it should be fixed. Thus:

Alexander, since you asked for the updated diff and commented, can you
have a look?

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568616.html [PR
target/99555]

Thanks,

Tobias

On 23.04.21 18:48, Tom de Vries wrote:
> On 4/23/21 5:45 PM, Alexander Monakov wrote:
>> On Thu, 22 Apr 2021, Tom de Vries wrote:
>>
>>> Ah, I see, agreed, that makes sense.  I was afraid there was some
>>> fundamental problem that I overlooked.
>>>
>>> Here's an updated version.  I've tried to make it clear that the
>>> futex_wait/wake are locally used versions, not generic functionality.
>> Could you please regenerate the patch passing appropriate flags to
>> 'git format-patch' so it presents a rewrite properly (see documentation
>> for --patience and --break-rewrites options). The attached patch was mostly
>> unreadable, I'm afraid.
> Sure.  I did notice that the patch was not readable, but I didn't known
> there were options to improve that, so thanks for pointing that out.
>
> Thanks,
> - Tom
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Tom de Vries Feb. 22, 2022, 2:52 p.m. UTC | #11
On 5/19/21 16:52, Tom de Vries wrote:
> On 4/23/21 6:48 PM, Tom de Vries wrote:
>> On 4/23/21 5:45 PM, Alexander Monakov wrote:
>>> On Thu, 22 Apr 2021, Tom de Vries wrote:
>>>
>>>> Ah, I see, agreed, that makes sense.  I was afraid there was some
>>>> fundamental problem that I overlooked.
>>>>
>>>> Here's an updated version.  I've tried to make it clear that the
>>>> futex_wait/wake are locally used versions, not generic functionality.
>>> Could you please regenerate the patch passing appropriate flags to
>>> 'git format-patch' so it presents a rewrite properly (see documentation
>>> for --patience and --break-rewrites options). The attached patch was mostly
>>> unreadable, I'm afraid.
>> Sure.  I did notice that the patch was not readable, but I didn't known
>> there were options to improve that, so thanks for pointing that out.
>>
> 
> Ping.  Any comments?

I've hardcoded do_spin to 1, and tested on:
- turing, pascal, maxwell (510.x driver)
- kepler (470.x driver)

Committed.

Thanks,
- Tom
diff mbox series

Patch

diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index c5c2fa8829b..058a8d4d5ca 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -78,6 +78,7 @@  void
 gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
   unsigned int generation, gen;
+  unsigned int nthreads = bar->total;
 
   if (__builtin_expect (state & BAR_WAS_LAST, 0))
     {
@@ -90,6 +91,15 @@  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
       if (__builtin_expect (team->task_count, 0))
 	{
 	  gomp_barrier_handle_tasks (state);
+	  /* Assert that all tasks have been handled.  */
+	  if (team->task_count != 0)
+	    __builtin_abort ();
+	  /* In gomp_barrier_handle_tasks, the team barrier has been marked
+	     as done, and all pending threads woken up.  So this is now the
+	     last and only thread in the barrier.  Adjust nthreads to
+	     reflect the new situation, to make sure we don't hang
+	     indefinitely at the bar.sync below.  */
+	  nthreads = 1;
 	  state &= ~BAR_WAS_LAST;
 	}
       else
@@ -97,8 +107,8 @@  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
 	  state &= ~BAR_CANCELLED;
 	  state += BAR_INCR - BAR_WAS_LAST;
 	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
-	  if (bar->total > 1)
-	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+	  if (nthreads > 1)
+	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
 	  return;
 	}
     }
@@ -107,8 +117,8 @@  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
   state &= ~BAR_CANCELLED;
   do
     {
-      if (bar->total > 1)
-	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+      if (nthreads > 1)
+	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
       gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
       if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
 	{
@@ -140,6 +150,7 @@  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
 				   gomp_barrier_state_t state)
 {
   unsigned int generation, gen;
+  unsigned int nthreads = bar->total;
 
   if (__builtin_expect (state & BAR_WAS_LAST, 0))
     {
@@ -156,14 +167,19 @@  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
       if (__builtin_expect (team->task_count, 0))
 	{
 	  gomp_barrier_handle_tasks (state);
+	  /* Assert that all tasks have been handled.  */
+	  if (team->task_count != 0)
+	    __builtin_abort ();
+	  /* See comment in gomp_team_barrier_wait_end.  */
+	  nthreads = 1;
 	  state &= ~BAR_WAS_LAST;
 	}
       else
 	{
 	  state += BAR_INCR - BAR_WAS_LAST;
 	  __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
-	  if (bar->total > 1)
-	    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+	  if (nthreads > 1)
+	    asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
 	  return false;
 	}
     }
@@ -174,8 +190,8 @@  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
   generation = state;
   do
     {
-      if (bar->total > 1)
-	asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
+      if (nthreads > 1)
+	asm ("bar.sync 1, %0;" : : "r" (32 * nthreads));
       gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
       if (__builtin_expect (gen & BAR_CANCELLED, 0))
 	return true;
diff --git a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
index f18b57bf047..e5c2291e6ff 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/task-detach-6.c
@@ -2,9 +2,6 @@ 
 
 #include <omp.h>
 #include <assert.h>
-#include <unistd.h> // For 'alarm'.
-
-#include "on_device_arch.h"
 
 /* Test tasks with detach clause on an offload device.  Each device
    thread spawns off a chain of tasks, that can then be executed by
@@ -12,11 +9,6 @@ 
 
 int main (void)
 {
-  //TODO See '../libgomp.c/pr99555-1.c'.
-  if (on_device_arch_nvptx ())
-    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
-		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
-
   int x = 0, y = 0, z = 0;
   int thread_count;
   omp_event_handle_t detach_event1, detach_event2;
diff --git a/libgomp/testsuite/libgomp.c/pr99555-1.c b/libgomp/testsuite/libgomp.c/pr99555-1.c
index bd33b93716b..7386e016fd2 100644
--- a/libgomp/testsuite/libgomp.c/pr99555-1.c
+++ b/libgomp/testsuite/libgomp.c/pr99555-1.c
@@ -2,16 +2,8 @@ 
 
 // { dg-additional-options "-O0" }
 
-#include <unistd.h> // For 'alarm'.
-
-#include "../libgomp.c-c++-common/on_device_arch.h"
-
 int main (void)
 {
-  if (on_device_arch_nvptx ())
-    alarm (4); /*TODO Until resolved, make sure that we exit quickly, with error status.
-		 { dg-xfail-run-if "PR99555" { offload_device_nvptx } } */
-
 #pragma omp target
 #pragma omp parallel // num_threads(1)
 #pragma omp task
diff --git a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
index e4373b4c6f1..03a3b61540d 100644
--- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
+++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
@@ -1,6 +1,5 @@ 
 ! { dg-do run }
 
-! { dg-additional-sources on_device_arch.c }
   ! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is valid for Fortran but not for C" }
 
 ! Test tasks with detach clause on an offload device.  Each device
@@ -14,17 +13,6 @@  program task_detach_6
   integer :: x = 0, y = 0, z = 0
   integer :: thread_count
 
-  interface
-    integer function on_device_arch_nvptx() bind(C)
-    end function on_device_arch_nvptx
-  end interface
-
-  !TODO See '../libgomp.c/pr99555-1.c'.
-  if (on_device_arch_nvptx () /= 0) then
-     call alarm (4, 0); !TODO Until resolved, make sure that we exit quickly, with error status.
-     ! { dg-xfail-run-if "PR99555" { offload_device_nvptx } }
-  end if
-
   !$omp target map (tofrom: x, y, z) map (from: thread_count)
     !$omp parallel private (detach_event1, detach_event2)
       !$omp single