diff mbox series

[omp,ftracer] Don't duplicate blocks in SIMT region

Message ID de220c39-f7cf-0cb4-13e8-b863e9643d0f@suse.de
State New
Headers show
Series [omp,ftracer] Don't duplicate blocks in SIMT region | expand

Commit Message

Tom de Vries Sept. 22, 2020, 4:38 p.m. UTC
[ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
with SIMT LANE [PR95654] ]

On 9/16/20 8:20 PM, Alexander Monakov wrote:
> 
> 
> On Wed, 16 Sep 2020, Tom de Vries wrote:
> 
>> [ cc-ing author omp support for nvptx. ]
> 
> The issue looks familiar. I recognized it back in 2017 (and LLVM people
> recognized it too for their GPU targets). In an attempt to get agreement
> to fix the issue "properly" for GCC I found a similar issue that affects
> all targets, not just offloading, and filed it as PR 80053.
> 
> (yes, there are no addressable labels involved in offloading, but nevertheless
> the nature of the middle-end issue is related)

Hi Alexander,

thanks for looking into this.

Seeing that the attempt to fix things properly is stalled, for now I'm
proposing a point-fix, similar to the original patch proposed by Tobias.

Richi, Jakub, OK for trunk?

Thanks,
- Tom

Comments

Richard Biener Sept. 23, 2020, 7:28 a.m. UTC | #1
On Tue, 22 Sep 2020, Tom de Vries wrote:

> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
> with SIMT LANE [PR95654] ]
> 
> On 9/16/20 8:20 PM, Alexander Monakov wrote:
> > 
> > 
> > On Wed, 16 Sep 2020, Tom de Vries wrote:
> > 
> >> [ cc-ing author omp support for nvptx. ]
> > 
> > The issue looks familiar. I recognized it back in 2017 (and LLVM people
> > recognized it too for their GPU targets). In an attempt to get agreement
> > to fix the issue "properly" for GCC I found a similar issue that affects
> > all targets, not just offloading, and filed it as PR 80053.
> > 
> > (yes, there are no addressable labels involved in offloading, but nevertheless
> > the nature of the middle-end issue is related)
> 
> Hi Alexander,
> 
> thanks for looking into this.
> 
> Seeing that the attempt to fix things properly is stalled, for now I'm
> proposing a point-fix, similar to the original patch proposed by Tobias.
> 
> Richi, Jakub, OK for trunk?

I notice that we call ignore_bb_p many times in tracer.c but one call
is conveniently early in tail_duplicate (void):

      int n = count_insns (bb);
      if (!ignore_bb_p (bb))
        blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
bb);

where count_insns already walks all stmts in the block.  It would be
nice to avoid repeatedly walking all stmts, maybe adjusting the above
call is enough and/or count_insns can compute this and/or the ignore_bb_p
result can be cached (optimize_bb_for_size_p might change though,
but maybe all other ignore_bb_p calls effectively just are that,
checks for blocks that became optimize_bb_for_size_p).

Richard.

> Thanks,
> - Tom
> 
>
Tom de Vries Sept. 23, 2020, 4:53 p.m. UTC | #2
On 9/23/20 9:28 AM, Richard Biener wrote:
> On Tue, 22 Sep 2020, Tom de Vries wrote:
> 
>> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
>> with SIMT LANE [PR95654] ]
>>
>> On 9/16/20 8:20 PM, Alexander Monakov wrote:
>>>
>>>
>>> On Wed, 16 Sep 2020, Tom de Vries wrote:
>>>
>>>> [ cc-ing author omp support for nvptx. ]
>>>
>>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
>>> recognized it too for their GPU targets). In an attempt to get agreement
>>> to fix the issue "properly" for GCC I found a similar issue that affects
>>> all targets, not just offloading, and filed it as PR 80053.
>>>
>>> (yes, there are no addressable labels involved in offloading, but nevertheless
>>> the nature of the middle-end issue is related)
>>
>> Hi Alexander,
>>
>> thanks for looking into this.
>>
>> Seeing that the attempt to fix things properly is stalled, for now I'm
>> proposing a point-fix, similar to the original patch proposed by Tobias.
>>
>> Richi, Jakub, OK for trunk?
> 
> I notice that we call ignore_bb_p many times in tracer.c but one call
> is conveniently early in tail_duplicate (void):
> 
>       int n = count_insns (bb);
>       if (!ignore_bb_p (bb))
>         blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
> bb);
> 
> where count_insns already walks all stmts in the block.  It would be
> nice to avoid repeatedly walking all stmts, maybe adjusting the above
> call is enough and/or count_insns can compute this and/or the ignore_bb_p
> result can be cached (optimize_bb_for_size_p might change though,
> but maybe all other ignore_bb_p calls effectively just are that,
> checks for blocks that became optimize_bb_for_size_p).
> 

This untested follow-up patch tries something in that direction.

Is this what you meant?

Thanks,
- Tom
Richard Biener Sept. 24, 2020, 11:42 a.m. UTC | #3
On Wed, 23 Sep 2020, Tom de Vries wrote:

> On 9/23/20 9:28 AM, Richard Biener wrote:
> > On Tue, 22 Sep 2020, Tom de Vries wrote:
> > 
> >> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
> >> with SIMT LANE [PR95654] ]
> >>
> >> On 9/16/20 8:20 PM, Alexander Monakov wrote:
> >>>
> >>>
> >>> On Wed, 16 Sep 2020, Tom de Vries wrote:
> >>>
> >>>> [ cc-ing author omp support for nvptx. ]
> >>>
> >>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
> >>> recognized it too for their GPU targets). In an attempt to get agreement
> >>> to fix the issue "properly" for GCC I found a similar issue that affects
> >>> all targets, not just offloading, and filed it as PR 80053.
> >>>
> >>> (yes, there are no addressable labels involved in offloading, but nevertheless
> >>> the nature of the middle-end issue is related)
> >>
> >> Hi Alexander,
> >>
> >> thanks for looking into this.
> >>
> >> Seeing that the attempt to fix things properly is stalled, for now I'm
> >> proposing a point-fix, similar to the original patch proposed by Tobias.
> >>
> >> Richi, Jakub, OK for trunk?
> > 
> > I notice that we call ignore_bb_p many times in tracer.c but one call
> > is conveniently early in tail_duplicate (void):
> > 
> >       int n = count_insns (bb);
> >       if (!ignore_bb_p (bb))
> >         blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
> > bb);
> > 
> > where count_insns already walks all stmts in the block.  It would be
> > nice to avoid repeatedly walking all stmts, maybe adjusting the above
> > call is enough and/or count_insns can compute this and/or the ignore_bb_p
> > result can be cached (optimize_bb_for_size_p might change though,
> > but maybe all other ignore_bb_p calls effectively just are that,
> > checks for blocks that became optimize_bb_for_size_p).
> > 
> 
> This untested follow-up patch tries something in that direction.
> 
> Is this what you meant?

Yeah, sort of.

+static bool
+cached_can_duplicate_bb_p (const_basic_block bb)
+{
+  if (can_duplicate_bb)

is there any path where can_duplicate_bb would be NULL?

+    {
+      unsigned int size = SBITMAP_SIZE (can_duplicate_bb);
+      /* Assume added bb's should be ignored.  */
+      if ((unsigned int)bb->index < size
+         && bitmap_bit_p (can_duplicate_bb_computed, bb->index))
+       return !bitmap_bit_p (can_duplicate_bb, bb->index);

yes, newly added bbs should be ignored so,

     }
 
-  return false;
+  bool val = compute_can_duplicate_bb_p (bb);
+  if (can_duplicate_bb)
+    cache_can_duplicate_bb_p (bb, val);

no need to compute & cache for them, just return true (because
we did duplicate them)?

Thanks,
Richard.


> Thanks,
> - Tom
>
Tom de Vries Sept. 24, 2020, 11:59 a.m. UTC | #4
On 9/24/20 1:42 PM, Richard Biener wrote:
> On Wed, 23 Sep 2020, Tom de Vries wrote:
> 
>> On 9/23/20 9:28 AM, Richard Biener wrote:
>>> On Tue, 22 Sep 2020, Tom de Vries wrote:
>>>
>>>> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
>>>> with SIMT LANE [PR95654] ]
>>>>
>>>> On 9/16/20 8:20 PM, Alexander Monakov wrote:
>>>>>
>>>>>
>>>>> On Wed, 16 Sep 2020, Tom de Vries wrote:
>>>>>
>>>>>> [ cc-ing author omp support for nvptx. ]
>>>>>
>>>>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
>>>>> recognized it too for their GPU targets). In an attempt to get agreement
>>>>> to fix the issue "properly" for GCC I found a similar issue that affects
>>>>> all targets, not just offloading, and filed it as PR 80053.
>>>>>
>>>>> (yes, there are no addressable labels involved in offloading, but nevertheless
>>>>> the nature of the middle-end issue is related)
>>>>
>>>> Hi Alexander,
>>>>
>>>> thanks for looking into this.
>>>>
>>>> Seeing that the attempt to fix things properly is stalled, for now I'm
>>>> proposing a point-fix, similar to the original patch proposed by Tobias.
>>>>
>>>> Richi, Jakub, OK for trunk?
>>>
>>> I notice that we call ignore_bb_p many times in tracer.c but one call
>>> is conveniently early in tail_duplicate (void):
>>>
>>>       int n = count_insns (bb);
>>>       if (!ignore_bb_p (bb))
>>>         blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
>>> bb);
>>>
>>> where count_insns already walks all stmts in the block.  It would be
>>> nice to avoid repeatedly walking all stmts, maybe adjusting the above
>>> call is enough and/or count_insns can compute this and/or the ignore_bb_p
>>> result can be cached (optimize_bb_for_size_p might change though,
>>> but maybe all other ignore_bb_p calls effectively just are that,
>>> checks for blocks that became optimize_bb_for_size_p).
>>>
>>
>> This untested follow-up patch tries something in that direction.
>>
>> Is this what you meant?
> 
> Yeah, sort of.
> 
> +static bool
> +cached_can_duplicate_bb_p (const_basic_block bb)
> +{
> +  if (can_duplicate_bb)
> 
> is there any path where can_duplicate_bb would be NULL?
> 

Yes, ignore_bb_p is called from gimple-ssa-split-paths.c.

> +    {
> +      unsigned int size = SBITMAP_SIZE (can_duplicate_bb);
> +      /* Assume added bb's should be ignored.  */
> +      if ((unsigned int)bb->index < size
> +         && bitmap_bit_p (can_duplicate_bb_computed, bb->index))
> +       return !bitmap_bit_p (can_duplicate_bb, bb->index);
> 
> yes, newly added bbs should be ignored so,
> 
>      }
>  
> -  return false;
> +  bool val = compute_can_duplicate_bb_p (bb);
> +  if (can_duplicate_bb)
> +    cache_can_duplicate_bb_p (bb, val);
> 
> no need to compute & cache for them, just return true (because
> we did duplicate them)?
> 

Also the case for gimple-ssa-split-paths.c.?

Thanks,
- Tom
Richard Biener Sept. 24, 2020, 12:44 p.m. UTC | #5
On Thu, 24 Sep 2020, Tom de Vries wrote:

> On 9/24/20 1:42 PM, Richard Biener wrote:
> > On Wed, 23 Sep 2020, Tom de Vries wrote:
> > 
> >> On 9/23/20 9:28 AM, Richard Biener wrote:
> >>> On Tue, 22 Sep 2020, Tom de Vries wrote:
> >>>
> >>>> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
> >>>> with SIMT LANE [PR95654] ]
> >>>>
> >>>> On 9/16/20 8:20 PM, Alexander Monakov wrote:
> >>>>>
> >>>>>
> >>>>> On Wed, 16 Sep 2020, Tom de Vries wrote:
> >>>>>
> >>>>>> [ cc-ing author omp support for nvptx. ]
> >>>>>
> >>>>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
> >>>>> recognized it too for their GPU targets). In an attempt to get agreement
> >>>>> to fix the issue "properly" for GCC I found a similar issue that affects
> >>>>> all targets, not just offloading, and filed it as PR 80053.
> >>>>>
> >>>>> (yes, there are no addressable labels involved in offloading, but nevertheless
> >>>>> the nature of the middle-end issue is related)
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> thanks for looking into this.
> >>>>
> >>>> Seeing that the attempt to fix things properly is stalled, for now I'm
> >>>> proposing a point-fix, similar to the original patch proposed by Tobias.
> >>>>
> >>>> Richi, Jakub, OK for trunk?
> >>>
> >>> I notice that we call ignore_bb_p many times in tracer.c but one call
> >>> is conveniently early in tail_duplicate (void):
> >>>
> >>>       int n = count_insns (bb);
> >>>       if (!ignore_bb_p (bb))
> >>>         blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
> >>> bb);
> >>>
> >>> where count_insns already walks all stmts in the block.  It would be
> >>> nice to avoid repeatedly walking all stmts, maybe adjusting the above
> >>> call is enough and/or count_insns can compute this and/or the ignore_bb_p
> >>> result can be cached (optimize_bb_for_size_p might change though,
> >>> but maybe all other ignore_bb_p calls effectively just are that,
> >>> checks for blocks that became optimize_bb_for_size_p).
> >>>
> >>
> >> This untested follow-up patch tries something in that direction.
> >>
> >> Is this what you meant?
> > 
> > Yeah, sort of.
> > 
> > +static bool
> > +cached_can_duplicate_bb_p (const_basic_block bb)
> > +{
> > +  if (can_duplicate_bb)
> > 
> > is there any path where can_duplicate_bb would be NULL?
> > 
> 
> Yes, ignore_bb_p is called from gimple-ssa-split-paths.c.

Oh, that was probably done because of the very same OMP issue ...

> > +    {
> > +      unsigned int size = SBITMAP_SIZE (can_duplicate_bb);
> > +      /* Assume added bb's should be ignored.  */
> > +      if ((unsigned int)bb->index < size
> > +         && bitmap_bit_p (can_duplicate_bb_computed, bb->index))
> > +       return !bitmap_bit_p (can_duplicate_bb, bb->index);
> > 
> > yes, newly added bbs should be ignored so,
> > 
> >      }
> >  
> > -  return false;
> > +  bool val = compute_can_duplicate_bb_p (bb);
> > +  if (can_duplicate_bb)
> > +    cache_can_duplicate_bb_p (bb, val);
> > 
> > no need to compute & cache for them, just return true (because
> > we did duplicate them)?
> > 
> 
> Also the case for gimple-ssa-split-paths.c.?

If it had the bitmap then yes ... since it doesn't the early
out should be in the conditional above only.

Richard.

> Thanks,
> - Tom
>
Tom de Vries Oct. 5, 2020, 7:01 a.m. UTC | #6
On 9/22/20 6:38 PM, Tom de Vries wrote:
> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
> with SIMT LANE [PR95654] ]
> 
> On 9/16/20 8:20 PM, Alexander Monakov wrote:
>>
>>
>> On Wed, 16 Sep 2020, Tom de Vries wrote:
>>
>>> [ cc-ing author omp support for nvptx. ]
>>
>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
>> recognized it too for their GPU targets). In an attempt to get agreement
>> to fix the issue "properly" for GCC I found a similar issue that affects
>> all targets, not just offloading, and filed it as PR 80053.
>>
>> (yes, there are no addressable labels involved in offloading, but nevertheless
>> the nature of the middle-end issue is related)
> 
> Hi Alexander,
> 
> thanks for looking into this.
> 
> Seeing that the attempt to fix things properly is stalled, for now I'm
> proposing a point-fix, similar to the original patch proposed by Tobias.
> 
> Richi, Jakub, OK for trunk?
> 

I've had to modify this patch in two ways:
- the original test-case stopped failing, though not the
  minimized one, so I added that one as a test-case
- only testing for ENTER_ALLOC and EXIT, and not explicitly for VOTE_ANY
  in ignore_bb_p also stopped working, so I've added that now.

Re-tested and committed.

Thanks,
- Tom
Tom de Vries Oct. 5, 2020, 7:05 a.m. UTC | #7
On 9/24/20 2:44 PM, Richard Biener wrote:
> On Thu, 24 Sep 2020, Tom de Vries wrote:
> 
>> On 9/24/20 1:42 PM, Richard Biener wrote:
>>> On Wed, 23 Sep 2020, Tom de Vries wrote:
>>>
>>>> On 9/23/20 9:28 AM, Richard Biener wrote:
>>>>> On Tue, 22 Sep 2020, Tom de Vries wrote:
>>>>>
>>>>>> [ was: Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB
>>>>>> with SIMT LANE [PR95654] ]
>>>>>>
>>>>>> On 9/16/20 8:20 PM, Alexander Monakov wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 16 Sep 2020, Tom de Vries wrote:
>>>>>>>
>>>>>>>> [ cc-ing author omp support for nvptx. ]
>>>>>>>
>>>>>>> The issue looks familiar. I recognized it back in 2017 (and LLVM people
>>>>>>> recognized it too for their GPU targets). In an attempt to get agreement
>>>>>>> to fix the issue "properly" for GCC I found a similar issue that affects
>>>>>>> all targets, not just offloading, and filed it as PR 80053.
>>>>>>>
>>>>>>> (yes, there are no addressable labels involved in offloading, but nevertheless
>>>>>>> the nature of the middle-end issue is related)
>>>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> thanks for looking into this.
>>>>>>
>>>>>> Seeing that the attempt to fix things properly is stalled, for now I'm
>>>>>> proposing a point-fix, similar to the original patch proposed by Tobias.
>>>>>>
>>>>>> Richi, Jakub, OK for trunk?
>>>>>
>>>>> I notice that we call ignore_bb_p many times in tracer.c but one call
>>>>> is conveniently early in tail_duplicate (void):
>>>>>
>>>>>       int n = count_insns (bb);
>>>>>       if (!ignore_bb_p (bb))
>>>>>         blocks[bb->index] = heap.insert (-bb->count.to_frequency (cfun), 
>>>>> bb);
>>>>>
>>>>> where count_insns already walks all stmts in the block.  It would be
>>>>> nice to avoid repeatedly walking all stmts, maybe adjusting the above
>>>>> call is enough and/or count_insns can compute this and/or the ignore_bb_p
>>>>> result can be cached (optimize_bb_for_size_p might change though,
>>>>> but maybe all other ignore_bb_p calls effectively just are that,
>>>>> checks for blocks that became optimize_bb_for_size_p).
>>>>>
>>>>
>>>> This untested follow-up patch tries something in that direction.
>>>>
>>>> Is this what you meant?
>>>
>>> Yeah, sort of.
>>>
>>> +static bool
>>> +cached_can_duplicate_bb_p (const_basic_block bb)
>>> +{
>>> +  if (can_duplicate_bb)
>>>
>>> is there any path where can_duplicate_bb would be NULL?
>>>
>>
>> Yes, ignore_bb_p is called from gimple-ssa-split-paths.c.
> 
> Oh, that was probably done because of the very same OMP issue ...
> 
>>> +    {
>>> +      unsigned int size = SBITMAP_SIZE (can_duplicate_bb);
>>> +      /* Assume added bb's should be ignored.  */
>>> +      if ((unsigned int)bb->index < size
>>> +         && bitmap_bit_p (can_duplicate_bb_computed, bb->index))
>>> +       return !bitmap_bit_p (can_duplicate_bb, bb->index);
>>>
>>> yes, newly added bbs should be ignored so,
>>>
>>>      }
>>>  
>>> -  return false;
>>> +  bool val = compute_can_duplicate_bb_p (bb);
>>> +  if (can_duplicate_bb)
>>> +    cache_can_duplicate_bb_p (bb, val);
>>>
>>> no need to compute & cache for them, just return true (because
>>> we did duplicate them)?
>>>
>>
>> Also the case for gimple-ssa-split-paths.c.?
> 
> If it had the bitmap then yes ... since it doesn't the early
> out should be in the conditional above only.
> 

Ack, updated the patch accordingly, and split it up in two bits, one
that does refactoring, and one that adds the actual caching:
- [ftracer] Factor out can_duplicate_bb_p
- [ftracer] Add caching of can_duplicate_bb_p

I'll post these in reply to this email.

Thanks,
- Tom
Alexander Monakov Oct. 5, 2020, 8:51 a.m. UTC | #8
On Mon, 5 Oct 2020, Tom de Vries wrote:

> I've had to modify this patch in two ways:
> - the original test-case stopped failing, though not the
>   minimized one, so I added that one as a test-case
> - only testing for ENTER_ALLOC and EXIT, and not explicitly for VOTE_ANY
>   in ignore_bb_p also stopped working, so I've added that now.
> 
> Re-tested and committed.

I don't understand, was the patch already approved somewhere? It has some
issues.


> --- a/gcc/tracer.c
> +++ b/gcc/tracer.c
> @@ -108,6 +108,24 @@ ignore_bb_p (const_basic_block bb)
>  	return true;
>      }
>  
> +  for (gimple_stmt_iterator gsi = gsi_start_bb (CONST_CAST_BB (bb));
> +       !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      gimple *g = gsi_stmt (gsi);
> +
> +      /* An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
> +	 duplicated as part of its group, or not at all.

What does "its group" stand for? It seems obviously copy-pasted from the
description of IFN_UNIQUE treatment, where it is even less clear what the
"group" is.

(I know what it means, but the comment is not explaining things well at all)

> +	 The IFN_GOMP_SIMT_VOTE_ANY is currently part of such a group,
> +	 so the same holds there, but it could be argued that the
> +	 IFN_GOMP_SIMT_VOTE_ANY could be generated after that group,
> +	 in which case it could be duplicated.  */

No, something like that cannot be argued, as VOTE_ANY may have data
dependencies to storage that is deallocated by SIMT_EXIT. You seem to be
claiming something that is simply not possible with the current design.

> +      if (is_gimple_call (g)
> +	  && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
> +	      || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)
> +	      || gimple_call_internal_p (g, IFN_GOMP_SIMT_VOTE_ANY)))


Hm? So you are leaving SIMT_XCHG_* be until the next testcase breaks?

> +	return true;
> +    }
> +
>    return false;
>  }

Alexander
diff mbox series

Patch

[omp, ftracer] Don't duplicate blocks in SIMT region

When running the libgomp testsuite on x86_64-linux with nvptx accelerator,
we run into:
...
FAIL: libgomp.fortran/pr66199-5.f90   -O3 -fomit-frame-pointer -funroll-loops \
  -fpeel-loops -ftracer -finline-functions  execution test
...

The problem is that ftracer duplicates a block containing GOMP_SIMT_VOTE_ANY.

That is, before ftracer we have (dropping the GOMP_SIMT_ prefix):
...
bb4(ENTER_ALLOC)
*----------+
|           \
|            \
|             v
|             *
v             bb8
*<------------*
bb5(VOTE_ANY)
*-------------+
|             |
|             |
|             |
|             |
|             v
|             *
v             bb7(XCHG_IDX)
*<------------*
bb6(EXIT)
...

The XCHG_IDX internal-fn does inter-SIMT-lane communication, which for nvptx
maps onto shfl, an operator which has the requirement that the warp executing
the operator is convergent.  The warp diverges at bb4, and
reconverges at bb5, and does not diverge by going to bb7, so the shfl is
indeed executed by a convergent warp.

After ftracer, we have:
...
bb4(ENTER_ALLOC)
*----------+
|           \
|            \
|             \
|              \
v               v
*               *
bb5(VOTE_ANY)   bb8(VOTE_ANY)
*               *
|\             /|
| \  +--------+ |
|  \/           |
|  /\           |
| /  +----------v
|/              *
v               bb7(XCHG_IDX)
*<--------------*
bb6(EXIT)
...

The warp diverges again at bb5, but does not reconverge again before bb6, so
the shfl is executed by a divergent warp, which causes the FAIL.

Fix this by making ftracer ignore blocks containing ENTER_ALLOC and EXIT,
effectively treating the SIMT region conservatively.

One could argue that the EXIT and VOTE_ANY can be generated by omp-low in
reverse order, in which case the VOTE_ANY could be duplicated.  This is the
reason VOTE_ANY is not explicitly listed as ignored in this patch.

An argument can also be made that the test needs to be added in a more
generic place, like gimple_can_duplicate_bb_p or some such, and that ftracer
then needs to use the generic test.  But that's a discussion with a much
broader scope, so I'm leaving that for another patch.

Build on x86_64-linux with nvptx accelerator, tested with libgomp.

gcc/ChangeLog:

	PR fortran/95654
	* tracer.c (ignore_bb_p): Ignore GOMP_SIMT_ENTER_ALLOC
	and GOMP_SIMT_EXIT.

---
 gcc/tracer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/gcc/tracer.c b/gcc/tracer.c
index 82ede722534..de80416f163 100644
--- a/gcc/tracer.c
+++ b/gcc/tracer.c
@@ -108,6 +108,16 @@  ignore_bb_p (const_basic_block bb)
 	return true;
     }
 
+  for (gimple_stmt_iterator gsi = gsi_start_bb (CONST_CAST_BB (bb));
+       !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *g = gsi_stmt (gsi);
+      if (is_gimple_call (g)
+	  && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
+	      || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)))
+	return true;
+    }
+
   return false;
 }