diff mbox series

nvptx: Cache stacks block for OpenMP kernel launch

Message ID 20201026141448.109041-1-julian@codesourcery.com
State New
Headers show
Series nvptx: Cache stacks block for OpenMP kernel launch | expand

Commit Message

Julian Brown Oct. 26, 2020, 2:14 p.m. UTC
Hi,

This patch adds caching for the stack block allocated for offloaded
OpenMP kernel launches on NVPTX. This is a performance optimisation --
we observed an average 11% or so performance improvement with this patch
across a set of accelerated GPU benchmarks on one machine (results vary
according to individual benchmark and with hardware used).

A given kernel launch will reuse the stack block from the previous launch
if it is large enough, else it is freed and reallocated. A slight caveat
is that memory will not be freed until the device is closed, so e.g. if
code is using highly variable launch geometries and large amounts of
GPU RAM, you might run out of resources slightly quicker with this patch.

Another way this patch gains performance is by omitting the
synchronisation at the end of an OpenMP offload kernel launch -- it's
safe for the GPU and CPU to continue executing in parallel at that point,
because e.g. copies-back from the device will be synchronised properly
with kernel completion anyway.

In turn, the last part necessitates a change to the way "(perhaps abort
was called)" errors are detected and reported.

Tested with offloading to NVPTX. OK for mainline?

Thanks,

Julian

2020-10-26  Julian Brown  <julian@codesourcery.com>

libgomp/
	* plugin/plugin-nvptx.c (maybe_abort_message): Add function.
	(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use above function.
	(struct ptx_device): Add omp_stacks struct.
	(nvptx_open_device): Initialise cached-stacks housekeeping info.
	(nvptx_close_device): Free cached stacks block and mutex.
	(nvptx_stacks_alloc): Rename to...
	(nvptx_stacks_acquire): This.  Cache stacks block between runs if same
	size or smaller is required.
	(nvptx_stacks_free): Rename to...
	(nvptx_stacks_release): This.  Do not free stacks block, but release
	mutex.
	(GOMP_OFFLOAD_run): Adjust for changes to above functions, and remove
	special-case "abort" error handling and synchronisation after kernel
	launch.
---
 libgomp/plugin/plugin-nvptx.c | 91 ++++++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 23 deletions(-)

Comments

Jakub Jelinek Oct. 26, 2020, 2:26 p.m. UTC | #1
On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> This patch adds caching for the stack block allocated for offloaded
> OpenMP kernel launches on NVPTX. This is a performance optimisation --
> we observed an average 11% or so performance improvement with this patch
> across a set of accelerated GPU benchmarks on one machine (results vary
> according to individual benchmark and with hardware used).
> 
> A given kernel launch will reuse the stack block from the previous launch
> if it is large enough, else it is freed and reallocated. A slight caveat
> is that memory will not be freed until the device is closed, so e.g. if
> code is using highly variable launch geometries and large amounts of
> GPU RAM, you might run out of resources slightly quicker with this patch.
> 
> Another way this patch gains performance is by omitting the
> synchronisation at the end of an OpenMP offload kernel launch -- it's
> safe for the GPU and CPU to continue executing in parallel at that point,
> because e.g. copies-back from the device will be synchronised properly
> with kernel completion anyway.
> 
> In turn, the last part necessitates a change to the way "(perhaps abort
> was called)" errors are detected and reported.
> 
> Tested with offloading to NVPTX. OK for mainline?

I'm afraid I don't know the plugin nor CUDA well enough to review this
properly (therefore I'd like to hear from Thomas, Tom and/or Alexander.
Anyway, just two questions, wouldn't it make sense to add some upper bound
limit over which it wouldn't cache the stacks, so that it would cache
most of the time for normal programs but if some kernel is really excessive
and then many normal ones wouldn't result in memory allocation failures?

And, in which context are cuStreamAddCallback registered callbacks run?
E.g. if it is inside of asynchronous interrput, using locking in there might
not be the best thing to do.

> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> -  if (r == CUDA_ERROR_LAUNCH_FAILED)
> -    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
> -		       maybe_abort_msg);
> -  else if (r != CUDA_SUCCESS)
> -    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
> +		    (void *) ptx_dev, 0);
>  }
>  
>  /* TODO: Implement GOMP_OFFLOAD_async_run. */
> -- 
> 2.28.0

	Jakub
Julian Brown Oct. 27, 2020, 1:17 p.m. UTC | #2
(Apologies if threading is broken, for some reason I didn't receive
this reply directly!)

On Mon Oct 26 14:26:34 GMT 2020, Jakub Jelinek wrote:
> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation
> > -- we observed an average 11% or so performance improvement with
> > this patch across a set of accelerated GPU benchmarks on one
> > machine (results vary according to individual benchmark and with
> > hardware used).
> > 
> > A given kernel launch will reuse the stack block from the previous
> > launch if it is large enough, else it is freed and reallocated. A
> > slight caveat is that memory will not be freed until the device is
> > closed, so e.g. if code is using highly variable launch geometries
> > and large amounts of GPU RAM, you might run out of resources
> > slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch --
> > it's safe for the GPU and CPU to continue executing in parallel at
> > that point, because e.g. copies-back from the device will be
> > synchronised properly with kernel completion anyway.
> > 
> > In turn, the last part necessitates a change to the way "(perhaps
> > abort was called)" errors are detected and reported.
> > 
> > Tested with offloading to NVPTX. OK for mainline?
> 
> I'm afraid I don't know the plugin nor CUDA well enough to review this
> properly (therefore I'd like to hear from Thomas, Tom and/or
> Alexander. Anyway, just two questions, wouldn't it make sense to add
> some upper bound limit over which it wouldn't cache the stacks, so
> that it would cache most of the time for normal programs but if some
> kernel is really excessive and then many normal ones wouldn't result
> in memory allocation failures?

Yes, that might work -- another idea is to free the stacks then retry
if a memory allocation fails, though that might lead to worse
fragmentation, perhaps. For the upper bound idea we'd need to pick a
sensible maximum limit. Something like 16MB maybe? Or,
user-controllable or some fraction of the GPU's total memory?

> And, in which context are cuStreamAddCallback registered callbacks
> run? E.g. if it is inside of asynchronous interrput, using locking in
> there might not be the best thing to do.

The cuStreamAddCallback API is documented here:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483

We're quite limited in what we can do in the callback function since
"Callbacks must not make any CUDA API calls". So what *can* a callback
function do? It is mentioned that the callback function's execution will
"pause" the stream it is logically running on. So can we get deadlock,
e.g. if multiple host threads are launching offload kernels
simultaneously? I don't think so, but I don't know how to prove it!

Thanks,

Julian
Chung-Lin Tang Oct. 28, 2020, 7:25 a.m. UTC | #3
On 2020/10/27 9:17 PM, Julian Brown wrote:
>> And, in which context are cuStreamAddCallback registered callbacks
>> run? E.g. if it is inside of asynchronous interrput, using locking in
>> there might not be the best thing to do.
> The cuStreamAddCallback API is documented here:
> 
> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> 
> We're quite limited in what we can do in the callback function since
> "Callbacks must not make any CUDA API calls". So what*can*  a callback
> function do? It is mentioned that the callback function's execution will
> "pause" the stream it is logically running on. So can we get deadlock,
> e.g. if multiple host threads are launching offload kernels
> simultaneously? I don't think so, but I don't know how to prove it!

I think it's not deadlock that's a problem here, but that the locking acquiring
in nvptx_stack_acquire will effectively serialize GPU kernel execution to just
one host thread (since you're holding it till kernel completion).
Also in that case, why do you need to use a CUDA callback? You can just call the
unlock directly afterwards.

I think a better way is to use a list of stack blocks in ptx_dev, and quickly
retrieve/unlock it in nvptx_stack_acquire, like how we did it in GOMP_OFFLOAD_alloc for
general device memory allocation.

Chung-Lin
Julian Brown Oct. 28, 2020, 11:32 a.m. UTC | #4
On Wed, 28 Oct 2020 15:25:56 +0800
Chung-Lin Tang <cltang@codesourcery.com> wrote:

> On 2020/10/27 9:17 PM, Julian Brown wrote:
> >> And, in which context are cuStreamAddCallback registered callbacks
> >> run? E.g. if it is inside of asynchronous interrput, using locking
> >> in there might not be the best thing to do.  
> > The cuStreamAddCallback API is documented here:
> > 
> > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> > 
> > We're quite limited in what we can do in the callback function since
> > "Callbacks must not make any CUDA API calls". So what*can*  a
> > callback function do? It is mentioned that the callback function's
> > execution will "pause" the stream it is logically running on. So
> > can we get deadlock, e.g. if multiple host threads are launching
> > offload kernels simultaneously? I don't think so, but I don't know
> > how to prove it!  
> 
> I think it's not deadlock that's a problem here, but that the locking
> acquiring in nvptx_stack_acquire will effectively serialize GPU
> kernel execution to just one host thread (since you're holding it
> till kernel completion). Also in that case, why do you need to use a
> CUDA callback? You can just call the unlock directly afterwards.

IIUC, there's a single GPU queue used for synchronous launches no
matter which host thread initiates the operation, and kernel execution
is serialised anyway, so that shouldn't be a problem. The only way to
get different kernels executing simultaneously is to use different CUDA
streams -- but I think that's still TBD for OpenMP ("TODO: Implement
GOMP_OFFLOAD_async_run").

> I think a better way is to use a list of stack blocks in ptx_dev, and
> quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it
> in GOMP_OFFLOAD_alloc for general device memory allocation.

If it weren't for the serialisation, we could also keep a stack cache
per-host-thread in nvptx_thread. But as it is, I don't think we need the
extra complication. When we do OpenMP async support, maybe a stack
cache can be put per-stream in goacc_asyncqueue or the OpenMP
equivalent.

Thanks,

Julian
Alexander Monakov Nov. 9, 2020, 9:32 p.m. UTC | #5
On Mon, 26 Oct 2020, Jakub Jelinek wrote:

> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation --
> > we observed an average 11% or so performance improvement with this patch
> > across a set of accelerated GPU benchmarks on one machine (results vary
> > according to individual benchmark and with hardware used).

In this patch you're folding two changes together: reuse of allocated stacks
and removing one host-device synchronization.  Why is that?  Can you report
performance change separately for each change (and split out the patches)?

> > A given kernel launch will reuse the stack block from the previous launch
> > if it is large enough, else it is freed and reallocated. A slight caveat
> > is that memory will not be freed until the device is closed, so e.g. if
> > code is using highly variable launch geometries and large amounts of
> > GPU RAM, you might run out of resources slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch -- it's
> > safe for the GPU and CPU to continue executing in parallel at that point,
> > because e.g. copies-back from the device will be synchronised properly
> > with kernel completion anyway.

I don't think this explanation is sufficient. My understanding is that OpenMP
forbids the host to proceed asynchronously after the target construct unless
it is a 'target nowait' construct. This may be observable if there's a printf
in the target region for example (or if it accesses memory via host pointers).

So this really needs to be a separate patch with more explanation why this is
okay (if it is okay).

> > In turn, the last part necessitates a change to the way "(perhaps abort
> > was called)" errors are detected and reported.

As already mentioned using callbacks is problematic. Plus, I'm sure the way
you lock out other threads is a performance loss when multiple threads have
target regions: even though they will not run concurrently on the GPU, you
still want to allow host threads to submit GPU jobs while the GPU is occupied.

I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then
you can arrange reuse without totally serializing host threads on target
regions.

Alexander
Julian Brown Nov. 13, 2020, 8:54 p.m. UTC | #6
Hi Alexander,

Thanks for the review! Comments below.

On Tue, 10 Nov 2020 00:32:36 +0300
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> 
> > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:  
> > > This patch adds caching for the stack block allocated for
> > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > optimisation -- we observed an average 11% or so performance
> > > improvement with this patch across a set of accelerated GPU
> > > benchmarks on one machine (results vary according to individual
> > > benchmark and with hardware used).  
> 
> In this patch you're folding two changes together: reuse of allocated
> stacks and removing one host-device synchronization.  Why is that?
> Can you report performance change separately for each change (and
> split out the patches)?

An accident of the development process of the patch, really -- the idea
for removing the post-kernel-launch synchronisation came from the
OpenACC side, and adapting it to OpenMP meant the stacks had to remain
allocated after the return of the GOMP_OFFLOAD_run function.

> > > A given kernel launch will reuse the stack block from the
> > > previous launch if it is large enough, else it is freed and
> > > reallocated. A slight caveat is that memory will not be freed
> > > until the device is closed, so e.g. if code is using highly
> > > variable launch geometries and large amounts of GPU RAM, you
> > > might run out of resources slightly quicker with this patch.
> > > 
> > > Another way this patch gains performance is by omitting the
> > > synchronisation at the end of an OpenMP offload kernel launch --
> > > it's safe for the GPU and CPU to continue executing in parallel
> > > at that point, because e.g. copies-back from the device will be
> > > synchronised properly with kernel completion anyway.  
> 
> I don't think this explanation is sufficient. My understanding is
> that OpenMP forbids the host to proceed asynchronously after the
> target construct unless it is a 'target nowait' construct. This may
> be observable if there's a printf in the target region for example
> (or if it accesses memory via host pointers).
> 
> So this really needs to be a separate patch with more explanation why
> this is okay (if it is okay).

As long as the offload kernel only touches GPU memory and does not have
any CPU-visible side effects (like the printf you mentioned -- I hadn't
really considered that, oops!), it's probably OK.

But anyway, the benefit obtained on OpenMP code (the same set of
benchmarks run before) of omitting the synchronisation at the end of
GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
stacks caching, and miss out the synchronisation removal for now. (It
might still be something worth considering later, perhaps, as long as
we can show some given kernel doesn't use printf or access memory via
host pointers -- I guess the former might be easier than the latter. I
have observed the equivalent OpenACC patch provide a significant boost
on some benchmarks, so there's probably something that could be gained
on the OpenMP side too.)

The benefit with the attached patch -- just stacks caching, no
synchronisation removal -- is about 12% on the same set of benchmarks
as before. Results are a little noisy on the machine I'm benchmarking
on, so this isn't necessarily proof that the synchronisation removal is
harmful for performance!

> > > In turn, the last part necessitates a change to the way "(perhaps
> > > abort was called)" errors are detected and reported.  
> 
> As already mentioned using callbacks is problematic. Plus, I'm sure
> the way you lock out other threads is a performance loss when
> multiple threads have target regions: even though they will not run
> concurrently on the GPU, you still want to allow host threads to
> submit GPU jobs while the GPU is occupied.
> 
> I would suggest to have a small pool (up to 3 entries perhaps) of
> stacks. Then you can arrange reuse without totally serializing host
> threads on target regions.

I'm really wary of the additional complexity of adding a stack pool,
and the memory allocation/freeing code paths in CUDA appear to be so
slow that we get a benefit with this patch even when the GPU stream has
to wait for the CPU to unlock the stacks block. Also, for large GPU
launches, the size of the soft-stacks block isn't really trivial (I've
seen something like 50MB on the hardware I'm using, with default
options), and multiplying that by 3 could start to eat into the GPU
heap memory for "useful data" quite significantly.

Consider the attached (probably not amazingly-written) microbenchmark.
It spawns 8 threads which each launch lots of OpenMP kernels
performing some trivial work, then joins the threads and checks the
results. As a baseline, with the "FEWER_KERNELS" parameters set (256
kernel launches over 8 threads), this gives us over 5 runs:

real    3m55.375s
user    7m14.192s
sys     0m30.148s

real    3m54.487s
user    7m6.775s
sys     0m34.678s

real    3m54.633s
user    7m20.381s
sys     0m30.620s

real    3m54.992s
user    7m12.464s
sys     0m29.610s

real    3m55.471s
user    7m14.342s
sys     0m29.815s

With a version of the attached patch, we instead get:

real    3m53.404s
user    3m39.869s
sys     0m16.149s

real    3m54.713s
user    3m41.018s
sys     0m16.129s

real    3m55.242s
user    3m55.148s
sys     0m17.130s

real    3m55.374s
user    3m40.411s
sys     0m15.818s

real    3m55.189s
user    3m40.144s
sys     0m15.846s

That is: real time is about the same, but user/sys time are reduced.

Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
baseline is:

real    12m29.975s
user    24m2.244s
sys     8m8.153s

real    12m15.391s
user    23m51.018s
sys     8m0.809s

real    12m5.424s
user    23m38.585s
sys     7m47.714s

real    12m10.456s
user    23m51.691s
sys     7m54.324s

real    12m37.735s
user    24m19.671s
sys     8m15.752s

And with the patch, we get:

real    4m42.600s
user    16m14.593s
sys     0m40.444s

real    4m43.579s
user    15m33.805s
sys     0m38.537s

real    4m42.211s
user    16m32.926s
sys     0m40.271s

real    4m44.256s
user    15m49.290s
sys     0m39.116s

real    4m42.013s
user    15m39.447s
sys     0m38.517s

Real, user and sys time are all dramatically less. So I'd suggest that
the attached patch is an improvement over the status quo, even if we
could experiment with the stacks pool idea as a further improvement
later on.

The attached patch also implements a size limit for retention of the
soft-stack block -- freeing it before allocating more memory, rather
than at the start of a kernel launch, so bigger blocks can still be
shared between kernel launches if there's no memory allocation between
them. It also tries freeing smaller cached soft-stack blocks and
retrying memory allocation in out-of-memory situations.

Re-tested with offloading to NVPTX. OK for trunk?

Thanks,

Julian

ChangeLog

2020-11-13  Julian Brown  <julian@codesourcery.com>

libgomp/
    * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
    (struct ptx_device): Add omp_stacks struct.
    (nvptx_open_device): Initialise cached-stacks housekeeping info.
    (nvptx_close_device): Free cached stacks block and mutex.
    (nvptx_stacks_free): New function.
    (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
    (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block.
    (nvptx_stacks_alloc): Rename to...
    (nvptx_stacks_acquire): This.  Cache stacks block between runs if same
    size or smaller is required.
    (nvptx_stacks_free): Remove.
    (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block
    during kernel execution.
Julian Brown Dec. 8, 2020, 1:13 a.m. UTC | #7
Ping?

Thanks,

Julian

On Fri, 13 Nov 2020 20:54:54 +0000
Julian Brown <julian@codesourcery.com> wrote:

> Hi Alexander,
> 
> Thanks for the review! Comments below.
> 
> On Tue, 10 Nov 2020 00:32:36 +0300
> Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> >   
> > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:    
> > > > This patch adds caching for the stack block allocated for
> > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > optimisation -- we observed an average 11% or so performance
> > > > improvement with this patch across a set of accelerated GPU
> > > > benchmarks on one machine (results vary according to individual
> > > > benchmark and with hardware used).    
> > 
> > In this patch you're folding two changes together: reuse of
> > allocated stacks and removing one host-device synchronization.  Why
> > is that? Can you report performance change separately for each
> > change (and split out the patches)?  
> 
> An accident of the development process of the patch, really -- the
> idea for removing the post-kernel-launch synchronisation came from the
> OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> allocated after the return of the GOMP_OFFLOAD_run function.
> 
> > > > A given kernel launch will reuse the stack block from the
> > > > previous launch if it is large enough, else it is freed and
> > > > reallocated. A slight caveat is that memory will not be freed
> > > > until the device is closed, so e.g. if code is using highly
> > > > variable launch geometries and large amounts of GPU RAM, you
> > > > might run out of resources slightly quicker with this patch.
> > > > 
> > > > Another way this patch gains performance is by omitting the
> > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > at that point, because e.g. copies-back from the device will be
> > > > synchronised properly with kernel completion anyway.    
> > 
> > I don't think this explanation is sufficient. My understanding is
> > that OpenMP forbids the host to proceed asynchronously after the
> > target construct unless it is a 'target nowait' construct. This may
> > be observable if there's a printf in the target region for example
> > (or if it accesses memory via host pointers).
> > 
> > So this really needs to be a separate patch with more explanation
> > why this is okay (if it is okay).  
> 
> As long as the offload kernel only touches GPU memory and does not
> have any CPU-visible side effects (like the printf you mentioned -- I
> hadn't really considered that, oops!), it's probably OK.
> 
> But anyway, the benefit obtained on OpenMP code (the same set of
> benchmarks run before) of omitting the synchronisation at the end of
> GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> stacks caching, and miss out the synchronisation removal for now. (It
> might still be something worth considering later, perhaps, as long as
> we can show some given kernel doesn't use printf or access memory via
> host pointers -- I guess the former might be easier than the latter. I
> have observed the equivalent OpenACC patch provide a significant boost
> on some benchmarks, so there's probably something that could be gained
> on the OpenMP side too.)
> 
> The benefit with the attached patch -- just stacks caching, no
> synchronisation removal -- is about 12% on the same set of benchmarks
> as before. Results are a little noisy on the machine I'm benchmarking
> on, so this isn't necessarily proof that the synchronisation removal
> is harmful for performance!
> 
> > > > In turn, the last part necessitates a change to the way
> > > > "(perhaps abort was called)" errors are detected and reported.
> > > >   
> > 
> > As already mentioned using callbacks is problematic. Plus, I'm sure
> > the way you lock out other threads is a performance loss when
> > multiple threads have target regions: even though they will not run
> > concurrently on the GPU, you still want to allow host threads to
> > submit GPU jobs while the GPU is occupied.
> > 
> > I would suggest to have a small pool (up to 3 entries perhaps) of
> > stacks. Then you can arrange reuse without totally serializing host
> > threads on target regions.  
> 
> I'm really wary of the additional complexity of adding a stack pool,
> and the memory allocation/freeing code paths in CUDA appear to be so
> slow that we get a benefit with this patch even when the GPU stream
> has to wait for the CPU to unlock the stacks block. Also, for large
> GPU launches, the size of the soft-stacks block isn't really trivial
> (I've seen something like 50MB on the hardware I'm using, with default
> options), and multiplying that by 3 could start to eat into the GPU
> heap memory for "useful data" quite significantly.
> 
> Consider the attached (probably not amazingly-written) microbenchmark.
> It spawns 8 threads which each launch lots of OpenMP kernels
> performing some trivial work, then joins the threads and checks the
> results. As a baseline, with the "FEWER_KERNELS" parameters set (256
> kernel launches over 8 threads), this gives us over 5 runs:
> 
> real    3m55.375s
> user    7m14.192s
> sys     0m30.148s
> 
> real    3m54.487s
> user    7m6.775s
> sys     0m34.678s
> 
> real    3m54.633s
> user    7m20.381s
> sys     0m30.620s
> 
> real    3m54.992s
> user    7m12.464s
> sys     0m29.610s
> 
> real    3m55.471s
> user    7m14.342s
> sys     0m29.815s
> 
> With a version of the attached patch, we instead get:
> 
> real    3m53.404s
> user    3m39.869s
> sys     0m16.149s
> 
> real    3m54.713s
> user    3m41.018s
> sys     0m16.129s
> 
> real    3m55.242s
> user    3m55.148s
> sys     0m17.130s
> 
> real    3m55.374s
> user    3m40.411s
> sys     0m15.818s
> 
> real    3m55.189s
> user    3m40.144s
> sys     0m15.846s
> 
> That is: real time is about the same, but user/sys time are reduced.
> 
> Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
> baseline is:
> 
> real    12m29.975s
> user    24m2.244s
> sys     8m8.153s
> 
> real    12m15.391s
> user    23m51.018s
> sys     8m0.809s
> 
> real    12m5.424s
> user    23m38.585s
> sys     7m47.714s
> 
> real    12m10.456s
> user    23m51.691s
> sys     7m54.324s
> 
> real    12m37.735s
> user    24m19.671s
> sys     8m15.752s
> 
> And with the patch, we get:
> 
> real    4m42.600s
> user    16m14.593s
> sys     0m40.444s
> 
> real    4m43.579s
> user    15m33.805s
> sys     0m38.537s
> 
> real    4m42.211s
> user    16m32.926s
> sys     0m40.271s
> 
> real    4m44.256s
> user    15m49.290s
> sys     0m39.116s
> 
> real    4m42.013s
> user    15m39.447s
> sys     0m38.517s
> 
> Real, user and sys time are all dramatically less. So I'd suggest that
> the attached patch is an improvement over the status quo, even if we
> could experiment with the stacks pool idea as a further improvement
> later on.
> 
> The attached patch also implements a size limit for retention of the
> soft-stack block -- freeing it before allocating more memory, rather
> than at the start of a kernel launch, so bigger blocks can still be
> shared between kernel launches if there's no memory allocation between
> them. It also tries freeing smaller cached soft-stack blocks and
> retrying memory allocation in out-of-memory situations.
> 
> Re-tested with offloading to NVPTX. OK for trunk?
> 
> Thanks,
> 
> Julian
> 
> ChangeLog
> 
> 2020-11-13  Julian Brown  <julian@codesourcery.com>
> 
> libgomp/
>     * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
>     (struct ptx_device): Add omp_stacks struct.
>     (nvptx_open_device): Initialise cached-stacks housekeeping info.
>     (nvptx_close_device): Free cached stacks block and mutex.
>     (nvptx_stacks_free): New function.
>     (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
>     (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks
> block. (nvptx_stacks_alloc): Rename to...
>     (nvptx_stacks_acquire): This.  Cache stacks block between runs if
> same size or smaller is required.
>     (nvptx_stacks_free): Remove.
>     (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks
> block during kernel execution.
Alexander Monakov Dec. 8, 2020, 5:11 p.m. UTC | #8
On Tue, 8 Dec 2020, Julian Brown wrote:

> Ping?

This has addressed my concerns, thanks.

Alexander

> On Fri, 13 Nov 2020 20:54:54 +0000
> Julian Brown <julian@codesourcery.com> wrote:
> 
> > Hi Alexander,
> > 
> > Thanks for the review! Comments below.
> > 
> > On Tue, 10 Nov 2020 00:32:36 +0300
> > Alexander Monakov <amonakov@ispras.ru> wrote:
> > 
> > > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> > >   
> > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:    
> > > > > This patch adds caching for the stack block allocated for
> > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > > optimisation -- we observed an average 11% or so performance
> > > > > improvement with this patch across a set of accelerated GPU
> > > > > benchmarks on one machine (results vary according to individual
> > > > > benchmark and with hardware used).    
> > > 
> > > In this patch you're folding two changes together: reuse of
> > > allocated stacks and removing one host-device synchronization.  Why
> > > is that? Can you report performance change separately for each
> > > change (and split out the patches)?  
> > 
> > An accident of the development process of the patch, really -- the
> > idea for removing the post-kernel-launch synchronisation came from the
> > OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> > allocated after the return of the GOMP_OFFLOAD_run function.
> > 
> > > > > A given kernel launch will reuse the stack block from the
> > > > > previous launch if it is large enough, else it is freed and
> > > > > reallocated. A slight caveat is that memory will not be freed
> > > > > until the device is closed, so e.g. if code is using highly
> > > > > variable launch geometries and large amounts of GPU RAM, you
> > > > > might run out of resources slightly quicker with this patch.
> > > > > 
> > > > > Another way this patch gains performance is by omitting the
> > > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > > at that point, because e.g. copies-back from the device will be
> > > > > synchronised properly with kernel completion anyway.    
> > > 
> > > I don't think this explanation is sufficient. My understanding is
> > > that OpenMP forbids the host to proceed asynchronously after the
> > > target construct unless it is a 'target nowait' construct. This may
> > > be observable if there's a printf in the target region for example
> > > (or if it accesses memory via host pointers).
> > > 
> > > So this really needs to be a separate patch with more explanation
> > > why this is okay (if it is okay).  
> > 
> > As long as the offload kernel only touches GPU memory and does not
> > have any CPU-visible side effects (like the printf you mentioned -- I
> > hadn't really considered that, oops!), it's probably OK.
> > 
> > But anyway, the benefit obtained on OpenMP code (the same set of
> > benchmarks run before) of omitting the synchronisation at the end of
> > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> > stacks caching, and miss out the synchronisation removal for now. (It
> > might still be something worth considering later, perhaps, as long as
> > we can show some given kernel doesn't use printf or access memory via
> > host pointers -- I guess the former might be easier than the latter. I
> > have observed the equivalent OpenACC patch provide a significant boost
> > on some benchmarks, so there's probably something that could be gained
> > on the OpenMP side too.)
> > 
> > The benefit with the attached patch -- just stacks caching, no
> > synchronisation removal -- is about 12% on the same set of benchmarks
> > as before. Results are a little noisy on the machine I'm benchmarking
> > on, so this isn't necessarily proof that the synchronisation removal
> > is harmful for performance!
> > 
> > > > > In turn, the last part necessitates a change to the way
> > > > > "(perhaps abort was called)" errors are detected and reported.
> > > > >   
> > > 
> > > As already mentioned using callbacks is problematic. Plus, I'm sure
> > > the way you lock out other threads is a performance loss when
> > > multiple threads have target regions: even though they will not run
> > > concurrently on the GPU, you still want to allow host threads to
> > > submit GPU jobs while the GPU is occupied.
> > > 
> > > I would suggest to have a small pool (up to 3 entries perhaps) of
> > > stacks. Then you can arrange reuse without totally serializing host
> > > threads on target regions.  
> > 
> > I'm really wary of the additional complexity of adding a stack pool,
> > and the memory allocation/freeing code paths in CUDA appear to be so
> > slow that we get a benefit with this patch even when the GPU stream
> > has to wait for the CPU to unlock the stacks block. Also, for large
> > GPU launches, the size of the soft-stacks block isn't really trivial
> > (I've seen something like 50MB on the hardware I'm using, with default
> > options), and multiplying that by 3 could start to eat into the GPU
> > heap memory for "useful data" quite significantly.
> > 
> > Consider the attached (probably not amazingly-written) microbenchmark.
> > It spawns 8 threads which each launch lots of OpenMP kernels
> > performing some trivial work, then joins the threads and checks the
> > results. As a baseline, with the "FEWER_KERNELS" parameters set (256
> > kernel launches over 8 threads), this gives us over 5 runs:
> > 
> > real    3m55.375s
> > user    7m14.192s
> > sys     0m30.148s
> > 
> > real    3m54.487s
> > user    7m6.775s
> > sys     0m34.678s
> > 
> > real    3m54.633s
> > user    7m20.381s
> > sys     0m30.620s
> > 
> > real    3m54.992s
> > user    7m12.464s
> > sys     0m29.610s
> > 
> > real    3m55.471s
> > user    7m14.342s
> > sys     0m29.815s
> > 
> > With a version of the attached patch, we instead get:
> > 
> > real    3m53.404s
> > user    3m39.869s
> > sys     0m16.149s
> > 
> > real    3m54.713s
> > user    3m41.018s
> > sys     0m16.129s
> > 
> > real    3m55.242s
> > user    3m55.148s
> > sys     0m17.130s
> > 
> > real    3m55.374s
> > user    3m40.411s
> > sys     0m15.818s
> > 
> > real    3m55.189s
> > user    3m40.144s
> > sys     0m15.846s
> > 
> > That is: real time is about the same, but user/sys time are reduced.
> > 
> > Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the
> > baseline is:
> > 
> > real    12m29.975s
> > user    24m2.244s
> > sys     8m8.153s
> > 
> > real    12m15.391s
> > user    23m51.018s
> > sys     8m0.809s
> > 
> > real    12m5.424s
> > user    23m38.585s
> > sys     7m47.714s
> > 
> > real    12m10.456s
> > user    23m51.691s
> > sys     7m54.324s
> > 
> > real    12m37.735s
> > user    24m19.671s
> > sys     8m15.752s
> > 
> > And with the patch, we get:
> > 
> > real    4m42.600s
> > user    16m14.593s
> > sys     0m40.444s
> > 
> > real    4m43.579s
> > user    15m33.805s
> > sys     0m38.537s
> > 
> > real    4m42.211s
> > user    16m32.926s
> > sys     0m40.271s
> > 
> > real    4m44.256s
> > user    15m49.290s
> > sys     0m39.116s
> > 
> > real    4m42.013s
> > user    15m39.447s
> > sys     0m38.517s
> > 
> > Real, user and sys time are all dramatically less. So I'd suggest that
> > the attached patch is an improvement over the status quo, even if we
> > could experiment with the stacks pool idea as a further improvement
> > later on.
> > 
> > The attached patch also implements a size limit for retention of the
> > soft-stack block -- freeing it before allocating more memory, rather
> > than at the start of a kernel launch, so bigger blocks can still be
> > shared between kernel launches if there's no memory allocation between
> > them. It also tries freeing smaller cached soft-stack blocks and
> > retrying memory allocation in out-of-memory situations.
> > 
> > Re-tested with offloading to NVPTX. OK for trunk?
> > 
> > Thanks,
> > 
> > Julian
> > 
> > ChangeLog
> > 
> > 2020-11-13  Julian Brown  <julian@codesourcery.com>
> > 
> > libgomp/
> >     * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
> >     (struct ptx_device): Add omp_stacks struct.
> >     (nvptx_open_device): Initialise cached-stacks housekeeping info.
> >     (nvptx_close_device): Free cached stacks block and mutex.
> >     (nvptx_stacks_free): New function.
> >     (nvptx_alloc): Add SUPPRESS_ERRORS parameter.
> >     (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks
> > block. (nvptx_stacks_alloc): Rename to...
> >     (nvptx_stacks_acquire): This.  Cache stacks block between runs if
> > same size or smaller is required.
> >     (nvptx_stacks_free): Remove.
> >     (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks
> > block during kernel execution.
>
Julian Brown Dec. 15, 2020, 1:39 p.m. UTC | #9
On Tue, 8 Dec 2020 20:11:38 +0300
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Tue, 8 Dec 2020, Julian Brown wrote:
> 
> > Ping?  
> 
> This has addressed my concerns, thanks.

Jakub, Tom -- just to confirm, is this OK for trunk now?

I noticed a slight bugfix myself in the no-stacks/out-of-memory case --
i.e. for OpenACC, in nvptx_stacks_free. The attached version of the
patch includes that fix.

Thanks,

Julian
Jakub Jelinek Dec. 15, 2020, 1:49 p.m. UTC | #10
On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote:
> @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>    nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>  
>    size_t stack_size = nvptx_stacks_size ();
> -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> +
> +  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
> +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
>    void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
>    size_t fn_args_size = sizeof fn_args;
>    void *config[] = {
> @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>  		       maybe_abort_msg);
>    else if (r != CUDA_SUCCESS)
>      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +
> +  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>  }

Do you need to hold the omp_stacks.lock across the entire offloading?
Doesn't that serialize all offloading kernels to the same device?
I mean, can't the lock be taken just shortly at the start to either acquire
the cached stacks or allocate a fresh stack, and then at the end to put the
stack back into the cache?

Also, how will this caching interact with malloc etc. performed in target
regions?  Shall we do the caching only if there is no other concurrent
offloading to the device because the newlib malloc will not be able to
figure out it could free this and let the host know it has freed it.

	Jakub
Julian Brown Dec. 15, 2020, 4:49 p.m. UTC | #11
On Tue, 15 Dec 2020 14:49:40 +0100
Jakub Jelinek <jakub@redhat.com> wrote:

> On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote:
> > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn,
> > ptx_dev, &teams, &threads); 
> >    size_t stack_size = nvptx_stacks_size ();
> > -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> > +
> > +  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
> > +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams
> > * threads); void *fn_args[] = {tgt_vars, stacks, (void *)
> > stack_size}; size_t fn_args_size = sizeof fn_args;
> >    void *config[] = {
> > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) maybe_abort_msg);
> >    else if (r != CUDA_SUCCESS)
> >      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error
> > (r));
> > -  nvptx_stacks_free (stacks, teams * threads);
> > +
> > +  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> >  }  
> 
> Do you need to hold the omp_stacks.lock across the entire offloading?
> Doesn't that serialize all offloading kernels to the same device?
> I mean, can't the lock be taken just shortly at the start to either
> acquire the cached stacks or allocate a fresh stack, and then at the
> end to put the stack back into the cache?

I think you're suggesting something like what Alexander mentioned -- a
pool of cached stacks blocks in case the single, locked block is
contested. Obviously at present kernel launches are serialised on the
target anyway, so it's a question of whether having the device wait for
the host to unlock the stacks block (i.e. a context switch, FSVO context
switch), or allocating a new stacks block, is quicker. I think the
numbers posted in the parent email show that memory allocation is so
slow that just waiting for the lock wins. I'm wary of adding
unnecessary complication, especially if it'll only be exercised in
already hard-to-debug cases (i.e. lots of threads)!

Just ignoring the cache if it's "in use" (and doing an allocation/free
of another stacks block, as at present) is something I'd not quite
considered. Indeed that might work, but I'm not sure if it'll be
any faster in practice.

> Also, how will this caching interact with malloc etc. performed in
> target regions?  Shall we do the caching only if there is no other
> concurrent offloading to the device because the newlib malloc will
> not be able to figure out it could free this and let the host know it
> has freed it.

Does target-side memory allocation call back into the plugin's
GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
memory allocation shouldn't be affected, I don't think?

Thanks,

Julian
Jakub Jelinek Dec. 15, 2020, 5 p.m. UTC | #12
On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:
> > Do you need to hold the omp_stacks.lock across the entire offloading?
> > Doesn't that serialize all offloading kernels to the same device?
> > I mean, can't the lock be taken just shortly at the start to either
> > acquire the cached stacks or allocate a fresh stack, and then at the
> > end to put the stack back into the cache?
> 
> I think you're suggesting something like what Alexander mentioned -- a
> pool of cached stacks blocks in case the single, locked block is
> contested. Obviously at present kernel launches are serialised on the
> target anyway, so it's a question of whether having the device wait for
> the host to unlock the stacks block (i.e. a context switch, FSVO context
> switch), or allocating a new stacks block, is quicker. I think the
> numbers posted in the parent email show that memory allocation is so
> slow that just waiting for the lock wins. I'm wary of adding
> unnecessary complication, especially if it'll only be exercised in
> already hard-to-debug cases (i.e. lots of threads)!

I'm not suggesting to have multiple stacks, on the contrary.  I've suggested
to do the caching only if at most one host thread is offloading to the
device.

If one uses
#pragma omp parallel num_threads(3)
{
  #pragma omp target
  ...
}
then I don't see what would previously prevent the concurrent offloading,
yes, we take the device lock during gomp_map_vars and again during
gomp_unmap_vars, but don't hold it across the offloading in between.

> Does target-side memory allocation call back into the plugin's
> GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> memory allocation shouldn't be affected, I don't think?

Again, I'm not suggesting that it should, but what I'm saying is that
if target region ends but some other host tasks are doing target regions to
the same device concurrently with that, or if there are async target in fly,
we shouldn't try to cache the stack, but free it right away, because
what the other target regions might need to malloc larger amounts of memory
and fail because of the caching.

	Jakub
Julian Brown Dec. 15, 2020, 11:16 p.m. UTC | #13
On Tue, 15 Dec 2020 18:00:36 +0100
Jakub Jelinek <jakub@redhat.com> wrote:

> On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:
> > > Do you need to hold the omp_stacks.lock across the entire
> > > offloading? Doesn't that serialize all offloading kernels to the
> > > same device? I mean, can't the lock be taken just shortly at the
> > > start to either acquire the cached stacks or allocate a fresh
> > > stack, and then at the end to put the stack back into the cache?  
> > 
> > I think you're suggesting something like what Alexander mentioned
> > -- a pool of cached stacks blocks in case the single, locked block
> > is contested. Obviously at present kernel launches are serialised
> > on the target anyway, so it's a question of whether having the
> > device wait for the host to unlock the stacks block (i.e. a context
> > switch, FSVO context switch), or allocating a new stacks block, is
> > quicker. I think the numbers posted in the parent email show that
> > memory allocation is so slow that just waiting for the lock wins.
> > I'm wary of adding unnecessary complication, especially if it'll
> > only be exercised in already hard-to-debug cases (i.e. lots of
> > threads)!  
> 
> I'm not suggesting to have multiple stacks, on the contrary.  I've
> suggested to do the caching only if at most one host thread is
> offloading to the device.
> 
> If one uses
> #pragma omp parallel num_threads(3)
> {
>   #pragma omp target
>   ...
> }
> then I don't see what would previously prevent the concurrent
> offloading, yes, we take the device lock during gomp_map_vars and
> again during gomp_unmap_vars, but don't hold it across the offloading
> in between.

I still don't think I quite understand what you're getting at.

We only implement synchronous launches for OpenMP on NVPTX at present,
and those all use the default CUDA runtime driver stream. Only one
kernel executes on the hardware at once, even if launched from
different host threads. The serialisation isn't due to the device lock
being held, but by the queueing semantics of the underlying API.

> > Does target-side memory allocation call back into the plugin's
> > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> > memory allocation shouldn't be affected, I don't think?  
> 
> Again, I'm not suggesting that it should, but what I'm saying is that
> if target region ends but some other host tasks are doing target
> regions to the same device concurrently with that, or if there are
> async target in fly, we shouldn't try to cache the stack, but free it
> right away, because what the other target regions might need to
> malloc larger amounts of memory and fail because of the caching.

I'm assuming you're not suggesting fundamentally changing APIs or
anything to determine if we're launching target regions from multiple
threads at once, but instead that we try to detect the condition
dynamically in the plugin?

So, would kernel launch look something like this? (Excuse
pseudo-code-isms!)

void GOMP_OFFLOAD_run (...)
{
  bool used_cache;

  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
  if (&ptx_dev->omp_stacks.usage_count > 0)
  {
    cuCtxSynchronize ();
    nvptx_stacks_free (&ptx_dev);
    ...allocate fresh stack, no caching...
    used_cache = false;
  }
  else
  {
    /* Allocate or re-use cached stacks, and then... */
    ptx_dev->omp_stacks.usage_count++;
    used_cache = true;
  }
  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);

  /* Launch kernel */

  if (used_cache) {
    cuStreamAddCallback (
      pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
      ptx_dev->omp_stacks.usage_count--;
      pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
    );
  } else {
    pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
    /* Free uncached stack */
    pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
  }
}

This seems like it'd be rather fragile to me, and would offer some
benefit perhaps only if a previous cached stacks block was much larger
than the one required for some given later launch. It wouldn't allow
any additional parallelism on the target I don't think.

Is that sort-of what you meant?

Oh, or perhaps something more like checking cuStreamQuery at the end of
the kernel launch to see if more work (...from other threads) is
outstanding on the same queue? I think that only usefully returns
CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.

Thanks for clarification (& apologies for being slow!),

Julian
Julian Brown Jan. 5, 2021, 12:13 p.m. UTC | #14
Hi Jakub,

Just to check, does my reply below address your concerns --
particularly with regards to the current usage of CUDA streams
serializing kernel executions from different host threads? Given that
situation, and the observed speed improvement with OpenMP offloading to
NVPTX with the patch, I'm not sure how much sense it makes to do
anything more sophisticated than this -- especially without a test case
that demonstrates a performance regression (or an exacerbated
out-of-memory condition) with the patch.

Thanks,

Julian

On Tue, 15 Dec 2020 23:16:48 +0000
Julian Brown <julian@codesourcery.com> wrote:

> On Tue, 15 Dec 2020 18:00:36 +0100
> Jakub Jelinek <jakub@redhat.com> wrote:
> 
> > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:  
> > > > Do you need to hold the omp_stacks.lock across the entire
> > > > offloading? Doesn't that serialize all offloading kernels to the
> > > > same device? I mean, can't the lock be taken just shortly at the
> > > > start to either acquire the cached stacks or allocate a fresh
> > > > stack, and then at the end to put the stack back into the
> > > > cache?    
> > > 
> > > I think you're suggesting something like what Alexander mentioned
> > > -- a pool of cached stacks blocks in case the single, locked block
> > > is contested. Obviously at present kernel launches are serialised
> > > on the target anyway, so it's a question of whether having the
> > > device wait for the host to unlock the stacks block (i.e. a
> > > context switch, FSVO context switch), or allocating a new stacks
> > > block, is quicker. I think the numbers posted in the parent email
> > > show that memory allocation is so slow that just waiting for the
> > > lock wins. I'm wary of adding unnecessary complication,
> > > especially if it'll only be exercised in already hard-to-debug
> > > cases (i.e. lots of threads)!    
> > 
> > I'm not suggesting to have multiple stacks, on the contrary.  I've
> > suggested to do the caching only if at most one host thread is
> > offloading to the device.
> > 
> > If one uses
> > #pragma omp parallel num_threads(3)
> > {
> >   #pragma omp target
> >   ...
> > }
> > then I don't see what would previously prevent the concurrent
> > offloading, yes, we take the device lock during gomp_map_vars and
> > again during gomp_unmap_vars, but don't hold it across the
> > offloading in between.  
> 
> I still don't think I quite understand what you're getting at.
> 
> We only implement synchronous launches for OpenMP on NVPTX at present,
> and those all use the default CUDA runtime driver stream. Only one
> kernel executes on the hardware at once, even if launched from
> different host threads. The serialisation isn't due to the device lock
> being held, but by the queueing semantics of the underlying API.
> 
> > > Does target-side memory allocation call back into the plugin's
> > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not,
> > > target-side memory allocation shouldn't be affected, I don't
> > > think?    
> > 
> > Again, I'm not suggesting that it should, but what I'm saying is
> > that if target region ends but some other host tasks are doing
> > target regions to the same device concurrently with that, or if
> > there are async target in fly, we shouldn't try to cache the stack,
> > but free it right away, because what the other target regions might
> > need to malloc larger amounts of memory and fail because of the
> > caching.  
> 
> I'm assuming you're not suggesting fundamentally changing APIs or
> anything to determine if we're launching target regions from multiple
> threads at once, but instead that we try to detect the condition
> dynamically in the plugin?
> 
> So, would kernel launch look something like this? (Excuse
> pseudo-code-isms!)
> 
> void GOMP_OFFLOAD_run (...)
> {
>   bool used_cache;
> 
>   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>   if (&ptx_dev->omp_stacks.usage_count > 0)
>   {
>     cuCtxSynchronize ();
>     nvptx_stacks_free (&ptx_dev);
>     ...allocate fresh stack, no caching...
>     used_cache = false;
>   }
>   else
>   {
>     /* Allocate or re-use cached stacks, and then... */
>     ptx_dev->omp_stacks.usage_count++;
>     used_cache = true;
>   }
>   pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> 
>   /* Launch kernel */
> 
>   if (used_cache) {
>     cuStreamAddCallback (
>       pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>       ptx_dev->omp_stacks.usage_count--;
>       pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>     );
>   } else {
>     pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>     /* Free uncached stack */
>     pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>   }
> }
> 
> This seems like it'd be rather fragile to me, and would offer some
> benefit perhaps only if a previous cached stacks block was much larger
> than the one required for some given later launch. It wouldn't allow
> any additional parallelism on the target I don't think.
> 
> Is that sort-of what you meant?
> 
> Oh, or perhaps something more like checking cuStreamQuery at the end
> of the kernel launch to see if more work (...from other threads) is
> outstanding on the same queue? I think that only usefully returns
> CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.
> 
> Thanks for clarification (& apologies for being slow!),
> 
> Julian
Jakub Jelinek Jan. 5, 2021, 3:32 p.m. UTC | #15
On Tue, Jan 05, 2021 at 12:13:59PM +0000, Julian Brown wrote:
> Just to check, does my reply below address your concerns --
> particularly with regards to the current usage of CUDA streams
> serializing kernel executions from different host threads? Given that
> situation, and the observed speed improvement with OpenMP offloading to
> NVPTX with the patch, I'm not sure how much sense it makes to do
> anything more sophisticated than this -- especially without a test case
> that demonstrates a performance regression (or an exacerbated
> out-of-memory condition) with the patch.

I guess I can live with it for GCC 11, but would like this to be
reconsidered for GCC 12, people do run OpenMP offloading code from multiple
often concurrent threads and we shouldn't serialize it unnecessarily.

	Jakub
diff mbox series

Patch

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..e7ff5d5213e0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -137,6 +137,15 @@  init_cuda_lib (void)
 #define MIN(X,Y) ((X) < (Y) ? (X) : (Y))
 #define MAX(X,Y) ((X) > (Y) ? (X) : (Y))
 
+static const char *
+maybe_abort_message (unsigned errmsg)
+{
+  if (errmsg == CUDA_ERROR_LAUNCH_FAILED)
+    return " (perhaps abort was called)";
+  else
+    return "";
+}
+
 /* Convenience macros for the frequently used CUDA library call and
    error handling sequence as well as CUDA library calls that
    do the error checking themselves or don't do it at all.  */
@@ -147,8 +156,9 @@  init_cuda_lib (void)
       = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
-	GOMP_PLUGIN_error (#FN " error: %s",	\
-			   cuda_error (__r));	\
+	GOMP_PLUGIN_error (#FN " error: %s%s",	\
+			   cuda_error (__r),	\
+			   maybe_abort_message (__r));	\
 	return ERET;				\
       }						\
   } while (0)
@@ -162,8 +172,9 @@  init_cuda_lib (void)
       = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
-	GOMP_PLUGIN_fatal (#FN " error: %s",	\
-			   cuda_error (__r));	\
+	GOMP_PLUGIN_fatal (#FN " error: %s%s",	\
+			   cuda_error (__r),	\
+			   maybe_abort_message (__r));	\
       }						\
   } while (0)
 
@@ -307,6 +318,14 @@  struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+    {
+      CUdeviceptr ptr;
+      size_t size;
+      pthread_mutex_t lock;
+    } omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +533,10 @@  nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +557,11 @@  nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
+  pthread_mutex_destroy (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+    CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
     CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -1866,26 +1894,49 @@  nvptx_stacks_size ()
   return 128 * 1024;
 }
 
-/* Return contiguous storage for NUM stacks, each SIZE bytes.  */
+/* Return contiguous storage for NUM stacks, each SIZE bytes, and obtain the
+   lock for that storage.  */
 
 static void *
-nvptx_stacks_alloc (size_t size, int num)
+nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 {
-  CUdeviceptr stacks;
-  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num);
+  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr && ptx_dev->omp_stacks.size >= size * num)
+    return (void *) ptx_dev->omp_stacks.ptr;
+
+  /* Free the old, too-small stacks.  */
+  if (ptx_dev->omp_stacks.ptr)
+    {
+      CUresult r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s\n", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+    }
+
+  /* Make new and bigger stacks, and remember where we put them and how big
+     they are.  */
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &ptx_dev->omp_stacks.ptr,
+				  size * num);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
-  return (void *) stacks;
+
+  ptx_dev->omp_stacks.size = size * num;
+
+  return (void *) ptx_dev->omp_stacks.ptr;
 }
 
-/* Release storage previously allocated by nvptx_stacks_alloc.  */
+/* Release the lock associated with a ptx_device's OpenMP stacks block.  */
 
 static void
-nvptx_stacks_free (void *p, int num)
+nvptx_stacks_release (CUstream stream, CUresult res, void *ptr)
 {
-  CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p);
-  if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+  if (res != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+  struct ptx_device *ptx_dev = (struct ptx_device *) ptr;
+  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
 }
 
 void
@@ -1898,7 +1949,6 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   const char *fn_name = launch->fn;
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
-  const char *maybe_abort_msg = "(perhaps abort was called)";
   int teams = 0, threads = 0;
 
   if (!args)
@@ -1922,7 +1972,7 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
-  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
+  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
   void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
   size_t fn_args_size = sizeof fn_args;
   void *config[] = {
@@ -1938,13 +1988,8 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
-  if (r == CUDA_ERROR_LAUNCH_FAILED)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
-		       maybe_abort_msg);
-  else if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-  nvptx_stacks_free (stacks, teams * threads);
+  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
+		    (void *) ptx_dev, 0);
 }
 
 /* TODO: Implement GOMP_OFFLOAD_async_run. */