diff mbox

[RFC,nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Message ID 87oacheqlz.fsf@hertz.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge Jan. 19, 2016, 11:49 a.m. UTC
Hi!

With nvptx offloading, in one OpenACC test case, we're running into the
following fatal error (GOMP_DEBUG=1 output):

    [...]
    info    : Function properties for 'LBM_performStreamCollide$_omp_fn$0':
    info    : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 bytes cmem[2], 0 bytes lmem
    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32
    
    libgomp: cuLaunchKernel error: too many resources requested for launch

Very likely this means that the number of registers used in this function
("used 87 registers"), multiplied by the thread block size (workers *
vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

(One problem certainly might be that we're currently not doing any
register allocation for nvptx, as far as I remember based on the idea
that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
this up" for us -- which I'm not sure it actually is doing?)

Below I'm posting a prototype patch which makes the execution run
successfully:

    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32
        cuLaunchKernel: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying with reduced number of workers
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=16, vectors=32
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: finished
    [...]

As -- I think -- the maximum number of registers in a thread block is
fixed, it would be good to remember the modified dims[GOMP_DIM_WORKER]
(which my patch doesn't).

Alternatively/additionally, we could try experimenting with using the
following of enum CUjit_option "Online compiler and linker options":

    CU_JIT_MAX_REGISTERS = 0
        Max number of registers that a thread may use. Option type: unsigned int Applies to: compiler only 
    CU_JIT_THREADS_PER_BLOCK
        IN: Specifies minimum number of threads per block to target compilation for OUT: Returns the number of threads the compiler actually targeted. This restricts the resource utilization fo the compiler (e.g. max registers) such that a block with the given number of threads should be able to launch based on register limitations. Note, this option does not currently take into account any other resource limitations, such as shared memory utilization. Cannot be combined with CU_JIT_TARGET. Option type: unsigned int Applies to: compiler only 
    [...]

..., to have the PTX JIT reduce the number of live registers (if
possible; I don't know), and/or could try experimenting with querying the
active device, enum CUdevice_attribute "Device properties":

    [...]
    CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
        Maximum number of 32-bit registers available per block 
    [...]

..., and use that in combination with each function's enum
CUfunction_attribute "Function properties":

    CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0
        The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.
    [...]
    CU_FUNC_ATTRIBUTE_NUM_REGS = 4
        The number of registers used by each thread of this function. 
    [...]

... to determine an optimal number of threads per block given the number
of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
would do that already?).  All these options however are more complicated
than the following simple "back-off" approach:

commit bb0bf9e50026feabe877c9d8174e78c021b002a4
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Tue Jan 19 12:31:27 2016 +0100

    [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
---
 gcc/gimple-fold.c             |    7 +++++++
 gcc/tree-vrp.c                |    1 +
 libgomp/plugin/plugin-nvptx.c |   28 ++++++++++++++++++++--------
 3 files changed, 28 insertions(+), 8 deletions(-)



Grüße
 Thomas

Comments

Alexander Monakov Jan. 19, 2016, 1:35 p.m. UTC | #1
On Tue, 19 Jan 2016, Thomas Schwinge wrote:

> Hi!
> 
> With nvptx offloading, in one OpenACC test case, we're running into the
> following fatal error (GOMP_DEBUG=1 output):
> 
>     [...]
>     info    : Function properties for 'LBM_performStreamCollide$_omp_fn$0':
>     info    : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 bytes cmem[2], 0 bytes lmem
>     [...]
>       nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32
>     
>     libgomp: cuLaunchKernel error: too many resources requested for launch
> 
> Very likely this means that the number of registers used in this function
> ("used 87 registers"), multiplied by the thread block size (workers *
> vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so
87*32*32 definitely overflows that limit.  A reference is available in CUDA C
Programming, appendix G, table 13:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
 
> (One problem certainly might be that we're currently not doing any
> register allocation for nvptx, as far as I remember based on the idea
> that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
> this up" for us -- which I'm not sure it actually is doing?)

(well, if you want I can point out that
 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's
 something I'd like to play with in the future, time permitting
 2) OpenACC register copying at forks increases (pseudo-)register pressure
 3) I think if you inspect PTX code you'll see it used way more than 87 regs)

As for the proposed patch, does the OpenACC spec leave the implementation
freedom to spawn a different number of workers than requested?  (honest
question -- I didn't look at the spec that closely)

> Alternatively/additionally, we could try experimenting with using the
> following of enum CUjit_option "Online compiler and linker options":
[snip]
> ..., to have the PTX JIT reduce the number of live registers (if
> possible; I don't know), and/or could try experimenting with querying the
> active device, enum CUdevice_attribute "Device properties":
> 
>     [...]
>     CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
>         Maximum number of 32-bit registers available per block 
>     [...]
> 
> ..., and use that in combination with each function's enum
> CUfunction_attribute "Function properties":
[snip]
> ... to determine an optimal number of threads per block given the number
> of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> would do that already?).

I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
cuOcc* (occupancy query) interface that allows to simply ask the driver about
the per-function launch limit.

Thanks.
Alexander
Nathan Sidwell Jan. 19, 2016, 1:47 p.m. UTC | #2
On 01/19/16 06:49, Thomas Schwinge wrote:

> (One problem certainly might be that we're currently not doing any
> register allocation for nvptx, as far as I remember based on the idea
> that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
> this up" for us -- which I'm not sure it actually is doing?)

My understanding is that the JIT   compiler does register allocation.

>     int axis = get_oacc_ifn_dim_arg (call);
> +  if (axis == GOMP_DIM_WORKER)
> +    {
> +      /* libgomp's nvptx plugin might potentially modify
> +	 dims[GOMP_DIM_WORKER].  */
> +      return NULL_TREE;
> +    }

this is almost certainly wrong.   You're preventing constant folding in the 
compiler.

nathan
Alexander Monakov Jan. 19, 2016, 2:07 p.m. UTC | #3
On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > ... to determine an optimal number of threads per block given the number
> > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > would do that already?).
> 
> I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
> cuOcc* (occupancy query) interface that allows to simply ask the driver about
> the per-function launch limit.

Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is
indeed sufficient for limiting threads per block, which is trivially
translatable into workers per gang in OpenACC.  IMO it's also a cleaner
approach in this case, compared to iterative backoff (if, again, the
implementation is free to do that).

When mentioning cuOcc* I was thinking about finding an optimal number of
blocks per device, which is a different story.

Alexander
Thomas Schwinge Jan. 19, 2016, 2:40 p.m. UTC | #4
Hi!

On Tue, 19 Jan 2016 08:47:02 -0500, Nathan Sidwell <nathan@acm.org> wrote:
> On 01/19/16 06:49, Thomas Schwinge wrote:
> >     int axis = get_oacc_ifn_dim_arg (call);
> > +  if (axis == GOMP_DIM_WORKER)
> > +    {
> > +      /* libgomp's nvptx plugin might potentially modify
> > +	 dims[GOMP_DIM_WORKER].  */
> > +      return NULL_TREE;
> > +    }
> 
> this is almost certainly wrong.   You're preventing constant folding in the 
> compiler.

Yes, because if libgomp can modify dims[GOMP_DIM_WORKER], in the compiler
we can no assume it to be constant?  (Did result in a run-time test
verification failure.)  Of course, my hammer might be a too big one
(which is why this is a RFC).


Grüße
 Thomas
Thomas Schwinge Jan. 19, 2016, 2:55 p.m. UTC | #5
Hi!

On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > > ... to determine an optimal number of threads per block given the number
> > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > > would do that already?).
> > 
> > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
> > cuOcc* (occupancy query) interface that allows to simply ask the driver about
> > the per-function launch limit.

You mean you already have implemented something along the lines I
proposed?

> Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is
> indeed sufficient for limiting threads per block, which is trivially
> translatable into workers per gang in OpenACC.

That's good to know, thanks!

> IMO it's also a cleaner
> approach in this case, compared to iterative backoff (if, again, the
> implementation is free to do that).

It is not explicitly spelled out in OpenACC 2.0a, but it got clarified in
OpenACC 2.5.  See "2.5.7. num workers clause": "[...]  The implementation
may use a different value than specified based on limitations imposed by
the target architecture".

> When mentioning cuOcc* I was thinking about finding an optimal number of
> blocks per device, which is a different story.

:-)


Grüße
 Thomas
Alexander Monakov Jan. 19, 2016, 3:10 p.m. UTC | #6
On Tue, 19 Jan 2016, Thomas Schwinge wrote:

> Hi!
> 
> On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov <amonakov@ispras.ru> wrote:
> > On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > > > ... to determine an optimal number of threads per block given the number
> > > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > > > would do that already?).
> > > 
> > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
> > > cuOcc* (occupancy query) interface that allows to simply ask the driver about
> > > the per-function launch limit.
> 
> You mean you already have implemented something along the lines I
> proposed?

Yes, I was implementing OpenMP teams, and it made sense to add warps per block
limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if
default or requested number of threads per team is too high).  I intend to
post that patch as part of a larger series shortly (but the patch itself is
simple enough, although a small tweak will be needed to make it apply to
OpenACC too).

Alexander
Alexander Monakov Jan. 20, 2016, 5:35 p.m. UTC | #7
On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > You mean you already have implemented something along the lines I
> > proposed?
> 
> Yes, I was implementing OpenMP teams, and it made sense to add warps per block
> limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if
> default or requested number of threads per team is too high).  I intend to
> post that patch as part of a larger series shortly (but the patch itself is
> simple enough, although a small tweak will be needed to make it apply to
> OpenACC too).

Here's the patch I was talking about:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=04e68c22081c36caf5da9d9f4ca5e895e1088c78;hp=735c8a7d88a7e14cb707f22286678982174175a6

Alexander
diff mbox

Patch

diff --git gcc/gimple-fold.c gcc/gimple-fold.c
index a0e7b7e..e75c58e 100644
--- gcc/gimple-fold.c
+++ gcc/gimple-fold.c
@@ -2935,6 +2935,13 @@  fold_internal_goacc_dim (const gimple *call)
     return NULL_TREE;
 
   int axis = get_oacc_ifn_dim_arg (call);
+  if (axis == GOMP_DIM_WORKER)
+    {
+      /* libgomp's nvptx plugin might potentially modify
+	 dims[GOMP_DIM_WORKER].  */
+      return NULL_TREE;
+    }
+
   int size = get_oacc_fn_dim_size (current_function_decl, axis);
   bool is_pos = gimple_call_internal_fn (call) == IFN_GOACC_DIM_POS;
   tree result = NULL_TREE;
diff --git gcc/tree-vrp.c gcc/tree-vrp.c
index e6c11e0..a0a78d2 100644
--- gcc/tree-vrp.c
+++ gcc/tree-vrp.c
@@ -3980,6 +3980,7 @@  extract_range_basic (value_range *vr, gimple *stmt)
 	  break;
 	case CFN_GOACC_DIM_SIZE:
 	case CFN_GOACC_DIM_POS:
+	  //TODO: is this kosher regarding libgomp's nvptx plugin potentially modifying dims[GOMP_DIM_WORKER]?
 	  /* Optimizing these two internal functions helps the loop
 	     optimizer eliminate outer comparisons.  Size is [1,N]
 	     and pos is [0,N-1].  */
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index eea74d4..54fd5cb 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -974,24 +974,36 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *));
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuda_error (r));
+  kargs[0] = &dp;
 
-  GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
-		     " gangs=%u, workers=%u, vectors=%u\n",
-		     __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
-		     dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]);
-
+ launch:
   // OpenACC		CUDA
   //
   // num_gangs		nctaid.x
   // num_workers	ntid.y
   // vector length	ntid.x
-
-  kargs[0] = &dp;
+  GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
+		     " gangs=%u, workers=%u, vectors=%u\n",
+		     __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
+		     dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]);
   r = cuLaunchKernel (function,
 		      dims[GOMP_DIM_GANG], 1, 1,
 		      dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
 		      0, dev_str->stream, kargs, 0);
-  if (r != CUDA_SUCCESS)
+  if (r == CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES)
+    {
+      /* Don't give up just yet; possibly too many threads for the kernel's
+	 register count.  */
+      if (dims[GOMP_DIM_WORKER] > 1)
+	{
+	  dims[GOMP_DIM_WORKER] /= 2;
+	  GOMP_PLUGIN_debug (0, "    cuLaunchKernel: "
+			     "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying "
+			     "with reduced number of workers\n");
+	  goto launch;
+	}
+    }
+  if (r != CUDA_SUCCESS) //CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
 #ifndef DISABLE_ASYNC