[RFC,nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Hi!

With nvptx offloading, in one OpenACC test case, we're running into the
following fatal error (GOMP_DEBUG=1 output):

    [...]
    info    : Function properties for 'LBM_performStreamCollide$_omp_fn$0':
    info    : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 bytes cmem[2], 0 bytes lmem
    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32

    libgomp: cuLaunchKernel error: too many resources requested for launch

Very likely this means that the number of registers used in this function
("used 87 registers"), multiplied by the thread block size (workers *
vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

(One problem certainly might be that we're currently not doing any
register allocation for nvptx, as far as I remember based on the idea
that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
this up" for us -- which I'm not sure it actually is doing?)

Below I'm posting a prototype patch which makes the execution run
successfully:

    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32
        cuLaunchKernel: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying with reduced number of workers
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=16, vectors=32
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: finished
    [...]

As -- I think -- the maximum number of registers in a thread block is
fixed, it would be good to remember the modified dims[GOMP_DIM_WORKER]
(which my patch doesn't).

Alternatively/additionally, we could try experimenting with using the
following of enum CUjit_option "Online compiler and linker options":

    CU_JIT_MAX_REGISTERS = 0
        Max number of registers that a thread may use. Option type: unsigned int Applies to: compiler only 
    CU_JIT_THREADS_PER_BLOCK
        IN: Specifies minimum number of threads per block to target compilation for OUT: Returns the number of threads the compiler actually targeted. This restricts the resource utilization fo the compiler (e.g. max registers) such that a block with the given number of threads should be able to launch based on register limitations. Note, this option does not currently take into account any other resource limitations, such as shared memory utilization. Cannot be combined with CU_JIT_TARGET. Option type: unsigned int Applies to: compiler only 
    [...]

..., to have the PTX JIT reduce the number of live registers (if
possible; I don't know), and/or could try experimenting with querying the
active device, enum CUdevice_attribute "Device properties":

    [...]
    CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
        Maximum number of 32-bit registers available per block 
    [...]

..., and use that in combination with each function's enum
CUfunction_attribute "Function properties":

    CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0
        The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.
    [...]
    CU_FUNC_ATTRIBUTE_NUM_REGS = 4
        The number of registers used by each thread of this function. 
    [...]

... to determine an optimal number of threads per block given the number
of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
would do that already?).  All these options however are more complicated
than the following simple "back-off" approach:

commit bb0bf9e50026feabe877c9d8174e78c021b002a4
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Tue Jan 19 12:31:27 2016 +0100

    [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
---
 gcc/gimple-fold.c             |    7 +++++++
 gcc/tree-vrp.c                |    1 +
 libgomp/plugin/plugin-nvptx.c |   28 ++++++++++++++++++++--------
 3 files changed, 28 insertions(+), 8 deletions(-)

Grüße
 Thomas

[RFC,nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Commit Message

Comments

Patch