[gomp-nvptx,2/9] nvptx backend: new "uniform SIMT" codegen variant

This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading.  Let me try to explain the idea behind it...

In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
instruction, with some threads possibly masked off if under a divergent branch.
So we are mapping OpenMP threads to such thread groups ("warps"), and hardware
threads are then mapped to OpenMP SIMD lanes.

We need to reach heads of SIMD regions with all hw threads active, because
there's no way to "resurrect" them once masked off: they need to follow the
same control flow, and reach the SIMD region entry with the same local state
(registers, and stack too for OpenACC).

The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
"slaves" which just follow branches without any computation -- that requires
extra jumps and broadcasting branch predicates, -- and 2) broadcast register
state and stack state from master to slaves when entering "vector" regions.

I'm taking a different approach.  I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.  Most instructions
automatically satisfy that: if threads have the same state, then executing an
arithmetic instruction, normal memory load/store, etc. keep local state the
same in all threads.

The two exception insn categories are atomics and calls.  For calls, we can
demand recursively that they uphold this execution model, until we reach
runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
atomics.

To handle atomics, we
  1) execute the atomic conditionally only in one warp member -- so its side
  effect happens once;
  2) copy the register that was set from that warp member to others -- so
  local state is kept synchronized:

    atom.op dest, ...

becomes

    /* pred = (current_lane == 0);  */
    @pred atom.op dest, ...
    shuffle.idx dest, dest, /*srclane=*/0

So the overhead is one shuffle insn following each atomic, plus predicate
setup in the prologue.

OK, so the above handles execution out of SIMD regions nicely, but then we'd
also need to run code inside of SIMD regions, where we need to turn off this
synching effect.  Turns out we can keep atomics decorated almost like before:

    @pred atom.op dest, ...
    shuffle.idx dest, dest, master_lane

and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we
need (master_lane == 0 && pred == (current_lane == 0)), and inside we need
(master_lane == current_lane && pred == true) (so that shuffle is no-op, and
predicate is 'true' for all lanes).  Then, (pred = (current_lane ==
master_lane) works in both cases, and we just need to set up master_lane
accordingly: master_lane = current_lane & mask, where mask is all-0 outside of
SIMD regions, and all-1 inside.  To store these per-warp masks, I've
introduced another shared memory array, __nvptx_uni.

	* config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
	(nvptx_init_unisimt_predicate): ...here (new function) and use it...
	(nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
	(nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
	(nvptx_get_unisimt_master): New helper function.
	(nvptx_get_unisimt_predicate): Ditto.
	(nvptx_call_insn_is_syscall_p): Ditto.
	(nvptx_unisimt_handle_set): Ditto.
	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
	(nvptx_get_axis_predicate): New helper function, factored out from...
	(nvptx_single): ...here.
	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_unisimt__ when -muniform-simt option is active.
	(struct machine_function): Add unisimt_master, unisimt_predicate
	rtx fields.
	* config/nvptx/nvptx.md (divergent): New attribute.
	(atomic_compare_and_swap<mode>_1): Mark as divergent.
	(atomic_exchange<mode>): Ditto.
	(atomic_fetch_add<mode>): Ditto.
	(atomic_fetch_addsf): Ditto.
	(atomic_fetch_<logic><mode>): Ditto.
	* config/nvptx/nvptx.opt (muniform-simt): New option.
	* doc/invoke.texi (-muniform-simt): Document.
---
 gcc/config/nvptx/nvptx.c   | 138 ++++++++++++++++++++++++++++++++++++++++++---
 gcc/config/nvptx/nvptx.h   |   4 ++
 gcc/config/nvptx/nvptx.md  |  18 ++++--
 gcc/config/nvptx/nvptx.opt |   4 ++
 gcc/doc/invoke.texi        |  14 +++++
 5 files changed, 165 insertions(+), 13 deletions(-)

[gomp-nvptx,2/9] nvptx backend: new "uniform SIMT" codegen variant

Commit Message

Comments

Patch