diff mbox

[gomp-nvptx,2/9] nvptx backend: new "uniform SIMT" codegen variant

Message ID 1448983707-18854-3-git-send-email-amonakov@ispras.ru
State New
Headers show

Commit Message

Alexander Monakov Dec. 1, 2015, 3:28 p.m. UTC
This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading.  Let me try to explain the idea behind it...

In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
instruction, with some threads possibly masked off if under a divergent branch.
So we are mapping OpenMP threads to such thread groups ("warps"), and hardware
threads are then mapped to OpenMP SIMD lanes.

We need to reach heads of SIMD regions with all hw threads active, because
there's no way to "resurrect" them once masked off: they need to follow the
same control flow, and reach the SIMD region entry with the same local state
(registers, and stack too for OpenACC).

The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
"slaves" which just follow branches without any computation -- that requires
extra jumps and broadcasting branch predicates, -- and 2) broadcast register
state and stack state from master to slaves when entering "vector" regions.

I'm taking a different approach.  I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.  Most instructions
automatically satisfy that: if threads have the same state, then executing an
arithmetic instruction, normal memory load/store, etc. keep local state the
same in all threads.

The two exception insn categories are atomics and calls.  For calls, we can
demand recursively that they uphold this execution model, until we reach
runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
atomics.

To handle atomics, we
  1) execute the atomic conditionally only in one warp member -- so its side
  effect happens once;
  2) copy the register that was set from that warp member to others -- so
  local state is kept synchronized:

    atom.op dest, ...

becomes

    /* pred = (current_lane == 0);  */
    @pred atom.op dest, ...
    shuffle.idx dest, dest, /*srclane=*/0

So the overhead is one shuffle insn following each atomic, plus predicate
setup in the prologue.

OK, so the above handles execution out of SIMD regions nicely, but then we'd
also need to run code inside of SIMD regions, where we need to turn off this
synching effect.  Turns out we can keep atomics decorated almost like before:

    @pred atom.op dest, ...
    shuffle.idx dest, dest, master_lane

and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we
need (master_lane == 0 && pred == (current_lane == 0)), and inside we need
(master_lane == current_lane && pred == true) (so that shuffle is no-op, and
predicate is 'true' for all lanes).  Then, (pred = (current_lane ==
master_lane) works in both cases, and we just need to set up master_lane
accordingly: master_lane = current_lane & mask, where mask is all-0 outside of
SIMD regions, and all-1 inside.  To store these per-warp masks, I've
introduced another shared memory array, __nvptx_uni.

	* config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
	(nvptx_init_unisimt_predicate): ...here (new function) and use it...
	(nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
	(nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
	(nvptx_get_unisimt_master): New helper function.
	(nvptx_get_unisimt_predicate): Ditto.
	(nvptx_call_insn_is_syscall_p): Ditto.
	(nvptx_unisimt_handle_set): Ditto.
	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
	(nvptx_get_axis_predicate): New helper function, factored out from...
	(nvptx_single): ...here.
	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_unisimt__ when -muniform-simt option is active.
	(struct machine_function): Add unisimt_master, unisimt_predicate
	rtx fields.
	* config/nvptx/nvptx.md (divergent): New attribute.
	(atomic_compare_and_swap<mode>_1): Mark as divergent.
	(atomic_exchange<mode>): Ditto.
	(atomic_fetch_add<mode>): Ditto.
	(atomic_fetch_addsf): Ditto.
	(atomic_fetch_<logic><mode>): Ditto.
	* config/nvptx/nvptx.opt (muniform-simt): New option.
	* doc/invoke.texi (-muniform-simt): Document.
---
 gcc/config/nvptx/nvptx.c   | 138 ++++++++++++++++++++++++++++++++++++++++++---
 gcc/config/nvptx/nvptx.h   |   4 ++
 gcc/config/nvptx/nvptx.md  |  18 ++++--
 gcc/config/nvptx/nvptx.opt |   4 ++
 gcc/doc/invoke.texi        |  14 +++++
 5 files changed, 165 insertions(+), 13 deletions(-)

Comments

Bernd Schmidt Dec. 1, 2015, 4:01 p.m. UTC | #1
On 12/01/2015 04:28 PM, Alexander Monakov wrote:
> I'm taking a different approach.  I want to execute all insns in all warp
> members, while ensuring that effect (on global and local state) is that same
> as if any single thread was executing that instruction.  Most instructions
> automatically satisfy that: if threads have the same state, then executing an
> arithmetic instruction, normal memory load/store, etc. keep local state the
> same in all threads.
>
> The two exception insn categories are atomics and calls.  For calls, we can
> demand recursively that they uphold this execution model, until we reach
> runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
> atomics.

Didn't we also conclude that address-taking (let's say for stack 
addresses) is also an operation that does not result in the same state?

Have you tried to use the mechanism used for OpenACC? IMO that would be 
a good first step - get things working with fewer changes, and then look 
into optimizing them (ideally for OpenMP and OpenACC both).


Bernd
Alexander Monakov Dec. 1, 2015, 4:20 p.m. UTC | #2
On Tue, 1 Dec 2015, Bernd Schmidt wrote:
> 
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?

This is intended to be used with soft-stacks in OpenMP offloading, and
soft-stacks are per-warp outside of SIMD regions, not private to hwthread.  So
no such problem arises.

(also, I wouldn't phrase it that way -- I wouldn't say that taking address of
a classic .local stack slot desyncs state)

> Have you tried to use the mechanism used for OpenACC? IMO that would be a good
> first step - get things working with fewer changes, and then look into
> optimizing them (ideally for OpenMP and OpenACC both).

I don't think I would have as much success trying to apply the OpenACC
mechanism with the overall direction I'm taking, that is, running with a
slightly modified libgomp port.  The way parallel regions are activated in the
guts of libgomp via GOMP_parallel/gomp_team_start makes things different, for
example.

Alexander
Jakub Jelinek Dec. 2, 2015, 10:40 a.m. UTC | #3
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote:
> The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
> "slaves" which just follow branches without any computation -- that requires
> extra jumps and broadcasting branch predicates, -- and 2) broadcast register
> state and stack state from master to slaves when entering "vector" regions.
> 
> I'm taking a different approach.  I want to execute all insns in all warp
> members, while ensuring that effect (on global and local state) is that same
> as if any single thread was executing that instruction.  Most instructions
> automatically satisfy that: if threads have the same state, then executing an
> arithmetic instruction, normal memory load/store, etc. keep local state the
> same in all threads.

Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches?  I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just jump around it, vs. having all threads busy?

If it is the same, then I think your approach is reasonable, but my
understanding of PTX is limited.

How exactly does OpenACC copy the stack?  At least for OpenMP, one could
have automatic vars whose addresses are passed to simd regions in different
functions, say like:

void
baz (int x, int *arr)
{
  int i;
  #pragma omp simd
  for (i = 0; i < 128; i++)
    arr[i] *= arr[i] + i + x; // Replace with something useful and expensive
}

void
bar (int x)
{
  int arr[128], i;
  for (i = 0; i < 128; i++)
    arr[i] = i + x;
  baz (x, arr);
}
#pragma omp declare target to (bar, baz)

void
foo ()
{
  int i;
  #pragma omp target teams distribute parallel for
  for (i = 0; i < 131072; i++)
    bar (i);
}
and without inlining you don't know if the arr in bar above will be shared
by all SIMD lanes (SIMT in PTX case) or not.

	Jakub
Nathan Sidwell Dec. 2, 2015, 1:02 p.m. UTC | #4
On 12/02/15 05:40, Jakub Jelinek wrote:
>  Don't know the HW good enough, is there any power consumption, heat etc.
> difference between the two approaches?  I mean does the HW consume different
> amount of power if only one thread in a warp executes code and the other
> threads in the same warp just jump around it, vs. having all threads busy?

Having all threads busy will increase power consumption.  It's also bad if the 
other vectors are executing memory access instructions.  However, for small 
blocks, it is probably a win over the jump around approach.  One of the 
optimizations for the future of the neutering algorithm is to add such 
predication for small blocks and keep branching for the larger blocks.

> How exactly does OpenACC copy the stack?  At least for OpenMP, one could
> have automatic vars whose addresses are passed to simd regions in different
> functions, say like:

The stack frame of the current function is copied when entering a partitioned 
region.  (There is no visibility of caller's frame and such.) Again, 
optimization would be trying to only copy the stack that's used in the 
partitioned region.

nathan
Jakub Jelinek Dec. 2, 2015, 1:10 p.m. UTC | #5
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> >difference between the two approaches?  I mean does the HW consume different
> >amount of power if only one thread in a warp executes code and the other
> >threads in the same warp just jump around it, vs. having all threads busy?
> 
> Having all threads busy will increase power consumption.  It's also bad if
> the other vectors are executing memory access instructions.  However, for

Then the uniform SIMT approach might not be that good idea.

> small blocks, it is probably a win over the jump around approach.  One of
> the optimizations for the future of the neutering algorithm is to add such
> predication for small blocks and keep branching for the larger blocks.
> 
> >How exactly does OpenACC copy the stack?  At least for OpenMP, one could
> >have automatic vars whose addresses are passed to simd regions in different
> >functions, say like:
> 
> The stack frame of the current function is copied when entering a
> partitioned region.  (There is no visibility of caller's frame and such.)
> Again, optimization would be trying to only copy the stack that's used in
> the partitioned region.

Always the whole stack, from the current stack pointer up to top of the
stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?

	Jakub
Nathan Sidwell Dec. 2, 2015, 1:38 p.m. UTC | #6
On 12/02/15 08:10, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:

> Always the whole stack, from the current stack pointer up to top of the
> stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?

The frame of the current function.  Not the whole stack.  As I said, there's no 
visibility of the stack beyond the current function.  (one could implement some 
kind of chaining, I guess)

PTX does not expose the concept of a stack at all.  No stack pointer, no link 
register, no argument pushing.

It does expose 'local' memory, which is private to a thread and only live during 
a function (not like function-scope 'static').  From that we construct stack frames.

The rules of PTX are such that one can (almost) determine the call graph 
statically.  I don't know whether the JIT implements .local as a stack or 
statically allocates it (and perhaps uses a liveness algorithm to determine 
which pieces may overlap).  Perhaps it depends on the physical device capabilities.

The 'almost' fails with indirect calls, except that
1) at an indirect call, you may specify the static set of fns you know it'll 
resolve to
2) if you don't know that, you have to specify the function prototype anyway. 
So the static set would be 'all functions of that type'.

I don't know if the JIT makes use of that information.

nathan
Jakub Jelinek Dec. 2, 2015, 1:46 p.m. UTC | #7
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:10, Jakub Jelinek wrote:
> >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> 
> >Always the whole stack, from the current stack pointer up to top of the
> >stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?
> 
> The frame of the current function.  Not the whole stack.  As I said, there's
> no visibility of the stack beyond the current function.  (one could
> implement some kind of chaining, I guess)

So, how does OpenACC cope with this?

Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the worker level and splits the work among
the threads in the warp, on the array section pointed by that passed in
pointer?  See the OpenMP testcase I've posted in this thread.

	Jakub
Bernd Schmidt Dec. 2, 2015, 2 p.m. UTC | #8
On 12/02/2015 02:46 PM, Jakub Jelinek wrote:
> Or does the OpenACC execution model not allow anything like that, i.e.
> have some function with an automatic variable pass the address of that
> variable to some other function and that other function use #acc loop kind
> that expects the caller to be at the worker level and splits the work among
> the threads in the warp, on the array section pointed by that passed in
> pointer?  See the OpenMP testcase I've posted in this thread.

I believe you're making a mistake if you think that the OpenACC 
"specification" considers such cases.


Bernd
Nathan Sidwell Dec. 2, 2015, 2:14 p.m. UTC | #9
On 12/02/15 08:46, Jakub Jelinek wrote:

> Or does the OpenACC execution model not allow anything like that, i.e.
> have some function with an automatic variable pass the address of that
> variable to some other function and that other function use #acc loop kind
> that expects the caller to be at the worker level and splits the work among
> the threads in the warp, on the array section pointed by that passed in
> pointer?  See the OpenMP testcase I've posted in this thread.

There are two cases to consider

1) the caller (& address taker) is already partitioned.  Thus the callers' 
frames are already copied.  The caller takes the address of the object in its 
own frame.

An example would be calling say __mulcd3 where the return value location is 
passed by pointer.

2) the caller is not partitioned and calls a function containing a partitioned 
loop.  The caller takes the address of its instance of the variable.  As part of 
the RTL expansion we have to convert addresses (to be stored in registers) to 
the generic address space.  That conversion creates a pointer that may be used 
by any thread (on the same CTA)[*].  The function call is  executed by all 
threads (they're partially un-neutered before the call).  In the partitioned 
loop, each thread ends up accessing the location in the frame of the original 
calling active thread.

[*]  although .local is private to each thread, it's placed in memory that is 
reachable from anywhere, provided a generic address is used.  Essentially it's 
like TLS and genericization is simply adding the thread pointer to the local 
memory offset to create a generic address.

nathan
Jakub Jelinek Dec. 2, 2015, 2:22 p.m. UTC | #10
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:46, Jakub Jelinek wrote:
> 
> >Or does the OpenACC execution model not allow anything like that, i.e.
> >have some function with an automatic variable pass the address of that
> >variable to some other function and that other function use #acc loop kind
> >that expects the caller to be at the worker level and splits the work among
> >the threads in the warp, on the array section pointed by that passed in
> >pointer?  See the OpenMP testcase I've posted in this thread.
> 
> There are two cases to consider
> 
> 1) the caller (& address taker) is already partitioned.  Thus the callers'
> frames are already copied.  The caller takes the address of the object in
> its own frame.
> 
> An example would be calling say __mulcd3 where the return value location is
> passed by pointer.
> 
> 2) the caller is not partitioned and calls a function containing a
> partitioned loop.  The caller takes the address of its instance of the
> variable.  As part of the RTL expansion we have to convert addresses (to be
> stored in registers) to the generic address space.  That conversion creates
> a pointer that may be used by any thread (on the same CTA)[*].  The function
> call is  executed by all threads (they're partially un-neutered before the
> call).  In the partitioned loop, each thread ends up accessing the location
> in the frame of the original calling active thread.
> 
> [*]  although .local is private to each thread, it's placed in memory that
> is reachable from anywhere, provided a generic address is used.  Essentially
> it's like TLS and genericization is simply adding the thread pointer to the
> local memory offset to create a generic address.

I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore you
refer to your own .local space rather than the other thread's.  Which is why
the -msoft-stack stuff has been added.
Perhaps we need to use it everywhere, at least for OpenMP, and do it
selectively, non-addressable vars can stay .local, addressable vars proven
not to escape to other threads (or other functions that could access them
from other threads) would go to soft stack.

	Jakub
Nathan Sidwell Dec. 2, 2015, 2:23 p.m. UTC | #11
On 12/02/15 09:22, Jakub Jelinek wrote:

> I believe Alex' testing revealed that if you take address of the same .local
> objects in several threads, the addresses are the same, and therefore you
> refer to your own .local space rather than the other thread's.

Before or after applying cvta?

nathan
Jakub Jelinek Dec. 2, 2015, 2:24 p.m. UTC | #12
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> On 12/02/15 09:22, Jakub Jelinek wrote:
> 
> >I believe Alex' testing revealed that if you take address of the same .local
> >objects in several threads, the addresses are the same, and therefore you
> >refer to your own .local space rather than the other thread's.
> 
> Before or after applying cvta?

I'll let Alex answer that.

	Jakub
Alexander Monakov Dec. 2, 2015, 2:34 p.m. UTC | #13
On Wed, 2 Dec 2015, Jakub Jelinek wrote:

> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 09:22, Jakub Jelinek wrote:
> > 
> > >I believe Alex' testing revealed that if you take address of the same .local
> > >objects in several threads, the addresses are the same, and therefore you
> > >refer to your own .local space rather than the other thread's.
> > 
> > Before or after applying cvta?
> 
> I'll let Alex answer that.

Both before and after, see this email:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02081.html

Alexander
Nathan Sidwell Dec. 2, 2015, 2:39 p.m. UTC | #14
On 12/02/15 09:24, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
>> On 12/02/15 09:22, Jakub Jelinek wrote:
>>
>>> I believe Alex' testing revealed that if you take address of the same .local
>>> objects in several threads, the addresses are the same, and therefore you
>>> refer to your own .local space rather than the other thread's.
>>
>> Before or after applying cvta?
>
> I'll let Alex answer that.

Nevermind, I've run an experiment, and it appears that local addresses converted 
to generic do give the same value regardless of executing thread.  I guess that 
means that genericization of local addresses to physical memory is done late at 
the load/store insn, rather than in the cvta insn.

When I added routine support, I did wonder whether the calling routine would 
need to clone its stack frame, but determined against it using the logic I wrote 
earlier.

nathan
Alexander Monakov Dec. 2, 2015, 2:41 p.m. UTC | #15
On Wed, 2 Dec 2015, Nathan Sidwell wrote:

> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> > difference between the two approaches?  I mean does the HW consume different
> > amount of power if only one thread in a warp executes code and the other
> > threads in the same warp just jump around it, vs. having all threads busy?
> 
> Having all threads busy will increase power consumption. >

Is that from general principles (i.e. "if it doesn't increase power
consumption, the GPU is poorly optimized"), or is that based on specific
knowledge on how existing GPUs operate (presumably reverse-engineered or
privately communicated -- I've never seen any public statements on this
point)?

The only certain case I imagine is instructions that go to SFU rather than
normal SPs -- but those are relatively rare.

> It's also bad if the other vectors are executing memory access instructions.

How so?  The memory accesses are the same independent of whether you reading
the same data from 1 thread or 32 synchronous threads.

Alexander
Nathan Sidwell Dec. 2, 2015, 2:43 p.m. UTC | #16
On 12/02/15 09:41, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Nathan Sidwell wrote:
>
>> On 12/02/15 05:40, Jakub Jelinek wrote:
>>> Don't know the HW good enough, is there any power consumption, heat etc.
>>> difference between the two approaches?  I mean does the HW consume different
>>> amount of power if only one thread in a warp executes code and the other
>>> threads in the same warp just jump around it, vs. having all threads busy?
>>
>> Having all threads busy will increase power consumption. >
>
> Is that from general principles (i.e. "if it doesn't increase power
> consumption, the GPU is poorly optimized"), or is that based on specific
> knowledge on how existing GPUs operate (presumably reverse-engineered or
> privately communicated -- I've never seen any public statements on this
> point)?

Nvidia told me.

> The only certain case I imagine is instructions that go to SFU rather than
> normal SPs -- but those are relatively rare.
>
>> It's also bad if the other vectors are executing memory access instructions.
>
> How so?  The memory accesses are the same independent of whether you reading
> the same data from 1 thread or 32 synchronous threads.

Nvidia told me.
Alexander Monakov Dec. 2, 2015, 2:54 p.m. UTC | #17
On Wed, 2 Dec 2015, Jakub Jelinek wrote:

> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > Don't know the HW good enough, is there any power consumption, heat etc.
> > >difference between the two approaches?  I mean does the HW consume different
> > >amount of power if only one thread in a warp executes code and the other
> > >threads in the same warp just jump around it, vs. having all threads busy?
> > 
> > Having all threads busy will increase power consumption.  It's also bad if
> > the other vectors are executing memory access instructions.  However, for
> 
> Then the uniform SIMT approach might not be that good idea.

Why?  Remember that the tradeoff is copying registers (and in OpenACC, stacks
too).  We don't know how the costs balance.  My intuition is that copying is
worse compared to what I'm doing.

Anyhow, for good performance the offloaded code needs to be running in vector
regions most of the time, where the concern doesn't apply.

Alexander
Jakub Jelinek Dec. 2, 2015, 3:12 p.m. UTC | #18
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> 
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption, heat etc.
> > > >difference between the two approaches?  I mean does the HW consume different
> > > >amount of power if only one thread in a warp executes code and the other
> > > >threads in the same warp just jump around it, vs. having all threads busy?
> > > 
> > > Having all threads busy will increase power consumption.  It's also bad if
> > > the other vectors are executing memory access instructions.  However, for
> > 
> > Then the uniform SIMT approach might not be that good idea.
> 
> Why?  Remember that the tradeoff is copying registers (and in OpenACC, stacks
> too).  We don't know how the costs balance.  My intuition is that copying is
> worse compared to what I'm doing.
> 
> Anyhow, for good performance the offloaded code needs to be running in vector
> regions most of the time, where the concern doesn't apply.

But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT increases
power consumption, it might not be desirable.

If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads in it), then we shouldn't need to copy any
stack, just broadcast the scalar vars.

	Jakub
Nathan Sidwell Dec. 2, 2015, 3:18 p.m. UTC | #19
On 12/02/15 10:12, Jakub Jelinek wrote:

> If we have a reasonable IPA pass to discover which addressable variables can
> be shared by multiple threads and which can't, then we could use soft-stack
> for those that can be shared by multiple PTX threads (different warps, or
> same warp, different threads in it), then we shouldn't need to copy any
> stack, just broadcast the scalar vars.

Note the current scalar (.reg)  broadcasting uses the live register set.  Not 
the subset of that that is actually read within the partitioned region.  That'd 
be a relatively straightforward optimization I think.

nathan
Jakub Jelinek Dec. 2, 2015, 4:35 p.m. UTC | #20
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
> > But you never know if people actually use #pragma omp simd regions or not,
> > sometimes they will, sometimes they won't, and if the uniform SIMT
> increases
> > power consumption, it might not be desirable.
> 
> It's easy to address: just terminate threads 1-31 if the linked image has
> no SIMD regions, like my pre-simd libgomp was doing.

Well, can't say the linked image in one shared library call a function
in another linked image in another shared library?  Or is that just not
supported for PTX?  I believe XeonPhi supports that.

If each linked image is self-contained, then that is probably a good idea,
but still you could have a single simd region somewhere and lots of other
target regions that don't use simd, or cases where only small amount of time
is spent in a simd region and this wouldn't help in that case.

If the addressables are handled through soft stack, then the rest is mostly
just SSA_NAMEs you can see on the edges of the SIMT region, that really
shouldn't be that expensive to broadcast or reduce back.

	Jakub
Nathan Sidwell Dec. 2, 2015, 5:09 p.m. UTC | #21
On 12/02/15 11:35, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
>>> But you never know if people actually use #pragma omp simd regions or not,
>>> sometimes they will, sometimes they won't, and if the uniform SIMT
>> increases
>>> power consumption, it might not be desirable.
>>
>> It's easy to address: just terminate threads 1-31 if the linked image has
>> no SIMD regions, like my pre-simd libgomp was doing.
>
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared library?  Or is that just not
> supported for PTX?  I believe XeonPhi supports that.

I don't believe PTX supports such dynamic loading within the PTX program 
currently being executed.  The JIT compiler can have several PTX 'objects' 
loaded into it before you tell it to go link everything.  At that point all 
symbols must be resolved.  I've no idea as to how passing a pointer to a 
function in some other 'executable' and calling it might behave -- my suspicion 
is 'badly'.
Alexander Monakov Dec. 2, 2015, 5:09 p.m. UTC | #22
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > It's easy to address: just terminate threads 1-31 if the linked image has
> > no SIMD regions, like my pre-simd libgomp was doing.
> 
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared library?  Or is that just not
> supported for PTX?  I believe XeonPhi supports that.

I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
__nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.

> If each linked image is self-contained, then that is probably a good idea,
> but still you could have a single simd region somewhere and lots of other
> target regions that don't use simd, or cases where only small amount of time
> is spent in a simd region and this wouldn't help in that case.

Should we actually be much concerned about optimizing this case, which
is unlikely to run faster than host cpu in the first place?

> If the addressables are handled through soft stack, then the rest is mostly
> just SSA_NAMEs you can see on the edges of the SIMT region, that really
> shouldn't be that expensive to broadcast or reduce back.

That's not enough: you have to reach the SIMD region entry in threads 1-31,
which means they need to execute all preceding control flow like thread 0,
which means they need to compute controlling predicates like thread 0.
(OpenACC broadcasts controlling predicates at branches)

Alexander
Nathan Sidwell Dec. 2, 2015, 5:20 p.m. UTC | #23
On 12/02/15 12:09, Alexander Monakov wrote:

> I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
> __nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.

Note I found a bug in .weak support.  See the comment in  gcc.dg/special/weak-2.c

/* NVPTX's implementation of weak is broken when a strong symbol is in
    a later object file than the weak definition.   */

> That's not enough: you have to reach the SIMD region entry in threads 1-31,
> which means they need to execute all preceding control flow like thread 0,
> which means they need to compute controlling predicates like thread 0.
> (OpenACC broadcasts controlling predicates at branches)

indeed.  Hence the partial 'forking' before a function call of a function with 
internal partitioned execution.

nathan
Alexander Monakov Dec. 3, 2015, 1:57 p.m. UTC | #24
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 12:09, Alexander Monakov wrote:
> 
> > I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> > it's not an issue.  E.g. check early in gomp_nvptx_main if .weak
> > __nvptx_has_simd != 0.  It would only break if there was dlopen on PTX.
> 
> Note I found a bug in .weak support.  See the comment in
> gcc.dg/special/weak-2.c
> 
> /* NVPTX's implementation of weak is broken when a strong symbol is in
>    a later object file than the weak definition.   */

Thanks for the warning.  However, the issue seems limited to function symbols:
I've made a test for data symbols, and they appear to work fine -- which
suffices in this context.

Alexander
Nathan Sidwell Dec. 7, 2015, 3:09 p.m. UTC | #25
On 12/01/15 11:01, Bernd Schmidt wrote:
> On 12/01/2015 04:28 PM, Alexander Monakov wrote:
>> I'm taking a different approach.  I want to execute all insns in all warp
>> members, while ensuring that effect (on global and local state) is that same
>> as if any single thread was executing that instruction.  Most instructions
>> automatically satisfy that: if threads have the same state, then executing an
>> arithmetic instruction, normal memory load/store, etc. keep local state the
>> same in all threads.
>>
>> The two exception insn categories are atomics and calls.  For calls, we can
>> demand recursively that they uphold this execution model, until we reach
>> runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
>> atomics.
>
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?
>
> Have you tried to use the mechanism used for OpenACC? IMO that would be a good
> first step - get things working with fewer changes, and then look into
> optimizing them (ideally for OpenMP and OpenACC both).

I would have thought the right approach would be to augment the existing 
neutering code to insert predication (instead of branch-around) using a 
heuristic as to which is the better choice.

nathan
diff mbox

Patch

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2dad3e2..9209b47 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -117,6 +117,9 @@  static GTY(()) rtx worker_red_sym;
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -599,6 +602,33 @@  nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize predicate and master lane index registers for
+   -muniform-simt code generation variant.  */
+
+static void
+nvptx_init_unisimt_predicate (FILE *file)
+{
+  int bits = BITS_PER_WORD;
+  int master = REGNO (cfun->machine->unisimt_master);
+  int pred = REGNO (cfun->machine->unisimt_predicate);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
+	   bits == 64 ? ".wide" : "");
+  fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
+  fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
+  fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
+  /* rNN = tid.x & __nvptx_uni[tid.y];  */
+  fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
+  fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
+  fprintf (file, "\t}\n");
+  need_unisimt_decl = true;
+}
+
 /* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
 
    extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
@@ -811,6 +841,8 @@  nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->axis_predicate[1])
     nvptx_init_axis_predicate (file,
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
+  if (cfun->machine->unisimt_predicate)
+    nvptx_init_unisimt_predicate (file);
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -2394,6 +2426,86 @@  nvptx_reorg_subreg (void)
     }
 }
 
+/* Return a SImode "master lane index" register for uniform-simt, allocating on
+   first use.  */
+
+static rtx
+nvptx_get_unisimt_master ()
+{
+  rtx &master = cfun->machine->unisimt_master;
+  return master ? master : master = gen_reg_rtx (SImode);
+}
+
+/* Return a BImode "predicate" register for uniform-simt, similar to above.  */
+
+static rtx
+nvptx_get_unisimt_predicate ()
+{
+  rtx &pred = cfun->machine->unisimt_predicate;
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
+/* Return true if given call insn references one of the functions provided by
+   the CUDA runtime: malloc, free, vprintf.  */
+
+static bool
+nvptx_call_insn_is_syscall_p (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+  gcc_checking_assert (GET_CODE (pat) == PARALLEL);
+  pat = XVECEXP (pat, 0, 0);
+  if (GET_CODE (pat) == SET)
+    pat = SET_SRC (pat);
+  gcc_checking_assert (GET_CODE (pat) == CALL
+		       && GET_CODE (XEXP (pat, 0)) == MEM);
+  rtx addr = XEXP (XEXP (pat, 0), 0);
+  if (GET_CODE (addr) != SYMBOL_REF)
+    return false;
+  const char *name = XSTR (addr, 0);
+  return (!strcmp (name, "vprintf")
+	  || !strcmp (name, "__nvptx_real_malloc")
+	  || !strcmp (name, "__nvptx_real_free"));
+}
+
+/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
+   propagate its value from lane MASTER to current lane.  */
+
+static void
+nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
+{
+  rtx reg;
+  if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
+    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+}
+
+/* Adjust code for uniform-simt code generation variant by making atomics and
+   "syscalls" conditionally executed, and inserting shuffle-based propagation
+   for registers being set.  */
+
+static void
+nvptx_reorg_uniform_simt ()
+{
+  rtx_insn *insn, *next;
+
+  for (insn = get_insns (); insn; insn = next)
+    {
+      next = NEXT_INSN (insn);
+      if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
+	  && !(NONJUMP_INSN_P (insn)
+	       && GET_CODE (PATTERN (insn)) == PARALLEL
+	       && get_attr_divergent (insn)))
+	continue;
+      rtx pat = PATTERN (insn);
+      rtx master = nvptx_get_unisimt_master ();
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+	nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      rtx pred = nvptx_get_unisimt_predicate ();
+      pred = gen_rtx_NE (BImode, pred, const0_rtx);
+      pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
+      validate_change (insn, &PATTERN (insn), pat, false);
+    }
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -2872,6 +2984,15 @@  nvptx_wsync (bool after)
   return gen_nvptx_barsync (GEN_INT (after));
 }
 
+/* Return a BImode "axis predicate" register, allocating on first use.  */
+
+static rtx
+nvptx_get_axis_predicate (int axis)
+{
+  rtx &pred = cfun->machine->axis_predicate[axis];
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
 /* Single neutering according to MASK.  FROM is the incoming block and
    TO is the outgoing block.  These may be the same block. Insert at
    start of FROM:
@@ -2956,14 +3077,7 @@  nvptx_single (unsigned mask, basic_block from, basic_block to)
     if (GOMP_DIM_MASK (mode) & skip_mask)
       {
 	rtx_code_label *label = gen_label_rtx ();
-	rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
-
-	if (!pred)
-	  {
-	    pred = gen_reg_rtx (BImode);
-	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
-	  }
-	
+	rtx pred = nvptx_get_axis_predicate (mode - GOMP_DIM_WORKER);
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
@@ -3202,6 +3316,9 @@  nvptx_reorg (void)
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
+  if (TARGET_UNIFORM_SIMT)
+    nvptx_reorg_uniform_simt ();
+
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -3379,6 +3496,11 @@  nvptx_file_end (void)
       fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n;",
 	       BITS_PER_WORD);
     }
+  if (need_unisimt_decl)
+    {
+      fprintf (asm_out_file, "// BEGIN GLOBAL VAR DECL: __nvptx_uni\n");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n;");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index db8e201..1c605df 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -33,6 +33,8 @@ 
       builtin_define ("__nvptx__");		\
       if (TARGET_SOFT_STACK)			\
         builtin_define ("__nvptx_softstack__");	\
+      if (TARGET_UNIFORM_SIMT)			\
+        builtin_define ("__nvptx_unisimt__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -234,6 +236,8 @@  struct GTY(()) machine_function
   int ret_reg_mode; /* machine_mode not defined yet. */
   int punning_buffer_size;
   rtx axis_predicate[2];
+  rtx unisimt_master;
+  rtx unisimt_predicate;
 };
 #endif
 
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 5ce7a89..f0fc02c 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -75,6 +75,9 @@  (define_c_enum "unspecv" [
 (define_attr "subregs_ok" "false,true"
   (const_string "false"))
 
+(define_attr "divergent" "false,true"
+  (const_string "false"))
+
 (define_predicate "nvptx_register_operand"
   (match_code "reg,subreg")
 {
@@ -1519,7 +1522,8 @@  (define_insn "atomic_compare_and_swap<mode>_1"
    (set (match_dup 1)
 	(unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
   ""
-  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
+  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")	;; output
@@ -1530,7 +1534,8 @@  (define_insn "atomic_exchange<mode>"
    (set (match_dup 1)
 	(match_operand:SDIM 2 "nvptx_register_operand" "R"))]	;; input
   ""
-  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
+  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_add<mode>"
   [(set (match_operand:SDIM 1 "memory_operand" "+m")
@@ -1542,7 +1547,8 @@  (define_insn "atomic_fetch_add<mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_addsf"
   [(set (match_operand:SF 1 "memory_operand" "+m")
@@ -1554,7 +1560,8 @@  (define_insn "atomic_fetch_addsf"
    (set (match_operand:SF 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_code_iterator any_logic [and ior xor])
 (define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@@ -1570,7 +1577,8 @@  (define_insn "atomic_fetch_<logic><mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   "0"
-  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 7ab09b9..47e811e 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@  Link in code for a __main kernel.
 msoft-stack
 Target Report Mask(SOFT_STACK)
 Use custom stacks instead of local memory for automatic storage.
+
+muniform-simt
+Target Report Mask(UNIFORM_SIMT)
+Generate code that executes all threads in a warp as if one was active.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 6e45fb6..46cd2e9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18942,6 +18942,20 @@  in shared memory array @code{char *__nvptx_stacks[]} at position @code{tid.y}
 as the stack pointer.  This is for placing automatic variables into storage
 that can be accessed from other threads, or modified with atomic instructions.
 
+@item -muniform-simt
+@opindex muniform-simt
+Switch to code generation variant that allows to execute all threads in each
+warp, while maintaining memory state and side effects as if only one thread
+in each warp was active outside of OpenMP SIMD regions.  All atomic operations
+and calls to runtime (malloc, free, vprintf) are conditionally executed (iff
+current lane index equals the master lane index), and the register being
+assigned is copied via a shuffle instruction from the master lane.  Outside of
+SIMD regions lane 0 is the master; inside, each thread sees itself as the
+master.  Shared memory array @code{int __nvptx_uni[]} stores all-zeros or
+all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
+regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
+with current lane index to compute the master lane index.
+
 @end table
 
 @node PDP-11 Options