diff mbox

nvptx: implement automatic storage in custom stacks

Message ID alpine.LNX.2.20.1511121609140.3050@monopod.intra.ispras.ru
State New
Headers show

Commit Message

Alexander Monakov Nov. 12, 2015, 1:58 p.m. UTC
Hello,

I'm proposing the following patch as a step towards resolving the issue with
inaccessibility of stack storage (.local memory) in PTX to other threads than
the one using that stack.  The idea is to have preallocated stacks, and have
__nvptx_stacks[] array in shared memory hold current stack pointers.  Each
thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for
OpenMP the intent is to preallocate on a per-warp basis (not per-thread).
For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not
introduced.

I've exposed a new command-line option -msoft-stack to ease testing, but for
OpenMP we'll have to automatically flip it based on function attributes.
Right now it's not easy because OpenMP and OpenACC both use "omp declare
target".  Jakub, I seem to recall a discussion about OpenACC changing to use a
separate attribute, but I cannot find it now.  Any advice here?

This approach also allows to implement alloca.  However, to drop
alloca-avoiding changes in libgomp we'd have to selectively enable
-msoft-stack there, only for functions that OpenACC wouldn't use.

I've run it through make -k check-c regtesting.  These are new fails, all
mysterious:

+FAIL: gcc.c-torture/execute/20090113-2.c   -O[123s]  execution test
Execution failure with invalid memory access.

+FAIL: gcc.c-torture/execute/20090113-3.c   -O[123s]  execution test
Times out (looping infinitely).

The above two I had difficulties investigating due to cuda-gdb 7.0 not showing
dissassembly for the misbehaving function.

+FAIL: gcc.c-torture/execute/loop-15.c   -O2  execution test
Rather surprising and unclear failure due to branch stack overflow.

There are also tests that now pass:
+PASS: gcc.c-torture/execute/20020529-1.c   -O0  execution test
Used to fail with invalid memory access.

+PASS: gcc.dg/sibcall-9.c execution test
(not meaningful on NVPTX)

+PASS: gcc.dg/torture/pr54261-1.c   -O[0123s]  execution test
Atomic modification to stack variables now works.

gcc/
	* config/nvptx/nvptx.c (need_softstack_decl): Declare.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_return): Restore stack pointer if needed.
	(nvptx_file_end): Emit declaration of __nvptx_stacks.
	* config/nvptx/nvptx.opt (msoft-stack): New option.
	* doc/invoke.texi (-msoft-stack): Document.

libgcc/
	* config/nvptx/crt0.s (__nvptx_stacks): Define.
	(%__softstack): Define 128 KiB stack for -msoft-stack.
	(__main): Setup __nvptx_stacks.

Comments

Bernd Schmidt Nov. 12, 2015, 2:40 p.m. UTC | #1
> I'm proposing the following patch as a step towards resolving the issue with
> inaccessibility of stack storage (.local memory) in PTX to other threads than
> the one using that stack.  The idea is to have preallocated stacks, and have
> __nvptx_stacks[] array in shared memory hold current stack pointers.  Each
> thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for
> OpenMP the intent is to preallocate on a per-warp basis (not per-thread).
> For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not
> introduced.

This is of course really ugly; I'd propose we keep it on an nvptx-OpenMP 
specific branch for now until we know that this is really going somewhere.

> I've run it through make -k check-c regtesting.  These are new fails, all
> mysterious:

These would have to be investigated first.

> +	  sz = (sz + keep_align - 1) & ~(keep_align - 1);

Use the ROUND_UP macro.

> +	  fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
> +	           bits == 64 ? ".wide" : "", bits);

Use a shift.

> +
> +  if (need_softstack_decl)
> +    {
> +      fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;");
> +    }

Lose excess braces.

> +.global .u64 %__softstack[16384];

Maybe declarea as .u8 so you don't have two different constants for the 
stack size?

> +        .reg .u64 %stackptr;
> +        mov.u64	%stackptr, %__softstack;
> +        cvta.global.u64	%stackptr, %stackptr;
> +        add.u64	%stackptr, %stackptr, 131072;
> +        st.shared.u64	[__nvptx_stacks], %stackptr;
> +

I'm guessing you have other missing pieces for setting this up for 
multiple threads.


Bernd
Alexander Monakov Nov. 12, 2015, 2:59 p.m. UTC | #2
On Thu, 12 Nov 2015, Bernd Schmidt wrote:
> > I've run it through make -k check-c regtesting.  These are new fails, all
> > mysterious:
> 
> These would have to be investigated first.

Any specific suggestions?  The PTX code emitted from GCC differs only in
prologue/epilogue, so whatever's broken... I think is unlikely due to this
change.  I can give it another try after upgrading CUDA driver and cuda-gdb
from 7.0 to latest.

> > +	  sz = (sz + keep_align - 1) & ~(keep_align - 1);
> 
> Use the ROUND_UP macro.

OK, thanks.
 
> > +	  fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
> > +	           bits == 64 ? ".wide" : "", bits);
> 
> Use a shift.

I think mul is acceptable here: PTX JIT is handling it properly, according to
what I saw while investigating in cuda-gdb.  If I used a shift, I'd also have
to introduce another instruction for a widening integer conversion in the
64-bit case.  Do you insist?

> > +
> > +  if (need_softstack_decl)
> > +    {
> > +      fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;");
> > +    }
> 
> Lose excess braces.

OK.
 
> > +.global .u64 %__softstack[16384];
> 
> Maybe declarea as .u8 so you don't have two different constants for the stack
> size?

OK, with ".align 8" to ensure 64-bit alignment.
 
> > +        .reg .u64 %stackptr;
> > +        mov.u64	%stackptr, %__softstack;
> > +        cvta.global.u64	%stackptr, %stackptr;
> > +        add.u64	%stackptr, %stackptr, 131072;
> > +        st.shared.u64	[__nvptx_stacks], %stackptr;
> > +
> 
> I'm guessing you have other missing pieces for setting this up for multiple
> threads.

This is crt0.s, which is linked in only for single-threaded testing with
-mmainkernel; for OpenMP, the intention is to handle it in the file that
implements libgomp_nvptx_main.

Thanks.
Alexander
Bernd Schmidt Nov. 12, 2015, 3:03 p.m. UTC | #3
On 11/12/2015 03:59 PM, Alexander Monakov wrote:
> On Thu, 12 Nov 2015, Bernd Schmidt wrote:
>>> I've run it through make -k check-c regtesting.  These are new fails, all
>>> mysterious:
>>
>> These would have to be investigated first.
>
> Any specific suggestions?  The PTX code emitted from GCC differs only in
> prologue/epilogue, so whatever's broken... I think is unlikely due to this
> change.  I can give it another try after upgrading CUDA driver and cuda-gdb
> from 7.0 to latest.

Yeah, load it into cuda-gdb, that may help show what's happening.

>>> +	  fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
>>> +	           bits == 64 ? ".wide" : "", bits);
>>
>> Use a shift.
>
> I think mul is acceptable here: PTX JIT is handling it properly, according to
> what I saw while investigating in cuda-gdb.  If I used a shift, I'd also have
> to introduce another instruction for a widening integer conversion in the
> 64-bit case.  Do you insist?

Nah, it's fine.
>
> This is crt0.s, which is linked in only for single-threaded testing with
> -mmainkernel; for OpenMP, the intention is to handle it in the file that
> implements libgomp_nvptx_main.

Yeah, that's what I meant. It might be nice to see that too if it 
already exists.


Bernd
Jakub Jelinek Nov. 30, 2015, 10:32 a.m. UTC | #4
On Thu, Nov 12, 2015 at 04:58:21PM +0300, Alexander Monakov wrote:
> I'm proposing the following patch as a step towards resolving the issue with
> inaccessibility of stack storage (.local memory) in PTX to other threads than
> the one using that stack.  The idea is to have preallocated stacks, and have
> __nvptx_stacks[] array in shared memory hold current stack pointers.  Each
> thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for
> OpenMP the intent is to preallocate on a per-warp basis (not per-thread).
> For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not
> introduced.
> 
> I've exposed a new command-line option -msoft-stack to ease testing, but for
> OpenMP we'll have to automatically flip it based on function attributes.
> Right now it's not easy because OpenMP and OpenACC both use "omp declare
> target".  Jakub, I seem to recall a discussion about OpenACC changing to use a
> separate attribute, but I cannot find it now.  Any advice here?

I believe OpenACC has acc routine {gang,worker,seq} that would roughly match
whether certain OpenMP declare target function (or ompfn region) is/can be
called within the target/teams/distribute context, or parallel context, or
simd context.  For OpenMP we have no such pragmas, so we need some analysis
to help the PTX (and, as Martin said on IRC, HSA apparently too) and add
attributes accordingly.
For the .ompfn* outlined region it is easy, there we know from which
construct it is, for other functions bet we want to do some IPA analysis for
this, start with the .ompfn* functions marked and walk the cgraph and for
declare target functions not callable from outside try to determine if they
are only called from parallel contexts, or not.

Does your patch affect all the stack allocations within certain function
(i.e. no way to select on a per-variable bases what stack to allocate it
to)?  Without any detailed analysis at least e.g. spilled (non-addressable)
vars could at least go to the local stack.  But PTX doesn't have any spills,
right?  Not sure about HSA.  If it is a per-function thing only, then it
isn't worth to do more detailed analysis at the ompexp time.

BTW, surely it will be an advantage if PTX can support alloca through this,
it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs
automatically.

	Jakub
Alexander Monakov Nov. 30, 2015, 11:14 a.m. UTC | #5
On Mon, 30 Nov 2015, Jakub Jelinek wrote:
> Does your patch affect all the stack allocations within certain function
> (i.e. no way to select on a per-variable bases what stack to allocate it
> to)?  Without any detailed analysis at least e.g. spilled (non-addressable)
> vars could at least go to the local stack.  But PTX doesn't have any spills,
> right?  Not sure about HSA.  If it is a per-function thing only, then it
> isn't worth to do more detailed analysis at the ompexp time.

Yes, there's no register allocation and thus no spills on PTX.
 
> BTW, surely it will be an advantage if PTX can support alloca through this,
> it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs
> automatically.

Yes, I'm going to support alloca on soft stacks, but, -msoft-stack has a
prerequisite of soft stacks being initially set up.  Therefore I'm treating it
as an ABI variant (together with another option to handle atomics and
"syscalls" outside of simd regions), and building a separate multilib for
that.  So I see it the other way around: it's not safe for the compiler to
always use soft-stacks for alloca (because OpenACC wouldn't set up soft
stacks), but if soft stacks are enabled, alloca can use them.

In the multilib variant that I'm introducing, all addressable vars go to soft
stacks, and classic .local stacks are used rarely, e.g. for stdarg passing,
and implicitely for calls/returns (and after JIT, they'll service register
spills too).

Alexander
Jakub Jelinek Nov. 30, 2015, 11:23 a.m. UTC | #6
On Mon, Nov 30, 2015 at 02:14:41PM +0300, Alexander Monakov wrote:
> On Mon, 30 Nov 2015, Jakub Jelinek wrote:
> > Does your patch affect all the stack allocations within certain function
> > (i.e. no way to select on a per-variable bases what stack to allocate it
> > to)?  Without any detailed analysis at least e.g. spilled (non-addressable)
> > vars could at least go to the local stack.  But PTX doesn't have any spills,
> > right?  Not sure about HSA.  If it is a per-function thing only, then it
> > isn't worth to do more detailed analysis at the ompexp time.
> 
> Yes, there's no register allocation and thus no spills on PTX.
>  
> > BTW, surely it will be an advantage if PTX can support alloca through this,
> > it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs
> > automatically.
> 
> Yes, I'm going to support alloca on soft stacks, but, -msoft-stack has a
> prerequisite of soft stacks being initially set up.  Therefore I'm treating it
> as an ABI variant (together with another option to handle atomics and
> "syscalls" outside of simd regions), and building a separate multilib for
> that.  So I see it the other way around: it's not safe for the compiler to
> always use soft-stacks for alloca (because OpenACC wouldn't set up soft
> stacks), but if soft stacks are enabled, alloca can use them.
> 
> In the multilib variant that I'm introducing, all addressable vars go to soft
> stacks, and classic .local stacks are used rarely, e.g. for stdarg passing,
> and implicitely for calls/returns (and after JIT, they'll service register
> spills too).

Does it really have to be a full multilib?  I mean, the only precondition is
that something sets up the var, right?  Would the OpenACC folks be willing
to set it up too?  For stuff like libc, functions that are ECF_LEAF builtins
IMHO really don't care whether they are built as -msoft-stack or not, they
shouldn't be passing addresses of local vars to code that could use OpenMP.
The only question is if say qsort or other functions that call user callbacks
could be passing addresses of local vars to those callbacks, or whether they
only pass addresses passed from callers, or addresses of heap objects.

	Jakub
Alexander Monakov Nov. 30, 2015, 11:37 a.m. UTC | #7
On Mon, 30 Nov 2015, Jakub Jelinek wrote:
> Does it really have to be a full multilib?  I mean, the only precondition is
> that something sets up the var, right?  Would the OpenACC folks be willing
> to set it up too?  For stuff like libc, functions that are ECF_LEAF builtins
> IMHO really don't care whether they are built as -msoft-stack or not, they
> shouldn't be passing addresses of local vars to code that could use OpenMP.
> The only question is if say qsort or other functions that call user callbacks
> could be passing addresses of local vars to those callbacks, or whether they
> only pass addresses passed from callers, or addresses of heap objects.

Well, in that full multilib there's also a second option enabled, to handle
atomics/syscalls outside of simd regions, where the cost is additional code in
the prologue, and one "shuffle" instruction after each atomic/syscall.

It doesn't have to be a multilib, but doing it as a multilib is a safe choice
w.r.t OpenACC work.

Alexander
diff mbox

Patch

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0204ad3..df915b9 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -114,6 +114,9 @@  static unsigned worker_red_align;
 #define worker_red_name "__worker_red"
 static GTY(()) rtx worker_red_sym;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -689,15 +692,46 @@  nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   /* Declare a local variable for the frame.  */
   sz = get_frame_size ();
-  if (sz > 0 || cfun->machine->has_call_with_sc)
+  if (sz == 0 && cfun->machine->has_call_with_sc)
+    sz = 1;
+  if (sz > 0)
     {
       int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
 
-      fprintf (file, "\t.reg.u%d %%frame;\n"
-	       "\t.local.align %d .b8 %%farray[" HOST_WIDE_INT_PRINT_DEC"];\n",
-	       BITS_PER_WORD, alignment, sz == 0 ? 1 : sz);
-      fprintf (file, "\tcvta.local.u%d %%frame, %%farray;\n",
-	       BITS_PER_WORD);
+      fprintf (file, "\t.reg.u%d %%frame;\n", BITS_PER_WORD);
+      if (TARGET_SOFT_STACK)
+	{
+	  /* Maintain 64-bit stack alignment.  */
+	  int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+	  sz = (sz + keep_align - 1) & ~(keep_align - 1);
+	  int bits = BITS_PER_WORD;
+	  fprintf (file, "\t.reg.u32 %%fstmp0;\n");
+	  fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
+	  fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
+	  fprintf (file, "\tmov.u32 %%fstmp0, %%tid.y;\n");
+	  fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	           bits == 64 ? ".wide" : "", bits);
+	  fprintf (file, "\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+	  /* fstmp2 = &__nvptx_stacks[tid.y];  */
+	  fprintf (file, "\tadd.u%d %%fstmp2, %%fstmp2, %%fstmp1;\n", bits);
+	  fprintf (file, "\tld.shared.u%d %%fstmp1, [%%fstmp2];\n", bits);
+	  fprintf (file, "\tsub.u%d %%frame, %%fstmp1, "
+	           HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+	  if (alignment > keep_align)
+	    fprintf (file, "\tand.b%d %%frame, %%frame, %d;\n",
+		     bits, -alignment);
+	  if (!crtl->is_leaf)
+	    fprintf (file, "\tst.shared.u%d [%%fstmp2], %%frame;\n", bits);
+	  need_softstack_decl = true;
+	}
+      else
+	{
+	  fprintf (file, "\t.local.align %d "
+		   ".b8 %%farray[" HOST_WIDE_INT_PRINT_DEC"];\n",
+		   alignment, sz);
+	  fprintf (file, "\tcvta.local.u%d %%frame, %%farray;\n",
+		   BITS_PER_WORD);
+	}
     }
 
   if (cfun->machine->has_call_with_varargs)
@@ -734,6 +768,13 @@  nvptx_output_return (void)
 {
   machine_mode mode = (machine_mode)cfun->machine->ret_reg_mode;
 
+  if (TARGET_SOFT_STACK
+      && !crtl->is_leaf
+      && (get_frame_size () > 0 || cfun->machine->has_call_with_sc))
+    {
+      int bits = BITS_PER_WORD;
+      fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n", bits);
+    }
   if (mode != VOIDmode)
     {
       mode = arg_promotion (mode);
@@ -3278,6 +3319,11 @@  nvptx_file_end (void)
 	       worker_red_align,
 	       worker_red_name, worker_red_size);
     }
+
+  if (need_softstack_decl)
+    {
+      fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 8017046..7ab09b9 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -28,3 +28,7 @@  Generate code for a 64-bit ABI.
 mmainkernel
 Target Report RejectNegative
 Link in code for a __main kernel.
+
+msoft-stack
+Target Report Mask(SOFT_STACK)
+Use custom stacks instead of local memory for automatic storage.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 587e30e..6e45fb6 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18935,6 +18935,13 @@  Generate code for 32-bit or 64-bit ABI.
 Link in code for a __main kernel.  This is for stand-alone instead of
 offloading execution.
 
+@item -msoft-stack
+@opindex msoft-stack
+Do not use @code{.local} memory for automatic storage.  Instead, use pointer
+in shared memory array @code{char *__nvptx_stacks[]} at position @code{tid.y}
+as the stack pointer.  This is for placing automatic variables into storage
+that can be accessed from other threads, or modified with atomic instructions.
+
 @end table
 
 @node PDP-11 Options
diff --git a/libgcc/config/nvptx/crt0.s b/libgcc/config/nvptx/crt0.s
index 38327ed..7a42e87 100644
--- a/libgcc/config/nvptx/crt0.s
+++ b/libgcc/config/nvptx/crt0.s
@@ -22,6 +22,9 @@ 
         exit;
 }
 
+.visible .shared .u64 __nvptx_stacks[1];
+.global .u64 %__softstack[16384];
+
 .extern .func (.param.u32 retval) main (.param.u32 argc, .param.u64 argv);
 
 .visible .entry __main (.param .u64 __retval, .param.u32 __argc, .param.u64 __argv)
@@ -34,6 +37,12 @@ 
         ld.param.u64    %rd0, [__retval];
         st.global.u64   [%__exitval], %rd0;
 
+        .reg .u64 %stackptr;
+        mov.u64	%stackptr, %__softstack;
+        cvta.global.u64	%stackptr, %stackptr;
+        add.u64	%stackptr, %stackptr, 131072;
+        st.shared.u64	[__nvptx_stacks], %stackptr;
+
 	ld.param.u32	%r1, [__argc];
 	ld.param.u64	%rd1, [__argv];
 	st.param.u32	[%argc], %r1;