diff mbox

[gomp-nvptx,4/9] nvptx backend: add -mgomp option and multilib

Message ID 1448983707-18854-5-git-send-email-amonakov@ispras.ru
State New
Headers show

Commit Message

Alexander Monakov Dec. 1, 2015, 3:28 p.m. UTC
Since OpenMP offloading requires both soft-stacks and "uniform SIMT", both
non-traditional codegen variants, I'm building a multilib variant with those
enabled.  This patch adds option -mgomp which enables -msoft-stack plus
-muniform-simt, and builds a multilib with it.

	* config/nvptx/nvptx.c (nvptx_option_override): Handle TARGET_GOMP.
	* config/nvptx/nvptx.opt (mgomp): New option.
	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
	* doc/invoke.texi (mgomp): Document.
---
 gcc/config/nvptx/nvptx.c   | 3 +++
 gcc/config/nvptx/nvptx.opt | 4 ++++
 gcc/config/nvptx/t-nvptx   | 2 ++
 gcc/doc/invoke.texi        | 5 +++++
 4 files changed, 14 insertions(+)

Comments

Jakub Jelinek Dec. 2, 2015, 10:56 a.m. UTC | #1
On Tue, Dec 01, 2015 at 06:28:22PM +0300, Alexander Monakov wrote:
> Since OpenMP offloading requires both soft-stacks and "uniform SIMT", both
> non-traditional codegen variants, I'm building a multilib variant with those
> enabled.  This patch adds option -mgomp which enables -msoft-stack plus
> -muniform-simt, and builds a multilib with it.
> 
> 	* config/nvptx/nvptx.c (nvptx_option_override): Handle TARGET_GOMP.
> 	* config/nvptx/nvptx.opt (mgomp): New option.
> 	* config/nvptx/t-nvptx (MULTILIB_OPTIONS): New.
> 	* doc/invoke.texi (mgomp): Document.

I thought the MULTILIB* vars allow you to multilib on none of
-msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without
building other variants, so you wouldn't need this.
Furthermore, as I said, I believe for e.g. most of newlib libc / libm
I think it is enough if they are built as -muniform-simt -mno-soft-stack,
if those functions are leaf or don't call user routines that could have
#pragma omp parallel.  -msoft-stack would unnecessarily slow the routines
down.
So perhaps just multilib on -muniform-simt, and document that -muniform-simt
built code requires also that the soft-stack var is set up and thus
-msoft-stack can be used when needed?

Can you post sample code with assembly for -msoft-stack and -muniform-simt
showing how are short interesting cases expanded?
Is there really no way even in direct PTX assembly to have .local file scope
vars (rather than the global arrays indexed by %tid)?

	Jakub
Alexander Monakov Dec. 2, 2015, 2:18 p.m. UTC | #2
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> I thought the MULTILIB* vars allow you to multilib on none of
> -msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without
> building other variants, so you wouldn't need this.

The nice effect of having -mgomp is better factorization: if I need to change
what OpenMP needs, e.g. for going with your suggestion below and dropping
-msoft-stack, I need to only change one line.  Otherwise I'd have to change
mkoffload too.

> Furthermore, as I said, I believe for e.g. most of newlib libc / libm
> I think it is enough if they are built as -muniform-simt -mno-soft-stack,
> if those functions are leaf or don't call user routines that could have
> #pragma omp parallel.  -msoft-stack would unnecessarily slow the routines
> down.

Not obviously so.  Outside of SIMD regions, running on hard stacks pointlessly
amplifies cache/memory traffic for stack references, so there would have to be
some evaluation before deciding.

> So perhaps just multilib on -muniform-simt, and document that -muniform-simt
> built code requires also that the soft-stack var is set up and thus
> -msoft-stack can be used when needed?

It's an interesting point, but I have doubts.  Is that something you'd want me
to address short-term?

> Can you post sample code with assembly for -msoft-stack and -muniform-simt
> showing how are short interesting cases expanded?
> Is there really no way even in direct PTX assembly to have .local file scope
> vars (rather than the global arrays indexed by %tid)?

Allow me to post samples a bit later; as for .local, the PTX documentation
explicitely states it must not be done:

    5.1.5. Local State Space
    [...]
    When compiling to use the Application Binary Interface (ABI), .local
    state-space variables must be declared within function scope and are
    allocated on the stack. In implementations that do not support a stack,
    all local memory variables are stored at fixed addresses, recursive
    function calls are not supported, and .local variables may be declared at
    module scope. When compiling legacy PTX code (ISA versions prior to 3.0)
    containing module-scoped .local variables, the compiler silently disables
    use of the ABI.

(while I'm unsure as to what exactly "compiling to use the ABI" is defined,
I'm assuming that's what we want in GCC, and otherwise linking may not work)

Thanks.
Alexander
Alexander Monakov Dec. 3, 2015, 10:42 a.m. UTC | #3
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> Can you post sample code with assembly for -msoft-stack and -muniform-simt
> showing how are short interesting cases expanded?

Here's short examples;  please let me know if I'm misunderstanding and you
wanted something else.

First, -muniform-simt effect on this input:

int f (int *p, int v)
{
  return __atomic_exchange_n (p, v, __ATOMIC_SEQ_CST);
}

leads to this assembly (showing diff -without/+with option):

 .visible .func (.param.u32 %out_retval)f(.param.u64 %in_ar1, .param.u32 %in_ar2)
 {
        .reg.u64 %ar1;
        .reg.u32 %ar2;
        .reg.u32 %retval;
        .reg.u64 %hr10;
        .reg.u32 %r23;
        .reg.u64 %r25;
        .reg.u32 %r26;
+       .reg.u32 %r28;
+       .reg.pred %r29;
        ld.param.u64 %ar1, [%in_ar1];
        ld.param.u32 %ar2, [%in_ar2];
+       {
+               .reg.u32 %ustmp0;
+               .reg.u64 %ustmp1;
+               .reg.u64 %ustmp2;
+               mov.u32 %ustmp0, %tid.y;
+               mul.wide.u32 %ustmp1, %ustmp0, 4;
+               mov.u64 %ustmp2, __nvptx_uni;
+               add.u64 %ustmp2, %ustmp2, %ustmp1;
+               ld.shared.u32 %r28, [%ustmp2];
+               mov.u32 %ustmp0, %tid.x;
+               and.b32 %r28, %r28, %ustmp0;
+               setp.eq.u32 %r29, %r28, %ustmp0;
+       }
                mov.u64 %r25, %ar1;
                mov.u32 %r26, %ar2;
-               atom.exch.b32   %r23, [%r25], %r26;
+       @%r29   atom.exch.b32   %r23, [%r25], %r26;
+               shfl.idx.b32    %r23, %r23, %r28, 31;
                mov.u32 %retval, %r23;
        st.param.u32    [%out_retval], %retval;
        ret;
        }
+// BEGIN GLOBAL VAR DECL: __nvptx_uni
+.extern .shared .u32 __nvptx_uni[32];

And, -msoft-stack for this input:

void g(void *);
void f()
{
  char a[42] __attribute__((aligned(64)));
  g(a);
}

leads to:

 .visible .func f
 {
        .reg.u64 %hr10;
        .reg.u64 %r22;
        .reg.u64 %frame;
-       .local.align 64 .b8 %farray[48];
-       cvta.local.u64 %frame, %farray;
+       .reg.u32 %fstmp0;
+       .reg.u64 %fstmp1;
+       .reg.u64 %fstmp2;
+       mov.u32 %fstmp0, %tid.y;
+       mul.wide.u32 %fstmp1, %fstmp0, 8;
+       mov.u64 %fstmp2, __nvptx_stacks;
+       add.u64 %fstmp2, %fstmp2, %fstmp1;
+       ld.shared.u64 %fstmp1, [%fstmp2];
+       sub.u64 %frame, %fstmp1, 48;
+       and.b64 %frame, %frame, -64;
+       st.shared.u64 [%fstmp2], %frame;
                mov.u64 %r22, %frame;
        {
                .param.u64 %out_arg0;
                st.param.u64 [%out_arg0], %r22;
                call g, (%out_arg0);
        }
+       st.shared.u64 [%fstmp2], %fstmp1;
        ret;
        }
 // BEGIN GLOBAL FUNCTION DECL: g
 .extern .func g(.param.u64 %in_ar1);
+// BEGIN GLOBAL VAR DECL: __nvptx_stacks
+.extern .shared .u64 __nvptx_stacks[32];


Alexander
diff mbox

Patch

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3bd3cf7..48ee96e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -153,6 +153,9 @@  nvptx_option_override (void)
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  if (TARGET_GOMP)
+    target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 47e811e..8826659 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -36,3 +36,7 @@  Use custom stacks instead of local memory for automatic storage.
 muniform-simt
 Target Report Mask(UNIFORM_SIMT)
 Generate code that executes all threads in a warp as if one was active.
+
+mgomp
+Target Report Mask(GOMP)
+Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt.
diff --git a/gcc/config/nvptx/t-nvptx b/gcc/config/nvptx/t-nvptx
index e2580c9..6c1010d 100644
--- a/gcc/config/nvptx/t-nvptx
+++ b/gcc/config/nvptx/t-nvptx
@@ -8,3 +8,5 @@  ALL_HOST_OBJS += mkoffload.o
 mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS)
 	+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \
 	  mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS)
+
+MULTILIB_OPTIONS = mgomp
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 46cd2e9..7e7f3b4 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18956,6 +18956,11 @@  all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
 regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
 with current lane index to compute the master lane index.
 
+@item -mgomp
+@opindex mgomp
+Generate code for use in OpenMP offloading: enables @option{-msoft-stack} and
+@option{-muniform-simt} options, and selects corresponding multilib variant.
+
 @end table
 
 @node PDP-11 Options