Message ID | 1448983707-18854-5-git-send-email-amonakov@ispras.ru |
---|---|
State | New |
Headers | show |
On Tue, Dec 01, 2015 at 06:28:22PM +0300, Alexander Monakov wrote: > Since OpenMP offloading requires both soft-stacks and "uniform SIMT", both > non-traditional codegen variants, I'm building a multilib variant with those > enabled. This patch adds option -mgomp which enables -msoft-stack plus > -muniform-simt, and builds a multilib with it. > > * config/nvptx/nvptx.c (nvptx_option_override): Handle TARGET_GOMP. > * config/nvptx/nvptx.opt (mgomp): New option. > * config/nvptx/t-nvptx (MULTILIB_OPTIONS): New. > * doc/invoke.texi (mgomp): Document. I thought the MULTILIB* vars allow you to multilib on none of -msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without building other variants, so you wouldn't need this. Furthermore, as I said, I believe for e.g. most of newlib libc / libm I think it is enough if they are built as -muniform-simt -mno-soft-stack, if those functions are leaf or don't call user routines that could have #pragma omp parallel. -msoft-stack would unnecessarily slow the routines down. So perhaps just multilib on -muniform-simt, and document that -muniform-simt built code requires also that the soft-stack var is set up and thus -msoft-stack can be used when needed? Can you post sample code with assembly for -msoft-stack and -muniform-simt showing how are short interesting cases expanded? Is there really no way even in direct PTX assembly to have .local file scope vars (rather than the global arrays indexed by %tid)? Jakub
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > I thought the MULTILIB* vars allow you to multilib on none of > -msoft-stack/-muniform-simt and both -msoft-stack/-muniform-simt, without > building other variants, so you wouldn't need this. The nice effect of having -mgomp is better factorization: if I need to change what OpenMP needs, e.g. for going with your suggestion below and dropping -msoft-stack, I need to only change one line. Otherwise I'd have to change mkoffload too. > Furthermore, as I said, I believe for e.g. most of newlib libc / libm > I think it is enough if they are built as -muniform-simt -mno-soft-stack, > if those functions are leaf or don't call user routines that could have > #pragma omp parallel. -msoft-stack would unnecessarily slow the routines > down. Not obviously so. Outside of SIMD regions, running on hard stacks pointlessly amplifies cache/memory traffic for stack references, so there would have to be some evaluation before deciding. > So perhaps just multilib on -muniform-simt, and document that -muniform-simt > built code requires also that the soft-stack var is set up and thus > -msoft-stack can be used when needed? It's an interesting point, but I have doubts. Is that something you'd want me to address short-term? > Can you post sample code with assembly for -msoft-stack and -muniform-simt > showing how are short interesting cases expanded? > Is there really no way even in direct PTX assembly to have .local file scope > vars (rather than the global arrays indexed by %tid)? Allow me to post samples a bit later; as for .local, the PTX documentation explicitely states it must not be done: 5.1.5. Local State Space [...] When compiling to use the Application Binary Interface (ABI), .local state-space variables must be declared within function scope and are allocated on the stack. In implementations that do not support a stack, all local memory variables are stored at fixed addresses, recursive function calls are not supported, and .local variables may be declared at module scope. When compiling legacy PTX code (ISA versions prior to 3.0) containing module-scoped .local variables, the compiler silently disables use of the ABI. (while I'm unsure as to what exactly "compiling to use the ABI" is defined, I'm assuming that's what we want in GCC, and otherwise linking may not work) Thanks. Alexander
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > Can you post sample code with assembly for -msoft-stack and -muniform-simt > showing how are short interesting cases expanded? Here's short examples; please let me know if I'm misunderstanding and you wanted something else. First, -muniform-simt effect on this input: int f (int *p, int v) { return __atomic_exchange_n (p, v, __ATOMIC_SEQ_CST); } leads to this assembly (showing diff -without/+with option): .visible .func (.param.u32 %out_retval)f(.param.u64 %in_ar1, .param.u32 %in_ar2) { .reg.u64 %ar1; .reg.u32 %ar2; .reg.u32 %retval; .reg.u64 %hr10; .reg.u32 %r23; .reg.u64 %r25; .reg.u32 %r26; + .reg.u32 %r28; + .reg.pred %r29; ld.param.u64 %ar1, [%in_ar1]; ld.param.u32 %ar2, [%in_ar2]; + { + .reg.u32 %ustmp0; + .reg.u64 %ustmp1; + .reg.u64 %ustmp2; + mov.u32 %ustmp0, %tid.y; + mul.wide.u32 %ustmp1, %ustmp0, 4; + mov.u64 %ustmp2, __nvptx_uni; + add.u64 %ustmp2, %ustmp2, %ustmp1; + ld.shared.u32 %r28, [%ustmp2]; + mov.u32 %ustmp0, %tid.x; + and.b32 %r28, %r28, %ustmp0; + setp.eq.u32 %r29, %r28, %ustmp0; + } mov.u64 %r25, %ar1; mov.u32 %r26, %ar2; - atom.exch.b32 %r23, [%r25], %r26; + @%r29 atom.exch.b32 %r23, [%r25], %r26; + shfl.idx.b32 %r23, %r23, %r28, 31; mov.u32 %retval, %r23; st.param.u32 [%out_retval], %retval; ret; } +// BEGIN GLOBAL VAR DECL: __nvptx_uni +.extern .shared .u32 __nvptx_uni[32]; And, -msoft-stack for this input: void g(void *); void f() { char a[42] __attribute__((aligned(64))); g(a); } leads to: .visible .func f { .reg.u64 %hr10; .reg.u64 %r22; .reg.u64 %frame; - .local.align 64 .b8 %farray[48]; - cvta.local.u64 %frame, %farray; + .reg.u32 %fstmp0; + .reg.u64 %fstmp1; + .reg.u64 %fstmp2; + mov.u32 %fstmp0, %tid.y; + mul.wide.u32 %fstmp1, %fstmp0, 8; + mov.u64 %fstmp2, __nvptx_stacks; + add.u64 %fstmp2, %fstmp2, %fstmp1; + ld.shared.u64 %fstmp1, [%fstmp2]; + sub.u64 %frame, %fstmp1, 48; + and.b64 %frame, %frame, -64; + st.shared.u64 [%fstmp2], %frame; mov.u64 %r22, %frame; { .param.u64 %out_arg0; st.param.u64 [%out_arg0], %r22; call g, (%out_arg0); } + st.shared.u64 [%fstmp2], %fstmp1; ret; } // BEGIN GLOBAL FUNCTION DECL: g .extern .func g(.param.u64 %in_ar1); +// BEGIN GLOBAL VAR DECL: __nvptx_stacks +.extern .shared .u64 __nvptx_stacks[32]; Alexander
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 3bd3cf7..48ee96e 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -153,6 +153,9 @@ nvptx_option_override (void) worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + + if (TARGET_GOMP) + target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT; } /* Return the mode to be used when declaring a ptx object for OBJ. diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 47e811e..8826659 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -36,3 +36,7 @@ Use custom stacks instead of local memory for automatic storage. muniform-simt Target Report Mask(UNIFORM_SIMT) Generate code that executes all threads in a warp as if one was active. + +mgomp +Target Report Mask(GOMP) +Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt. diff --git a/gcc/config/nvptx/t-nvptx b/gcc/config/nvptx/t-nvptx index e2580c9..6c1010d 100644 --- a/gcc/config/nvptx/t-nvptx +++ b/gcc/config/nvptx/t-nvptx @@ -8,3 +8,5 @@ ALL_HOST_OBJS += mkoffload.o mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS) +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \ mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS) + +MULTILIB_OPTIONS = mgomp diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 46cd2e9..7e7f3b4 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -18956,6 +18956,11 @@ all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD regions). Each thread can bitwise-and the bitmask at position @code{tid.y} with current lane index to compute the master lane index. +@item -mgomp +@opindex mgomp +Generate code for use in OpenMP offloading: enables @option{-msoft-stack} and +@option{-muniform-simt} options, and selects corresponding multilib variant. + @end table @node PDP-11 Options