Message ID | 823cc381-8752-14df-d6e2-0203de5da2fb@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | [og7] vector_length extension part 2: Generalize state propagation and synchronization | expand |
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, > have been extended to take a barrier ID and a thread count. The idea > here is to assign one barrier for each logical vector. Worker-single > synchronization is controlled by barrier 0. Therefore, the vector > barrier ID is set to tid.y+1 (because there's one vector unit per > worker) in nvptx_init_oacc_workers and placed into a register stored in > cfun->machine->sync_bar. If no workers are present, then the barrier ID > falls back to 0. I compiled a worker loop before and after the patch series, and observed this change: ... @@ -70,7 +71,7 @@ $L2: // joining 2; $L5: - bar.sync 1; + bar.sync 0; // join 2; ret; } ... AFAICT from your explanation above, that change is intentional. Changing the code generation scheme for workers is fine, but obviously that should be a minimal, separate patch that we can bisect back to. Thanks, - Tom
On 03/21/2018 10:10 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, >> have been extended to take a barrier ID and a thread count. The idea >> here is to assign one barrier for each logical vector. Worker-single >> synchronization is controlled by barrier 0. Therefore, the vector >> barrier ID is set to tid.y+1 (because there's one vector unit per >> worker) in nvptx_init_oacc_workers and placed into a register stored in >> cfun->machine->sync_bar. If no workers are present, then the barrier ID >> falls back to 0. > > I compiled a worker loop before and after the patch series, and observed > this change: > ... > @@ -70,7 +71,7 @@ > $L2: > // joining 2; > $L5: > - bar.sync 1; > + bar.sync 0; > // join 2; > ret; > } > ... > > AFAICT from your explanation above, that change is intentional. > > Changing the code generation scheme for workers is fine, but obviously > that should be a minimal, separate patch that we can bisect back to. That sounds reasonable. I'll apply this patch to og7 once testing has completed. While all of the functionality it introduces is unnecessary without the vector length changes, at least it can be applied independently. Cesar Update bar.sync usage 2018-03-21 Cesar Philippidis <cesar@codesourcery.com> gcc/ * config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take in a lock and thread count. Update call to gen_nvptx_barsync. (nvptx_single): Update call to nvptx_cta_sync. (nvptx_process_pars): Likewise. * config/nvptx/nvptx.md (nvptx_barsync): Adjust operands. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index b7e3f59fed7..029628f8a0e 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3936,13 +3936,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, return empty; } -/* Emit a CTA-level synchronization barrier. We use different - markers for before and after synchronizations. */ +/* Emit a CTA-level synchronization barrier (bar.sync). LOCK is the + barrier number, which is an integer or a register. THREADS is the + number of threads controlled by the barrier. */ static rtx -nvptx_cta_sync (bool after) +nvptx_cta_sync (rtx lock, int threads) { - return gen_nvptx_barsync (GEN_INT (after)); + return gen_nvptx_barsync (lock, GEN_INT (threads)); } #if WORKAROUND_PTXJIT_BUG @@ -4192,6 +4193,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* Includes worker mode, do spill & fill. By construction we should never have worker mode only. */ broadcast_data_t data; + rtx barrier = GEN_INT (0); + int threads = 0; data.base = oacc_bcast_sym; data.ptr = 0; @@ -4204,14 +4207,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) false), before); /* Barrier so other workers can see the write. */ - emit_insn_before (nvptx_cta_sync (false), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data, false), tail); /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ - emit_insn_before (nvptx_cta_sync (true), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); } extract_insn (tail); @@ -4328,12 +4331,14 @@ nvptx_process_pars (parallel *par) bool empty = nvptx_shared_propagate (true, is_call, par->forked_block, par->fork_insn, false); + rtx barrier = GEN_INT (0); + int threads = 0; if (!empty || !is_call) { /* Insert begin and end synchronizations. */ - emit_insn_before (nvptx_cta_sync (false), par->forked_insn); - emit_insn_before (nvptx_cta_sync (true), par->join_insn); + emit_insn_before (nvptx_cta_sync (barrier, threads), par->forked_insn); + emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn); } } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 2b4bcb3a45b..e638a13c366 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -1421,10 +1421,16 @@ [(set_attr "atomic" "true")]) (define_insn "nvptx_barsync" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] + [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri") + (match_operand:SI 1 "const_int_operand")] UNSPECV_BARSYNC)] "" - "\\tbar.sync\\t%0;" + { + if (!REG_P (operands[0])) + return "\\tbar.sync\\t%0;"; + else + return "\\tbar.sync\\t%0, %1;"; + } [(set_attr "predicable" "false")]) (define_insn "nvptx_nounroll"
On 03/22/2018 04:59 AM, Cesar Philippidis wrote: > On 03/21/2018 10:10 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, >>> have been extended to take a barrier ID and a thread count. The idea >>> here is to assign one barrier for each logical vector. Worker-single >>> synchronization is controlled by barrier 0. Therefore, the vector >>> barrier ID is set to tid.y+1 (because there's one vector unit per >>> worker) in nvptx_init_oacc_workers and placed into a register stored in >>> cfun->machine->sync_bar. If no workers are present, then the barrier ID >>> falls back to 0. >> >> I compiled a worker loop before and after the patch series, and observed >> this change: >> ... >> @@ -70,7 +71,7 @@ >> $L2: >> // joining 2; >> $L5: >> - bar.sync 1; >> + bar.sync 0; >> // join 2; >> ret; >> } >> ... >> >> AFAICT from your explanation above, that change is intentional. >> >> Changing the code generation scheme for workers is fine, but obviously >> that should be a minimal, separate patch that we can bisect back to. > > That sounds reasonable. I'll apply this patch to og7 once testing has > completed. While all of the functionality it introduces is unnecessary In other words, the patch is not minimal. Thanks, - Tom > without the vector length changes, at least it can be applied independently. > diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index b7e3f59..16d846e 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3936,13 +3936,13 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, return empty; } -/* Emit a CTA-level synchronization barrier. We use different - markers for before and after synchronizations. */ +/* Emit a CTA-level synchronization barrier (bar.sync). LOCK is the + barrier number, which is an integer or a register. */ static rtx -nvptx_cta_sync (bool after) +nvptx_cta_sync (rtx lock) { - return gen_nvptx_barsync (GEN_INT (after)); + return gen_nvptx_barsync (lock); } #if WORKAROUND_PTXJIT_BUG @@ -4192,6 +4192,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* Includes worker mode, do spill & fill. By construction we should never have worker mode only. */ broadcast_data_t data; + rtx barrier = GEN_INT (0); data.base = oacc_bcast_sym; data.ptr = 0; @@ -4204,14 +4205,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) false), before); /* Barrier so other workers can see the write. */ - emit_insn_before (nvptx_cta_sync (false), tail); + emit_insn_before (nvptx_cta_sync (barrier), tail); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data, false), tail); /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ - emit_insn_before (nvptx_cta_sync (true), tail); + emit_insn_before (nvptx_cta_sync (barrier), tail); } extract_insn (tail); @@ -4328,12 +4329,13 @@ nvptx_process_pars (parallel *par) bool empty = nvptx_shared_propagate (true, is_call, par->forked_block, par->fork_insn, false); + rtx barrier = GEN_INT (0); if (!empty || !is_call) { /* Insert begin and end synchronizations. */ - emit_insn_before (nvptx_cta_sync (false), par->forked_insn); - emit_insn_before (nvptx_cta_sync (true), par->join_insn); + emit_insn_before (nvptx_cta_sync (barrier), par->forked_insn); + emit_insn_before (nvptx_cta_sync (barrier), par->join_insn); } } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > (nvptx_declare_function_name): Emit a .maxntid directive hint and > call nvptx_init_oacc_workers. > + > + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ > + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) > + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) > + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " > + << cfun->machine->axis_dim[1] << ", 1\n"; > + This change: ... // BEGIN FUNCTION DEF: main$_omp_fn$0 .entry main$_omp_fn$0 (.param .u64 %in_ar0) + .maxntid 32, 32, 1 ... needs to be an individual patch. > + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ 'Help' is too strongly formulated, given that there's no clear link between the semantics of the directive, and the observed effect. Use "seems to have the effect" or some such formulation. Also, list in the comment a JIT driver version, and sm_ version and a testcase for which this is required. Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.) Thanks, - Tom
On 03/22/2018 06:43 AM, Tom de Vries wrote: > On 03/22/2018 04:59 AM, Cesar Philippidis wrote: >> On 03/21/2018 10:10 AM, Tom de Vries wrote: >>> Changing the code generation scheme for workers is fine, but obviously >>> that should be a minimal, separate patch that we can bisect back to. >> >> That sounds reasonable. I'll apply this patch to og7 once testing has >> completed. While all of the functionality it introduces is unnecessary > > In other words, the patch is not minimal. My intention was to reduce the size of the final vector length patch. But I can commit this patch after testing as it's equivalent at this point. Cesar diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index b7e3f59fed7..eff87732c4b 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4211,7 +4211,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ - emit_insn_before (nvptx_cta_sync (true), tail); + emit_insn_before (nvptx_cta_sync (false), tail); } extract_insn (tail); @@ -4333,7 +4333,7 @@ nvptx_process_pars (parallel *par) { /* Insert begin and end synchronizations. */ emit_insn_before (nvptx_cta_sync (false), par->forked_insn); - emit_insn_before (nvptx_cta_sync (true), par->join_insn); + emit_insn_before (nvptx_cta_sync (false), par->join_insn); } } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > The attached patch generalizes the worker state propagation and > synchronization code to handle large vectors. When the vector_length is > larger than a CUDA warp, the nvptx BE will now use shared-memory to > spill-and-fill vector state when transitioning from vector-single mode > to vector partitioned. I've compiled this test-case: ... int main (void) { int a[10]; #pragma acc parallel loop worker for (int i = 0; i < 10; i++) a[i] = i; return 0; } ... without and with the patch series, and observed the following difference in generated ptx: ... -.shared .align 8 .u8 __oacc_bcast[8]; +.shared .align 8 .u8 __oacc_bcast[264]; ... Why is the example using 33 times more shared memory space with the patch series applied? Thanks, - Tom
On 03/22/2018 07:23 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > >> (nvptx_declare_function_name): Emit a .maxntid directive hint and >> call nvptx_init_oacc_workers. > >> + >> + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ >> + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) >> + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) >> + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " >> + << cfun->machine->axis_dim[1] << ", 1\n"; >> + > > This change: > ... > // BEGIN FUNCTION DEF: main$_omp_fn$0 > .entry main$_omp_fn$0 (.param .u64 %in_ar0) > + .maxntid 32, 32, 1 > ... > needs to be an individual patch. cfun->machine->axis_dims is something new to the vector length changes, so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution. >> + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ > > 'Help' is too strongly formulated, given that there's no clear link > between the semantics of the directive, and the observed effect. > > Use "seems to have the effect" or some such formulation. > > Also, list in the comment a JIT driver version, and sm_ version and a > testcase for which this is required. > > Also, guard it with WORKAROUND_PTXJIT_BUG_3 (_2 is already taken in trunk.) Sounds reasonable. I'll commit the patch to og7 once the regression testing has completed. Thanks, Cesar From b89ec8060de3affb94b580be3260381028d4c183 Mon Sep 17 00:00:00 2001 From: Cesar Philippidis <cesar@codesourcery.com> Date: Thu, 22 Mar 2018 08:05:53 -0700 Subject: [PATCH] add .maxntid hint --- gcc/config/nvptx/nvptx.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index eff87732c4b..9fb2bcd6852 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -76,6 +76,7 @@ #include "target-def.h" #define WORKAROUND_PTXJIT_BUG 1 +#define WORKAROUND_PTXJIT_BUG_3 1 /* Define dimension sizes for known hardware. */ #define PTX_VECTOR_LENGTH 32 @@ -1219,6 +1220,15 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) stream, in order to share the prototype writing code. */ std::stringstream s; write_fn_proto (s, true, name, decl); + +#if WORKAROUND_PTXJIT_BUG_3 + /* Emitting a .maxntid seems to have the effect of encouraging the + PTX JIT emit SYNC branches. */ + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) + s << ".maxntid 32, 32, 1\n"; +#endif + s << "{\n"; bool return_in_mem = write_return_type (s, false, result_type);
On 03/22/2018 04:11 PM, Cesar Philippidis wrote: > On 03/22/2018 07:23 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> >>> (nvptx_declare_function_name): Emit a .maxntid directive hint and >>> call nvptx_init_oacc_workers. >>> + >>> + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ >>> + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) >>> + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) >>> + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " >>> + << cfun->machine->axis_dim[1] << ", 1\n"; >>> + >> This change: >> ... >> // BEGIN FUNCTION DEF: main$_omp_fn$0 >> .entry main$_omp_fn$0 (.param .u64 %in_ar0) >> + .maxntid 32, 32, 1 >> ... >> needs to be an individual patch. > cfun->machine->axis_dims is something new to the vector length changes, > so I hard-coded .maxntid to size '32, 32, 1' for og7 as an interim solution. > That's obviously not good enough. When I compile this test-case: ... int main (void) { int a[10]; #pragma acc parallel num_workers (16) #pragma acc loop worker for (int i = 0; i < 10; i++) a[i] = i; return 0; } ... I get: ... .maxntid 32, 16, 1 ... That's the change you need to isolate. Thanks, - Tom
On 03/22/2018 07:44 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> The attached patch generalizes the worker state propagation and >> synchronization code to handle large vectors. When the vector_length is >> larger than a CUDA warp, the nvptx BE will now use shared-memory to >> spill-and-fill vector state when transitioning from vector-single mode >> to vector partitioned. > > I've compiled this test-case: > ... > int > main (void) > { > int a[10]; > #pragma acc parallel loop worker > for (int i = 0; i < 10; i++) > a[i] = i; > > return 0; > } > ... > > without and with the patch series, and observed the following difference > in generated ptx: > ... > -.shared .align 8 .u8 __oacc_bcast[8]; > +.shared .align 8 .u8 __oacc_bcast[264]; > ... > > Why is the example using 33 times more shared memory space with the > patch series applied? Because the nvptx BE wasn't taking into account that vector_length = 32 doesn't need to use shared-memory to broadcast variables. That magic value of 33 was derived from nvptx_mach_max_workers () + 1. When vector_length > 32, there needs to be nvptx_mach_max_workers () partitions for vector state propagation. There also needs to be a shared-memory buffer for worker-state propagation, because I found situations where some threads where still spilling and filling workers before vector 0 transitioned vector-partitioned mode. The attached, untested, patch should resolve that issue. Cesar diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 3102c79bf96..f81fb0113d5 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, if (oacc_bcast_partition < data.offset) { int psize = data.offset; + int pnum = 1; + + if (nvptx_mach_vector_length () > PTX_WARP_SIZE) + pnum = nvptx_mach_max_workers () + 1; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); oacc_bcast_partition = psize; - oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + oacc_bcast_size = psize * pnum; } } return empty; @@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) if (oacc_bcast_partition < size) { int psize = size; + int pnum = 1; + + if (nvptx_mach_vector_length () > PTX_WARP_SIZE) + pnum = nvptx_mach_max_workers () + 1; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); oacc_bcast_partition = psize; - oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + oacc_bcast_size = psize * pnum; } data.offset = 0;
On 03/22/2018 09:18 AM, Tom de Vries wrote: > That's obviously not good enough. > > When I compile this test-case: > ... > int > main (void) > { > int a[10]; > #pragma acc parallel num_workers (16) > #pragma acc loop worker > for (int i = 0; i < 10; i++) > a[i] = i; > > return 0; > } > ... > > I get: > ... > .maxntid 32, 16, 1 > ... > > That's the change you need to isolate. I attached an updated patch which incorporates the cfun->machine->axis_dim changes. It now generates more precise arguments for maxntid. Cesar From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001 From: Cesar Philippidis <cesar@codesourcery.com> Date: Thu, 22 Mar 2018 08:05:53 -0700 Subject: [PATCH] emit .maxntid hint --- gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++ gcc/config/nvptx/nvptx.h | 2 ++ 2 files changed, 21 insertions(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index eff87732c4b..3958f71e995 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -76,6 +76,7 @@ #include "target-def.h" #define WORKAROUND_PTXJIT_BUG 1 +#define WORKAROUND_PTXJIT_BUG_3 1 /* Define dimension sizes for known hardware. */ #define PTX_VECTOR_LENGTH 32 @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) stream, in order to share the prototype writing code. */ std::stringstream s; write_fn_proto (s, true, name, decl); + +#if WORKAROUND_PTXJIT_BUG_3 + /* Emitting a .maxntid seems to have the effect of encouraging the + PTX JIT emit SYNC branches. */ + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " + << cfun->machine->axis_dim[1] << ", 1\n"; +#endif + s << "{\n"; bool return_in_mem = write_return_type (s, false, result_type); @@ -2831,6 +2842,11 @@ struct offload_attrs int max_workers; }; +/* Define entries for cfun->machine->axis_dim. */ + +#define MACH_VECTOR_LENGTH 0 +#define MACH_MAX_WORKERS 1 + struct parallel { /* Parent parallel. */ @@ -4525,6 +4541,9 @@ nvptx_reorg (void) populate_offload_attrs (&oa); + cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length; + cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers; + /* If there is worker neutering, there must be vector neutering. Otherwise the hardware will fail. */ gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 8a14507c88a..958516da604 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -226,6 +226,8 @@ struct GTY(()) machine_function int return_mode; /* Return mode of current fn. (machine_mode not defined yet.) */ rtx axis_predicate[2]; /* Neutering predicates. */ + int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is + vector_length, dim[1] is num_workers. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ rtx unisimt_location; /* Mask location for -muniform-simt. */
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > + rtx red_partition; /* Similar to bcast_partition, except for vector > + reductions. */ Shouldn't this be in "[og7] vector_length extension part 3: reductions"? Thanks, - Tom
On 03/22/2018 10:39 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> + rtx red_partition; /* Similar to bcast_partition, except for vector >> + reductions. */ > > Shouldn't this be in "[og7] vector_length extension part 3: reductions"? Maybe. But keep in mind, with the exception of the bar.sync and maxntid changes you requested, I don't think the vector length patch makes sense to go in as individual hunks. Maybe I could split out the new TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch. But, at the same time, if something isn't being used, what's the point of going through that extra work? Cesar
On 03/22/2018 06:24 PM, Cesar Philippidis wrote: > On 03/22/2018 09:18 AM, Tom de Vries wrote: > >> That's obviously not good enough. >> >> When I compile this test-case: >> ... >> int >> main (void) >> { >> int a[10]; >> #pragma acc parallel num_workers (16) >> #pragma acc loop worker >> for (int i = 0; i < 10; i++) >> a[i] = i; >> >> return 0; >> } >> ... >> >> I get: >> ... >> .maxntid 32, 16, 1 >> ... >> >> That's the change you need to isolate. > > I attached an updated patch which incorporates the > cfun->machine->axis_dim changes. It now generates more precise arguments > for maxntid. I'll try this out. Still, this doesn't address my request: "Also, list in the comment a JIT driver version, and sm_ version and a testcase for which this is required" Thanks, - Tom > > Cesar > > > 0001-emit-.maxntid-hint.patch > > > From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001 > From: Cesar Philippidis <cesar@codesourcery.com> > Date: Thu, 22 Mar 2018 08:05:53 -0700 > Subject: [PATCH] emit .maxntid hint > > --- > gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++ > gcc/config/nvptx/nvptx.h | 2 ++ > 2 files changed, 21 insertions(+) > > diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c > index eff87732c4b..3958f71e995 100644 > --- a/gcc/config/nvptx/nvptx.c > +++ b/gcc/config/nvptx/nvptx.c > @@ -76,6 +76,7 @@ > #include "target-def.h" > > #define WORKAROUND_PTXJIT_BUG 1 > +#define WORKAROUND_PTXJIT_BUG_3 1 > > /* Define dimension sizes for known hardware. */ > #define PTX_VECTOR_LENGTH 32 > @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) > stream, in order to share the prototype writing code. */ > std::stringstream s; > write_fn_proto (s, true, name, decl); > + > +#if WORKAROUND_PTXJIT_BUG_3 > + /* Emitting a .maxntid seems to have the effect of encouraging the > + PTX JIT emit SYNC branches. */ > + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) > + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) > + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " > + << cfun->machine->axis_dim[1] << ", 1\n"; > +#endif > + > s << "{\n"; > > bool return_in_mem = write_return_type (s, false, result_type); > @@ -2831,6 +2842,11 @@ struct offload_attrs > int max_workers; > }; > > +/* Define entries for cfun->machine->axis_dim. */ > + > +#define MACH_VECTOR_LENGTH 0 > +#define MACH_MAX_WORKERS 1 > + > struct parallel > { > /* Parent parallel. */ > @@ -4525,6 +4541,9 @@ nvptx_reorg (void) > > populate_offload_attrs (&oa); > > + cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length; > + cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers; > + > /* If there is worker neutering, there must be vector > neutering. Otherwise the hardware will fail. */ > gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) > diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h > index 8a14507c88a..958516da604 100644 > --- a/gcc/config/nvptx/nvptx.h > +++ b/gcc/config/nvptx/nvptx.h > @@ -226,6 +226,8 @@ struct GTY(()) machine_function > int return_mode; /* Return mode of current fn. > (machine_mode not defined yet.) */ > rtx axis_predicate[2]; /* Neutering predicates. */ > + int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is > + vector_length, dim[1] is num_workers. */ > rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ > rtx unisimt_predicate; /* Predicate for -muniform-simt. */ > rtx unisimt_location; /* Mask location for -muniform-simt. */ >
On 03/22/2018 06:47 PM, Cesar Philippidis wrote: > On 03/22/2018 10:39 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >>> + rtx red_partition; /* Similar to bcast_partition, except for vector >>> + reductions. */ >> >> Shouldn't this be in "[og7] vector_length extension part 3: reductions"? > > Maybe. But keep in mind, with the exception of the bar.sync and maxntid > changes you requested, I don't think the vector length patch makes sense > to go in as individual hunks. Maybe I could split out the new > TARGET_GOACC_ADJUST_PARALLELISM hook in part 4 into a separate patch. > But, at the same time, if something isn't being used, what's the point > of going through that extra work? Because patches that are split into logically consistent parts are easy to review, and easy to analyze and fix or undo when bisected back to. And yes, that's extra work. Thanks, - Tom
On 03/22/2018 10:51 AM, Tom de Vries wrote: > On 03/22/2018 06:24 PM, Cesar Philippidis wrote: >> On 03/22/2018 09:18 AM, Tom de Vries wrote: >> >>> That's obviously not good enough. >>> >>> When I compile this test-case: >>> ... >>> int >>> main (void) >>> { >>> int a[10]; >>> #pragma acc parallel num_workers (16) >>> #pragma acc loop worker >>> for (int i = 0; i < 10; i++) >>> a[i] = i; >>> >>> return 0; >>> } >>> ... >>> >>> I get: >>> ... >>> .maxntid 32, 16, 1 >>> ... >>> >>> That's the change you need to isolate. >> >> I attached an updated patch which incorporates the >> cfun->machine->axis_dim changes. It now generates more precise arguments >> for maxntid. > > I'll try this out. > > Still, this doesn't address my request: "Also, list in the comment a JIT > driver version, and sm_ version and a testcase for which this is required" I attached the test case where it used to fail without maxntid. But after looking at again, the maxntid directive was probably masking that other PTX JIT bug involving abort and exiting threads that you fixed. And in fact, the test case works without the maxntid patch on my sm_60 GPU. I'm going to retest the variable vector length changes without it and see if it's still necessary. On one hand, maxntid should be fairly innocuous, but I don't like how it can mask other PTX JIT bugs. At this point, I'm leaning towards dropping it if does not impact the libgomp regression test suite anymore. What do you want to do? Cesar /* This test was failing with nvptx offloading without the .maxntid PTX directive. */ int i; int main(void) { int j, v; i = -1; j = -2; v = 0; j = -2; v = 0; #pragma acc parallel present_or_copyout (v) copyout (i, j) vector_length(128) { i = 2; j = 1; if (i != 2 || j != 1) __builtin_abort (); v = 1; } if (v != 1 || i != 2 || j != 1) __builtin_abort (); i = -1; j = -2; v = 0; #pragma acc parallel present_or_copyout (v) copy (i, j) vector_length(128) { if (i != -1 || j != -2) __builtin_abort (); i = 2; j = 1; if (i != 2 || j != 1) __builtin_abort (); v = 1; } if (v != 1 || i != 2 || j != 1) __builtin_abort (); return 0; }
On 03/22/2018 08:04 PM, Cesar Philippidis wrote: > I'm going to retest the variable vector length changes without it and > see if it's still necessary. On one hand, maxntid should be fairly > innocuous, but I don't like how it can mask other PTX JIT bugs. At this > point, I'm leaning towards dropping it if does not impact the libgomp > regression test suite anymore. What do you want to do? If there is no observable difference in tests passing/failing, then we should drop it. Thanks, - Tom
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > @@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) > pred = gen_reg_rtx (BImode); > cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred; > } > - > + It's fine to clean up whitespace, but please do that in separate patches. Committed. Thanks, - Tom [nvptx] Fix whitespace in nvptx_single 2018-03-23 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (nvptx_single): Fix whitespace. --- gcc/config/nvptx/nvptx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index b7e3f59..50d7319 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4100,7 +4100,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) pred = gen_reg_rtx (BImode); cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred; } - + rtx br; if (mode == GOMP_DIM_VECTOR) br = gen_br_true (pred, label);
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > +/* Loop structure of the function. The entire function is described as > + a NULL loop. */ > + > struct parallel > { > /* Parent parallel. */ You dropped this comment in "vector_length extension part 1: generalize function and variable names". It's good to add it back, but that needs to be a separate patch. Committed. Thanks, - Tom [nvptx] Re-add removed struct parallel comment 2018-03-23 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (struct parallel): Re-add comment. --- gcc/config/nvptx/nvptx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 50d7319..9873449 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -2831,6 +2831,9 @@ struct offload_attrs int max_workers; }; +/* Loop structure of the function. The entire function is described as + a NULL loop. */ + struct parallel { /* Parent parallel. */
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md > index 28ae263c867..ac2731233dd 100644 > --- a/gcc/config/nvptx/nvptx.md > +++ b/gcc/config/nvptx/nvptx.md > @@ -1418,10 +1418,16 @@ > [(set_attr "atomic" "true")]) > > (define_insn "nvptx_barsync" > - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] > + [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri") > + (match_operand:SI 1 "const_int_operand")] > UNSPECV_BARSYNC)] > "" > - "\\tbar.sync\\t%0;" > + { > + if (!REG_P (operands[0])) > + return "\\tbar.sync\\t%0;"; > + else > + return "\\tbar.sync\\t%0, %1;"; > + } > [(set_attr "predicable" "false")]) This is wrong. The first operand can be a register or a constant, and the second operand is independent. Whether or not we print the second operand is independent of whether the first is a register. In this patch I've reserved INTVAL (operands[1]) == 0 for the "no second operand" case. Committed. Thanks, - Tom [nvptx] Add thread count parm to bar.sync 2018-03-23 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.md (nvptx_barsync): Add and handle operand. * config/nvptx/nvptx.c (nvptx_cta_sync): Change arguments to take in a lock and thread count. Update call to gen_nvptx_barsync. (nvptx_single, nvptx_process_pars): Update calls to nvptx_cta_sync. --- gcc/config/nvptx/nvptx.c | 22 ++++++++++++++-------- gcc/config/nvptx/nvptx.md | 10 ++++++++-- 3 files changed, 29 insertions(+), 10 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 12441cb..32f2efb 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3939,13 +3939,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, return empty; } -/* Emit a CTA-level synchronization barrier. We use different - markers for before and after synchronizations. */ +/* Emit a CTA-level synchronization barrier (bar.sync). LOCK is the + barrier number, which is an integer or a register. THREADS is the + number of threads controlled by the barrier. */ static rtx -nvptx_cta_sync (bool after) +nvptx_cta_sync (rtx lock, int threads) { - return gen_nvptx_barsync (GEN_INT (after)); + return gen_nvptx_barsync (lock, GEN_INT (threads)); } #if WORKAROUND_PTXJIT_BUG @@ -4195,6 +4196,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* Includes worker mode, do spill & fill. By construction we should never have worker mode only. */ broadcast_data_t data; + rtx barrier = GEN_INT (0); + int threads = 0; data.base = oacc_bcast_sym; data.ptr = 0; @@ -4207,14 +4210,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) false), before); /* Barrier so other workers can see the write. */ - emit_insn_before (nvptx_cta_sync (false), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data, false), tail); /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ - emit_insn_before (nvptx_cta_sync (false), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); } extract_insn (tail); @@ -4331,12 +4334,15 @@ nvptx_process_pars (parallel *par) bool empty = nvptx_shared_propagate (true, is_call, par->forked_block, par->fork_insn, false); + rtx barrier = GEN_INT (0); + int threads = 0; if (!empty || !is_call) { /* Insert begin and end synchronizations. */ - emit_insn_before (nvptx_cta_sync (false), par->forked_insn); - emit_insn_before (nvptx_cta_sync (false), par->join_insn); + emit_insn_before (nvptx_cta_sync (barrier, threads), + par->forked_insn); + emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn); } } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 2b4bcb3a..2609222 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -1421,10 +1421,16 @@ [(set_attr "atomic" "true")]) (define_insn "nvptx_barsync" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] + [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri") + (match_operand:SI 1 "const_int_operand")] UNSPECV_BARSYNC)] "" - "\\tbar.sync\\t%0;" + { + if (INTVAL (operands[1]) == 0) + return "\\tbar.sync\\t%0;"; + else + return "\\tbar.sync\\t%0, %1;"; + } [(set_attr "predicable" "false")]) (define_insn "nvptx_nounroll"
On 03/22/2018 06:24 PM, Cesar Philippidis wrote: > On 03/22/2018 09:18 AM, Tom de Vries wrote: > >> That's obviously not good enough. >> >> When I compile this test-case: >> ... >> int >> main (void) >> { >> int a[10]; >> #pragma acc parallel num_workers (16) >> #pragma acc loop worker >> for (int i = 0; i < 10; i++) >> a[i] = i; >> >> return 0; >> } >> ... >> >> I get: >> ... >> .maxntid 32, 16, 1 >> ... >> >> That's the change you need to isolate. > > I attached an updated patch which incorporates the > cfun->machine->axis_dim changes. It now generates more precise arguments > for maxntid. > Even with maxntid dropped, axis_dim is still used elsewhere in the patch series, so we can split off the introduction of axis_dim and helper functions in a separate patch. Committed. Thanks, - Tom > Cesar > > > 0001-emit-.maxntid-hint.patch > > > From 11035dc92884146dc4d974156adcb260568db785 Mon Sep 17 00:00:00 2001 > From: Cesar Philippidis <cesar@codesourcery.com> > Date: Thu, 22 Mar 2018 08:05:53 -0700 > Subject: [PATCH] emit .maxntid hint > > --- > gcc/config/nvptx/nvptx.c | 19 +++++++++++++++++++ > gcc/config/nvptx/nvptx.h | 2 ++ > 2 files changed, 21 insertions(+) > > diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c > index eff87732c4b..3958f71e995 100644 > --- a/gcc/config/nvptx/nvptx.c > +++ b/gcc/config/nvptx/nvptx.c > @@ -76,6 +76,7 @@ > #include "target-def.h" > > #define WORKAROUND_PTXJIT_BUG 1 > +#define WORKAROUND_PTXJIT_BUG_3 1 > > /* Define dimension sizes for known hardware. */ > #define PTX_VECTOR_LENGTH 32 > @@ -1219,6 +1220,16 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) > stream, in order to share the prototype writing code. */ > std::stringstream s; > write_fn_proto (s, true, name, decl); > + > +#if WORKAROUND_PTXJIT_BUG_3 > + /* Emitting a .maxntid seems to have the effect of encouraging the > + PTX JIT emit SYNC branches. */ > + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) > + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) > + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " > + << cfun->machine->axis_dim[1] << ", 1\n"; > +#endif > + > s << "{\n"; > > bool return_in_mem = write_return_type (s, false, result_type); > @@ -2831,6 +2842,11 @@ struct offload_attrs > int max_workers; > }; > > +/* Define entries for cfun->machine->axis_dim. */ > + > +#define MACH_VECTOR_LENGTH 0 > +#define MACH_MAX_WORKERS 1 > + > struct parallel > { > /* Parent parallel. */ > @@ -4525,6 +4541,9 @@ nvptx_reorg (void) > > populate_offload_attrs (&oa); > > + cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length; > + cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers; > + > /* If there is worker neutering, there must be vector > neutering. Otherwise the hardware will fail. */ > gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) > diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h > index 8a14507c88a..958516da604 100644 > --- a/gcc/config/nvptx/nvptx.h > +++ b/gcc/config/nvptx/nvptx.h > @@ -226,6 +226,8 @@ struct GTY(()) machine_function > int return_mode; /* Return mode of current fn. > (machine_mode not defined yet.) */ > rtx axis_predicate[2]; /* Neutering predicates. */ > + int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is > + vector_length, dim[1] is num_workers. */ > rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ > rtx unisimt_predicate; /* Predicate for -muniform-simt. */ > rtx unisimt_location; /* Mask location for -muniform-simt. */ > [nvptx] Add axis_dim 2018-03-23 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define. (nvptx_mach_max_workers, nvptx_mach_vector_length): New function. (nvptx_reorg): Set function-specific axis_dim's. * config/nvptx/nvptx.h (struct machine_function): Add axis_dims. --- gcc/config/nvptx/nvptx.c | 20 ++++++++++++++++++++ gcc/config/nvptx/nvptx.h | 2 ++ 3 files changed, 29 insertions(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 32f2efb..3cb33ae 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -2831,6 +2831,23 @@ struct offload_attrs int max_workers; }; +/* Define entries for cfun->machine->axis_dim. */ + +#define MACH_VECTOR_LENGTH 0 +#define MACH_MAX_WORKERS 1 + +static int ATTRIBUTE_UNUSED +nvptx_mach_max_workers () +{ + return cfun->machine->axis_dim[MACH_MAX_WORKERS]; +} + +static int ATTRIBUTE_UNUSED +nvptx_mach_vector_length () +{ + return cfun->machine->axis_dim[MACH_VECTOR_LENGTH]; +} + /* Loop structure of the function. The entire function is described as a NULL loop. */ @@ -4534,6 +4551,9 @@ nvptx_reorg (void) populate_offload_attrs (&oa); + cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length; + cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers; + /* If there is worker neutering, there must be vector neutering. Otherwise the hardware will fail. */ gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 8a14507..784628e 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -226,6 +226,8 @@ struct GTY(()) machine_function int return_mode; /* Return mode of current fn. (machine_mode not defined yet.) */ rtx axis_predicate[2]; /* Neutering predicates. */ + int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is + vector_length, dim[1] is num_workers. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ rtx unisimt_location; /* Mask location for -muniform-simt. */
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > + if (cfun->machine->sync_bar) > + fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " > + "// vector synchronization barrier\n", > + REGNO (cfun->machine->sync_bar)); I realize that atm we don't support large vector length when nesting a vector loop inside a worker loop, but ... if we did support that, and used a vector_length of 64, then with the "Maximum number of threads per block" of 1024 we have a possible 16 workers. And when using the maximum number of workers, we'll end up using logical barrier 16 (while we only have 0..15). It would be good to have at least an assert detecting this situation. Thanks, - Tom
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > As a follow up patch will show, the nvptx BE falls back to using > vector_length = 32 when a vector loop is nested inside a worker loop. I disabled the fallback, and analyzed the vred2d-128.c illegal memory access execution failure. I minimized that down to this ptx: ... .shared .align 8 .u8 __oacc_bcast[176]; { { .reg .u32 %x; mov.u32 %x,%tid.x; setp.ne.u32 %r86,%x,0; } { .reg .u32 %tidy; .reg .u64 %t_bcast; .reg .u64 %y64; mov.u32 %tidy,%tid.y; cvt.u64.u32 %y64,%tidy; add.u64 %y64,%y64,1; cvta.shared.u64 %t_bcast,__oacc_bcast; mad.lo.u64 %r66,%y64,88,%t_bcast; } @ %r86 bra $L28; st.u32 [%r66+80],0; $L28: ret; } ... The ptx is called with 2 workers and 128 vector_length. So, 2 workers mean %tid.y has values 0 and 1. Then %y64 has values 1 and 2. Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88). Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + (2 * 88) + 80. So we're accessing memory at location 256, while the __oacc_bcast is only 176 bytes big. I formulated this assert that AFAIU detects this situation in the compiler: ... @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name) fprintf (file, "\t}\n"); } +static int nvptx_mach_max_workers (); + /* Emit code to initialize OpenACC worker broadcast and synchronization registers. */ @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file) "// vector broadcast offset\n", REGNO (cfun->machine->bcast_partition), oacc_bcast_partition); + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1) <= oacc_bcast_size); } if (cfun->machine->sync_bar) fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " ... The assert is not triggered when the fallback is used. Thanks, - Tom
On 03/30/2018 03:07 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> As a follow up patch will show, the nvptx BE falls back to using >> vector_length = 32 when a vector loop is nested inside a worker loop. > > I disabled the fallback, and analyzed the vred2d-128.c illegal memory > access execution failure. > > I minimized that down to this ptx: > ... > .shared .align 8 .u8 __oacc_bcast[176]; > > { > { > .reg .u32 %x; > mov.u32 %x,%tid.x; > setp.ne.u32 %r86,%x,0; > } > > { > .reg .u32 %tidy; > .reg .u64 %t_bcast; > .reg .u64 %y64; > mov.u32 %tidy,%tid.y; > cvt.u64.u32 %y64,%tidy; > add.u64 %y64,%y64,1; > cvta.shared.u64 %t_bcast,__oacc_bcast; > mad.lo.u64 %r66,%y64,88,%t_bcast; > } > > @ %r86 bra $L28; > st.u32 [%r66+80],0; > $L28: > ret; > } > ... > > The ptx is called with 2 workers and 128 vector_length. > > So, 2 workers mean %tid.y has values 0 and 1. > Then %y64 has values 1 and 2. > Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88). > Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast + > (2 * 88) + 80. > > So we're accessing memory at location 256, while the __oacc_bcast is > only 176 bytes big. > > I formulated this assert that AFAIU detects this situation in the compiler: > ... > @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int regno, > const char *name) > fprintf (file, "\t}\n"); > } > > +static int nvptx_mach_max_workers (); > + > /* Emit code to initialize OpenACC worker broadcast and synchronization > registers. */ > > @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file) > "// vector broadcast offset\n", > REGNO (cfun->machine->bcast_partition), > oacc_bcast_partition); > + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + > 1) <= oacc_bcast_size); > } > if (cfun->machine->sync_bar) > fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " > ... > > The assert is not triggered when the fallback is used. I've tracked the problem down to: ... > - if (oacc_bcast_size < data.offset) > - oacc_bcast_size = data.offset; > + if (oacc_bcast_partition < data.offset) > + { > + int psize = data.offset; > + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); > + oacc_bcast_partition = psize; > + oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); > + } ... We hit this if clause for a first compiled function, with num_workers(1). This sets oacc_bcast_partition and oacc_bcast_size as required for that functions. Then we hit this if clause for a second compiled function, with num_workers (2). We need oacc_bcast_size updated, but the 'oacc_bcast_partition < data.offset' is false, so the update doesn't happen. I managed to fix this by making the code unconditional, and using MAX to update oacc_bcast_partition and oacc_bcast_size. Thanks, - Tom
On 03/30/2018 07:45 AM, Tom de Vries wrote: > On 03/30/2018 03:07 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >>> As a follow up patch will show, the nvptx BE falls back to using >>> vector_length = 32 when a vector loop is nested inside a worker loop. >> >> I disabled the fallback, and analyzed the vred2d-128.c illegal memory >> access execution failure. >> >> I minimized that down to this ptx: >> ... >> .shared .align 8 .u8 __oacc_bcast[176]; >> >> { >> { >> .reg .u32 %x; >> mov.u32 %x,%tid.x; >> setp.ne.u32 %r86,%x,0; >> } >> >> { >> .reg .u32 %tidy; >> .reg .u64 %t_bcast; >> .reg .u64 %y64; >> mov.u32 %tidy,%tid.y; >> cvt.u64.u32 %y64,%tidy; >> add.u64 %y64,%y64,1; >> cvta.shared.u64 %t_bcast,__oacc_bcast; >> mad.lo.u64 %r66,%y64,88,%t_bcast; >> } >> >> @ %r86 bra $L28; >> st.u32 [%r66+80],0; >> $L28: >> ret; >> } >> ... >> >> The ptx is called with 2 workers and 128 vector_length. >> >> So, 2 workers mean %tid.y has values 0 and 1. >> Then %y64 has values 1 and 2. >> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88). >> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast >> + (2 * 88) + 80. >> >> So we're accessing memory at location 256, while the __oacc_bcast is >> only 176 bytes big. >> >> I formulated this assert that AFAIU detects this situation in the >> compiler: >> ... >> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int >> regno, const char *name) >> fprintf (file, "\t}\n"); >> } >> >> +static int nvptx_mach_max_workers (); >> + >> /* Emit code to initialize OpenACC worker broadcast and synchronization >> registers. */ >> >> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file) >> "// vector broadcast offset\n", >> REGNO (cfun->machine->bcast_partition), >> oacc_bcast_partition); >> + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + >> 1) <= oacc_bcast_size); >> } >> if (cfun->machine->sync_bar) >> fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " >> ... >> >> The assert is not triggered when the fallback is used. > > I've tracked the problem down to: > ... >> - if (oacc_bcast_size < >> data.offset) >> - oacc_bcast_size = >> data.offset; >> + if (oacc_bcast_partition < >> data.offset) >> + >> { >> + int psize = >> data.offset; >> + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align >> - 1); + >> oacc_bcast_partition = >> psize; >> + oacc_bcast_size = psize * (nvptx_mach_max_workers () + >> 1); + >> } > > ... > > We hit this if clause for a first compiled function, with num_workers(1). > > This sets oacc_bcast_partition and oacc_bcast_size as required for that > functions. > > Then we hit this if clause for a second compiled function, with > num_workers (2). > > We need oacc_bcast_size updated, but the 'oacc_bcast_partition < > data.offset' is false, so the update doesn't happen. > > I managed to fix this by making the code unconditional, and using MAX to > update oacc_bcast_partition and oacc_bcast_size. It looks like that's fallout from this patch <https://gcc.gnu.org/ml/gcc-patches/2018-03/msg01212.html>. I should have checked that patch with the vector length fallback disabled. Cesar
On 03/30/2018 05:00 PM, Cesar Philippidis wrote: > I should > have checked that patch with the vector length fallback disabled. Right. The patch series introduces a lot of code that is not exercised. I've added an -mlong-vector-in-workers option in my local branch and added 3 test-cases to exercise the code with fallback disabled everytime I run the libgomp tests. Thanks, - Tom
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. One last thing: this variable needs to be reset to zero for every function. Without this reset, we can generated different code for a function depending on whether there's another function in front or not. > (populate_offload_attrs): Handle the situation where the default > runtime geometry has not been initialized yet for reductions. I've moved this bit to "vector_length extension part 4: target hooks and automatic parallelism". Build on x86_64 with nvptx accelerator and tested libgomp. Committed. Thanks, - Tom [nvptx] Generalize state propagation and synchronization 2018-04-03 Cesar Philippidis <cesar@codesourcery.com> Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. (nvptx_option_override): Init oacc_bcast_partition. (nvptx_init_oacc_workers): New function. (nvptx_declare_function_name): Call nvptx_init_oacc_workers. (nvptx_needs_shared_bcast): New function. (nvptx_find_par): Generalize to enable vectors to use shared-memory to propagate state. (nvptx_shared_propagate): Initialize vector bcast partition and synchronization state. (nvptx_single): Generalize to enable vectors to use shared-memory to propagate state. (nvptx_process_pars): Likewise. * config/nvptx/nvptx.h (struct machine_function): Add bcast_partition and sync_bar members. --- gcc/config/nvptx/nvptx.c | 137 ++++++++++++++++++++++++++++++++++++++++++----- gcc/config/nvptx/nvptx.h | 4 ++ 2 files changed, 129 insertions(+), 12 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index d4ff730..0b46e13 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -133,6 +133,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab; memory. It'd be nice if PTX supported common blocks, because then this could be shared across TUs (taking the largest size). */ static unsigned oacc_bcast_size; +static unsigned oacc_bcast_partition; static unsigned oacc_bcast_align; static GTY(()) rtx oacc_bcast_sym; @@ -157,6 +158,8 @@ static bool need_softstack_decl; /* True if any function references __nvptx_uni. */ static bool need_unisimt_decl; +static int nvptx_mach_max_workers (); + /* Allocate a new, cleared machine_function structure. */ static struct machine_function * @@ -210,6 +213,7 @@ nvptx_option_override (void) oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast"); SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED); oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + oacc_bcast_partition = 0; worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red"); SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); @@ -1097,6 +1101,40 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name) fprintf (file, "\t}\n"); } +/* Emit code to initialize OpenACC worker broadcast and synchronization + registers. */ + +static void +nvptx_init_oacc_workers (FILE *file) +{ + fprintf (file, "\t{\n"); + fprintf (file, "\t\t.reg.u32\t%%tidy;\n"); + if (cfun->machine->bcast_partition) + { + fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n"); + fprintf (file, "\t\t.reg.u64\t%%y64;\n"); + } + fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n"); + if (cfun->machine->bcast_partition) + { + fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n"); + fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n"); + fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n"); + fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; " + "// vector broadcast offset\n", + REGNO (cfun->machine->bcast_partition), + oacc_bcast_partition); + } + /* Verify oacc_bcast_size. */ + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1) + <= oacc_bcast_size); + if (cfun->machine->sync_bar) + fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " + "// vector synchronization barrier\n", + REGNO (cfun->machine->sync_bar)); + fprintf (file, "\t}\n"); +} + /* Emit code to initialize predicate and master lane index registers for -muniform-simt code generation variant. */ @@ -1323,6 +1361,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) if (cfun->machine->unisimt_predicate || (cfun->machine->has_simtreg && !crtl->is_leaf)) nvptx_init_unisimt_predicate (file); + if (cfun->machine->bcast_partition || cfun->machine->sync_bar) + nvptx_init_oacc_workers (file); } /* Output code for switching uniform-simt state. ENTERING indicates whether @@ -3000,6 +3040,19 @@ nvptx_split_blocks (bb_insn_map_t *map) } } +/* Return true if MASK contains parallelism that requires shared + memory to broadcast. */ + +static bool +nvptx_needs_shared_bcast (unsigned mask) +{ + bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER); + bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + && nvptx_mach_vector_length () != PTX_WARP_SIZE; + + return worker || large_vector; +} + /* BLOCK is a basic block containing a head or tail instruction. Locate the associated prehead or pretail instruction, which must be in the single predecessor block. */ @@ -3075,7 +3128,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block) par = new parallel (par, mask); par->forked_block = block; par->forked_insn = end; - if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + if (nvptx_needs_shared_bcast (mask)) par->fork_insn = nvptx_discover_pre (block, CODE_FOR_nvptx_fork); } @@ -3090,7 +3143,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block) gcc_assert (par->mask == mask); par->join_block = block; par->join_insn = end; - if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + if (nvptx_needs_shared_bcast (mask)) par->joining_insn = nvptx_discover_pre (block, CODE_FOR_nvptx_joining); par = par->parent; @@ -3947,11 +4000,33 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, gcc_assert (empty == !data.offset); if (data.offset) { + rtx bcast_sym = oacc_bcast_sym; + /* Stuff was emitted, initialize the base pointer now. */ - rtx init = gen_rtx_SET (data.base, oacc_bcast_sym); + if (vector && nvptx_mach_max_workers () > 1) + { + if (!cfun->machine->bcast_partition) + { + /* It would be nice to place this register in + DATA_AREA_SHARED. */ + cfun->machine->bcast_partition = gen_reg_rtx (DImode); + } + if (!cfun->machine->sync_bar) + cfun->machine->sync_bar = gen_reg_rtx (SImode); + + bcast_sym = cfun->machine->bcast_partition; + } + + rtx init = gen_rtx_SET (data.base, bcast_sym); emit_insn_after (init, insn); - oacc_bcast_size = MAX (oacc_bcast_size, data.offset); + unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align); + unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE + ? nvptx_mach_max_workers () + 1 + : 1); + + oacc_bcast_partition = MAX (oacc_bcast_partition, psize); + oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum); } return empty; } @@ -4146,7 +4221,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) { rtx pvar = XEXP (XEXP (cond_branch, 0), 0); - if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask) + if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask + && nvptx_mach_vector_length () == PTX_WARP_SIZE) { /* Vector mode only, do a shuffle. */ #if WORKAROUND_PTXJIT_BUG @@ -4213,23 +4289,51 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* Includes worker mode, do spill & fill. By construction we should never have worker mode only. */ broadcast_data_t data; + unsigned size = GET_MODE_SIZE (SImode); + bool vector = true; rtx barrier = GEN_INT (0); int threads = 0; + if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask) + vector = false; + data.base = oacc_bcast_sym; data.ptr = 0; - oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode)); + if (vector + && nvptx_mach_max_workers () > 1 + && cfun->machine->bcast_partition) + data.base = cfun->machine->bcast_partition; + + gcc_assert (data.base != NULL); + + unsigned int psize = ROUND_UP (size, oacc_bcast_align); + unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE + ? nvptx_mach_max_workers () + 1 + : 1); + + oacc_bcast_partition = MAX (oacc_bcast_partition, psize); + oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data, - false), + vector), before); + + if (vector + && nvptx_mach_max_workers () > 1 + && cfun->machine->sync_bar) + { + barrier = cfun->machine->sync_bar; + threads = nvptx_mach_vector_length (); + } + /* Barrier so other workers can see the write. */ emit_insn_before (nvptx_cta_sync (barrier, threads), tail); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data, - false), tail); + vector), + tail); /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ @@ -4342,17 +4446,26 @@ nvptx_process_pars (parallel *par) } bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0; - - if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)); + bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + && nvptx_mach_vector_length () > PTX_WARP_SIZE); + + if (worker || large_vector) { nvptx_shared_propagate (false, is_call, par->forked_block, - par->forked_insn, false); + par->forked_insn, !worker); bool empty = nvptx_shared_propagate (true, is_call, par->forked_block, par->fork_insn, - false); + !worker); rtx barrier = GEN_INT (0); int threads = 0; + if (!worker && cfun->machine->sync_bar) + { + barrier = cfun->machine->sync_bar; + threads = nvptx_mach_vector_length (); + } + if (!empty || !is_call) { /* Insert begin and end synchronizations. */ diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 784628e..fb9f04b 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -228,6 +228,10 @@ struct GTY(()) machine_function rtx axis_predicate[2]; /* Neutering predicates. */ int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is vector_length, dim[1] is num_workers. */ + rtx bcast_partition; /* Register containing the size of each + vector's partition of share-memory used to + broadcast state. */ + rtx sync_bar; /* Synchronization barrier ID for vectors. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ rtx unisimt_location; /* Mask location for -muniform-simt. */
On 04/03/2018 05:00 PM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. > > One last thing: this variable needs to be reset to zero for every function. > > Without this reset, we can generated different code for a function > depending on whether there's another function in front or not. In the previous commit, I set that variable in nvptx_option_override, but as I've found out that's not enough. This patch does the init in nvptx_set_current_function. Build x86_64 with nvptx accelerator and reg-tested libgomp. Committed. Thanks, - Tom [nvptx] Add per-function initialization of oacc_broadcast_partition 2018-04-05 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (nvptx_set_current_function): Initialize oacc_broadcast_partition. --- gcc/config/nvptx/nvptx.c | 1 + 1 file changed, 1 insertion(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 0b46e13..009ca59 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -5962,6 +5962,7 @@ nvptx_set_current_function (tree fndecl) gangprivate_shared_hmap.empty (); nvptx_previous_fndecl = fndecl; + oacc_bcast_partition = 0; } #undef TARGET_OPTION_OVERRIDE
On 04/03/2018 05:00 PM, Tom de Vries wrote: > + unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align); > + unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE > + ? nvptx_mach_max_workers () + 1 > + : 1); This claims too much space for a simple long vector loop. Filed as PR85231 - "[og7, openacc, nvptx] Too much shared memory claimed for long vector length". Thanks, - Tom
On 03/30/2018 05:14 PM, Tom de Vries wrote: > On 03/30/2018 05:00 PM, Cesar Philippidis wrote: >> I should >> have checked that patch with the vector length fallback disabled. > > Right. The patch series introduces a lot of code that is not exercised. > > I've added an -mlong-vector-in-workers option in my local branch and > added 3 test-cases to exercise the code with fallback disabled everytime > I run the libgomp tests. > This patch adds that option. Build x86_64 with nvptx accelerator and tested libgomp. Committed. Thanks, - Tom [nvptx] Add -mlong-vector-in-workers 2018-04-05 Tom de Vries <tom@codesourcery.com> * config/nvptx/nvptx.c (nvptx_adjust_parallelism): Handle nvptx_long_vectors_in_workers. * config/nvptx/nvptx.opt (mlong-vector-in-workers): Add option. * testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c: New test. * testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c: New test. * testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c: New test. * testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c: New test. * testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c: New test. --- gcc/config/nvptx/nvptx.c | 3 +- gcc/config/nvptx/nvptx.opt | 3 ++ .../vector-length-128-4.c | 41 ++++++++++++++++++++ .../vector-length-128-5.c | 42 +++++++++++++++++++++ .../vector-length-128-6.c | 42 +++++++++++++++++++++ .../vector-length-128-8.c | 44 ++++++++++++++++++++++ .../vector-length-128-9.c | 44 ++++++++++++++++++++++ 7 files changed, 218 insertions(+), 1 deletion(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 595413a..b5e6dce 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -5397,7 +5397,8 @@ nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask) worker loop. Therefore, fallback to setting vector_length to PTX_WARP_SIZE. Hopefully this condition may be relaxed for sm_70+ targets. */ - if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + if (nvptx_long_vectors_in_workers == 0 + && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))) { tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE, diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index e2d64bd..f7f37ec 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -62,3 +62,6 @@ Enum(ptx_isa) String(sm_35) Value(PTX_ISA_SM35) misa= Target RejectNegative ToLower Joined Enum(ptx_isa) Var(ptx_isa_option) Init(PTX_ISA_SM30) Specify the version of the ptx ISA to use. + +mlong-vector-in-workers +Target Var(nvptx_long_vectors_in_workers) Undocumented Init(0) diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c new file mode 100644 index 0000000..6d43f82 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c @@ -0,0 +1,41 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */ +/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */ +/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */ + +#include <stdlib.h> + +#define N 1024 + +unsigned int a[N]; +unsigned int b[N]; +unsigned int c[N]; +unsigned int n = N; + +int +main (void) +{ + for (unsigned int i = 0; i < n; ++i) + { + a[i] = i % 3; + b[i] = i % 5; + } + +#pragma acc parallel num_workers (2) vector_length (128) copyin (a,b) copyout (c) + { +#pragma acc loop worker + for (unsigned int i = 0; i < 4; i++) +#pragma acc loop vector + for (unsigned int j = 0; j < n / 4; j++) + c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j]; + } + + for (unsigned int i = 0; i < n; ++i) + if (c[i] != (i % 3) + (i % 5)) + abort (); + + return 0; +} + +/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c new file mode 100644 index 0000000..661fdc7 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c @@ -0,0 +1,42 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-fopenacc-dim=-:2:128" } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */ +/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */ +/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */ + +#include <stdlib.h> + +#define N 1024 + +unsigned int a[N]; +unsigned int b[N]; +unsigned int c[N]; +unsigned int n = N; + +int +main (void) +{ + for (unsigned int i = 0; i < n; ++i) + { + a[i] = i % 3; + b[i] = i % 5; + } + +#pragma acc parallel copyin (a,b) copyout (c) + { +#pragma acc loop worker + for (unsigned int i = 0; i < 4; i++) +#pragma acc loop vector + for (unsigned int j = 0; j < n / 4; j++) + c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j]; + } + + for (unsigned int i = 0; i < n; ++i) + if (c[i] != (i % 3) + (i % 5)) + abort (); + + return 0; +} + +/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c new file mode 100644 index 0000000..91f611e --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c @@ -0,0 +1,42 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-set-target-env-var "GOMP_OPENACC_DIM" ":2:" } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */ +/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */ +/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */ + +#include <stdlib.h> + +#define N 1024 + +unsigned int a[N]; +unsigned int b[N]; +unsigned int c[N]; +unsigned int n = N; + +int +main (void) +{ + for (unsigned int i = 0; i < n; ++i) + { + a[i] = i % 3; + b[i] = i % 5; + } + +#pragma acc parallel vector_length (128) copyin (a,b) copyout (c) + { +#pragma acc loop worker + for (unsigned int i = 0; i < 4; i++) +#pragma acc loop vector + for (unsigned int j = 0; j < n / 4; j++) + c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j]; + } + + for (unsigned int i = 0; i < n; ++i) + if (c[i] != (i % 3) + (i % 5)) + abort (); + + return 0; +} + +/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c new file mode 100644 index 0000000..6246067 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-8.c @@ -0,0 +1,44 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */ +/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */ +/* { dg-additional-options "-fopenacc-dim=-:-:-" } */ +/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */ + +#include <stdlib.h> + +#define N 1024 + +unsigned int a[N]; +unsigned int b[N]; +unsigned int c[N]; +unsigned int n = N; + +int +main (void) +{ + for (unsigned int i = 0; i < n; ++i) + { + a[i] = i % 3; + b[i] = i % 5; + } + +#pragma acc parallel copyin (a,b) copyout (c) + { +#pragma acc loop worker + for (unsigned int i = 0; i < 4; i++) +#pragma acc loop vector + for (unsigned int j = 0; j < n / 4; j++) + c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j]; + } + + for (unsigned int i = 0; i < n; ++i) + if (c[i] != (i % 3) + (i % 5)) + abort (); + + return 0; +} + +/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */ + +/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 32\\)" "oaccdevlow" } } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=32, vectors=32" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c new file mode 100644 index 0000000..2f8b4b7 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-9.c @@ -0,0 +1,44 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */ +/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */ +/* { dg-additional-options "-fopenacc-dim=-:8:-" } */ +/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */ + +#include <stdlib.h> + +#define N 1024 + +unsigned int a[N]; +unsigned int b[N]; +unsigned int c[N]; +unsigned int n = N; + +int +main (void) +{ + for (unsigned int i = 0; i < n; ++i) + { + a[i] = i % 3; + b[i] = i % 5; + } + +#pragma acc parallel copyin (a,b) copyout (c) + { +#pragma acc loop worker + for (unsigned int i = 0; i < 4; i++) +#pragma acc loop vector + for (unsigned int j = 0; j < n / 4; j++) + c[(i * N / 4) + j] = a[(i * N / 4) + j] + b[(i * N / 4) + j]; + } + + for (unsigned int i = 0; i < n; ++i) + if (c[i] != (i % 3) + (i % 5)) + abort (); + + return 0; +} + +/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */ + +/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 8, 32\\)" "oaccdevlow" } } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=32" } */
2018-03-02 Cesar Philippidis <cesar@codesourcery.com> gcc/ * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. (nvptx_init_axis_predicate): Initialize vector_red_partition. (nvptx_init_oacc_workers): New function. (nvptx_declare_function_name): Emit a .maxntid directive hint and call nvptx_init_oacc_workers. (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define. (nvptx_mach_max_workers): New function. (nvptx_mach_vector_length): New function. (nvptx_needs_shared_bcast): New function. (nvptx_find_par): Generalize to enable vectors to use shared-memory to propagate state. (nvptx_shared_propagate): Iniitalize vector bcast partition and synchronization state. (nvptx_cta_sync): Change arguments to take in a lock and thread count. Update call to gen_nvptx_barsync. (nvptx_single): Generalize to enable vectors to use shared-memory to propagate state. (nvptx_process_pars): Likewise. (populate_offload_attrs): Handle the situation where the default runtime geometry has not been initialized yet for reductions. (nvptx_reorg): Set function-specific axis_dim's. * config/nvptx/nvptx.h (struct machine_function): Add axis_dims, bcast_partition, red_partition and sync_bar members. * config/nvptx/nvptx.md (nvptx_barsync): Adjust operands. From 0a1dd1d85e47feeaa6f7a2e070baba69dadea444 Mon Sep 17 00:00:00 2001 From: Cesar Philippidis <cesar@codesourcery.com> Date: Fri, 2 Mar 2018 07:39:25 -0800 Subject: [PATCH] bar and sync --- gcc/config/nvptx/nvptx.c | 226 ++++++++++++++++++++++++++++++++++++++++------ gcc/config/nvptx/nvptx.h | 8 ++ gcc/config/nvptx/nvptx.md | 10 +- 3 files changed, 214 insertions(+), 30 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 9d77176c638..507c8671704 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -133,6 +133,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab; memory. It'd be nice if PTX supported common blocks, because then this could be shared across TUs (taking the largest size). */ static unsigned oacc_bcast_size; +static unsigned oacc_bcast_partition; static unsigned oacc_bcast_align; static GTY(()) rtx oacc_bcast_sym; @@ -1104,8 +1105,53 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name) { fprintf (file, "\t{\n"); fprintf (file, "\t\t.reg.u32\t%%%s;\n", name); - fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name); + if (strcmp (name, "x") == 0 && cfun->machine->red_partition) + { + fprintf (file, "\t\t.reg.u64\t%%t_red;\n"); + fprintf (file, "\t\t.reg.u64\t%%y64;\n"); + } + fprintf (file, "\t\tmov.u32\t\t%%%s, %%tid.%s;\n", name, name); fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name); + if (strcmp (name, "x") == 0 && cfun->machine->red_partition) + { + fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n"); + fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n"); + fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; " + "// vector reduction buffer\n", + REGNO (cfun->machine->red_partition), + vector_red_partition); + } + fprintf (file, "\t}\n"); +} + +/* Emit code to initialize OpenACC worker broadcast and synchronization + registers. */ + +static void +nvptx_init_oacc_workers (FILE *file) +{ + fprintf (file, "\t{\n"); + fprintf (file, "\t\t.reg.u32\t%%tidy;\n"); + if (cfun->machine->bcast_partition) + { + fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n"); + fprintf (file, "\t\t.reg.u64\t%%y64;\n"); + } + fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n"); + if (cfun->machine->bcast_partition) + { + fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n"); + fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n"); + fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n"); + fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; " + "// vector broadcast offset\n", + REGNO (cfun->machine->bcast_partition), + oacc_bcast_partition); + } + if (cfun->machine->sync_bar) + fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " + "// vector synchronization barrier\n", + REGNO (cfun->machine->sync_bar)); fprintf (file, "\t}\n"); } @@ -1231,6 +1277,13 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) stream, in order to share the prototype writing code. */ std::stringstream s; write_fn_proto (s, true, name, decl); + + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) + && lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) + s << ".maxntid " << cfun->machine->axis_dim[0] << ", " + << cfun->machine->axis_dim[1] << ", 1\n"; + s << "{\n"; bool return_in_mem = write_return_type (s, false, result_type); @@ -1341,6 +1394,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) if (cfun->machine->unisimt_predicate || (cfun->machine->has_simtreg && !crtl->is_leaf)) nvptx_init_unisimt_predicate (file); + if (cfun->machine->bcast_partition || cfun->machine->sync_bar) + nvptx_init_oacc_workers (file); } /* Output code for switching uniform-simt state. ENTERING indicates whether @@ -2849,6 +2904,26 @@ struct offload_attrs int max_workers; }; +/* Define entries for cfun->machine->axis_dim. */ + +#define MACH_VECTOR_LENGTH 0 +#define MACH_MAX_WORKERS 1 + +static int +nvptx_mach_max_workers () +{ + return cfun->machine->axis_dim[MACH_MAX_WORKERS]; +} + +static int +nvptx_mach_vector_length () +{ + return cfun->machine->axis_dim[MACH_VECTOR_LENGTH]; +} + +/* Loop structure of the function. The entire function is described as + a NULL loop. */ + struct parallel { /* Parent parallel. */ @@ -2996,6 +3071,19 @@ nvptx_split_blocks (bb_insn_map_t *map) } } +/* Return true if MASK contains parallelism that requires shared + memory to broadcast. */ + +static bool +nvptx_needs_shared_bcast (unsigned mask) +{ + bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER); + bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + && nvptx_mach_vector_length () != PTX_WARP_SIZE; + + return worker || large_vector; +} + /* BLOCK is a basic block containing a head or tail instruction. Locate the associated prehead or pretail instruction, which must be in the single predecessor block. */ @@ -3071,7 +3159,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block) par = new parallel (par, mask); par->forked_block = block; par->forked_insn = end; - if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + if (nvptx_needs_shared_bcast (mask)) par->fork_insn = nvptx_discover_pre (block, CODE_FOR_nvptx_fork); } @@ -3086,7 +3174,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block) gcc_assert (par->mask == mask); par->join_block = block; par->join_insn = end; - if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + if (nvptx_needs_shared_bcast (mask)) par->joining_insn = nvptx_discover_pre (block, CODE_FOR_nvptx_joining); par = par->parent; @@ -3944,23 +4032,45 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, gcc_assert (empty == !data.offset); if (data.offset) { + rtx bcast_sym = oacc_bcast_sym; + /* Stuff was emitted, initialize the base pointer now. */ - rtx init = gen_rtx_SET (data.base, oacc_bcast_sym); + if (vector && nvptx_mach_max_workers () > 1) + { + if (!cfun->machine->bcast_partition) + { + /* It would be nice to place this register in + DATA_AREA_SHARED. */ + cfun->machine->bcast_partition = gen_reg_rtx (DImode); + } + if (!cfun->machine->sync_bar) + cfun->machine->sync_bar = gen_reg_rtx (SImode); + + bcast_sym = cfun->machine->bcast_partition; + } + + rtx init = gen_rtx_SET (data.base, bcast_sym); emit_insn_after (init, insn); - if (oacc_bcast_size < data.offset) - oacc_bcast_size = data.offset; + if (oacc_bcast_partition < data.offset) + { + int psize = data.offset; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); + oacc_bcast_partition = psize; + oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + } } return empty; } -/* Emit a CTA-level synchronization barrier. We use different - markers for before and after synchronizations. */ +/* Emit a CTA-level synchronization barrier (bar.sync). LOCK is the + barrier number, which is an integer or a register. THREADS is the + number of threads controlled by the barrier. */ static rtx -nvptx_cta_sync (bool after) +nvptx_cta_sync (rtx lock, int threads) { - return gen_nvptx_barsync (GEN_INT (after)); + return gen_nvptx_barsync (lock, GEN_INT (threads)); } #if WORKAROUND_PTXJIT_BUG @@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) pred = gen_reg_rtx (BImode); cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred; } - + rtx br; if (mode == GOMP_DIM_VECTOR) br = gen_br_true (pred, label); else br = gen_br_true_uni (pred, label); - emit_insn_before (br, head); + + if (recog_memoized (head) == CODE_FOR_nvptx_forked + && recog_memoized (NEXT_INSN (head)) == CODE_FOR_nvptx_barsync) + { + head = NEXT_INSN (head); + emit_insn_after (br, head); + } + else if (recog_memoized (head) == CODE_FOR_nvptx_barsync) + emit_insn_after (br, head); + else + emit_insn_before (br, head); LABEL_NUSES (label)++; if (tail_branch) @@ -4135,7 +4255,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) { rtx pvar = XEXP (XEXP (cond_branch, 0), 0); - if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask) + if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask + && nvptx_mach_vector_length () == PTX_WARP_SIZE) { /* Vector mode only, do a shuffle. */ #if WORKAROUND_PTXJIT_BUG @@ -4202,26 +4323,55 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) /* Includes worker mode, do spill & fill. By construction we should never have worker mode only. */ broadcast_data_t data; + unsigned size = GET_MODE_SIZE (SImode); + bool vector = true; + rtx barrier = GEN_INT (0); + int threads = 0; + + if (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask) + vector = false; data.base = oacc_bcast_sym; data.ptr = 0; - if (oacc_bcast_size < GET_MODE_SIZE (SImode)) - oacc_bcast_size = GET_MODE_SIZE (SImode); + if (vector + && nvptx_mach_max_workers () > 1 + && cfun->machine->bcast_partition) + data.base = cfun->machine->bcast_partition; + + gcc_assert (data.base != NULL); + + if (oacc_bcast_partition < size) + { + int psize = size; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); + oacc_bcast_partition = psize; + oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + } data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data, - false), + vector), before); + + if (vector + && nvptx_mach_max_workers () > 1 + && cfun->machine->sync_bar) + { + barrier = cfun->machine->sync_bar; + threads = nvptx_mach_vector_length (); + } + /* Barrier so other workers can see the write. */ - emit_insn_before (nvptx_cta_sync (false), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); data.offset = 0; emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data, - false), tail); + vector), + tail); /* This barrier is needed to avoid worker zero clobbering the broadcast buffer before all the other workers have had a chance to read this instance of it. */ - emit_insn_before (nvptx_cta_sync (true), tail); + emit_insn_before (nvptx_cta_sync (barrier, threads), tail); } extract_insn (tail); @@ -4330,20 +4480,32 @@ nvptx_process_pars (parallel *par) } bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0; - - if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)); + bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + && nvptx_mach_vector_length () > PTX_WARP_SIZE); + + if (worker || large_vector) { nvptx_shared_propagate (false, is_call, par->forked_block, - par->forked_insn, false); + par->forked_insn, !worker); bool empty = nvptx_shared_propagate (true, is_call, par->forked_block, par->fork_insn, - false); + !worker); + rtx barrier = GEN_INT (0); + int threads = 0; + + if (!worker && cfun->machine->sync_bar) + { + barrier = cfun->machine->sync_bar; + threads = nvptx_mach_vector_length (); + } if (!empty || !is_call) { /* Insert begin and end synchronizations. */ - emit_insn_after (nvptx_cta_sync (false), par->forked_insn); - emit_insn_before (nvptx_cta_sync (true), par->joining_insn); + emit_insn_after (nvptx_cta_sync (barrier, threads), par->forked_insn); + emit_insn_before (nvptx_cta_sync (barrier, threads), + par->joining_insn); } } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) @@ -4469,15 +4631,20 @@ populate_offload_attrs (offload_attrs *oa) if (oa->vector_length == 0) { /* FIXME: Need a more graceful way to handle large vector - lengths in OpenACC routines. */ + lengths in OpenACC routines and also -fopenacc-dims. */ if (!lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (current_function_decl))) oa->vector_length = PTX_WARP_SIZE; - else + else if (PTX_VECTOR_LENGTH != PTX_WARP_SIZE) oa->vector_length = PTX_VECTOR_LENGTH; } if (oa->num_workers == 0) - oa->max_workers = PTX_CTA_SIZE / oa->vector_length; + { + if (oa->vector_length == 0) + oa->max_workers = PTX_WORKER_LENGTH; + else + oa->max_workers = PTX_CTA_SIZE / oa->vector_length; + } else oa->max_workers = oa->num_workers; } @@ -4535,6 +4702,9 @@ nvptx_reorg (void) populate_offload_attrs (&oa); + cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length; + cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers; + /* If there is worker neutering, there must be vector neutering. Otherwise the hardware will fail. */ gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 8a14507c88a..99943025a50 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -226,6 +226,14 @@ struct GTY(()) machine_function int return_mode; /* Return mode of current fn. (machine_mode not defined yet.) */ rtx axis_predicate[2]; /* Neutering predicates. */ + int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is + vector_length, dim[1] is num_workers. */ + rtx bcast_partition; /* Register containing the size of each + vector's partition of share-memory used to + broadcast state. */ + rtx red_partition; /* Similar to bcast_partition, except for vector + reductions. */ + rtx sync_bar; /* Synchronization barrier ID for vectors. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ rtx unisimt_location; /* Mask location for -muniform-simt. */ diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 28ae263c867..ac2731233dd 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -1418,10 +1418,16 @@ [(set_attr "atomic" "true")]) (define_insn "nvptx_barsync" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] + [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri") + (match_operand:SI 1 "const_int_operand")] UNSPECV_BARSYNC)] "" - "\\tbar.sync\\t%0;" + { + if (!REG_P (operands[0])) + return "\\tbar.sync\\t%0;"; + else + return "\\tbar.sync\\t%0, %1;"; + } [(set_attr "predicable" "false")]) (define_insn "nvptx_nounroll" -- 2.14.3