From patchwork Mon Jan 7 09:03:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 1021218 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-493508-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="LUh2s96M"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 43Y8Yq5k17z9sBn for ; Mon, 7 Jan 2019 20:03:10 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:from:to:cc:references:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=IG/8UWTNvYnbdZvog 4jZ+ABQGniNstUysbLeZQxes5lBNi0KVsSLkjLdcRqeVPRGH/7cljtDBb+Yiw7NX mL6ipDo05keZtOdTf+UM7niMIJ2h201Z2jhCUQa/cP6B9Vcz3aNgNc6vig+mmPuC reyVzq5pimscxPK0egwH/CQs7g= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:from:to:cc:references:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=ztLymwIpIk6eBXuTYnoWsAZ AyH4=; b=LUh2s96MfFVowyx9S+u/WqNuRCk0t4Xi0CRRUcEz2I9qlIOJ9pAoXej LNen3240HdhQDTxPmw4KHxi67Cwe12+PhlfdFBYDQ2z381id/iNixU8GeSdDI1ag KH7eqYBNGpqG5TeEa962ZsCWuHx3ThZgDUd7qx0UqPDKqpnf3yzY= Received: (qmail 50018 invoked by alias); 7 Jan 2019 09:03:02 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 49991 invoked by uid 89); 7 Jan 2019 09:03:02 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-25.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_STOCKGEN, SPF_PASS autolearn=ham version=3.3.2 spammy=Global, sk:vector-, sk:vector, gang X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 07 Jan 2019 09:02:57 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id ABCDBACBD; Mon, 7 Jan 2019 09:02:54 +0000 (UTC) Subject: [nvptx] Handle large vector reductions From: Tom de Vries To: "Schwinge, Thomas" Cc: "gcc-patches@gcc.gnu.org" References: <2ece5d7b-3675-84ab-f255-3c56a2ffd7dc@suse.de> <91b927af-d854-2865-7cbd-9a9a835ab5cc@codesourcery.com> <1394d89c-896e-f6a3-5f9a-78e98b16e85c@suse.de> Message-ID: <576fee81-6570-2057-900c-c4ceeb6f7fe7@suse.de> Date: Mon, 7 Jan 2019 10:03:32 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 MIME-Version: 1.0 In-Reply-To: <1394d89c-896e-f6a3-5f9a-78e98b16e85c@suse.de> X-IsSubscribed: yes [ was: Re: [nvptx] vector length patch series ] On 14-12-18 20:58, Tom de Vries wrote: > 0024-nvptx-Handle-large-vector-reductions.patch Committed. Thanks, - Tom [nvptx] Handle large vector reductions Add support for vector reductions with openacc vector_length larger than warp-size. 2018-12-17 Tom de Vries * config/nvptx/nvptx-protos.h (nvptx_output_red_partition): Declare. * config/nvptx/nvptx.c (vector_red_size, vector_red_align, vector_red_partition, vector_red_sym): New global variables. (nvptx_option_override): Initialize vector_red_sym. (nvptx_declare_function_name): Restore red_partition register. (nvptx_file_end): Emit code to declare the vector reduction variables. (nvptx_output_red_partition): New function. (nvptx_expand_shared_addr): Add vector argument. Use it to handle large vector reductions. (enum nvptx_builtins): Add NVPTX_BUILTIN_VECTOR_ADDR. (nvptx_init_builtins): Add VECTOR_ADDR. (nvptx_expand_builtin): Update call to nvptx_expand_shared_addr. Handle nvptx_expand_shared_addr. (nvptx_get_shared_red_addr): Add vector argument and handle large vectors. (nvptx_goacc_reduction_setup): Add offload_attrs argument and handle large vectors. (nvptx_goacc_reduction_init): Likewise. (nvptx_goacc_reduction_fini): Likewise. (nvptx_goacc_reduction_teardown): Likewise. (nvptx_goacc_reduction): Update calls to nvptx_goacc_reduction_{setup, init,fini,teardown}. (nvptx_init_axis_predicate): Initialize vector_red_partition. (nvptx_set_current_function): Init vector_red_partition. * config/nvptx/nvptx.md (UNSPECV_RED_PART): New unspecv. (nvptx_red_partition): New insn. * config/nvptx/nvptx.h (struct machine_function): Add red_partition. --- gcc/config/nvptx/nvptx-protos.h | 1 + gcc/config/nvptx/nvptx.c | 154 ++++++++++++++++++++++++++++++++-------- gcc/config/nvptx/nvptx.h | 2 + gcc/config/nvptx/nvptx.md | 12 ++++ 4 files changed, 140 insertions(+), 29 deletions(-) diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h index 1a26d00ab99..be09a15e49c 100644 --- a/gcc/config/nvptx/nvptx-protos.h +++ b/gcc/config/nvptx/nvptx-protos.h @@ -56,5 +56,6 @@ extern const char *nvptx_output_return (void); extern const char *nvptx_output_set_softstack (unsigned); extern const char *nvptx_output_simt_enter (rtx, rtx, rtx); extern const char *nvptx_output_simt_exit (rtx); +extern const char *nvptx_output_red_partition (rtx, rtx); #endif #endif diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 26c80716603..5a4b38de522 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -150,6 +150,14 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Buffer needed for vector reductions, when vector_length > + PTX_WARP_SIZE. This has to be distinct from the worker broadcast + array, as both may be live concurrently. */ +static unsigned vector_red_size; +static unsigned vector_red_align; +static unsigned vector_red_partition; +static GTY(()) rtx vector_red_sym; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -226,6 +234,11 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + vector_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__vector_red"); + SET_SYMBOL_DATA_AREA (vector_red_sym, DATA_AREA_SHARED); + vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + vector_red_partition = 0; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -1104,8 +1117,25 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name) { fprintf (file, "\t{\n"); fprintf (file, "\t\t.reg.u32\t%%%s;\n", name); + if (strcmp (name, "x") == 0 && cfun->machine->red_partition) + { + fprintf (file, "\t\t.reg.u64\t%%t_red;\n"); + fprintf (file, "\t\t.reg.u64\t%%y64;\n"); + } fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name); fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name); + if (strcmp (name, "x") == 0 && cfun->machine->red_partition) + { + fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n"); + fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n"); + fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; " + "// vector reduction buffer\n", + REGNO (cfun->machine->red_partition), + vector_red_partition); + } + /* Verify vector_red_size. */ + gcc_assert (vector_red_partition * nvptx_mach_max_workers () + <= vector_red_size); fprintf (file, "\t}\n"); } @@ -1342,6 +1372,13 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) fprintf (file, "\t.local.align 8 .b8 %%simtstack_ar[" HOST_WIDE_INT_PRINT_DEC "];\n", simtsz); } + + /* Restore the vector reduction partition register, if necessary. + FIXME: Find out when and why this is necessary, and fix it. */ + if (cfun->machine->red_partition) + regno_reg_rtx[REGNO (cfun->machine->red_partition)] + = cfun->machine->red_partition; + /* Declare the pseudos we have as ptx registers. */ int maxregs = max_reg_num (); for (int i = LAST_VIRTUAL_REGISTER + 1; i < maxregs; i++) @@ -5188,6 +5225,10 @@ nvptx_file_end (void) write_shared_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (vector_red_size) + write_shared_buffer (asm_out_file, vector_red_sym, + vector_red_align, vector_red_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5233,31 +5274,68 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore) return target; } -/* Worker reduction address expander. */ +const char * +nvptx_output_red_partition (rtx dst, rtx offset) +{ + const char *zero_offset = "\t\tmov.u64\t%%r%d, %%r%d; // vred buffer\n"; + const char *with_offset = "\t\tadd.u64\t%%r%d, %%r%d, %d; // vred buffer\n"; + + if (offset == const0_rtx) + fprintf (asm_out_file, zero_offset, REGNO (dst), + REGNO (cfun->machine->red_partition)); + else + fprintf (asm_out_file, with_offset, REGNO (dst), + REGNO (cfun->machine->red_partition), UINTVAL (offset)); + + return ""; +} + +/* Shared-memory reduction address expander. */ static rtx nvptx_expand_shared_addr (tree exp, rtx target, - machine_mode ARG_UNUSED (mode), int ignore) + machine_mode ARG_UNUSED (mode), int ignore, + int vector) { if (ignore) return target; unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2)); - worker_red_align = MAX (worker_red_align, align); - unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0)); unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1)); - worker_red_size = MAX (worker_red_size, size + offset); - rtx addr = worker_red_sym; - if (offset) + + if (vector) { - addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset)); - addr = gen_rtx_CONST (Pmode, addr); + offload_attrs oa; + + populate_offload_attrs (&oa); + + unsigned int psize = ROUND_UP (size + offset, align); + unsigned int pnum = nvptx_mach_max_workers (); + vector_red_partition = MAX (vector_red_partition, psize); + vector_red_size = MAX (vector_red_size, psize * pnum); + vector_red_align = MAX (vector_red_align, align); + + if (cfun->machine->red_partition == NULL) + cfun->machine->red_partition = gen_reg_rtx (Pmode); + + addr = gen_reg_rtx (Pmode); + emit_insn (gen_nvptx_red_partition (addr, GEN_INT (offset))); } + else + { + worker_red_align = MAX (worker_red_align, align); + worker_red_size = MAX (worker_red_size, size + offset); - emit_move_insn (target, addr); + if (offset) + { + addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset)); + addr = gen_rtx_CONST (Pmode, addr); + } + } + emit_move_insn (target, addr); return target; } @@ -5305,6 +5383,7 @@ enum nvptx_builtins NVPTX_BUILTIN_SHUFFLE, NVPTX_BUILTIN_SHUFFLELL, NVPTX_BUILTIN_WORKER_ADDR, + NVPTX_BUILTIN_VECTOR_ADDR, NVPTX_BUILTIN_CMP_SWAP, NVPTX_BUILTIN_CMP_SWAPLL, NVPTX_BUILTIN_MAX @@ -5342,6 +5421,8 @@ nvptx_init_builtins (void) DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE)); DEF (WORKER_ADDR, "worker_addr", (PTRVOID, ST, UINT, UINT, NULL_TREE)); + DEF (VECTOR_ADDR, "vector_addr", + (PTRVOID, ST, UINT, UINT, NULL_TREE)); DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE)); DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE)); @@ -5370,7 +5451,10 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget), return nvptx_expand_shuffle (exp, target, mode, ignore); case NVPTX_BUILTIN_WORKER_ADDR: - return nvptx_expand_shared_addr (exp, target, mode, ignore); + return nvptx_expand_shared_addr (exp, target, mode, ignore, false); + + case NVPTX_BUILTIN_VECTOR_ADDR: + return nvptx_expand_shared_addr (exp, target, mode, ignore, true); case NVPTX_BUILTIN_CMP_SWAP: case NVPTX_BUILTIN_CMP_SWAPLL: @@ -5630,10 +5714,13 @@ nvptx_goacc_fork_join (gcall *call, const int dims[], data at that location. */ static tree -nvptx_get_shared_red_addr (tree type, tree offset) +nvptx_get_shared_red_addr (tree type, tree offset, bool vector) { + enum nvptx_builtins addr_dim = NVPTX_BUILTIN_WORKER_ADDR; + if (vector) + addr_dim = NVPTX_BUILTIN_VECTOR_ADDR; machine_mode mode = TYPE_MODE (type); - tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true); + tree fndecl = nvptx_builtin_decl (addr_dim, true); tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode)); tree align = build_int_cst (unsigned_type_node, GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT); @@ -5949,7 +6036,7 @@ nvptx_reduction_update (location_t loc, gimple_stmt_iterator *gsi, /* NVPTX implementation of GOACC_REDUCTION_SETUP. */ static void -nvptx_goacc_reduction_setup (gcall *call) +nvptx_goacc_reduction_setup (gcall *call, offload_attrs *oa) { gimple_stmt_iterator gsi = gsi_for_stmt (call); tree lhs = gimple_call_lhs (call); @@ -5968,11 +6055,13 @@ nvptx_goacc_reduction_setup (gcall *call) var = build_simple_mem_ref (ref_to_res); } - if (level == GOMP_DIM_WORKER) + if (level == GOMP_DIM_WORKER + || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE)) { /* Store incoming value to worker reduction buffer. */ tree offset = gimple_call_arg (call, 5); - tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset); + tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset, + level == GOMP_DIM_VECTOR); tree ptr = make_ssa_name (TREE_TYPE (call)); gimplify_assign (ptr, call, &seq); @@ -5991,7 +6080,7 @@ nvptx_goacc_reduction_setup (gcall *call) /* NVPTX implementation of GOACC_REDUCTION_INIT. */ static void -nvptx_goacc_reduction_init (gcall *call) +nvptx_goacc_reduction_init (gcall *call, offload_attrs *oa) { gimple_stmt_iterator gsi = gsi_for_stmt (call); tree lhs = gimple_call_lhs (call); @@ -6005,7 +6094,7 @@ nvptx_goacc_reduction_init (gcall *call) push_gimplify_context (true); - if (level == GOMP_DIM_VECTOR) + if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE) { /* Initialize vector-non-zeroes to INIT_VAL (OP). */ tree tid = make_ssa_name (integer_type_node); @@ -6075,7 +6164,7 @@ nvptx_goacc_reduction_init (gcall *call) /* NVPTX implementation of GOACC_REDUCTION_FINI. */ static void -nvptx_goacc_reduction_fini (gcall *call) +nvptx_goacc_reduction_fini (gcall *call, offload_attrs *oa) { gimple_stmt_iterator gsi = gsi_for_stmt (call); tree lhs = gimple_call_lhs (call); @@ -6089,7 +6178,7 @@ nvptx_goacc_reduction_fini (gcall *call) push_gimplify_context (true); - if (level == GOMP_DIM_VECTOR) + if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE) { /* Emit binary shuffle tree. TODO. Emit this as an actual loop, but that requires a method of emitting a unified jump at the @@ -6110,11 +6199,12 @@ nvptx_goacc_reduction_fini (gcall *call) { tree accum = NULL_TREE; - if (level == GOMP_DIM_WORKER) + if (level == GOMP_DIM_WORKER || level == GOMP_DIM_VECTOR) { /* Get reduction buffer address. */ tree offset = gimple_call_arg (call, 5); - tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset); + tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset, + level == GOMP_DIM_VECTOR); tree ptr = make_ssa_name (TREE_TYPE (call)); gimplify_assign (ptr, call, &seq); @@ -6145,7 +6235,7 @@ nvptx_goacc_reduction_fini (gcall *call) /* NVPTX implementation of GOACC_REDUCTION_TEARDOWN. */ static void -nvptx_goacc_reduction_teardown (gcall *call) +nvptx_goacc_reduction_teardown (gcall *call, offload_attrs *oa) { gimple_stmt_iterator gsi = gsi_for_stmt (call); tree lhs = gimple_call_lhs (call); @@ -6154,11 +6244,13 @@ nvptx_goacc_reduction_teardown (gcall *call) gimple_seq seq = NULL; push_gimplify_context (true); - if (level == GOMP_DIM_WORKER) + if (level == GOMP_DIM_WORKER + || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE)) { /* Read the worker reduction buffer. */ tree offset = gimple_call_arg (call, 5); - tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset); + tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset, + level == GOMP_DIM_VECTOR); tree ptr = make_ssa_name (TREE_TYPE (call)); gimplify_assign (ptr, call, &seq); @@ -6189,23 +6281,26 @@ static void nvptx_goacc_reduction (gcall *call) { unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0)); + offload_attrs oa; + + populate_offload_attrs (&oa); switch (code) { case IFN_GOACC_REDUCTION_SETUP: - nvptx_goacc_reduction_setup (call); + nvptx_goacc_reduction_setup (call, &oa); break; case IFN_GOACC_REDUCTION_INIT: - nvptx_goacc_reduction_init (call); + nvptx_goacc_reduction_init (call, &oa); break; case IFN_GOACC_REDUCTION_FINI: - nvptx_goacc_reduction_fini (call); + nvptx_goacc_reduction_fini (call, &oa); break; case IFN_GOACC_REDUCTION_TEARDOWN: - nvptx_goacc_reduction_teardown (call); + nvptx_goacc_reduction_teardown (call, &oa); break; default: @@ -6290,6 +6385,7 @@ nvptx_set_current_function (tree fndecl) return; nvptx_previous_fndecl = fndecl; + vector_red_partition = 0; oacc_bcast_partition = 0; } diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 76ce871a731..29e658248ab 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -224,6 +224,8 @@ struct GTY(()) machine_function rtx bcast_partition; /* Register containing the size of each vector's partition of share-memory used to broadcast state. */ + rtx red_partition; /* Similar to bcast_partition, except for vector + reductions. */ rtx sync_bar; /* Synchronization barrier ID for vectors. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 271b00e1eb0..1a090a47a32 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -68,6 +68,8 @@ UNSPECV_SIMT_ENTER UNSPECV_SIMT_EXIT + + UNSPECV_RED_PART ]) (define_attr "subregs_ok" "false,true" @@ -1508,3 +1510,13 @@ "" "\\t.pragma \\\"nounroll\\\";" [(set_attr "predicable" "false")]) + +(define_insn "nvptx_red_partition" + [(set (match_operand:DI 0 "nonimmediate_operand" "=R") + (unspec_volatile [(match_operand:DI 1 "const_int_operand")] + UNSPECV_RED_PART))] + "" + { + return nvptx_output_red_partition (operands[0], operands[1]); + } + [(set_attr "predicable" "false")])