From patchwork Tue Nov 29 12:23:45 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 700492 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3tSjRB1bZVz9t15 for ; Tue, 29 Nov 2016 23:24:37 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="sv04iHJ+"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=jPBl0B6KuihYj7dCa q6aa3kSi4FzLsEgSSjPHwnWyr+icxN6xanXibIGf7jPOpQt32BdgMPWvUgydE1ui qcva3l/z9jc8sHDk+eh/CZMn7eqSccpS+cgPNmrTSGM1V7N4eUNxrGbpRUxYBbS8 tS7q+U3tVmJcu5rehdC4FkFbb8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=qMck6c9U4lBWTwDixc1MuKk SaXY=; b=sv04iHJ+k9CYjEoD83bDBo9jt/R4QzWBnbiDw/53fU3b3JRjxjoM2Ws uC/ejezHMBqNIljXEqsb/SqTfLlslzsP92mGj89qETqvU1EzADLLpQLdO3yKA322 9laRkB01dQW5O9ccTBcsdczo0wJFh5n4PPCaABmX1uUZZN0kcHNs= Received: (qmail 79035 invoked by alias); 29 Nov 2016 12:24:24 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 78718 invoked by uid 89); 29 Nov 2016 12:24:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL, BAYES_20, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=collapse, trend, Tang, tang X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 29 Nov 2016 12:24:01 +0000 Received: from svr-orw-mbx-04.mgc.mentorg.com ([147.34.90.204]) by relay1.mentorg.com with esmtp id 1cBhS0-0006YI-V5 from ChungLin_Tang@mentor.com ; Tue, 29 Nov 2016 04:23:56 -0800 Received: from svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) by SVR-ORW-MBX-04.mgc.mentorg.com (147.34.90.204) with Microsoft SMTP Server (TLS) id 15.0.1210.3; Tue, 29 Nov 2016 04:23:54 -0800 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) with Microsoft SMTP Server (TLS) id 15.0.1210.3 via Frontend Transport; Tue, 29 Nov 2016 04:23:49 -0800 Subject: Re: [Patch 2/5] OpenACC tile clause support, omp-low parts To: Jakub Jelinek References: <20161111100030.GV3541@tucnak.redhat.com> <20161118111536.GU3541@tucnak.redhat.com> CC: gcc-patches , Nathan Sidwell , Cesar Philippidis From: Chung-Lin Tang Message-ID: <6f729e16-65f7-decc-8c57-33180a78fb9b@mentor.com> Date: Tue, 29 Nov 2016 20:23:45 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.5.0 MIME-Version: 1.0 In-Reply-To: <20161118111536.GU3541@tucnak.redhat.com> Adjusted and re-tested using the way you advised, attached updated patch. Thanks, Chung-Lin On 2016/11/18 7:15 PM, Jakub Jelinek wrote: > Hi! > > On Thu, Nov 17, 2016 at 05:31:40PM +0800, Chung-Lin Tang wrote: >> +#ifndef ACCEL_COMPILER >> + span = integer_one_node; >> +#else >> + if (!e_mask) >> + /* Not paritioning. */ >> + span = integer_one_node; > ... > This goes against the recent trend of avoiding #if/#ifdef guarded blocks > of code as much as possible, the ACCEL_COMPILER only hunk is significant > and will usually not be enabled, so people will not notice breakages in it > until building an accel compiler. > What about > #ifndef ACCEL_COMPILER > if (true) > span = integer_one_node; > else > #endif > if (!e_mask) > /* Not paritioning. */ > span_integer_one_node; > else if > ... > or what I've proposed earlier: > #ifndef ACCEL_COMPILER > e_mask = 0; > #endif > if (!e_mask) > ... > > Ok with that fixed. > > Jakub > Index: omp-low.c =================================================================== --- omp-low.c (revision 241809) +++ omp-low.c (working copy) @@ -213,7 +213,8 @@ struct omp_for_data tree chunk_size; gomp_for *for_stmt; tree pre, iter_type; - int collapse; + tree tiling; /* Tiling values (if non null). */ + int collapse; /* Collapsed loops, 1 for a non-collapsed loop. */ int ordered; bool have_nowait, have_ordered, simd_schedule; unsigned char sched_modifiers; @@ -242,9 +243,10 @@ struct oacc_loop tree routine; /* Pseudo-loop enclosing a routine. */ unsigned mask; /* Partitioning mask. */ + unsigned e_mask; /* Partitioning of element loops (when tiling). */ unsigned inner; /* Partitioning of inner loops. */ unsigned flags; /* Partitioning flags. */ - unsigned ifns; /* Contained loop abstraction functions. */ + vec ifns; /* Contained loop abstraction functions. */ tree chunk_size; /* Chunk size. */ gcall *head_end; /* Final marker of head sequence. */ }; @@ -256,9 +258,10 @@ enum oacc_loop_flags { OLF_AUTO = 1u << 1, /* Compiler chooses axes. */ OLF_INDEPENDENT = 1u << 2, /* Iterations are known independent. */ OLF_GANG_STATIC = 1u << 3, /* Gang partitioning is static (has op). */ - + OLF_TILE = 1u << 4, /* Tiled loop. */ + /* Explicitly specified loop axes. */ - OLF_DIM_BASE = 4, + OLF_DIM_BASE = 5, OLF_DIM_GANG = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG), OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER), OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR), @@ -536,13 +539,9 @@ extract_omp_for_data (gomp_for *for_stmt, struct o fd->for_stmt = for_stmt; fd->pre = NULL; - if (gimple_omp_for_collapse (for_stmt) > 1) - fd->loops = loops; - else - fd->loops = &fd->loop; - fd->have_nowait = distribute || simd; fd->have_ordered = false; + fd->tiling = NULL; fd->collapse = 1; fd->ordered = 0; fd->sched_kind = OMP_CLAUSE_SCHEDULE_STATIC; @@ -587,9 +586,22 @@ extract_omp_for_data (gomp_for *for_stmt, struct o collapse_count = &OMP_CLAUSE_COLLAPSE_COUNT (t); } break; + case OMP_CLAUSE_TILE: + fd->tiling = OMP_CLAUSE_TILE_LIST (t); + fd->collapse = list_length (fd->tiling); + gcc_assert (fd->collapse); + collapse_iter = &OMP_CLAUSE_TILE_ITERVAR (t); + collapse_count = &OMP_CLAUSE_TILE_COUNT (t); + break; default: break; } + + if (fd->collapse > 1 || fd->tiling) + fd->loops = loops; + else + fd->loops = &fd->loop; + if (fd->ordered && fd->collapse == 1 && loops != NULL) { fd->loops = loops; @@ -608,7 +620,7 @@ extract_omp_for_data (gomp_for *for_stmt, struct o fd->sched_kind = OMP_CLAUSE_SCHEDULE_STATIC; gcc_assert (fd->chunk_size == NULL); } - gcc_assert (fd->collapse == 1 || collapse_iter != NULL); + gcc_assert ((fd->collapse == 1 && !fd->tiling) || collapse_iter != NULL); if (taskloop) fd->sched_kind = OMP_CLAUSE_SCHEDULE_RUNTIME; if (fd->sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME) @@ -626,7 +638,10 @@ extract_omp_for_data (gomp_for *for_stmt, struct o int cnt = fd->ordered ? fd->ordered : fd->collapse; for (i = 0; i < cnt; i++) { - if (i == 0 && fd->collapse == 1 && (fd->ordered == 0 || loops == NULL)) + if (i == 0 + && fd->collapse == 1 + && !fd->tiling + && (fd->ordered == 0 || loops == NULL)) loop = &fd->loop; else if (loops != NULL) loop = loops + i; @@ -655,7 +670,7 @@ extract_omp_for_data (gomp_for *for_stmt, struct o || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC && !fd->have_ordered)) { - if (fd->collapse == 1) + if (fd->collapse == 1 && !fd->tiling) iter_type = TREE_TYPE (loop->v); else if (i == 0 || TYPE_PRECISION (iter_type) @@ -786,7 +801,7 @@ extract_omp_for_data (gomp_for *for_stmt, struct o *collapse_count = create_tmp_var (iter_type, ".count"); } - if (fd->collapse > 1 || (fd->ordered && loops)) + if (fd->collapse > 1 || fd->tiling || (fd->ordered && loops)) { fd->loop.v = *collapse_iter; fd->loop.n1 = build_int_cst (TREE_TYPE (fd->loop.v), 0); @@ -2192,6 +2207,7 @@ scan_sharing_clauses (tree clauses, omp_context *c case OMP_CLAUSE_INDEPENDENT: case OMP_CLAUSE_AUTO: case OMP_CLAUSE_SEQ: + case OMP_CLAUSE_TILE: break; case OMP_CLAUSE_ALIGNED: @@ -2201,7 +2217,6 @@ scan_sharing_clauses (tree clauses, omp_context *c install_var_local (decl, ctx); break; - case OMP_CLAUSE_TILE: case OMP_CLAUSE__CACHE_: default: gcc_unreachable (); @@ -2362,10 +2377,10 @@ scan_sharing_clauses (tree clauses, omp_context *c case OMP_CLAUSE_INDEPENDENT: case OMP_CLAUSE_AUTO: case OMP_CLAUSE_SEQ: + case OMP_CLAUSE_TILE: case OMP_CLAUSE__GRIDDIM_: break; - case OMP_CLAUSE_TILE: case OMP_CLAUSE__CACHE_: default: gcc_unreachable (); @@ -6349,6 +6364,10 @@ lower_oacc_head_mark (location_t loc, tree ddvar, tag |= OLF_INDEPENDENT; break; + case OMP_CLAUSE_TILE: + tag |= OLF_TILE; + break; + default: continue; } @@ -6366,14 +6385,20 @@ lower_oacc_head_mark (location_t loc, tree ddvar, if (!tgt || is_oacc_parallel (tgt)) tag |= OLF_INDEPENDENT; - /* A loop lacking SEQ, GANG, WORKER and/or VECTOR is implicitly AUTO. */ - if (!(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE) - | OLF_SEQ))) - tag |= OLF_AUTO; + if (tag & OLF_TILE) + /* Tiling could use all 3 levels. */ + levels = 3; + else + { + /* A loop lacking SEQ, GANG, WORKER and/or VECTOR could be AUTO. + Ensure at least one level, or 2 for possible auto + partitioning */ + bool maybe_auto = !(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) + << OLF_DIM_BASE) | OLF_SEQ)); - /* Ensure at least one level. */ - if (!levels) - levels++; + if (levels < 1u + maybe_auto) + levels = 1u + maybe_auto; + } args.quick_push (build_int_cst (integer_type_node, levels)); args.quick_push (build_int_cst (integer_type_node, tag)); @@ -7445,7 +7470,9 @@ struct oacc_collapse { tree base; /* Base value. */ tree iters; /* Number of steps. */ - tree step; /* step size. */ + tree step; /* Step size. */ + tree tile; /* Tile increment (if tiled). */ + tree outer; /* Tile iterator var. */ }; /* Helper for expand_oacc_for. Determine collapsed loop information. @@ -7455,15 +7482,20 @@ struct oacc_collapse static tree expand_oacc_collapse_init (const struct omp_for_data *fd, gimple_stmt_iterator *gsi, - oacc_collapse *counts, tree bound_type) + oacc_collapse *counts, tree bound_type, + location_t loc) { + tree tiling = fd->tiling; tree total = build_int_cst (bound_type, 1); int ix; gcc_assert (integer_onep (fd->loop.step)); gcc_assert (integer_zerop (fd->loop.n1)); - for (ix = 0; ix != fd->collapse; ix++) + /* When tiling, the first operand of the tile clause applies to the + innermost loop, and we work outwards from there. Seems + backwards, but whatever. */ + for (ix = fd->collapse; ix--;) { const omp_for_data_loop *loop = &fd->loops[ix]; @@ -7478,6 +7510,30 @@ expand_oacc_collapse_init (const struct omp_for_da if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type)) diff_type = signed_type_for (diff_type); + if (tiling) + { + tree num = build_int_cst (integer_type_node, fd->collapse); + tree loop_no = build_int_cst (integer_type_node, ix); + tree tile = TREE_VALUE (tiling); + gcall *call + = gimple_build_call_internal (IFN_GOACC_TILE, 5, num, loop_no, tile, + /* gwv-outer=*/integer_zero_node, + /* gwv-inner=*/integer_zero_node); + + counts[ix].outer = create_tmp_var (iter_type, ".outer"); + counts[ix].tile = create_tmp_var (diff_type, ".tile"); + gimple_call_set_lhs (call, counts[ix].tile); + gimple_set_location (call, loc); + gsi_insert_before (gsi, call, GSI_SAME_STMT); + + tiling = TREE_CHAIN (tiling); + } + else + { + counts[ix].tile = NULL; + counts[ix].outer = loop->v; + } + tree b = loop->n1; tree e = loop->n2; tree s = loop->step; @@ -7531,13 +7587,14 @@ expand_oacc_collapse_init (const struct omp_for_da return total; } -/* Emit initializers for collapsed loop members. IVAR is the outer +/* Emit initializers for collapsed loop members. INNER is true if + this is for the element loop of a TILE. IVAR is the outer loop iteration variable, from which collapsed loop iteration values are calculated. COUNTS array has been initialized by expand_oacc_collapse_inits. */ static void -expand_oacc_collapse_vars (const struct omp_for_data *fd, +expand_oacc_collapse_vars (const struct omp_for_data *fd, bool inner, gimple_stmt_iterator *gsi, const oacc_collapse *counts, tree ivar) { @@ -7549,7 +7606,8 @@ static void { const omp_for_data_loop *loop = &fd->loops[ix]; const oacc_collapse *collapse = &counts[ix]; - tree iter_type = TREE_TYPE (loop->v); + tree v = inner ? loop->v : collapse->outer; + tree iter_type = TREE_TYPE (v); tree diff_type = TREE_TYPE (collapse->step); tree plus_type = iter_type; enum tree_code plus_code = PLUS_EXPR; @@ -7561,24 +7619,25 @@ static void plus_type = sizetype; } - expr = fold_build2 (TRUNC_MOD_EXPR, ivar_type, ivar, - fold_convert (ivar_type, collapse->iters)); + expr = ivar; + if (ix) + { + tree mod = fold_convert (ivar_type, collapse->iters); + ivar = fold_build2 (TRUNC_DIV_EXPR, ivar_type, expr, mod); + expr = fold_build2 (TRUNC_MOD_EXPR, ivar_type, expr, mod); + ivar = force_gimple_operand_gsi (gsi, ivar, true, NULL_TREE, + true, GSI_SAME_STMT); + } + expr = fold_build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr), collapse->step); - expr = fold_build2 (plus_code, iter_type, collapse->base, + expr = fold_build2 (plus_code, iter_type, + inner ? collapse->outer : collapse->base, fold_convert (plus_type, expr)); expr = force_gimple_operand_gsi (gsi, expr, false, NULL_TREE, true, GSI_SAME_STMT); - gassign *ass = gimple_build_assign (loop->v, expr); + gassign *ass = gimple_build_assign (v, expr); gsi_insert_before (gsi, ass, GSI_SAME_STMT); - - if (ix) - { - expr = fold_build2 (TRUNC_DIV_EXPR, ivar_type, ivar, - fold_convert (ivar_type, collapse->iters)); - ivar = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE, - true, GSI_SAME_STMT); - } } } @@ -11148,7 +11207,8 @@ expand_omp_taskloop_for_inner (struct omp_region * where LTGT is < or >. We may have a specified chunking size, CHUNKING (constant 0 for no chunking) and we will have a GWV partitioning mask, specifying dimensions over which the loop is to be - partitioned (see note below). We generate code that looks like: + partitioned (see note below). We generate code that looks like + (this ignores tiling): [incoming FALL->body, BRANCH->exit] typedef signedintify (typeof (V)) T; // underlying signed integral type @@ -11178,12 +11238,8 @@ expand_omp_taskloop_for_inner (struct omp_region * [incoming] V = B + ((range -/+ 1) / S +/- 1) * S [*] - [*] Needed if V live at end of loop + [*] Needed if V live at end of loop. */ - Note: CHUNKING & GWV mask are specified explicitly here. This is a - transition, and will be specified by a more general mechanism shortly. - */ - static void expand_oacc_for (struct omp_region *region, struct omp_for_data *fd) { @@ -11245,9 +11301,16 @@ expand_oacc_for (struct omp_region *region, struct tree step = create_tmp_var (diff_type, ".step"); bool up = cond_code == LT_EXPR; tree dir = build_int_cst (diff_type, up ? +1 : -1); - bool chunking = !gimple_in_ssa_p (cfun);; + bool chunking = !gimple_in_ssa_p (cfun); bool negating; + /* Tiling vars. */ + tree tile_size = NULL_TREE; + tree element_s = NULL_TREE; + tree e_bound = NULL_TREE, e_offset = NULL_TREE, e_step = NULL_TREE; + basic_block elem_body_bb = NULL; + basic_block elem_cont_bb = NULL; + /* SSA instances. */ tree offset_incr = NULL_TREE; tree offset_init = NULL_TREE; @@ -11278,11 +11341,12 @@ expand_oacc_for (struct omp_region *region, struct gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG)); } - if (fd->collapse > 1) + if (fd->collapse > 1 || fd->tiling) { + gcc_assert (!gimple_in_ssa_p (cfun) && up); counts = XALLOCAVEC (struct oacc_collapse, fd->collapse); tree total = expand_oacc_collapse_init (fd, &gsi, counts, - TREE_TYPE (fd->loop.n2)); + TREE_TYPE (fd->loop.n2), loc); if (SSA_VAR_P (fd->loop.n2)) { @@ -11291,7 +11355,6 @@ expand_oacc_for (struct omp_region *region, struct ass = gimple_build_assign (fd->loop.n2, total); gsi_insert_before (&gsi, ass, GSI_SAME_STMT); } - } tree b = fd->loop.n1; @@ -11315,6 +11378,29 @@ expand_oacc_for (struct omp_region *region, struct expr = fold_convert (diff_type, chunk_size); chunk_size = force_gimple_operand_gsi (&gsi, expr, true, NULL_TREE, true, GSI_SAME_STMT); + + if (fd->tiling) + { + /* Determine the tile size and element step, + modify the outer loop step size. */ + tile_size = create_tmp_var (diff_type, ".tile_size"); + expr = build_int_cst (diff_type, 1); + for (int ix = 0; ix < fd->collapse; ix++) + expr = fold_build2 (MULT_EXPR, diff_type, counts[ix].tile, expr); + expr = force_gimple_operand_gsi (&gsi, expr, true, + NULL_TREE, true, GSI_SAME_STMT); + ass = gimple_build_assign (tile_size, expr); + gsi_insert_before (&gsi, ass, GSI_SAME_STMT); + + element_s = create_tmp_var (diff_type, ".element_s"); + ass = gimple_build_assign (element_s, s); + gsi_insert_before (&gsi, ass, GSI_SAME_STMT); + + expr = fold_build2 (MULT_EXPR, diff_type, s, tile_size); + s = force_gimple_operand_gsi (&gsi, expr, true, + NULL_TREE, true, GSI_SAME_STMT); + } + /* Determine the range, avoiding possible unsigned->signed overflow. */ negating = !up && TYPE_UNSIGNED (iter_type); expr = fold_build2 (MINUS_EXPR, plus_type, @@ -11419,8 +11505,72 @@ expand_oacc_for (struct omp_region *region, struct true, GSI_SAME_STMT); ass = gimple_build_assign (v, expr); gsi_insert_before (&gsi, ass, GSI_SAME_STMT); - if (fd->collapse > 1) - expand_oacc_collapse_vars (fd, &gsi, counts, v); + + if (fd->collapse > 1 || fd->tiling) + expand_oacc_collapse_vars (fd, false, &gsi, counts, v); + + if (fd->tiling) + { + /* Determine the range of the element loop -- usually simply + the tile_size, but could be smaller if the final + iteration of the outer loop is a partial tile. */ + tree e_range = create_tmp_var (diff_type, ".e_range"); + + expr = build2 (MIN_EXPR, diff_type, + build2 (MINUS_EXPR, diff_type, bound, offset), + build2 (MULT_EXPR, diff_type, tile_size, + element_s)); + expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE, + true, GSI_SAME_STMT); + ass = gimple_build_assign (e_range, expr); + gsi_insert_before (&gsi, ass, GSI_SAME_STMT); + + /* Determine bound, offset & step of inner loop. */ + e_bound = create_tmp_var (diff_type, ".e_bound"); + e_offset = create_tmp_var (diff_type, ".e_offset"); + e_step = create_tmp_var (diff_type, ".e_step"); + + /* Mark these as element loops. */ + tree t, e_gwv = integer_minus_one_node; + tree chunk = build_int_cst (diff_type, 0); /* Never chunked. */ + + t = build_int_cst (integer_type_node, IFN_GOACC_LOOP_OFFSET); + call = gimple_build_call_internal (IFN_GOACC_LOOP, 7, t, dir, e_range, + element_s, chunk, e_gwv, chunk); + gimple_call_set_lhs (call, e_offset); + gimple_set_location (call, loc); + gsi_insert_before (&gsi, call, GSI_SAME_STMT); + + t = build_int_cst (integer_type_node, IFN_GOACC_LOOP_BOUND); + call = gimple_build_call_internal (IFN_GOACC_LOOP, 7, t, dir, e_range, + element_s, chunk, e_gwv, e_offset); + gimple_call_set_lhs (call, e_bound); + gimple_set_location (call, loc); + gsi_insert_before (&gsi, call, GSI_SAME_STMT); + + t = build_int_cst (integer_type_node, IFN_GOACC_LOOP_STEP); + call = gimple_build_call_internal (IFN_GOACC_LOOP, 6, t, dir, e_range, + element_s, chunk, e_gwv); + gimple_call_set_lhs (call, e_step); + gimple_set_location (call, loc); + gsi_insert_before (&gsi, call, GSI_SAME_STMT); + + /* Add test and split block. */ + expr = build2 (cond_code, boolean_type_node, e_offset, e_bound); + stmt = gimple_build_cond_empty (expr); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + split = split_block (body_bb, stmt); + elem_body_bb = split->dest; + if (cont_bb == body_bb) + cont_bb = elem_body_bb; + body_bb = split->src; + + split->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE; + + /* Initialize the user's loop vars. */ + gsi = gsi_start_bb (elem_body_bb); + expand_oacc_collapse_vars (fd, true, &gsi, counts, e_offset); + } } /* Loop increment goes into cont_bb. If this is not a loop, we @@ -11434,9 +11584,33 @@ expand_oacc_for (struct omp_region *region, struct gomp_continue *cont_stmt = as_a (gsi_stmt (gsi)); loc = gimple_location (cont_stmt); + if (fd->tiling) + { + /* Insert element loop increment and test. */ + expr = build2 (PLUS_EXPR, diff_type, e_offset, e_step); + expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE, + true, GSI_SAME_STMT); + ass = gimple_build_assign (e_offset, expr); + gsi_insert_before (&gsi, ass, GSI_SAME_STMT); + expr = build2 (cond_code, boolean_type_node, e_offset, e_bound); + + stmt = gimple_build_cond_empty (expr); + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT); + split = split_block (cont_bb, stmt); + elem_cont_bb = split->src; + cont_bb = split->dest; + + split->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE; + make_edge (elem_cont_bb, elem_body_bb, EDGE_TRUE_VALUE); + + make_edge (body_bb, cont_bb, EDGE_FALSE_VALUE); + + gsi = gsi_for_stmt (cont_stmt); + } + /* Increment offset. */ if (gimple_in_ssa_p (cfun)) - expr= build2 (plus_code, iter_type, offset, + expr = build2 (plus_code, iter_type, offset, fold_convert (plus_type, step)); else expr = build2 (PLUS_EXPR, diff_type, offset, step); @@ -11510,7 +11684,7 @@ expand_oacc_for (struct omp_region *region, struct if (cont_bb) { - /* We now have one or two nested loops. Update the loop + /* We now have one, two or three nested loops. Update the loop structures. */ struct loop *parent = entry_bb->loop_father; struct loop *body = body_bb->loop_father; @@ -11537,6 +11711,15 @@ expand_oacc_for (struct omp_region *region, struct body_loop->header = body_bb; body_loop->latch = cont_bb; add_loop (body_loop, parent); + + if (fd->tiling) + { + /* Insert tiling's element loop. */ + struct loop *inner_loop = alloc_loop (); + inner_loop->header = elem_body_bb; + inner_loop->latch = elem_cont_bb; + add_loop (inner_loop, body_loop); + } } } } @@ -18555,6 +18738,23 @@ omp_finish_file (void) } } +/* Call dim_pos (POS == true) or dim_size (POS == false) builtins for + axis DIM. Return a tmp var holding the result. */ + +static tree +oacc_dim_call (bool pos, int dim, gimple_seq *seq) +{ + tree arg = build_int_cst (unsigned_type_node, dim); + tree size = create_tmp_var (integer_type_node); + enum internal_fn fn = pos ? IFN_GOACC_DIM_POS : IFN_GOACC_DIM_SIZE; + gimple *call = gimple_build_call_internal (fn, 1, arg); + + gimple_call_set_lhs (call, size); + gimple_seq_add_stmt (seq, call); + + return size; +} + /* Find the number of threads (POS = false), or thread number (POS = true) for an OpenACC region partitioned as MASK. Setup code required for the calculation is added to SEQ. */ @@ -18569,29 +18769,17 @@ oacc_thread_numbers (bool pos, int mask, gimple_se for (ix = GOMP_DIM_GANG; ix != GOMP_DIM_MAX; ix++) if (GOMP_DIM_MASK (ix) & mask) { - tree arg = build_int_cst (unsigned_type_node, ix); - if (res) { /* We had an outer index, so scale that by the size of this dimension. */ - tree n = create_tmp_var (integer_type_node); - gimple *call - = gimple_build_call_internal (IFN_GOACC_DIM_SIZE, 1, arg); - - gimple_call_set_lhs (call, n); - gimple_seq_add_stmt (seq, call); + tree n = oacc_dim_call (false, ix, seq); res = fold_build2 (MULT_EXPR, integer_type_node, res, n); } if (pos) { /* Determine index in this dimension. */ - tree id = create_tmp_var (integer_type_node); - gimple *call = gimple_build_call_internal - (IFN_GOACC_DIM_POS, 1, arg); - - gimple_call_set_lhs (call, id); - gimple_seq_add_stmt (seq, call); + tree id = oacc_dim_call (true, ix, seq); if (res) res = fold_build2 (PLUS_EXPR, integer_type_node, res, id); else @@ -18790,6 +18978,85 @@ oacc_xform_loop (gcall *call) gsi_replace_with_seq (&gsi, seq, true); } +/* Transform a GOACC_TILE call. Determines the element loop span for + the specified loop of the nest. This is 1 if we're not tiling. + + GOACC_TILE (collapse_count, loop_no, tile_arg, gwv_tile, gwv_element); */ + +static void +oacc_xform_tile (gcall *call) +{ + gimple_stmt_iterator gsi = gsi_for_stmt (call); + unsigned collapse = tree_to_uhwi (gimple_call_arg (call, 0)); + /* Inner loops have higher loop_nos. */ + unsigned loop_no = tree_to_uhwi (gimple_call_arg (call, 1)); + tree tile_size = gimple_call_arg (call, 2); + unsigned e_mask = tree_to_uhwi (gimple_call_arg (call, 4)); + tree lhs = gimple_call_lhs (call); + tree type = TREE_TYPE (lhs); + gimple_seq seq = NULL; + tree span = build_int_cst (type, 1); + + gcc_assert (!(e_mask + & ~(GOMP_DIM_MASK (GOMP_DIM_VECTOR) + | GOMP_DIM_MASK (GOMP_DIM_WORKER)))); + push_gimplify_context (!seen_error ()); + +#ifndef ACCEL_COMPILER + /* Partitioning disabled on host compilers. */ + e_mask = 0; +#endif + if (!e_mask) + /* Not paritioning. */ + span = integer_one_node; + else if (!integer_zerop (tile_size)) + /* User explicitly specified size. */ + span = tile_size; + else + { + /* Pick a size based on the paritioning of the element loop and + the number of loop nests. */ + tree first_size = NULL_TREE; + tree second_size = NULL_TREE; + + if (e_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) + first_size = oacc_dim_call (false, GOMP_DIM_VECTOR, &seq); + if (e_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) + second_size = oacc_dim_call (false, GOMP_DIM_WORKER, &seq); + + if (!first_size) + { + first_size = second_size; + second_size = NULL_TREE; + } + + if (loop_no + 1 == collapse) + { + span = first_size; + if (!loop_no && second_size) + span = fold_build2 (MULT_EXPR, TREE_TYPE (span), + span, second_size); + } + else if (loop_no + 2 == collapse) + span = second_size; + else + span = NULL_TREE; + + if (!span) + /* There's no obvious element size for this loop. Options + are 1, first_size or some non-unity constant (32 is my + favourite). We should gather some statistics. */ + span = first_size; + } + + span = fold_convert (type, span); + gimplify_assign (lhs, span, &seq); + + pop_gimplify_context (NULL); + + gsi_replace_with_seq (&gsi, seq, true); +} + /* Default partitioned and minimum partitioned dimensions. */ static int oacc_default_dims[GOMP_DIM_MAX]; @@ -18948,8 +19215,7 @@ new_oacc_loop_raw (oacc_loop *parent, location_t l memset (loop->tails, 0, sizeof (loop->tails)); loop->routine = NULL_TREE; - loop->mask = loop->flags = loop->inner = 0; - loop->ifns = 0; + loop->mask = loop->e_mask = loop->flags = loop->inner = 0; loop->chunk_size = 0; loop->head_end = NULL; @@ -19012,7 +19278,7 @@ static oacc_loop * finish_oacc_loop (oacc_loop *loop) { /* If the loop has been collapsed, don't partition it. */ - if (!loop->ifns) + if (loop->ifns.is_empty ()) loop->mask = loop->flags = 0; return loop->parent; } @@ -19148,9 +19414,10 @@ oacc_loop_discover_walk (oacc_loop *loop, basic_bl break; case IFN_GOACC_LOOP: - /* Count the goacc loop abstraction fns, to determine if the - loop was collapsed already. */ - loop->ifns++; + case IFN_GOACC_TILE: + /* Record the abstraction function, so we can manipulate it + later. */ + loop->ifns.safe_push (call); break; case IFN_UNIQUE: @@ -19285,51 +19552,6 @@ oacc_loop_xform_head_tail (gcall *from, int level) } } -/* Transform the IFN_GOACC_LOOP internal functions by providing the - determined partitioning mask and chunking argument. END_MARKER - points at the end IFN_HEAD_TAIL call intgroducing the loop. IFNS - is the number of IFN_GOACC_LOOP calls for the loop. MASK_ARG is - the replacement partitioning mask and CHUNK_ARG is the replacement - chunking arg. */ - -static void -oacc_loop_xform_loop (gcall *end_marker, unsigned ifns, - tree mask_arg, tree chunk_arg) -{ - gimple_stmt_iterator gsi = gsi_for_stmt (end_marker); - - gcc_checking_assert (ifns); - for (;;) - { - for (; !gsi_end_p (gsi); gsi_next (&gsi)) - { - gimple *stmt = gsi_stmt (gsi); - - if (!is_gimple_call (stmt)) - continue; - - gcall *call = as_a (stmt); - - if (!gimple_call_internal_p (call)) - continue; - - if (gimple_call_internal_fn (call) != IFN_GOACC_LOOP) - continue; - - *gimple_call_arg_ptr (call, 5) = mask_arg; - *gimple_call_arg_ptr (call, 4) = chunk_arg; - ifns--; - if (!ifns) - return; - } - - /* The LOOP_BOUND ifn could be in the single successor - block. */ - basic_block bb = single_succ (gsi_bb (gsi)); - gsi = gsi_start_bb (bb); - } -} - /* Process the discovered OpenACC loops, setting the correct partitioning level etc. */ @@ -19342,13 +19564,34 @@ oacc_loop_process (oacc_loop *loop) if (loop->mask && !loop->routine) { int ix; - unsigned mask = loop->mask; - unsigned dim = GOMP_DIM_GANG; - tree mask_arg = build_int_cst (unsigned_type_node, mask); + tree mask_arg = build_int_cst (unsigned_type_node, loop->mask); + tree e_mask_arg = build_int_cst (unsigned_type_node, loop->e_mask); tree chunk_arg = loop->chunk_size; + gcall *call; - oacc_loop_xform_loop (loop->head_end, loop->ifns, mask_arg, chunk_arg); + for (ix = 0; loop->ifns.iterate (ix, &call); ix++) + switch (gimple_call_internal_fn (call)) + { + case IFN_GOACC_LOOP: + { + bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node; + gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg); + if (!is_e) + gimple_call_set_arg (call, 4, chunk_arg); + } + break; + case IFN_GOACC_TILE: + gimple_call_set_arg (call, 3, mask_arg); + gimple_call_set_arg (call, 4, e_mask_arg); + break; + + default: + gcc_unreachable (); + } + + unsigned dim = GOMP_DIM_GANG; + unsigned mask = loop->mask | loop->e_mask; for (ix = 0; ix != GOMP_DIM_MAX && mask; ix++) { while (!(GOMP_DIM_MASK (dim) & mask)) @@ -19388,10 +19631,16 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsig { bool auto_par = (loop->flags & OLF_AUTO) != 0; bool seq_par = (loop->flags & OLF_SEQ) != 0; + bool tiling = (loop->flags & OLF_TILE) != 0; this_mask = ((loop->flags >> OLF_DIM_BASE) & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1)); + /* Apply auto partitioning if this is a non-partitioned regular + loop, or (no more than) single axis tiled loop. */ + bool maybe_auto + = !seq_par && this_mask == (tiling ? this_mask & -this_mask : 0); + if ((this_mask != 0) + auto_par + seq_par > 1) { if (noisy) @@ -19399,7 +19648,7 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsig seq_par ? "% overrides other OpenACC loop specifiers" : "% conflicts with other OpenACC loop specifiers"); - auto_par = false; + maybe_auto = false; loop->flags &= ~OLF_AUTO; if (seq_par) { @@ -19408,15 +19657,18 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsig this_mask = 0; } } - if (auto_par && (loop->flags & OLF_INDEPENDENT)) - mask_all |= GOMP_DIM_MASK (GOMP_DIM_MAX); + if (maybe_auto && (loop->flags & OLF_INDEPENDENT)) + { + loop->flags |= OLF_AUTO; + mask_all |= GOMP_DIM_MASK (GOMP_DIM_MAX); + } } if (this_mask & outer_mask) { const oacc_loop *outer; for (outer = loop->parent; outer; outer = outer->parent) - if (outer->mask & this_mask) + if ((outer->mask | outer->e_mask) & this_mask) break; if (noisy) @@ -19462,13 +19714,33 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsig } } + mask_all |= this_mask; + + if (loop->flags & OLF_TILE) + { + /* When tiling, vector goes to the element loop, and failing + that we put worker there. The std doesn't contemplate + specifying all three. We choose to put worker and vector on + the element loops in that case. */ + unsigned this_e_mask = this_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR); + if (!this_e_mask || this_mask & GOMP_DIM_MASK (GOMP_DIM_GANG)) + this_e_mask |= this_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER); + + loop->e_mask = this_e_mask; + this_mask ^= this_e_mask; + } + loop->mask = this_mask; - mask_all |= this_mask; - + + if (dump_file) + fprintf (dump_file, "Loop %s:%d user specified %d & %d\n", + LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), + loop->mask, loop->e_mask); + if (loop->child) { - loop->inner = oacc_loop_fixed_partitions (loop->child, - outer_mask | this_mask); + unsigned tmp_mask = outer_mask | this_mask | loop->e_mask; + loop->inner = oacc_loop_fixed_partitions (loop->child, tmp_mask); mask_all |= loop->inner; } @@ -19480,14 +19752,17 @@ oacc_loop_fixed_partitions (oacc_loop *loop, unsig /* Walk the OpenACC loop heirarchy to assign auto-partitioned loops. OUTER_MASK is the partitioning this loop is contained within. + OUTER_ASSIGN is true if an outer loop is being auto-partitioned. Return the cumulative partitioning used by this loop, siblings and children. */ static unsigned -oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask) +oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask, + bool outer_assign) { bool assign = (loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT); bool noisy = true; + bool tiling = loop->flags & OLF_TILE; #ifdef ACCEL_COMPILER /* When device_type is supported, we want the device compiler to be @@ -19495,31 +19770,52 @@ static unsigned noisy = false; #endif - if (assign && outer_mask < GOMP_DIM_MASK (GOMP_DIM_MAX - 1)) + if (assign && (!outer_assign | loop->inner)) { - /* Allocate the outermost loop at the outermost available - level. */ - unsigned this_mask = outer_mask + 1; + /* Allocate outermost and non-innermost loops at the outermost + non-innermost available level. */ + unsigned this_mask = GOMP_DIM_MASK (GOMP_DIM_GANG); + + /* Find the first outermost available partition. */ + while (this_mask <= outer_mask) + this_mask <<= 1; - if (!(this_mask & loop->inner)) - loop->mask = this_mask; + /* Grab two axes if tiling, and we've not assigned anything */ + if (tiling && !(loop->mask | loop->e_mask)) + this_mask |= this_mask << 1; + + /* Prohibit the innermost partitioning at the moment. */ + this_mask &= GOMP_DIM_MASK (GOMP_DIM_MAX - 1) - 1; + + /* Don't use any dimension explicitly claimed by an inner loop. */ + this_mask &= ~loop->inner; + + if (tiling && !loop->e_mask) + { + /* If we got two axes, allocate the inner one to the element + loop. */ + loop->e_mask = this_mask & (this_mask << 1); + this_mask ^= loop->e_mask; + } + + loop->mask |= this_mask; } if (loop->child) { - unsigned child_mask = outer_mask | loop->mask; - - if (loop->mask || assign) - child_mask |= GOMP_DIM_MASK (GOMP_DIM_MAX); - - loop->inner = oacc_loop_auto_partitions (loop->child, child_mask); + unsigned tmp_mask = outer_mask | loop->mask | loop->e_mask; + loop->inner = oacc_loop_auto_partitions (loop->child, tmp_mask, + outer_assign | assign); } - if (assign && !loop->mask) + if (assign && (!loop->mask || (tiling && !loop->e_mask) || !outer_assign)) { - /* Allocate the loop at the innermost available level. */ + /* Allocate the loop at the innermost available level. Note + that we do this even if we already assigned this loop the + outermost available level above. That way we'll partition + this along 2 axes, if they are available. */ unsigned this_mask = 0; - + /* Determine the outermost partitioning used within this loop. */ this_mask = loop->inner | GOMP_DIM_MASK (GOMP_DIM_MAX); this_mask = least_bit_hwi (this_mask); @@ -19530,24 +19826,44 @@ static unsigned /* And avoid picking one use by an outer loop. */ this_mask &= ~outer_mask; - if (!this_mask && noisy) + /* If tiling and we failed completely above, grab the next one + too. Making sure it doesn't hit an outer loop. */ + if (tiling) + { + this_mask &= ~(loop->e_mask | loop->mask); + unsigned tile_mask = ((this_mask >> 1) + & ~(outer_mask | loop->e_mask | loop->mask)); + + if (tile_mask || loop->mask) + { + loop->e_mask |= this_mask; + this_mask = tile_mask; + } + if (!loop->e_mask && noisy) + warning_at (loop->loc, 0, + "insufficient partitioning available" + " to parallelize element loop"); + } + + loop->mask |= this_mask; + if (!loop->mask && noisy) warning_at (loop->loc, 0, - "insufficient partitioning available to parallelize loop"); - - loop->mask = this_mask; + "insufficient partitioning available" + " to parallelize%s loop", tiling ? " tile" : ""); } if (assign && dump_file) - fprintf (dump_file, "Auto loop %s:%d assigned %d\n", + fprintf (dump_file, "Auto loop %s:%d assigned %d & %d\n", LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), - loop->mask); + loop->mask, loop->e_mask); unsigned inner_mask = 0; if (loop->sibling) - inner_mask |= oacc_loop_auto_partitions (loop->sibling, outer_mask); + inner_mask |= oacc_loop_auto_partitions (loop->sibling, + outer_mask, outer_assign); - inner_mask |= loop->inner | loop->mask; + inner_mask |= loop->inner | loop->mask | loop->e_mask; return inner_mask; } @@ -19563,7 +19879,7 @@ oacc_loop_partition (oacc_loop *loop, unsigned out if (mask_all & GOMP_DIM_MASK (GOMP_DIM_MAX)) { mask_all ^= GOMP_DIM_MASK (GOMP_DIM_MAX); - mask_all |= oacc_loop_auto_partitions (loop, outer_mask); + mask_all |= oacc_loop_auto_partitions (loop, outer_mask, false); } return mask_all; } @@ -19713,6 +20029,11 @@ execute_oacc_device_lower () { default: break; + case IFN_GOACC_TILE: + oacc_xform_tile (call); + rescan = true; + break; + case IFN_GOACC_LOOP: oacc_xform_loop (call); rescan = true; @@ -19740,7 +20061,7 @@ execute_oacc_device_lower () switch (kind) { default: - gcc_unreachable (); + break; case IFN_UNIQUE_OACC_FORK: case IFN_UNIQUE_OACC_JOIN: