From patchwork Tue Nov 24 12:24:47 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 548039 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 91C301402CC for ; Tue, 24 Nov 2015 23:26:02 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=I3+Sg8Od; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=V7+uZYHP5Y4s4xpAr 9M/xyqa/5koFrmYknAAwwNkfk5DKy69Gw9Ns7//2GAEKgxO4OcDjIMm1whavx0Eu quZ7tJZFgJaZOyrPnk5r0JgtRzn4lcp32lZ9aqCG6F0UJuD92xTG0xumLOT5Y5ed 4BSfXpcxMw12PV/4klBFdZ0azw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=XitVcSub+CQN2QIkeVGMfx4 YiTM=; b=I3+Sg8Odf4KnpOeBB2OTX7go2YWNtc66DqXwAWaw4JtEFzmuySxu+JK /5RkiN2U2zdUzxoPoYiUnNt+6zrn74I4GPtagDosQvXsXRGO5XkaI5VC6IWc4ntM z414oYI0G26RdcDB0Fenj8kCVrrMz8I+aaN2KkZKHD+372CrPYDA= Received: (qmail 16435 invoked by alias); 24 Nov 2015 12:25:52 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 16424 invoked by uid 89); 24 Nov 2015 12:25:51 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.3 required=5.0 tests=AWL, BAYES_20, RP_MATCHES_RCVD, SPF_PASS, UNSUBSCRIBE_BODY autolearn=no version=3.3.2 X-HELO: fencepost.gnu.org Received: from fencepost.gnu.org (HELO fencepost.gnu.org) (208.118.235.10) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 24 Nov 2015 12:25:48 +0000 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43433) by fencepost.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1a1CfJ-0003sf-R5 for gcc-patches@gnu.org; Tue, 24 Nov 2015 07:25:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a1CfE-0003yZ-DW for gcc-patches@gnu.org; Tue, 24 Nov 2015 07:25:45 -0500 Received: from relay1.mentorg.com ([192.94.38.131]:48428) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1CfD-0003yO-Uw for gcc-patches@gnu.org; Tue, 24 Nov 2015 07:25:40 -0500 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1a1CfC-0000qR-Rb from Tom_deVries@mentor.com ; Tue, 24 Nov 2015 04:25:39 -0800 Received: from [127.0.0.1] (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Tue, 24 Nov 2015 12:25:36 +0000 Subject: Re: [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels To: "gcc-patches@gnu.org" References: <5640BD31.2060602@mentor.com> <5640F98B.5050601@mentor.com> <5649C508.80803@mentor.com> CC: Jakub Jelinek , Richard Biener From: Tom de Vries Message-ID: <5654570F.3050003@mentor.com> Date: Tue, 24 Nov 2015 13:24:47 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <5649C508.80803@mentor.com> X-detected-operating-system: by eggs.gnu.org: Windows NT kernel [generic] [fuzzy] X-Received-From: 192.94.38.131 On 16/11/15 12:59, Tom de Vries wrote: > On 09/11/15 20:52, Tom de Vries wrote: >> On 09/11/15 16:35, Tom de Vries wrote: >>> Hi, >>> >>> this patch series for stage1 trunk adds support to: >>> - parallelize oacc kernels regions using parloops, and >>> - map the loops onto the oacc gang dimension. >>> >>> The patch series contains these patches: >>> >>> 1 Insert new exit block only when needed in >>> transform_to_exit_first_loop_alt >>> 2 Make create_parallel_loop return void >>> 3 Ignore reduction clause on kernels directive >>> 4 Implement -foffload-alias >>> 5 Add in_oacc_kernels_region in struct loop >>> 6 Add pass_oacc_kernels >>> 7 Add pass_dominator_oacc_kernels >>> 8 Add pass_ch_oacc_kernels >>> 9 Add pass_parallelize_loops_oacc_kernels >>> 10 Add pass_oacc_kernels pass group in passes.def >>> 11 Update testcases after adding kernels pass group >>> 12 Handle acc loop directive >>> 13 Add c-c++-common/goacc/kernels-*.c >>> 14 Add gfortran.dg/goacc/kernels-*.f95 >>> 15 Add libgomp.oacc-c-c++-common/kernels-*.c >>> 16 Add libgomp.oacc-fortran/kernels-*.f95 >>> >>> The first 9 patches are more or less independent, but patches 10-16 are >>> intended to be committed at the same time. >>> >>> Bootstrapped and reg-tested on x86_64. >>> >>> Build and reg-tested with nvidia accelerator, in combination with a >>> patch that enables accelerator testing (which is submitted at >>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). >>> >>> I'll post the individual patches in reply to this message. >> >> This patch adds pass_parallelize_loops_oacc_kernels. >> >> There's a number of things we do differently in parloops for oacc >> kernels: >> - in normal parloops, we generate code to choose between a parallel >> version of the loop, and a sequential (low iteration count) version. >> Since the code in oacc kernels region is supposed to run on the >> accelerator anyway, we skip this check, and don't add a low iteration >> count loop. >> - in normal parloops, we generate an #pragma omp parallel / >> GIMPLE_OMP_RETURN pair to delimit the region which will we split off >> into a thread function. Since the oacc kernels region is already >> split off, we don't add this pair. >> - we indicate the parallelization factor by setting the oacc function >> attributes >> - we generate an #pragma oacc loop instead of an #pragma omp for, and >> we add the gang clause >> - in normal parloops, we rewrite the variable accesses in the loop in >> terms into accesses relative to a thread function parameter. For the >> oacc kernels region, that rewrite has already been done at omp-lower, >> so we skip this. >> - we need to ensure that the entire kernels region can be run in >> parallel. The loop independence check is already present, so for oacc >> kernels we add a check between blocks outside the loop and the entire >> region. >> - we guard stores in the blocks outside the loop with gang_pos == 0. >> There's no need for each gang to write to a single location, we can >> do this in just one gang. (Typically this is the write of the final >> value of the iteration variable if that one is copied back to the >> host). >> > > Reposting with loop optimizer init added in > pass_parallelize_loops_oacc_kernels::execute. > Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize added in pass_parallelize_loops_oacc_kernels::execute. Thanks, - Tom Add pass_parallelize_loops_oacc_kernels 2015-11-09 Tom de Vries * omp-low.c (set_oacc_fn_attrib): Make extern. * omp-low.c (expand_omp_atomic_fetch_op): Release defs of update stmt. * omp-low.h (set_oacc_fn_attrib): Declare. * tree-parloops.c (struct reduction_info): Add reduc_addr field. (create_call_for_reduction_1): Handle case that reduc_addr is non-NULL. (create_parallel_loop, gen_parallel_loop, try_create_reduction_list): Add and handle function parameter oacc_kernels_p. (get_omp_data_i_param): New function. (ref_conflicts_with_region, oacc_entry_exit_ok_1) (oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function. (parallelize_loops): Add and handle function parameter oacc_kernels_p. Calculate dominance info. Skip loops that are not in a kernels region in oacc_kernels_p mode. Skip inner loops of parallelized loops. (pass_parallelize_loops::execute): Call parallelize_loops with false argument. (pass_data_parallelize_loops_oacc_kernels): New pass_data. (class pass_parallelize_loops_oacc_kernels): New pass. (pass_parallelize_loops_oacc_kernels::execute) (make_pass_parallelize_loops_oacc_kernels): New function. * tree-pass.h (make_pass_parallelize_loops_oacc_kernels): Declare. --- gcc/omp-low.c | 8 +- gcc/omp-low.h | 1 + gcc/tree-parloops.c | 700 +++++++++++++++++++++++++++++++++++++++++++++++----- gcc/tree-pass.h | 2 + 4 files changed, 647 insertions(+), 64 deletions(-) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 0d4c6e5..efe5d3a 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -11925,10 +11925,14 @@ expand_omp_atomic_fetch_op (basic_block load_bb, gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ATOMIC_STORE); gsi_remove (&gsi, true); gsi = gsi_last_bb (store_bb); + stmt = gsi_stmt (gsi); gsi_remove (&gsi, true); if (gimple_in_ssa_p (cfun)) - update_ssa (TODO_update_ssa_no_phi); + { + release_defs (stmt); + update_ssa (TODO_update_ssa_no_phi); + } return true; } @@ -12302,7 +12306,7 @@ replace_oacc_fn_attrib (tree fn, tree dims) function attribute. Push any that are non-constant onto the ARGS list, along with an appropriate GOMP_LAUNCH_DIM tag. */ -static void +void set_oacc_fn_attrib (tree fn, tree clauses, vec *args) { /* Must match GOMP_DIM ordering. */ diff --git a/gcc/omp-low.h b/gcc/omp-low.h index 194b3d1..1790f40 100644 --- a/gcc/omp-low.h +++ b/gcc/omp-low.h @@ -33,6 +33,7 @@ extern tree omp_member_access_dummy_var (tree); extern void replace_oacc_fn_attrib (tree, tree); extern tree build_oacc_routine_dims (tree); extern tree get_oacc_fn_attrib (tree); +extern void set_oacc_fn_attrib (tree, tree, vec *); extern int get_oacc_ifn_dim_arg (const gimple *); extern int get_oacc_fn_dim_size (tree, int); diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c index 9b564ca..0403d3b 100644 --- a/gcc/tree-parloops.c +++ b/gcc/tree-parloops.c @@ -53,6 +53,10 @@ along with GCC; see the file COPYING3. If not see #include "tree-ssa.h" #include "params.h" #include "params-enum.h" +#include "tree-ssa-alias.h" +#include "tree-eh.h" +#include "gomp-constants.h" +#include "tree-dfa.h" /* This pass tries to distribute iterations of loops into several threads. The implementation is straightforward -- for each loop we test whether its @@ -192,6 +196,8 @@ struct reduction_info of the reduction variable when existing the loop. */ tree initial_value; /* The initial value of the reduction var before entering the loop. */ tree field; /* the name of the field in the parloop data structure intended for reduction. */ + tree reduc_addr; /* The address of the reduction variable for + openacc reductions. */ tree init; /* reduction initialization value. */ gphi *new_phi; /* (helper field) Newly created phi node whose result will be passed to the atomic operation. Represents @@ -1085,10 +1091,29 @@ create_call_for_reduction_1 (reduction_info **slot, struct clsn_data *clsn_data) tree tmp_load, name; gimple *load; - load_struct = build_simple_mem_ref (clsn_data->load); - t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE); + if (reduc->reduc_addr == NULL_TREE) + { + load_struct = build_simple_mem_ref (clsn_data->load); + t = build3 (COMPONENT_REF, type, load_struct, reduc->field, NULL_TREE); + + addr = build_addr (t); + } + else + { + /* Set the address for the atomic store. */ + addr = reduc->reduc_addr; - addr = build_addr (t); + /* Remove the non-atomic store '*addr = sum'. */ + tree res = PHI_RESULT (reduc->keep_res); + use_operand_p use_p; + gimple *stmt; + bool single_use_p = single_imm_use (res, &use_p, &stmt); + gcc_assert (single_use_p); + replace_uses_by (gimple_vdef (stmt), + gimple_vuse (stmt)); + gimple_stmt_iterator gsi = gsi_for_stmt (stmt); + gsi_remove (&gsi, true); + } /* Create phi node. */ bb = clsn_data->load_bb; @@ -1990,7 +2015,8 @@ transform_to_exit_first_loop (struct loop *loop, static void create_parallel_loop (struct loop *loop, tree loop_fn, tree data, - tree new_data, unsigned n_threads, location_t loc) + tree new_data, unsigned n_threads, location_t loc, + bool oacc_kernels_p) { gimple_stmt_iterator gsi; basic_block bb, paral_bb, for_bb, ex_bb, continue_bb; @@ -2003,19 +2029,33 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data, gomp_continue *omp_cont_stmt; tree cvar, cvar_init, initvar, cvar_next, cvar_base, type; edge exit, nexit, guard, end, e; + tree for_clauses = NULL_TREE; /* Prepare the GIMPLE_OMP_PARALLEL statement. */ bb = loop_preheader_edge (loop)->src; - paral_bb = single_pred (bb); - gsi = gsi_last_bb (paral_bb); + if (!oacc_kernels_p) + { + paral_bb = single_pred (bb); + gsi = gsi_last_bb (paral_bb); + } - t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS); - OMP_CLAUSE_NUM_THREADS_EXPR (t) - = build_int_cst (integer_type_node, n_threads); - omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data); - gimple_set_location (omp_par_stmt, loc); + if (!oacc_kernels_p) + { + t = build_omp_clause (loc, OMP_CLAUSE_NUM_THREADS); + OMP_CLAUSE_NUM_THREADS_EXPR (t) + = build_int_cst (integer_type_node, n_threads); + omp_par_stmt = gimple_build_omp_parallel (NULL, t, loop_fn, data); + gimple_set_location (omp_par_stmt, loc); - gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT); + gsi_insert_after (&gsi, omp_par_stmt, GSI_NEW_STMT); + } + else + { + tree clause = build_omp_clause (loc, OMP_CLAUSE_NUM_GANGS); + OMP_CLAUSE_NUM_GANGS_EXPR (clause) + = build_int_cst (integer_type_node, n_threads); + set_oacc_fn_attrib (cfun->decl, clause, NULL); + } /* Initialize NEW_DATA. */ if (data) @@ -2033,12 +2073,18 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data, gsi_insert_before (&gsi, assign_stmt, GSI_SAME_STMT); } - /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL. */ - bb = split_loop_exit_edge (single_dom_exit (loop)); - gsi = gsi_last_bb (bb); - omp_return_stmt1 = gimple_build_omp_return (false); - gimple_set_location (omp_return_stmt1, loc); - gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT); + /* Skip insertion of OMP_RETURN for oacc_kernels_p. We've already generated + one when lowering the oacc kernels directive in + pass_lower_omp/lower_omp (). */ + if (!oacc_kernels_p) + { + /* Emit GIMPLE_OMP_RETURN for GIMPLE_OMP_PARALLEL. */ + bb = split_loop_exit_edge (single_dom_exit (loop)); + gsi = gsi_last_bb (bb); + omp_return_stmt1 = gimple_build_omp_return (false); + gimple_set_location (omp_return_stmt1, loc); + gsi_insert_after (&gsi, omp_return_stmt1, GSI_NEW_STMT); + } /* Extract data for GIMPLE_OMP_FOR. */ gcc_assert (loop->header == single_dom_exit (loop)->src); @@ -2130,7 +2176,17 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data, OMP_CLAUSE_SCHEDULE_CHUNK_EXPR (t) = build_int_cst (integer_type_node, chunk_size); - for_stmt = gimple_build_omp_for (NULL, GF_OMP_FOR_KIND_FOR, t, 1, NULL); + if (1) + { + /* In combination with the NUM_GANGS on the parallel. */ + for_clauses = build_omp_clause (loc, OMP_CLAUSE_GANG); + } + + for_stmt = gimple_build_omp_for (NULL, + (oacc_kernels_p + ? GF_OMP_FOR_KIND_OACC_LOOP + : GF_OMP_FOR_KIND_FOR), + for_clauses, 1, NULL); gimple_set_location (for_stmt, loc); gimple_omp_for_set_index (for_stmt, 0, initvar); gimple_omp_for_set_initial (for_stmt, 0, cvar_init); @@ -2172,7 +2228,8 @@ create_parallel_loop (struct loop *loop, tree loop_fn, tree data, static void gen_parallel_loop (struct loop *loop, reduction_info_table_type *reduction_list, - unsigned n_threads, struct tree_niter_desc *niter) + unsigned n_threads, struct tree_niter_desc *niter, + bool oacc_kernels_p) { tree many_iterations_cond, type, nit; tree arg_struct, new_arg_struct; @@ -2253,40 +2310,44 @@ gen_parallel_loop (struct loop *loop, if (stmts) gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts); - if (loop->inner) - m_p_thread=2; - else - m_p_thread=MIN_PER_THREAD; - - many_iterations_cond = - fold_build2 (GE_EXPR, boolean_type_node, - nit, build_int_cst (type, m_p_thread * n_threads)); - - many_iterations_cond - = fold_build2 (TRUTH_AND_EXPR, boolean_type_node, - invert_truthvalue (unshare_expr (niter->may_be_zero)), - many_iterations_cond); - many_iterations_cond - = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE); - if (stmts) - gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts); - if (!is_gimple_condexpr (many_iterations_cond)) + if (!oacc_kernels_p) { + if (loop->inner) + m_p_thread=2; + else + m_p_thread=MIN_PER_THREAD; + + many_iterations_cond = + fold_build2 (GE_EXPR, boolean_type_node, + nit, build_int_cst (type, m_p_thread * n_threads)); + + many_iterations_cond + = fold_build2 (TRUTH_AND_EXPR, boolean_type_node, + invert_truthvalue (unshare_expr (niter->may_be_zero)), + many_iterations_cond); many_iterations_cond - = force_gimple_operand (many_iterations_cond, &stmts, - true, NULL_TREE); + = force_gimple_operand (many_iterations_cond, &stmts, false, NULL_TREE); if (stmts) gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts); - } + if (!is_gimple_condexpr (many_iterations_cond)) + { + many_iterations_cond + = force_gimple_operand (many_iterations_cond, &stmts, + true, NULL_TREE); + if (stmts) + gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), + stmts); + } - initialize_original_copy_tables (); + initialize_original_copy_tables (); - /* We assume that the loop usually iterates a lot. */ - prob = 4 * REG_BR_PROB_BASE / 5; - loop_version (loop, many_iterations_cond, NULL, - prob, prob, REG_BR_PROB_BASE - prob, true); - update_ssa (TODO_update_ssa); - free_original_copy_tables (); + /* We assume that the loop usually iterates a lot. */ + prob = 4 * REG_BR_PROB_BASE / 5; + loop_version (loop, many_iterations_cond, NULL, + prob, prob, REG_BR_PROB_BASE - prob, true); + update_ssa (TODO_update_ssa); + free_original_copy_tables (); + } /* Base all the induction variables in LOOP on a single control one. */ canonicalize_loop_ivs (loop, &nit, true); @@ -2306,6 +2367,9 @@ gen_parallel_loop (struct loop *loop, } else { + if (oacc_kernels_p) + n_threads = 1; + /* Fall back on the method that handles more cases, but duplicates the loop body: move the exit condition of LOOP to the beginning of its header, and duplicate the part of the last iteration that gets disabled @@ -2322,19 +2386,34 @@ gen_parallel_loop (struct loop *loop, entry = loop_preheader_edge (loop); exit = single_dom_exit (loop); - eliminate_local_variables (entry, exit); - /* In the old loop, move all variables non-local to the loop to a structure - and back, and create separate decls for the variables used in loop. */ - separate_decls_in_region (entry, exit, reduction_list, &arg_struct, - &new_arg_struct, &clsn_data); + /* This rewrites the body in terms of new variables. This has already + been done for oacc_kernels_p in pass_lower_omp/lower_omp (). */ + if (!oacc_kernels_p) + { + eliminate_local_variables (entry, exit); + /* In the old loop, move all variables non-local to the loop to a + structure and back, and create separate decls for the variables used in + loop. */ + separate_decls_in_region (entry, exit, reduction_list, &arg_struct, + &new_arg_struct, &clsn_data); + } + else + { + arg_struct = NULL_TREE; + new_arg_struct = NULL_TREE; + clsn_data.load = NULL_TREE; + clsn_data.load_bb = exit->dest; + clsn_data.store = NULL_TREE; + clsn_data.store_bb = NULL; + } /* Create the parallel constructs. */ loc = UNKNOWN_LOCATION; cond_stmt = last_stmt (loop->header); if (cond_stmt) loc = gimple_location (cond_stmt); - create_parallel_loop (loop, create_loop_fn (loc), arg_struct, - new_arg_struct, n_threads, loc); + create_parallel_loop (loop, create_loop_fn (loc), arg_struct, new_arg_struct, + n_threads, loc, oacc_kernels_p); if (reduction_list->elements () > 0) create_call_for_reduction (loop, reduction_list, &clsn_data); @@ -2531,12 +2610,21 @@ try_get_loop_niter (loop_p loop, struct tree_niter_desc *niter) return true; } +static tree +get_omp_data_i_param (void) +{ + tree decl = DECL_ARGUMENTS (cfun->decl); + gcc_assert (DECL_CHAIN (decl) == NULL_TREE); + return ssa_default_def (cfun, decl); +} + /* Try to initialize REDUCTION_LIST for code generation part. REDUCTION_LIST describes the reductions. */ static bool try_create_reduction_list (loop_p loop, - reduction_info_table_type *reduction_list) + reduction_info_table_type *reduction_list, + bool oacc_kernels_p) { edge exit = single_dom_exit (loop); gphi_iterator gsi; @@ -2595,6 +2683,7 @@ try_create_reduction_list (loop_p loop, " FAILED: it is not a part of reduction.\n"); return false; } + red->keep_res = phi; if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "reduction phi is "); @@ -2629,15 +2718,402 @@ try_create_reduction_list (loop_p loop, } + if (oacc_kernels_p) + { + edge e = loop_preheader_edge (loop); + + for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi); + gsi_next (&gsi)) + { + gphi *phi = gsi.phi (); + tree def = PHI_RESULT (phi); + affine_iv iv; + + if (!virtual_operand_p (def) + && !simple_iv (loop, loop, def, &iv, true)) + { + struct reduction_info *red; + red = reduction_phi (reduction_list, phi); + + /* Look for pattern: + + + .omp_data_i = &.omp_data_arr; + addr = .omp_data_i->sum; + sum_a = *addr; + + : + sum_b = PHI + + and assign addr to reduc->reduc_addr. */ + + tree arg = PHI_ARG_DEF_FROM_EDGE (phi, e); + gimple *stmt = SSA_NAME_DEF_STMT (arg); + if (!gimple_assign_single_p (stmt)) + return false; + tree memref = gimple_assign_rhs1 (stmt); + if (TREE_CODE (memref) != MEM_REF) + return false; + tree addr = TREE_OPERAND (memref, 0); + + gimple *stmt2 = SSA_NAME_DEF_STMT (addr); + if (!gimple_assign_single_p (stmt2)) + return false; + tree compref = gimple_assign_rhs1 (stmt2); + if (TREE_CODE (compref) != COMPONENT_REF) + return false; + tree addr2 = TREE_OPERAND (compref, 0); + if (TREE_CODE (addr2) != MEM_REF) + return false; + addr2 = TREE_OPERAND (addr2, 0); + if (TREE_CODE (addr2) != SSA_NAME + || addr2 != get_omp_data_i_param ()) + return false; + red->reduc_addr = addr; + } + } + } + + return true; +} + +static bool +ref_conflicts_with_region (gimple_stmt_iterator gsi, ao_ref *ref, + bool ref_is_store, vec region_bbs, + unsigned int i, gimple *skip_stmt) +{ + basic_block bb = region_bbs[i]; + gsi_next (&gsi); + + while (true) + { + for (; !gsi_end_p (gsi); + gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (stmt == skip_stmt) + { + if (dump_file) + { + fprintf (dump_file, "skipping reduction store: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + continue; + } + + if (!gimple_vdef (stmt) + && !gimple_vuse (stmt)) + continue; + + if (gimple_code (stmt) == GIMPLE_RETURN) + continue; + + if (ref_is_store) + { + if (ref_maybe_used_by_stmt_p (stmt, ref)) + { + if (dump_file) + { + fprintf (dump_file, "Stmt "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + return true; + } + } + else + { + if (stmt_may_clobber_ref_p_1 (stmt, ref)) + { + if (dump_file) + { + fprintf (dump_file, "Stmt "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + return true; + } + } + } + i++; + if (i == region_bbs.length ()) + break; + bb = region_bbs[i]; + gsi = gsi_start_bb (bb); + } + + return false; +} + +static bool +oacc_entry_exit_ok_1 (bitmap in_loop_bbs, vec region_bbs, + tree omp_data_i, + reduction_info_table_type *reduction_list, + bitmap reduction_stores) +{ + unsigned i; + basic_block bb; + FOR_EACH_VEC_ELT (region_bbs, i, bb) + { + if (bitmap_bit_p (in_loop_bbs, bb->index)) + continue; + + gimple_stmt_iterator gsi; + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); + gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + gimple *skip_stmt = NULL; + + if (is_gimple_debug (stmt) + || gimple_code (stmt) == GIMPLE_COND) + continue; + + ao_ref ref; + bool ref_is_store = false; + if (gimple_assign_load_p (stmt)) + { + tree rhs = gimple_assign_rhs1 (stmt); + tree base = get_base_address (rhs); + if (TREE_CODE (base) == MEM_REF + && operand_equal_p (TREE_OPERAND (base, 0), omp_data_i, 0)) + continue; + + tree lhs = gimple_assign_lhs (stmt); + if (TREE_CODE (lhs) == SSA_NAME + && has_single_use (lhs)) + { + use_operand_p use_p; + gimple *use_stmt; + single_imm_use (lhs, &use_p, &use_stmt); + if (gimple_code (use_stmt) == GIMPLE_PHI) + { + struct reduction_info *red; + red = reduction_phi (reduction_list, use_stmt); + tree val = PHI_RESULT (red->keep_res); + if (has_single_use (val)) + { + single_imm_use (val, &use_p, &use_stmt); + if (gimple_store_p (use_stmt)) + { + unsigned int id + = SSA_NAME_VERSION (gimple_vdef (use_stmt)); + bitmap_set_bit (reduction_stores, id); + skip_stmt = use_stmt; + if (dump_file) + { + fprintf (dump_file, "found reduction load: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + } + } + } + } + + ao_ref_init (&ref, rhs); + } + else if (gimple_store_p (stmt)) + { + ao_ref_init (&ref, gimple_assign_lhs (stmt)); + ref_is_store = true; + } + else if (gimple_code (stmt) == GIMPLE_OMP_RETURN) + continue; + else if (!gimple_has_side_effects (stmt) + && !gimple_could_trap_p (stmt) + && !stmt_could_throw_p (stmt) + && !gimple_vdef (stmt) + && !gimple_vuse (stmt)) + continue; + else if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt) + && gimple_call_internal_fn (stmt) == IFN_GOACC_DIM_POS) + continue; + else if (gimple_code (stmt) == GIMPLE_RETURN) + continue; + else + { + if (dump_file) + { + fprintf (dump_file, "Unhandled stmt in entry/exit: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + return false; + } + + if (ref_conflicts_with_region (gsi, &ref, ref_is_store, region_bbs, + i, skip_stmt)) + { + if (dump_file) + { + fprintf (dump_file, "conflicts with entry/exit stmt: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + return false; + } + } + } + return true; } +/* Find stores inside REGION_BBS and outside IN_LOOP_BBS, and guard them with + gang_pos == 0, except when the stores are REDUCTION_STORES. Return true + if any changes were made. */ + +static bool +oacc_entry_exit_single_gang (bitmap in_loop_bbs, vec region_bbs, + bitmap reduction_stores) +{ + tree gang_pos = NULL_TREE; + bool changed = false; + + unsigned i; + basic_block bb; + FOR_EACH_VEC_ELT (region_bbs, i, bb) + { + if (bitmap_bit_p (in_loop_bbs, bb->index)) + continue; + + gimple_stmt_iterator gsi; + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);) + { + gimple *stmt = gsi_stmt (gsi); + + if (!gimple_store_p (stmt)) + { + /* Update gsi to point to next stmt. */ + gsi_next (&gsi); + continue; + } + + if (bitmap_bit_p (reduction_stores, + SSA_NAME_VERSION (gimple_vdef (stmt)))) + { + if (dump_file) + { + fprintf (dump_file, + "skipped reduction store for single-gang" + " neutering: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + + /* Update gsi to point to next stmt. */ + gsi_next (&gsi); + continue; + } + + changed = true; + + if (gang_pos == NULL_TREE) + { + tree arg = build_int_cst (integer_type_node, GOMP_DIM_GANG); + gcall *gang_single + = gimple_build_call_internal (IFN_GOACC_DIM_POS, 1, arg); + gang_pos = make_ssa_name (integer_type_node); + gimple_call_set_lhs (gang_single, gang_pos); + gimple_stmt_iterator start + = gsi_start_bb (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun))); + tree vuse = ssa_default_def (cfun, gimple_vop (cfun)); + gimple_set_vuse (gang_single, vuse); + gsi_insert_before (&start, gang_single, GSI_SAME_STMT); + } + + if (dump_file) + { + fprintf (dump_file, + "found store that needs single-gang neutering: "); + print_gimple_stmt (dump_file, stmt, 0, 0); + } + + { + /* Split block before store. */ + gimple_stmt_iterator gsi2 = gsi; + gsi_prev (&gsi2); + edge e; + if (gsi_end_p (gsi2)) + { + e = split_block_after_labels (bb); + gsi2 = gsi_last_bb (bb); + } + else + e = split_block (bb, gsi_stmt (gsi2)); + basic_block bb2 = e->dest; + + /* Split block after store. */ + gimple_stmt_iterator gsi3 = gsi_start_bb (bb2); + edge e2 = split_block (bb2, gsi_stmt (gsi3)); + basic_block bb3 = e2->dest; + + gimple *cond + = gimple_build_cond (EQ_EXPR, gang_pos, integer_zero_node, + NULL_TREE, NULL_TREE); + gsi_insert_after (&gsi2, cond, GSI_NEW_STMT); + + edge e3 = make_edge (bb, bb3, EDGE_FALSE_VALUE); + e->flags = EDGE_TRUE_VALUE; + + tree vdef = gimple_vdef (stmt); + tree vuse = gimple_vuse (stmt); + + tree phi_res = copy_ssa_name (vdef); + gphi *new_phi = create_phi_node (phi_res, bb3); + replace_uses_by (vdef, phi_res); + add_phi_arg (new_phi, vuse, e3, UNKNOWN_LOCATION); + add_phi_arg (new_phi, vdef, e2, UNKNOWN_LOCATION); + + /* Update gsi to point to next stmt. */ + bb = bb3; + gsi = gsi_start_bb (bb); + } + } + } + + return changed; +} + +static bool +oacc_entry_exit_ok (struct loop *loop, + reduction_info_table_type *reduction_list) +{ + basic_block *loop_bbs = get_loop_body_in_dom_order (loop); + tree omp_data_i = get_omp_data_i_param (); + gcc_assert (omp_data_i != NULL_TREE); + vec region_bbs + = get_all_dominated_blocks (CDI_DOMINATORS, ENTRY_BLOCK_PTR_FOR_FN (cfun)); + + bitmap in_loop_bbs = BITMAP_ALLOC (NULL); + bitmap_clear (in_loop_bbs); + for (unsigned int i = 0; i < loop->num_nodes; i++) + bitmap_set_bit (in_loop_bbs, loop_bbs[i]->index); + + bitmap reduction_stores = BITMAP_ALLOC (NULL); + bool res = oacc_entry_exit_ok_1 (in_loop_bbs, region_bbs, omp_data_i, + reduction_list, reduction_stores); + + if (res) + { + bool changed = oacc_entry_exit_single_gang (in_loop_bbs, region_bbs, + reduction_stores); + if (changed) + { + free_dominance_info (CDI_DOMINATORS); + calculate_dominance_info (CDI_DOMINATORS); + } + } + + free (loop_bbs); + + BITMAP_FREE (in_loop_bbs); + BITMAP_FREE (reduction_stores); + + return res; +} + /* Detect parallel loops and generate parallel code using libgomp primitives. Returns true if some loop was parallelized, false otherwise. */ static bool -parallelize_loops (void) +parallelize_loops (bool oacc_kernels_p) { unsigned n_threads = flag_tree_parallelize_loops; bool changed = false; @@ -2649,19 +3125,29 @@ parallelize_loops (void) source_location loop_loc; /* Do not parallelize loops in the functions created by parallelization. */ - if (parallelized_function_p (cfun->decl)) + if (!oacc_kernels_p + && parallelized_function_p (cfun->decl)) return false; + + /* Do not parallelize loops in offloaded functions. */ + if (!oacc_kernels_p + && get_oacc_fn_attrib (cfun->decl) != NULL) + return false; + if (cfun->has_nonlocal_label) return false; gcc_obstack_init (&parloop_obstack); reduction_info_table_type reduction_list (10); + calculate_dominance_info (CDI_DOMINATORS); + FOR_EACH_LOOP (loop, 0) { if (loop == skip_loop) { - if (dump_file && (dump_flags & TDF_DETAILS)) + if (!loop->in_oacc_kernels_region + && dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Skipping loop %d as inner loop of parallelized loop\n", loop->num); @@ -2673,6 +3159,22 @@ parallelize_loops (void) skip_loop = NULL; reduction_list.empty (); + + if (oacc_kernels_p) + { + if (!loop->in_oacc_kernels_region) + continue; + + /* Don't try to parallelize inner loops in an oacc kernels region. */ + if (loop->inner) + skip_loop = loop->inner; + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "Trying loop %d with header bb %d in oacc kernels" + " region\n", loop->num, loop->header->index); + } + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Trying loop %d as candidate\n",loop->num); @@ -2714,6 +3216,7 @@ parallelize_loops (void) /* FIXME: Bypass this check as graphite doesn't update the count and frequency correctly now. */ if (!flag_loop_parallelize_all + && !oacc_kernels_p && ((estimated != -1 && estimated <= (HOST_WIDE_INT) n_threads * MIN_PER_THREAD) /* Do not bother with loops in cold areas. */ @@ -2723,14 +3226,23 @@ parallelize_loops (void) if (!try_get_loop_niter (loop, &niter_desc)) continue; - if (!try_create_reduction_list (loop, &reduction_list)) + if (!try_create_reduction_list (loop, &reduction_list, oacc_kernels_p)) continue; if (!flag_loop_parallelize_all && !loop_parallel_p (loop, &parloop_obstack)) continue; + if (oacc_kernels_p + && !oacc_entry_exit_ok (loop, &reduction_list)) + { + if (dump_file) + fprintf (dump_file, "entry/exit not ok: FAILED\n"); + continue; + } + changed = true; + /* Skip inner loop(s) of parallelized loop. */ skip_loop = loop->inner; if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2743,8 +3255,9 @@ parallelize_loops (void) fprintf (dump_file, "\nloop at %s:%d: ", LOCATION_FILE (loop_loc), LOCATION_LINE (loop_loc)); } + gen_parallel_loop (loop, &reduction_list, - n_threads, &niter_desc); + n_threads, &niter_desc, oacc_kernels_p); } obstack_free (&parloop_obstack, NULL); @@ -2794,7 +3307,7 @@ pass_parallelize_loops::execute (function *fun) if (number_of_loops (fun) <= 1) return 0; - if (parallelize_loops ()) + if (parallelize_loops (false)) { fun->curr_properties &= ~(PROP_gimple_eomp); @@ -2813,3 +3326,66 @@ make_pass_parallelize_loops (gcc::context *ctxt) { return new pass_parallelize_loops (ctxt); } + +namespace { + +const pass_data pass_data_parallelize_loops_oacc_kernels = +{ + GIMPLE_PASS, /* type */ + "parloops_oacc_kernels", /* name */ + OPTGROUP_LOOP, /* optinfo_flags */ + TV_TREE_PARALLELIZE_LOOPS, /* tv_id */ + ( PROP_cfg | PROP_ssa ), /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + 0, /* todo_flags_finish */ +}; + +class pass_parallelize_loops_oacc_kernels : public gimple_opt_pass +{ +public: + pass_parallelize_loops_oacc_kernels (gcc::context *ctxt) + : gimple_opt_pass (pass_data_parallelize_loops_oacc_kernels, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) { return flag_tree_parallelize_loops > 1; } + virtual unsigned int execute (function *); + +}; // class pass_parallelize_loops_oacc_kernels + +unsigned +pass_parallelize_loops_oacc_kernels::execute (function *fun) +{ + unsigned int todo = 0; + + loop_optimizer_init (LOOPS_NORMAL + | LOOPS_HAVE_RECORDED_EXITS); + + if (number_of_loops (fun) <= 1) + return 0; + + rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa); + + scev_initialize (); + + if (parallelize_loops (true)) + { + fun->curr_properties &= ~(PROP_gimple_eomp); + todo |= TODO_update_ssa; + } + + scev_finalize (); + loop_optimizer_finalize (); + + return todo; +} + +} // anon namespace + +gimple_opt_pass * +make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt) +{ + return new pass_parallelize_loops_oacc_kernels (ctxt); +} diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 9704918..004db77 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -385,6 +385,8 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt); extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt); extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt); extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt); +extern gimple_opt_pass * + make_pass_parallelize_loops_oacc_kernels (gcc::context *ctxt); extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt); extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt); extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);