From patchwork Wed Jun 16 12:43:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1492916 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=M07KFaNu; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=WtrYw2LM; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4G4lHg0mj2z9sVb for ; Wed, 16 Jun 2021 22:44:15 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 36E8C393C84F for ; Wed, 16 Jun 2021 12:44:12 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id DDBCA383D805 for ; Wed, 16 Jun 2021 12:43:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DDBCA383D805 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 8ED6F219C7; Wed, 16 Jun 2021 12:43:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1623847423; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=jxC3NN0zcO4qo/tDtqzq4/I6EzR22ePsWZjOBV+GGBs=; b=M07KFaNuNkBtRDY8YhaVPVgaedvp6kXop3j935wRORUXF1A01bWryPR0pSPtTxzJOH4mkd a1bVrV6MscJ2HuFhpmbBNkF4iLr/14pj+hpVjKePklq2mQUoFGr22kekJR1RpM48cFY+Xn sZ16aOeKzZxNFpNHewEpVOcfhqWtYXs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1623847423; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=jxC3NN0zcO4qo/tDtqzq4/I6EzR22ePsWZjOBV+GGBs=; b=WtrYw2LM59h2u+AvfBBSv/ND4bMDysPph4h0OozeKCSCW4JkcCVt/QVSsd3arh95M8GYvH QYNFwysjy1Cu7mDA== Received: from [10.163.28.26] (unknown [10.163.28.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 6D352A3B81; Wed, 16 Jun 2021 12:43:43 +0000 (UTC) Date: Wed, 16 Jun 2021 14:43:43 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH] Vectorization of BB reductions Message-ID: <228po041-312s-7413-8383-n4592016or4@fhfr.qr> MIME-Version: 1.0 X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: richard.sandiford@arm.com Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" This adds a simple reduction vectorization capability to the non-loop vectorizer. Simple meaning it lacks any of the fancy ways to generate the reduction epilogue but only supports those we can handle via a direct internal function reducing a vector to a scalar. One of the main reasons is to avoid massive refactoring at this point but also that more complex epilogue operations are hardly profitable. Mixed sign reductions are for now fend off and I'm not finally settled with whether we want an explicit SLP node for the reduction epilogue operation. Handling mixed signs could be done by multiplying with a { 1, -1, .. } vector. Fend off are also reductions with non-internal operands (constants or register parameters for example). Costing is done by accounting the original scalar participating stmts for the scalar cost and log2 permutes and operations for the vectorized epilogue. --- SPEC CPU 2017 FP with rate workload measurements show (picked fastest runs of three) regressions for 507.cactuBSSN_r (1.5%), 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and 538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2. Statistics on CPU 2017 shows that the overwhelming number of seeds we find are reductions of two lanes (well - that's basically every associative operation). That means we put a quite high pressure on the SLP discovery process this way. In total we find 583218 seeds we put to SLP discovery out of which 66205 pass that and only 6185 of those make it through code generation checks. 796 of those are discarded because the reduction is part of a larger SLP instance. 4195 of the remaining are deemed not profitable to vectorize and 1194 are finally vectorized. That's a poor 0.2% rate. Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%), 28181 four (5%), 4808 five, 909 six and there are instances up to 120 lanes. There's a set of 54086 candidate seeds we reject because they contain a constant or invariant (not implemented yet) but still have two or more lanes that could be put to SLP discovery. Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also built and tested SPEC CPU 2017 with -Ofast -march=znver2 successfully. I do think this is good enough(TM) for this point, please speak up if you disagree and/or like to see changes. Thanks, Richard. 2021-06-16 Richard Biener PR tree-optimization/54400 * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_bb_reduc. (reduction_fn_for_scalar_code): Declare. * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence): Check SLP_INSTANCE_KIND instead of looking at the representative. (vect_slp_analyze_instance_alignment): Likewise. * tree-vect-loop.c (reduction_fn_for_scalar_code): Export. * tree-vect-slp.c (vect_slp_linearize_chain): Split out chain linearization from vect_build_slp_tree_2 and generalize for the use of BB reduction vectorization. (vect_build_slp_tree_2): Adjust accordingly. (vect_optimize_slp): Elide permutes at the root of BB reduction instances. (vectorizable_bb_reduc_epilogue): New function. (vect_slp_prune_covered_roots): Likewise. (vect_slp_analyze_operations): Use them. (vect_slp_check_for_constructors): Recognize associatable chains for BB reduction vectorization. (vectorize_slp_instance_root_stmt): Generate code for the BB reduction epilogue. * gcc.dg/vect/bb-slp-pr54400.c: New testcase. --- gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c | 43 +++ gcc/tree-vect-data-refs.c | 9 +- gcc/tree-vect-loop.c | 2 +- gcc/tree-vect-slp.c | 383 +++++++++++++++++---- gcc/tree-vectorizer.h | 2 + 5 files changed, 367 insertions(+), 72 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c new file mode 100644 index 00000000000..6b427aac774 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c @@ -0,0 +1,43 @@ +/* { dg-do run } */ +/* { dg-require-effective-target vect_float} */ +/* { dg-additional-options "-w -Wno-psabi -ffast-math" } */ + +#include "tree-vect.h" + +typedef float v4sf __attribute__((vector_size(sizeof(float)*4))); + +float __attribute__((noipa)) +f(v4sf v) +{ + return v[0]+v[1]+v[2]+v[3]; +} + +float __attribute__((noipa)) +g(float *v) +{ + return v[0]+v[1]+v[2]+v[3]; +} + +float __attribute__((noipa)) +h(float *v) +{ + return 2*v[0]+3*v[1]+4*v[2]+5*v[3]; +} + +int +main () +{ + check_vect (); + v4sf v = (v4sf) { 1.f, 3.f, 4.f, 2.f }; + if (f (v) != 10.f) + abort (); + if (g (&v[0]) != 10.f) + abort (); + if (h (&v[0]) != 37.f) + abort (); + return 0; +} + +/* We are lacking an effective target for .REDUC_PLUS support. */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* } } } */ +/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* } } } */ diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 2694d1ab452..bb086c6ac1c 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -843,9 +843,9 @@ vect_slp_analyze_instance_dependence (vec_info *vinfo, slp_instance instance) DUMP_VECT_SCOPE ("vect_slp_analyze_instance_dependence"); /* The stores of this instance are at the root of the SLP tree. */ - slp_tree store = SLP_INSTANCE_TREE (instance); - if (! STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (store))) - store = NULL; + slp_tree store = NULL; + if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store) + store = SLP_INSTANCE_TREE (instance); /* Verify we can sink stores to the vectorized stmt insert location. */ stmt_vec_info last_store_info = NULL; @@ -2464,8 +2464,7 @@ vect_slp_analyze_instance_alignment (vec_info *vinfo, if (! vect_slp_analyze_node_alignment (vinfo, node)) return false; - node = SLP_INSTANCE_TREE (instance); - if (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node)) + if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store && ! vect_slp_analyze_node_alignment (vinfo, SLP_INSTANCE_TREE (instance))) return false; diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index ee79808472c..51a46a6d852 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -3209,7 +3209,7 @@ fold_left_reduction_fn (tree_code code, internal_fn *reduc_fn) Return FALSE if CODE currently cannot be vectorized as reduction. */ -static bool +bool reduction_fn_for_scalar_code (enum tree_code code, internal_fn *reduc_fn) { switch (code) diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 8ec589b7948..0c1f85beeb2 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1442,6 +1442,84 @@ dt_sort_cmp (const void *op1_, const void *op2_, void *) return (int)op1->code - (int)op2->code; } +/* Linearize the associatable expression chain at START with the + associatable operation CODE (where PLUS_EXPR also allows MINUS_EXPR), + filling CHAIN with the result and using WORKLIST as intermediate storage. + CODE_STMT and ALT_CODE_STMT are filled with the first stmt using CODE + or MINUS_EXPR. *CHAIN_STMTS if not NULL is filled with all computation + stmts, starting with START. */ + +static void +vect_slp_linearize_chain (vec_info *vinfo, + vec > &worklist, + vec &chain, + enum tree_code code, gimple *start, + gimple *&code_stmt, gimple *&alt_code_stmt, + vec *chain_stmts) +{ + /* For each lane linearize the addition/subtraction (or other + uniform associatable operation) expression tree. */ + worklist.safe_push (std::make_pair (code, start)); + while (!worklist.is_empty ()) + { + auto entry = worklist.pop (); + gassign *stmt = as_a (entry.second); + enum tree_code in_code = entry.first; + enum tree_code this_code = gimple_assign_rhs_code (stmt); + /* Pick some stmts suitable for SLP_TREE_REPRESENTATIVE. */ + if (!code_stmt + && gimple_assign_rhs_code (stmt) == code) + code_stmt = stmt; + else if (!alt_code_stmt + && gimple_assign_rhs_code (stmt) == MINUS_EXPR) + alt_code_stmt = stmt; + if (chain_stmts) + chain_stmts->safe_push (stmt); + for (unsigned opnum = 1; opnum <= 2; ++opnum) + { + tree op = gimple_op (stmt, opnum); + vect_def_type dt; + stmt_vec_info def_stmt_info; + bool res = vect_is_simple_use (op, vinfo, &dt, &def_stmt_info); + gcc_assert (res); + if (dt == vect_internal_def) + { + stmt_vec_info orig_def_stmt_info = def_stmt_info; + def_stmt_info = vect_stmt_to_vectorize (def_stmt_info); + if (def_stmt_info != orig_def_stmt_info) + op = gimple_get_lhs (def_stmt_info->stmt); + } + gimple *use_stmt; + use_operand_p use_p; + if (dt == vect_internal_def + && single_imm_use (op, &use_p, &use_stmt) + && is_gimple_assign (def_stmt_info->stmt) + && (gimple_assign_rhs_code (def_stmt_info->stmt) == code + || (code == PLUS_EXPR + && (gimple_assign_rhs_code (def_stmt_info->stmt) + == MINUS_EXPR)))) + { + tree_code op_def_code = this_code; + if (op_def_code == MINUS_EXPR && opnum == 1) + op_def_code = PLUS_EXPR; + if (in_code == MINUS_EXPR) + op_def_code = op_def_code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR; + worklist.safe_push (std::make_pair (op_def_code, + def_stmt_info->stmt)); + } + else + { + tree_code op_def_code = this_code; + if (op_def_code == MINUS_EXPR && opnum == 1) + op_def_code = PLUS_EXPR; + if (in_code == MINUS_EXPR) + op_def_code = op_def_code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR; + chain.safe_push (chain_op_t (op_def_code, dt, op)); + } + } + } +} + typedef hash_map , slp_tree, simple_hashmap_traits > scalar_stmts_to_slp_tree_map_t; @@ -1784,63 +1862,14 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, { /* For each lane linearize the addition/subtraction (or other uniform associatable operation) expression tree. */ - worklist.safe_push (std::make_pair (code, stmts[lane]->stmt)); - while (!worklist.is_empty ()) - { - auto entry = worklist.pop (); - gassign *stmt = as_a (entry.second); - enum tree_code in_code = entry.first; - enum tree_code this_code = gimple_assign_rhs_code (stmt); - /* Pick some stmts suitable for SLP_TREE_REPRESENTATIVE. */ - if (!op_stmt_info - && gimple_assign_rhs_code (stmt) == code) - op_stmt_info = vinfo->lookup_stmt (stmt); - else if (!other_op_stmt_info - && gimple_assign_rhs_code (stmt) == MINUS_EXPR) - other_op_stmt_info = vinfo->lookup_stmt (stmt); - for (unsigned opnum = 1; opnum <= 2; ++opnum) - { - tree op = gimple_op (stmt, opnum); - vect_def_type dt; - stmt_vec_info def_stmt_info; - bool res = vect_is_simple_use (op, vinfo, &dt, &def_stmt_info); - gcc_assert (res); - if (dt == vect_internal_def) - { - def_stmt_info = vect_stmt_to_vectorize (def_stmt_info); - op = gimple_get_lhs (def_stmt_info->stmt); - } - gimple *use_stmt; - use_operand_p use_p; - if (dt == vect_internal_def - && single_imm_use (op, &use_p, &use_stmt) - && is_gimple_assign (def_stmt_info->stmt) - && (gimple_assign_rhs_code (def_stmt_info->stmt) == code - || (code == PLUS_EXPR - && (gimple_assign_rhs_code (def_stmt_info->stmt) - == MINUS_EXPR)))) - { - tree_code op_def_code = this_code; - if (op_def_code == MINUS_EXPR && opnum == 1) - op_def_code = PLUS_EXPR; - if (in_code == MINUS_EXPR) - op_def_code - = op_def_code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR; - worklist.safe_push (std::make_pair (op_def_code, - def_stmt_info->stmt)); - } - else - { - tree_code op_def_code = this_code; - if (op_def_code == MINUS_EXPR && opnum == 1) - op_def_code = PLUS_EXPR; - if (in_code == MINUS_EXPR) - op_def_code - = op_def_code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR; - chain.safe_push (chain_op_t (op_def_code, dt, op)); - } - } - } + gimple *op_stmt = NULL, *other_op_stmt = NULL; + vect_slp_linearize_chain (vinfo, worklist, chain, code, + stmts[lane]->stmt, op_stmt, other_op_stmt, + NULL); + if (!op_stmt_info && op_stmt) + op_stmt_info = vinfo->lookup_stmt (op_stmt); + if (!other_op_stmt_info && other_op_stmt) + other_op_stmt_info = vinfo->lookup_stmt (other_op_stmt); if (chain.length () == 2) { /* In a chain of just two elements resort to the regular @@ -3915,6 +3944,48 @@ vect_optimize_slp (vec_info *vinfo) } } } + + /* And any permutations of BB reductions. */ + if (is_a (vinfo)) + { + for (slp_instance instance : vinfo->slp_instances) + { + if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc) + continue; + slp_tree old = SLP_INSTANCE_TREE (instance); + if (SLP_TREE_CODE (old) == VEC_PERM_EXPR + && SLP_TREE_CHILDREN (old).length () == 1) + { + slp_tree child = SLP_TREE_CHILDREN (old)[0]; + if (SLP_TREE_DEF_TYPE (child) == vect_external_def) + { + /* Preserve the special VEC_PERM we use to shield existing + vector defs from the rest. But make it a no-op. */ + unsigned i = 0; + for (std::pair &p + : SLP_TREE_LANE_PERMUTATION (old)) + p.second = i++; + } + else + { + SLP_INSTANCE_TREE (instance) = child; + SLP_TREE_REF_COUNT (child)++; + vect_free_slp_tree (old); + } + } + else if (SLP_TREE_LOAD_PERMUTATION (old).exists () + && SLP_TREE_REF_COUNT (old) == 1) + { + /* ??? For loads the situation is more complex since + we can't modify the permute in place in case the + node is used multiple times. In fact for loads this + should be somehow handled in the propagation engine. */ + auto fn = [] (const void *a, const void *b) + { return *(const int *)a - *(const int *)b; }; + SLP_TREE_LOAD_PERMUTATION (old).qsort (fn); + } + } + } } /* Gather loads reachable from the individual SLP graph entries. */ @@ -4492,7 +4563,6 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, return res; } - /* Mark lanes of NODE that are live outside of the basic-block vectorized region and that can be vectorized using vectorizable_live_operation with STMT_VINFO_LIVE_P. Not handled live operations will cause the @@ -4596,6 +4666,55 @@ vect_bb_slp_mark_live_stmts (bb_vec_info bb_vinfo, slp_tree node, cost_vec, svisited, visited); } +/* Determine whether we can vectorize the reduction epilogue for INSTANCE. */ + +static bool +vectorizable_bb_reduc_epilogue (slp_instance instance, + stmt_vector_for_cost *cost_vec) +{ + enum tree_code reduc_code + = gimple_assign_rhs_code (instance->root_stmts[0]->stmt); + if (reduc_code == MINUS_EXPR) + reduc_code = PLUS_EXPR; + internal_fn reduc_fn; + tree vectype = SLP_TREE_VECTYPE (SLP_INSTANCE_TREE (instance)); + if (!reduction_fn_for_scalar_code (reduc_code, &reduc_fn) + || reduc_fn == IFN_LAST + || !direct_internal_fn_supported_p (reduc_fn, vectype, OPTIMIZE_FOR_BOTH)) + return false; + + /* There's no way to cost a horizontal vector reduction via REDUC_FN so + cost log2 vector operations plus shuffles. */ + unsigned steps = floor_log2 (vect_nunits_for_cost (vectype)); + record_stmt_cost (cost_vec, steps, vector_stmt, instance->root_stmts[0], + vectype, 0, vect_body); + record_stmt_cost (cost_vec, steps, vec_perm, instance->root_stmts[0], + vectype, 0, vect_body); + return true; +} + +/* Prune from ROOTS all stmts that are computed as part of lanes of NODE + and recurse to children. */ + +static void +vect_slp_prune_covered_roots (slp_tree node, hash_set &roots, + hash_set &visited) +{ + if (SLP_TREE_DEF_TYPE (node) != vect_internal_def + || visited.add (node)) + return; + + stmt_vec_info stmt; + unsigned i; + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt) + roots.remove (vect_orig_stmt (stmt)); + + slp_tree child; + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) + if (child) + vect_slp_prune_covered_roots (child, roots, visited); +} + /* Analyze statements in SLP instances of VINFO. Return true if the operations are supported. */ @@ -4619,15 +4738,20 @@ vect_slp_analyze_operations (vec_info *vinfo) SLP_INSTANCE_TREE (instance), instance, visited, visited_vec, &cost_vec) - /* Instances with a root stmt require vectorized defs for the - SLP tree root. */ - /* ??? Do inst->kind check instead. */ - || (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty () + /* CTOR instances require vectorized defs for the SLP tree root. */ + || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor && (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance)) - != vect_internal_def))) + != vect_internal_def)) + /* Check we can vectorize the reduction. */ + || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc + && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) { slp_tree node = SLP_INSTANCE_TREE (instance); - stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; + stmt_vec_info stmt_info; + if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) + stmt_info = SLP_INSTANCE_ROOT_STMTS (instance)[0]; + else + stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "removing SLP instance operations starting from: %G", @@ -4654,6 +4778,34 @@ vect_slp_analyze_operations (vec_info *vinfo) } } + /* Now look for SLP instances with a root that are covered by other + instances and remove them. */ + hash_set roots; + for (i = 0; vinfo->slp_instances.iterate (i, &instance); ++i) + if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) + roots.add (SLP_INSTANCE_ROOT_STMTS (instance)[0]); + if (!roots.is_empty ()) + { + visited.empty (); + for (i = 0; vinfo->slp_instances.iterate (i, &instance); ++i) + vect_slp_prune_covered_roots (SLP_INSTANCE_TREE (instance), roots, + visited); + for (i = 0; vinfo->slp_instances.iterate (i, &instance); ) + if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty () + && !roots.contains (SLP_INSTANCE_ROOT_STMTS (instance)[0])) + { + stmt_vec_info root = SLP_INSTANCE_ROOT_STMTS (instance)[0]; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "removing SLP instance operations starting " + "from: %G", root->stmt); + vect_free_slp_instance (instance); + vinfo->slp_instances.ordered_remove (i); + } + else + ++i; + } + /* Compute vectorizable live stmts. */ if (bb_vec_info bb_vinfo = dyn_cast (vinfo)) { @@ -5115,7 +5267,10 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) continue; tree rhs = gimple_assign_rhs1 (assign); - if (gimple_assign_rhs_code (assign) == CONSTRUCTOR) + enum tree_code code = gimple_assign_rhs_code (assign); + use_operand_p use_p; + gimple *use_stmt; + if (code == CONSTRUCTOR) { if (!VECTOR_TYPE_P (TREE_TYPE (rhs)) || maybe_ne (TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)), @@ -5136,7 +5291,7 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) stmt_vec_info stmt_info = bb_vinfo->lookup_stmt (assign); BB_VINFO_GROUPED_STORES (bb_vinfo).safe_push (stmt_info); } - else if (gimple_assign_rhs_code (assign) == BIT_INSERT_EXPR + else if (code == BIT_INSERT_EXPR && VECTOR_TYPE_P (TREE_TYPE (rhs)) && TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).is_constant () && TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).to_constant () > 1 @@ -5230,6 +5385,69 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) else roots.release (); } + else if (!VECTOR_TYPE_P (TREE_TYPE (rhs)) + && (associative_tree_code (code) || code == MINUS_EXPR) + /* ??? The flag_associative_math and TYPE_OVERFLOW_WRAPS + checks pessimize a two-element reduction. PR54400. + ??? In-order reduction could be handled if we only + traverse one operand chain in vect_slp_linearize_chain. */ + && ((FLOAT_TYPE_P (TREE_TYPE (rhs)) && flag_associative_math) + || (INTEGRAL_TYPE_P (TREE_TYPE (rhs)) + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (rhs)))) + /* Ops with constants at the tail can be stripped here. */ + && TREE_CODE (rhs) == SSA_NAME + && TREE_CODE (gimple_assign_rhs2 (assign)) == SSA_NAME + /* Should be the chain end. */ + && (!single_imm_use (gimple_assign_lhs (assign), + &use_p, &use_stmt) + || !is_gimple_assign (use_stmt) + || (gimple_assign_rhs_code (use_stmt) != code + && ((code != PLUS_EXPR && code != MINUS_EXPR) + || (gimple_assign_rhs_code (use_stmt) + != (code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR)))))) + { + /* We start the match at the end of a possible association + chain. */ + auto_vec chain; + auto_vec > worklist; + auto_vec chain_stmts; + gimple *code_stmt = NULL, *alt_code_stmt = NULL; + if (code == MINUS_EXPR) + code = PLUS_EXPR; + internal_fn reduc_fn; + if (!reduction_fn_for_scalar_code (code, &reduc_fn) + || reduc_fn == IFN_LAST) + continue; + vect_slp_linearize_chain (bb_vinfo, worklist, chain, code, assign, + /* ??? */ + code_stmt, alt_code_stmt, &chain_stmts); + if (chain.length () > 1) + { + /* Sort the chain according to def_type and operation. */ + chain.sort (dt_sort_cmp, bb_vinfo); + /* ??? Now we'd want to strip externals and constants + but record those to be handled in the epilogue. */ + /* ??? For now do not allow mixing ops or externs/constants. */ + bool invalid = false; + for (unsigned i = 0; i < chain.length (); ++i) + if (chain[i].dt != vect_internal_def + || chain[i].code != code) + invalid = true; + if (!invalid) + { + vec stmts; + stmts.create (chain.length ()); + for (unsigned i = 0; i < chain.length (); ++i) + stmts.quick_push (bb_vinfo->lookup_def (chain[i].op)); + vec roots; + roots.create (chain_stmts.length ()); + for (unsigned i = 0; i < chain_stmts.length (); ++i) + roots.quick_push (bb_vinfo->lookup_stmt (chain_stmts[i])); + bb_vinfo->roots.safe_push (slp_root (slp_inst_kind_bb_reduc, + stmts, roots)); + } + } + } } } @@ -6861,6 +7079,39 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) rstmt = gimple_build_assign (lhs, r_constructor); } } + else if (instance->kind == slp_inst_kind_bb_reduc) + { + /* Largely inspired by reduction chain epilogue handling in + vect_create_epilog_for_reduction. */ + vec vec_defs = vNULL; + vect_get_slp_defs (node, &vec_defs); + enum tree_code reduc_code + = gimple_assign_rhs_code (instance->root_stmts[0]->stmt); + /* ??? We actually have to reflect signs somewhere. */ + if (reduc_code == MINUS_EXPR) + reduc_code = PLUS_EXPR; + gimple_seq epilogue = NULL; + /* We may end up with more than one vector result, reduce them + to one vector. */ + tree vec_def = vec_defs[0]; + for (unsigned i = 1; i < vec_defs.length (); ++i) + vec_def = gimple_build (&epilogue, reduc_code, TREE_TYPE (vec_def), + vec_def, vec_defs[i]); + vec_defs.release (); + /* ??? Support other schemes than direct internal fn. */ + internal_fn reduc_fn; + if (!reduction_fn_for_scalar_code (reduc_code, &reduc_fn) + || reduc_fn == IFN_LAST) + gcc_unreachable (); + tree scalar_def = gimple_build (&epilogue, as_combined_fn (reduc_fn), + TREE_TYPE (TREE_TYPE (vec_def)), vec_def); + + gimple_stmt_iterator rgsi = gsi_for_stmt (instance->root_stmts[0]->stmt); + gsi_insert_seq_before (&rgsi, epilogue, GSI_SAME_STMT); + gimple_assign_set_rhs_from_tree (&rgsi, scalar_def); + update_stmt (gsi_stmt (rgsi)); + return; + } else gcc_unreachable (); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 1fb46c6ba14..04c20f8bd0f 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -190,6 +190,7 @@ enum slp_instance_kind { slp_inst_kind_store, slp_inst_kind_reduc_group, slp_inst_kind_reduc_chain, + slp_inst_kind_bb_reduc, slp_inst_kind_ctor }; @@ -1971,6 +1972,7 @@ extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int, unsigned int); extern gimple_seq vect_gen_len (tree, tree, tree, tree); extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); +extern bool reduction_fn_for_scalar_code (enum tree_code, internal_fn *); /* Drive for loop transformation stage. */ extern class loop *vect_transform_loop (loop_vec_info, gimple *);