From patchwork Mon Jul 27 03:58:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Kewen.Lin" X-Patchwork-Id: 1336661 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=gcc.gnu.org Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=Oxto77UB; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4BFQzM48cYz9sRR for ; Mon, 27 Jul 2020 13:59:09 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 99C933857C63; Mon, 27 Jul 2020 03:59:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 99C933857C63 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1595822346; bh=8EwLca+aLSzArCwvitp09zJ3JyNSLQV1a7Eu1E7LXdw=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=Oxto77UB/wp9H0kYIoM4p9nL/GUXWnM5hulKKTYmByFnmh47ySWcuNxlwNRIlQRoC UdNZDLY1Qw3VrX/JxkU4SLpcXZemyddqkmGGCXvdW8PL/39szUwChcdUcTBTtAl/JL xuswVZLs+IeUQXq+n2bdMk6/aOORz+3ubii+YnMI= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id EF7323858D38 for ; Mon, 27 Jul 2020 03:59:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org EF7323858D38 Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 06R3VCLY006236; Sun, 26 Jul 2020 23:59:00 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 32gebw1kr3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 26 Jul 2020 23:59:00 -0400 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 06R3VdCR007064; Sun, 26 Jul 2020 23:58:59 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0b-001b2d01.pphosted.com with ESMTP id 32gebw1kqg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 26 Jul 2020 23:58:59 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 06R3nXOi029338; Mon, 27 Jul 2020 03:58:58 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma03ams.nl.ibm.com with ESMTP id 32gcpx1sfv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Jul 2020 03:58:57 +0000 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 06R3wtke21889424 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 27 Jul 2020 03:58:56 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D29FBAE055; Mon, 27 Jul 2020 03:58:55 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8CEE6AE056; Mon, 27 Jul 2020 03:58:53 +0000 (GMT) Received: from KewenLins-MacBook-Pro.local (unknown [9.200.54.160]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 27 Jul 2020 03:58:53 +0000 (GMT) Subject: [PATCH v4] vect/rs6000: Support vector with length cost modeling To: GCC Patches , richard.sandiford@arm.com References: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com> <1aeabdc7-0cf4-055b-a3ec-74c283053cf5@linux.ibm.com> Message-ID: Date: Mon, 27 Jul 2020 11:58:52 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-07-27_02:2020-07-24, 2020-07-27 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 phishscore=0 spamscore=0 impostorscore=0 priorityscore=1501 lowpriorityscore=0 malwarescore=0 adultscore=0 bulkscore=0 suspectscore=0 clxscore=1015 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2007270021 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Kewen.Lin via Gcc-patches" From: "Kewen.Lin" Reply-To: "Kewen.Lin" Cc: Bill Schmidt , Segher Boessenkool Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Hi Richard, Thanks for the review again! on 2020/7/25 上午12:21, Richard Sandiford wrote: > "Kewen.Lin" writes: > > Thanks, the rearrangement of the existing code looks good. Could you > split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff > out into separate patches? > Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html. errr... that subject should be with prefix "[PATCH] vect:". [snip ...] (Some comments in the snipped content will be done in v4) >> + here. */ >> + >> + /* For now we only operate length-based partial vectors on Power, >> + which has constant VF all the time, we need some tweakings below >> + if it doesn't hold in future. */ >> + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); > > Where do you rely on this? There didn't seem to be any obvious > to_constant uses. Since this is “only” a cost calculation, we should > be using assumed_vf. Sorry for the confusion. This was intended for the poly things like VF or nitems_per_ctrl which isn't constant during compilation time, then get people's attention on the possible runtime cost on things like scaling up for nitems_step etc. But I just realized that the computations like the multiply with another constant can operate on the coefficient, it looks there is no runtime cost then? If so, I think I thought too much before. ;-) >> - prologue_cost_vec.release (); >> - epilogue_cost_vec.release (); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt, >> + NULL, NULL_TREE, 0, vect_prologue); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt, >> + NULL, NULL_TREE, 0, vect_body); > > IMO this seems to be reproducing too much of the functions that you > referred to. And the danger with that is that they could easily > get out of sync later. Good point! The original intention was to model as possible as we can, to avoid some bad decision due to some unmodeled pieces, like the case the loop body is small and some computation become nonnegligible. The unsync risks seems also applied for other codes. How about adding some "note" comments in those functions? The updated v4 is attached by addressing your comments as well as Segher's comments. BR, Kewen ----- gcc/ChangeLog: * config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New function. (rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop. * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost modeling for vector with length. * tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update function comment. * tree-vect-stmts.c (vect_gen_len): Likewise. diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 009afc5f894..86ef584e09b 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, return retval; } +/* For some target specific vectorization cost which can't be handled per stmt, + we check the requisite conditions and adjust the vectorization cost + accordingly if satisfied. One typical example is to model shift cost for + vector with length by counting number of required lengths under condition + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ + +static void +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data) +{ + struct loop *loop = data->loop_info; + gcc_assert (loop); + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); + + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + rgroup_controls *rgc; + unsigned int num_vectors_m1; + unsigned int shift_cnt = 0; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + /* Each length needs one shift to fill into bits 0-7. */ + shift_cnt += num_vectors_m1 + 1; + + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } +} + /* Implement targetm.vectorize.finish_cost. */ static void @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, rs6000_cost_data *cost_data = (rs6000_cost_data*) data; if (cost_data->loop_info) - rs6000_density_test (cost_data); + { + rs6000_adjust_vect_cost_per_loop (cost_data); + rs6000_density_test (cost_data); + } /* Don't vectorize minimum-vectorization-factor, simple copy loops that require versioning for any reason. The vectorization is at diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 490e7befea3..9d0e3fc525e 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, This means that we cannot guarantee that such an induction variable would ever hit a value that produces a set of all-false masks or zero - lengths for RGC. */ + lengths for RGC. + + Note that please check cost modeling whether needed to be updated in + function vect_estimate_min_profitable_iters if any changes. */ static tree vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo, diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 06cde4b1da3..a00160a7f2d 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, vector_stmt, NULL, NULL_TREE, 0, vect_body); } + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + /* Referring to the functions vect_set_loop_condition_partial_vectors + and vect_set_loop_controls_directly, we need to generate each + length in the prologue and in the loop body if required. Although + there are some possible optimizations, we consider the worst case + here. */ + + /* For wrap around checking. */ + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); + unsigned int compare_precision = TYPE_PRECISION (compare_type); + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); + + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); + bool need_iterate_p + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_known_niters_smaller_than_vf (loop_vinfo)); + + /* Init min/max, shift and minus cost relative to single + scalar_stmt. For now we only use length-based partial vectors on + Power, target specific cost tweaking may be needed for other + ports in future. */ + unsigned int min_max_cost = 2; + unsigned int shift_cost = 1, minus_cost = 1; + + /* Init cost relative to single scalar_stmt. */ + unsigned int prologue_cnt = 0; + unsigned int body_cnt = 0; + + rgroup_controls *rgc; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + { + unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor; + + /* May need one shift for nitems_total computation. */ + if (nitems != 1 && !niters_known_p) + prologue_cnt += shift_cost; + + /* Need to handle wrap around. */ + if (iv_limit == -1 + || (wi::min_precision (iv_limit * nitems, UNSIGNED) + > compare_precision)) + prologue_cnt += (min_max_cost + minus_cost); + + /* Need to handle batch limit excepting for the 1st one. */ + prologue_cnt += (min_max_cost + minus_cost) * num_vectors_m1; + + unsigned int num_vectors = num_vectors_m1 + 1; + /* Need to set up lengths in prologue, only one MIN required + since start index is zero. */ + prologue_cnt += min_max_cost * num_vectors; + + /* Need to update lengths in body for next iteration. */ + if (need_iterate_p) + body_cnt += (2 * min_max_cost + minus_cost) * num_vectors; + } + + (void) add_stmt_cost (loop_vinfo, target_cost_data, prologue_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } /* FORNOW: The scalar outside cost is incremented in one of the following ways: @@ -3932,8 +3996,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* ??? The "if" arm is written to handle all cases; see below for what - we would do for !LOOP_VINFO_FULLY_MASKED_P. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Rewriting the condition above in terms of the number of vector iterations (vniters) rather than the number of @@ -3960,7 +4024,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", min_vec_niters); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Now that we know the minimum number of vector iterations, find the minimum niters for which the scalar cost is larger: @@ -4015,6 +4079,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; + else if (min_profitable_iters < peel_iters_prologue) + /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the + vectorized loop to execute at least once. */ + min_profitable_iters = peel_iters_prologue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -4032,7 +4100,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (vec_outside_cost <= 0) min_profitable_estimate = 0; - else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* This is a repeat of the code above, but with + SOC rather than - SOC. */ @@ -4044,7 +4112,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (outside_overhead > 0) min_vec_niters = outside_overhead / saving_per_viter + 1; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { int threshold = (vec_inside_cost * min_vec_niters + vec_outside_cost diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 31af46ae19c..8550a252f44 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -12090,7 +12090,10 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, min_of_start_and_end = min (START_INDEX, END_INDEX); left_len = END_INDEX - min_of_start_and_end; rhs = min (left_len, LEN_LIMIT); - LEN = rhs; */ + LEN = rhs; + + Note that please check cost modeling whether needed to be updated in + function vect_estimate_min_profitable_iters if any changes. */ gimple_seq vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)