From patchwork Sun Sep 30 11:16:10 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 188163 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 66F332C00DA for ; Sun, 30 Sep 2012 21:16:25 +1000 (EST) Comment: DKIM? See http://www.dkim.org DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=gcc.gnu.org; s=default; x=1349608586; h=Comment: DomainKey-Signature:Received:Received:Received:Received:Date: From:To:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition:User-Agent:Mailing-List:Precedence:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:Sender: Delivered-To; bh=qLTTdJQTAkC81WXwcS3Gt1tYBBc=; b=EJz8+oaV/CLPaRh zOEG2O0F9CPX+CZeGwoqueW8uACSZSfsfk+kFm0ONtZoCI3yX9nHP+NRMgINRZKK e0HvCbSo2X9xEwC0n0CkVXqu/xhjKkuRpYsNYx2YpJ345y/e4ErbOmV7A8thBPzN TItv67zOmBDDOLcFxYEsL4WE4srU= Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org; h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:Date:From:To:Subject:Message-ID:MIME-Version:Content-Type:Content-Disposition:User-Agent:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To; b=Pi9LJRCRCn4u8lfvHUQTBLrXopjLVWZEdjkv860GyS5tSH9AEtX9uLEp4Lr1EE FDWCMmR9huqckfE+90ywX7MGI/pkphk0o0Cv3dJdUFcMq3COjosjZNa+6y2yNQVi k1DOIoOSQvgqrFZfmAXIirluvcyxs096lHOGV48Ohba6A=; Received: (qmail 8923 invoked by alias); 30 Sep 2012 11:16:21 -0000 Received: (qmail 8912 invoked by uid 22791); 30 Sep 2012 11:16:19 -0000 X-SWARE-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL, BAYES_00, KHOP_RCVD_UNTRUST, RCVD_IN_DNSWL_LOW, RCVD_IN_HOSTKARMA_W, RCVD_IN_HOSTKARMA_WL, RP_MATCHES_RCVD, TW_TM X-Spam-Check-By: sourceware.org Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sun, 30 Sep 2012 11:16:13 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 9FFC1543048; Sun, 30 Sep 2012 13:16:10 +0200 (CEST) Date: Sun, 30 Sep 2012 13:16:10 +0200 From: Jan Hubicka To: gcc-patches@gcc.gnu.org, rguenther@suse.de Subject: [RFC] Make vectorizer to skip loops with small iteration estimate Message-ID: <20120930111610.GA7097@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Hi, the point of the following patch is to make vectorizer to not vectorize the following testcase with profile feedback: int a[10000]; int i=500000000; int k=2; int val; __attribute__ ((noinline,noclone)) test() { int j; for(j=0;j VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (A) where SIC = scalar iteration cost, VIC = vector iteration cost, VOC = vector outside cost, VF = vectorization factor, PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations SOC = scalar outside cost for run time cost model check. */ This value is used for both 1) decision if number of iterations is too low (max iterations is known) 2) decision on runtime whether we want to take the vectorized path or the scalar path. The vectoried loop looks like: k.1_10 = k; if (k.1_10 > 0) { pretmp_2 = val; niters.8_4 = (unsigned int) k.1_10; bnd.9_13 = niters.8_4 >> 2; ratio_mult_vf.10_1 = bnd.9_13 << 2; _18 = niters.8_4 <= 3; _19 = ratio_mult_vf.10_1 == 0; _20 = _19 | _18; if (_20 != 0) scalar loop else vector prologue } So the unvectorized cost is SIC * niters The vectorized path is SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC The scalar path of vectorizer loop is SIC * niters + SOC It makes sense to vectorize if SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (B) That is in the optimal cse where we actually vectorize the overall speed of vectorized loop including the runtime check is better. It makes sense to take the vector loop if SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C) Because the scalar loop is taken. The attached patch implements the formula (C) and uses it to deterine the decision based on number of iterations estimate (that is usually provided by the feedback) As a reality check, I tried my testcase. 9: Cost model analysis: Vector inside of loop cost: 1 Vector prologue cost: 7 Vector epilogue cost: 2 Scalar iteration cost: 1 Scalar outside cost: 6 Vector outside cost: 9 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 4 9: Profitability threshold = 3 9: Profitability estimated iterations threshold = 20 This is overrated. The loop starts to be benefical at about 4 iterations in reality. I guess the values are kind of wrong. Vector inside of loop cost and Scalar iteration cost seems to ignore the fact that the loops do contain some control flow that should account at least one extra cycle. Vector prologue cost seems bit overrated for one pack operation. Of course this is very simple benchmark, in reality the vectorizatoin can be a lot more harmful by complicating more complex control flows. So I guess we have two options 1) go with the new formula and try to make cost model a bit more realistic. 2) stay with original formula that is quite close to reality, but I think more by an accident. 2) Even when loop iterates 2 times, it is estimated to 4 iterations by estimated_stmt_executions_int with the profile feedback. The reason is loop_ch pass. Given a rolled loop with exit probability 30%, proceeds by duplicating the header with original probabilities. This makes the loop to be executed with 60% probability. Because the loop body counts remain the same (and they should), the expected number of iterations increase by the decrease of entry edge to the header. I wonder what to do about this. Obviously without path profiling loop_ch can not really do a good job. We can artifically make header to suceed more likely, that is the reality, but that requires non-trivial loop profile updating. We can also simply record the iteration bound into loop structure and ignore that the profile is not realistic Finally we can duplicate loop headers before profilng. I implemented that via early_ch pass executed only with profile generation or feedback. I guess it makes sense to do, even if it breaks the assumption that we should do strictly -Os generation on paths where Index: tree-vect-loop.c =================================================================== --- tree-vect-loop.c (revision 191852) +++ tree-vect-loop.c (working copy) @@ -1243,6 +1243,8 @@ vect_analyze_loop_operations (loop_vec_i unsigned int th; bool only_slp_in_loop = true, ok; HOST_WIDE_INT max_niter; + HOST_WIDE_INT estimated_niter; + int min_profitable_estimate; if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "=== vect_analyze_loop_operations ==="); @@ -1436,7 +1438,8 @@ vect_analyze_loop_operations (loop_vec_i vector stmts depends on VF. */ vect_update_slp_costs_according_to_vf (loop_vinfo); - min_profitable_iters = vect_estimate_min_profitable_iters (loop_vinfo); + vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters, + &min_profitable_estimate); LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters; if (min_profitable_iters < 0) @@ -1452,6 +1455,7 @@ vect_analyze_loop_operations (loop_vec_i min_scalar_loop_bound = ((PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND) * vectorization_factor) - 1); + /* Use the cost model only if it is more conservative than user specified threshold. */ @@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i return false; } + if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1 + && (unsigned HOST_WIDE_INT) estimated_niter <= MAX (th, min_profitable_estimate)) + { + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) + fprintf (vect_dump, "not vectorized: estimated iteration count too small."); + if (vect_print_dump_info (REPORT_DETAILS)) + fprintf (vect_dump, "not vectorized: estimated iteration count smaller than " + "user specified loop bound parameter or minimum " + "profitable iterations (whichever is more conservative)."); + return false; + } + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) || LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0 || LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)) @@ -2512,15 +2528,15 @@ vect_get_known_peeling_cost (loop_vec_in Return the number of iterations required for the vector version of the loop to be profitable relative to the cost of the scalar version of the - loop. + loop. */ - TODO: Take profile info into account before making vectorization - decisions, if available. */ - -int -vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo) +void +vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, + int *ret_min_profitable_niters, + int *ret_min_profitable_estimate) { int min_profitable_iters; + int min_profitable_estimate; int peel_iters_prologue; int peel_iters_epilogue; unsigned vec_inside_cost = 0; @@ -2538,7 +2554,9 @@ vect_estimate_min_profitable_iters (loop { if (vect_print_dump_info (REPORT_COST)) fprintf (vect_dump, "cost model disabled."); - return 0; + *ret_min_profitable_niters = 0; + *ret_min_profitable_estimate = 0; + return; } /* Requires loop versioning tests to handle misalignment. */ @@ -2774,7 +2792,9 @@ vect_estimate_min_profitable_iters (loop "divided by the scalar iteration cost = %d " "is greater or equal to the vectorization factor = %d.", vec_inside_cost, scalar_single_iter_cost, vf); - return -1; + *ret_min_profitable_niters = -1; + *ret_min_profitable_estimate = -1; + return; } if (vect_print_dump_info (REPORT_COST)) @@ -2789,6 +2809,7 @@ vect_estimate_min_profitable_iters (loop fprintf (vect_dump, " Scalar iteration cost: %d\n", scalar_single_iter_cost); fprintf (vect_dump, " Scalar outside cost: %d\n", scalar_outside_cost); + fprintf (vect_dump, " Vector outside cost: %d\n", vec_outside_cost); fprintf (vect_dump, " prologue iterations: %d\n", peel_iters_prologue); fprintf (vect_dump, " epilogue iterations: %d\n", @@ -2809,7 +2830,34 @@ vect_estimate_min_profitable_iters (loop fprintf (vect_dump, " Profitability threshold = %d\n", min_profitable_iters); - return min_profitable_iters; + *ret_min_profitable_niters = min_profitable_iters; + /* Calculate number of iterations required to make the vector version + profitable, relative to the loop bodies only. + + Non-vectorized variant is SIC * niters and it must win over vector + variant on the expected loop trip count. The following condition must hold true: + SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC */ + + if ((scalar_single_iter_cost * vf) > (int) vec_inside_cost) + { + if (vec_outside_cost <= 0) + min_profitable_estimate = 1; + else + { + min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) * vf + - vec_inside_cost * peel_iters_prologue + - vec_inside_cost * peel_iters_epilogue) + / ((scalar_single_iter_cost * vf) + - vec_inside_cost); + } + } + min_profitable_estimate --; + min_profitable_estimate = MAX (min_profitable_estimate, min_profitable_iters); + if (vect_print_dump_info (REPORT_COST)) + fprintf (vect_dump, " Profitability estimated iterations threshold = %d\n", + min_profitable_estimate); + + *ret_min_profitable_estimate = min_profitable_estimate; }