From patchwork Thu Aug 3 20:47:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 1816666 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=IhVweqSi; dkim-atps=neutral Received: from server2.sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4RH1BY1NQTz1ydj for ; Fri, 4 Aug 2023 06:47:49 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 15452385802F for ; Thu, 3 Aug 2023 20:47:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 15452385802F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1691095667; bh=+b5S0afg5XyhJ4xPyXHjC5MLQbuDOJ+SPlq19nKRJ48=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=IhVweqSivb16nIo5HRJfOwCUowI3MwrEACqKO1PgnM5kkIsgiRyCV1teeVY2kNI3+ 8tSauDfeIBf8KfELWLfr4cKG6sck8KUFLpoSskVzSW4YOU3OLCRRyq3yuhVFOGeA9e rPYE9k0vHJonqy/DLudcsDBcE3FaberSlJmIU44A= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id DD8A03857C40 for ; Thu, 3 Aug 2023 20:47:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DD8A03857C40 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 747302828D3; Thu, 3 Aug 2023 22:47:25 +0200 (CEST) Date: Thu, 3 Aug 2023 22:47:25 +0200 To: gcc-patches@gcc.gnu.org Subject: Update estimated iteraitons counts after splitting Message-ID: MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, JMQ_SPF_NEUTRAL, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Jan Hubicka via Gcc-patches From: Jan Hubicka Reply-To: Jan Hubicka Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi, Hmmer's internal function has 4 loops. The following is the profile at start: loop 1: estimate 472 iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 0.9979) loop 2: estimate 99 iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104) loop 3: estimate 99 iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104) loop 4: estimate 100 iterations by profile: 100.999596 (reliable) execution count:84167 (precise, freq 0.9902) So the first loops is outer loop and second/third loops are nesed. Fourth loop is not critical. Precise iteraiton counts are unknown (473 and 100 comes from profile) Nested loop has following form: for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } We do quite some belly dancing here. 1) loop-ch slightly misupdates profile, so the estimates of 99 does not match profile setimate of 100. 2) loops-split splits on if (k < M) and produces two loops. It fails to notice that the second loop never iterates. It used to misupdate profile a lot which later caused internal loop to become cold. This is fixed now. 3) loop-dist introduces runtime aliasing checks for both loops 4) tree vectorizer vectorizes some of the copies of the loop produces and drops expected iteration counts 5) loop peeling peels the loops with expected low iteration counts 6) complete loop unrolling kills some loops in prologues/epilogues. We end up with quite many loops and run out of registers: iterations by profile: 5.312499 (unreliable, maybe flat) this is vectorized internal loops after loop peeling iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) iterations by profile: 0.009495 (unreliable, maybe flat) Those are all versioned/peeled and vectorized variants of the loop never looping iterations by profile: 100.000008 (unreliable) iterations by profile: 100.000000 (unreliable) Those are variants with failed aliasing checks iterations by profile: 9.662853 (unreliable, maybe flat) iterations by profile: 4.646072 (unreliable) iterations by profile: 100.000007 (unreliable) iterations by profile: 5.312500 (unreliable) iterations by profile: 473.497707 (reliable) This is loop 1 iterations by profile: 100.999596 (reliable) This is the loop 4. This patch fixes loop iteration estimate update after loop split so we get: iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081) This is remainder of the peeled vectorized loop 2. It misses estimate that is correct since after peeling it 6 times it is essentially impossible to tell what the remaining loop profile is (without histograms) iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 (guessed, freq 4.4094) Peeled split part of loop 2 (one that never loops). We ought to work this out but at least w estimate 99 iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122) estimate 99 iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100) estimate 99 iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 (guessed, freq 417.7100) Profile here mismatches estimate - I will need to work out why. estimate 5 iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, freq 375.9390) This is vectorized but not peeled loop 3 estimate 99 iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420) Unvectorized variant of loop 3 estimate 5 iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512) Another vectorized variant of loop 3 estimate 472 iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979) Outer loop estimate 100 iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902) loop 4, not vectorized/peeled So there is still work to do on this testcase, but with the patch we prevent 3 useless loops. Bootstrapped/regtested x86_64-linux, committed. gcc/ChangeLog: * tree-ssa-loop-split.cc (split_loop): Update estimated iteration counts. diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc index 923e49e716d..2f7918c6e65 100644 --- a/gcc/tree-ssa-loop-split.cc +++ b/gcc/tree-ssa-loop-split.cc @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see #include "gimplify-me.h" #include "print-tree.h" #include "value-query.h" +#include "sreal.h" /* This file implements two kinds of loop splitting. @@ -691,6 +692,37 @@ split_loop (class loop *loop1) = loop1_prob.invert (); fix_loop_bb_probability (loop1, loop2, true_edge, false_edge); + /* If conditional we split on has reliable profilea nd both + preconditionals of loop1 and loop2 are constant true, we can + only redistribute the iteration counts to the split loops. + + If the conditionals we insert before loop1 or loop2 are non-trivial + they increase expected loop count, so account this accordingly. + If we do not know the probability of split conditional, avoid + reudcing loop estimates, since we do not really know how they are + split between of the two new loops. Keep orignal estimate since + it is likely better then completely dropping it. + + TODO: If we know that onle of the new loops has constant + number of iterations, we can do better. We could also update + upper bounds. */ + if (loop1->any_estimate + && wi::fits_shwi_p (loop1->nb_iterations_estimate)) + { + sreal scale = true_edge->probability.reliable_p () + ? true_edge->probability.to_sreal () : (sreal)1; + sreal scale2 = false_edge->probability.reliable_p () + ? false_edge->probability.to_sreal () : (sreal)1; + /* +1 to get header interations rather than latch iterations and then + -1 to convert back. */ + loop1->nb_iterations_estimate + = MAX ((((sreal)loop1->nb_iterations_estimate.to_shwi () + 1) * scale + / loop1_prob.to_sreal ()).to_nearest_int () - 1, 0); + loop2->nb_iterations_estimate + = MAX ((((sreal)loop2->nb_iterations_estimate.to_shwi () + 1) * scale2 + / profile_probability::very_likely ().to_sreal ()) + .to_nearest_int () - 1, 0); + } update_loop_exit_probability_scale_dom_bbs (loop1); update_loop_exit_probability_scale_dom_bbs (loop2); @@ -711,8 +743,6 @@ split_loop (class loop *loop1) tree guard_next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop1)); patch_loop_exit (loop1, guard_stmt, guard_next, newend, initial_true); - /* TODO: Update any_esitmate and upper bounds. */ - /* Finally patch out the two copies of the condition to be always true/false (or opposite). */ gcond *force_true = as_a (*gsi_last_bb (bbs[i]));