diff mbox series

Update estimated iteraitons counts after splitting

Message ID ZMwSXdJyms5onlrv@kam.mff.cuni.cz
State New
Headers show
Series Update estimated iteraitons counts after splitting | expand

Commit Message

Jan Hubicka Aug. 3, 2023, 8:47 p.m. UTC
Hi,
Hmmer's internal function has 4 loops.  The following is the profile at start:

  loop 1:
  estimate 472
  iterations by profile: 473.497707 (reliable) count in:84821 (precise, freq 0.9979)

    loop 2:
    estimate 99
    iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)

    loop 3:
    estimate 99
    iterations by profile: 100.000000 (reliable) count in:39848881 (precise, freq 468.8104)

  loop 4:
  estimate 100
  iterations by profile: 100.999596 (reliable) execution count:84167 (precise, freq 0.9902)

So the first loops is outer loop and second/third loops are nesed. Fourth loop is not critical.
Precise iteraiton counts are unknown (473 and 100 comes from profile)
Nested loop has following form:

    for (k = 1; k <= M; k++) {
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;

      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;

      if (k < M) {
        ic[k] = mpp[k] + tpmi[k];
        if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
        ic[k] += is[k];
        if (ic[k] < -INFTY) ic[k] = -INFTY;
      }

We do quite some belly dancing here.
 1) loop-ch slightly misupdates profile, so the estimates of 99
    does not match profile setimate of 100.
 2) loops-split splits on if (k < M) and produces two loops.
    It fails to notice that the second loop never iterates.
    It used to misupdate profile a lot which later caused internal
    loop to become cold.  This is fixed now.
 3) loop-dist introduces runtime aliasing checks for both loops
 4) tree vectorizer vectorizes some of the copies of the loop produces
    and drops expected iteration counts
 5) loop peeling peels the loops with expected low iteration counts
 6) complete loop unrolling kills some loops in prologues/epilogues.

We end up with quite many loops and run out of registers:

  iterations by profile: 5.312499 (unreliable, maybe flat)
    this is vectorized internal loops after loop peeling

  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
  iterations by profile: 0.009495 (unreliable, maybe flat)
    Those are all versioned/peeled and vectorized variants of the loop never looping

  iterations by profile: 100.000008 (unreliable)
  iterations by profile: 100.000000 (unreliable)
    Those are variants with failed aliasing checks

  iterations by profile: 9.662853 (unreliable, maybe flat)
  iterations by profile: 4.646072 (unreliable)
  iterations by profile: 100.000007 (unreliable)
  iterations by profile: 5.312500 (unreliable)
  iterations by profile: 473.497707 (reliable)
    This is loop 1

  iterations by profile: 100.999596 (reliable)
    This is the loop 4.

This patch fixes loop iteration estimate update after loop split so we get:

  iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081)
    This is remainder of the peeled vectorized loop 2.  It misses estimate that is correct since after peeling it 6 times it is essentially
    impossible to tell what the remaining loop profile is (without histograms)

  iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374801 (guessed, freq 4.4094)
    Peeled split part of loop 2 (one that never loops).  We ought to work this out
    but at least w

  estimate 99
  iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122)
  estimate 99
  iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100)

  estimate 99
  iterations by profile: 9.662853 (unreliable, maybe flat) entry count:35505353 (guessed, freq 417.7100)
    Profile here mismatches estimate - I will need to work out why.

  estimate 5
  iterations by profile: 4.646072 (unreliable) entry count:31954818 (guessed, freq 375.9390)
    This is vectorized but not peeled loop 3
  estimate 99
  iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420)
    Unvectorized variant of loop 3
  estimate 5
  iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512)
    Another vectorized variant of loop 3
  estimate 472
  iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979)
    Outer loop

  estimate 100
  iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902)
    loop 4, not vectorized/peeled

So there is still work to do on this testcase, but with the patch we prevent 3 useless loops.

Bootstrapped/regtested x86_64-linux, committed.

gcc/ChangeLog:

	* tree-ssa-loop-split.cc (split_loop): Update estimated iteration counts.
diff mbox series

Patch

diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 923e49e716d..2f7918c6e65 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -43,6 +43,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "gimplify-me.h"
 #include "print-tree.h"
 #include "value-query.h"
+#include "sreal.h"
 
 /* This file implements two kinds of loop splitting.
 
@@ -691,6 +692,37 @@  split_loop (class loop *loop1)
 			= loop1_prob.invert ();
 
 	fix_loop_bb_probability (loop1, loop2, true_edge, false_edge);
+	/* If conditional we split on has reliable profilea nd both
+	   preconditionals of loop1 and loop2 are constant true, we can
+	   only redistribute the iteration counts to the split loops.
+
+	   If the conditionals we insert before loop1 or loop2 are non-trivial
+	   they increase expected loop count, so account this accordingly.
+	   If we do not know the probability of split conditional, avoid
+	   reudcing loop estimates, since we do not really know how they are
+	   split between of the two new loops.  Keep orignal estimate since
+	   it is likely better then completely dropping it.
+
+	   TODO: If we know that onle of the new loops has constant
+	   number of iterations, we can do better.  We could also update
+	   upper bounds.  */
+	if (loop1->any_estimate
+	    && wi::fits_shwi_p (loop1->nb_iterations_estimate))
+	  {
+	    sreal scale = true_edge->probability.reliable_p ()
+			  ? true_edge->probability.to_sreal () : (sreal)1;
+	    sreal scale2 = false_edge->probability.reliable_p ()
+			  ? false_edge->probability.to_sreal () : (sreal)1;
+	    /* +1 to get header interations rather than latch iterations and then
+	       -1 to convert back.  */
+	    loop1->nb_iterations_estimate
+	      = MAX ((((sreal)loop1->nb_iterations_estimate.to_shwi () + 1) * scale
+		     / loop1_prob.to_sreal ()).to_nearest_int () - 1, 0);
+	    loop2->nb_iterations_estimate
+	      = MAX ((((sreal)loop2->nb_iterations_estimate.to_shwi () + 1) * scale2
+		     / profile_probability::very_likely ().to_sreal ())
+		     .to_nearest_int () - 1, 0);
+	  }
 	update_loop_exit_probability_scale_dom_bbs (loop1);
 	update_loop_exit_probability_scale_dom_bbs (loop2);
 
@@ -711,8 +743,6 @@  split_loop (class loop *loop1)
 	tree guard_next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop1));
 	patch_loop_exit (loop1, guard_stmt, guard_next, newend, initial_true);
 
-	/* TODO: Update any_esitmate and upper bounds.  */
-
 	/* Finally patch out the two copies of the condition to be always
 	   true/false (or opposite).  */
 	gcond *force_true = as_a<gcond *> (*gsi_last_bb (bbs[i]));