diff mbox

Undermine the jump threading cost model to fix PR77445.

Message ID 1481801639-14286-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh Dec. 15, 2016, 11:33 a.m. UTC
Hi,

As mentioned in PR77445, the improvements to the jump threading cost model
this year have caused substantial regressions in the amount of jump threading
we do and the performance of workloads which rely on that threading.

This patch represents the low-bar in fixing the performance issues reported
in PR77445 - by weakening the cost model enough that we thread in a way much
closer to GCC 6. I don't think this patch is likely to be acceptable for
trunk, but I'm posting it for consideration regardless. 

Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p,
then we don't thread. The problem in late threading is bad edge profile
data makes the edge look cold, and thus it fails optimize_edge_for_speed_p
and is no longer considered a candidate for threading. As an aside, I think
this is the wrong cost model for jump threading, where you get the most
impact if you can resolve unpredictable switch statements - which by their
nature may have multiple cold edges in need of threading.

Early threading should avoid these issues, as there is no edge profile
info yet. optimize_edge_for_speed_p is therefore more likely to hold, but
the condition for threading is:

  if (speed_p && optimize_edge_for_speed_p (taken_edge))
    {
      if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
	{
          [...reject threading...]
        }
    }
  else if (n_insns > 1)
    {
      [...reject threading...]
    }

With speed_p is hardwired to false for the early threader
( pass_early_thread_jumps::execute ):

	find_jump_threads_backwards (bb, false);

So we always fall to the n_insns > 1 case and thus only rarely get to
thread.

In this patch I change that call in pass_early_thread_jumps::execute to
instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p
check to pass in the main threading cost model, and then the
optimize_edge_for_speed_p can also pass. That gets the first stage of
jump-threading back working in a proprietary benchmark which is sensitive
to this optimisation.

To get the rest of the required jump threading, I also have to weaken the
cost model - and this is obviously a hack! The easy hack is to special case
when the taken edge has frequency zero, and permit jump threading there.

I know this patch is likely not the preferred way to fix this. For me that
would be a change to the cost model, which as I mentioned above I think
misses the point about which edges we want to thread. By far the best
fix would be to the junk edge profiling data we create during threading.

However, this patch does fix the performance issues identified in PR77445,
and does highlight a fundamental issue with the early threader (which
doesn't seem to me like it will be effective while it sets speed_p to
false), so I'd like it to be considered for trunk if no better fix appears
before stage 4.

Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle
which passes spot the threading opportunities.

OK?

Thanks,
James

---
gcc/

2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>

	PR tree-optimization/77445
	* tree-ssa-threadbackward.c (profitable_jump_thread_path) Work
	around sometimes corrupt edge frequency data.
	(pass_early_thread_jumps::execute): Pass
	optimize_bb_for_speed_p as the speed_p parameter to
	find_jump_threads_backwards to enable threading in more cases.

gcc/testsuite/

2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>

	PR tree-optimization/77445
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes.
	* gcc.dg/tree-ssa/pr66752-3.c: Likewise.

Comments

Richard Biener Dec. 16, 2016, 12:25 p.m. UTC | #1
On Thu, Dec 15, 2016 at 12:33 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
>
> Hi,
>
> As mentioned in PR77445, the improvements to the jump threading cost model
> this year have caused substantial regressions in the amount of jump threading
> we do and the performance of workloads which rely on that threading.
>
> This patch represents the low-bar in fixing the performance issues reported
> in PR77445 - by weakening the cost model enough that we thread in a way much
> closer to GCC 6. I don't think this patch is likely to be acceptable for
> trunk, but I'm posting it for consideration regardless.
>
> Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p,
> then we don't thread. The problem in late threading is bad edge profile
> data makes the edge look cold, and thus it fails optimize_edge_for_speed_p
> and is no longer considered a candidate for threading. As an aside, I think
> this is the wrong cost model for jump threading, where you get the most
> impact if you can resolve unpredictable switch statements - which by their
> nature may have multiple cold edges in need of threading.
>
> Early threading should avoid these issues, as there is no edge profile
> info yet. optimize_edge_for_speed_p is therefore more likely to hold, but
> the condition for threading is:
>
>   if (speed_p && optimize_edge_for_speed_p (taken_edge))
>     {
>       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>         {
>           [...reject threading...]
>         }
>     }
>   else if (n_insns > 1)
>     {
>       [...reject threading...]
>     }
>
> With speed_p is hardwired to false for the early threader
> ( pass_early_thread_jumps::execute ):
>
>         find_jump_threads_backwards (bb, false);
>
> So we always fall to the n_insns > 1 case and thus only rarely get to
> thread.
>
> In this patch I change that call in pass_early_thread_jumps::execute to
> instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p
> check to pass in the main threading cost model, and then the
> optimize_edge_for_speed_p can also pass. That gets the first stage of
> jump-threading back working in a proprietary benchmark which is sensitive
> to this optimisation.

But all early optimizations are supposed to never increase code size -- they
exist to get more realistic sizes to the IPA inliner heuristic code.

Iff at all then add some --param early-jump-threading-insns and use that
in place of the '1' and then eventually bump that to 2 or 3.  But using
the 100 from the late threading is too much.

Note that using optimize_bb_for_speed_p at this point in the pipeline
is nonsense and you can as well use ! optimize_size ...  because even
a guessed profile is only created at the end of the early optimization
pipeline.

> To get the rest of the required jump threading, I also have to weaken the
> cost model - and this is obviously a hack! The easy hack is to special case
> when the taken edge has frequency zero, and permit jump threading there.
>
> I know this patch is likely not the preferred way to fix this. For me that
> would be a change to the cost model, which as I mentioned above I think
> misses the point about which edges we want to thread. By far the best
> fix would be to the junk edge profiling data we create during threading.
>
> However, this patch does fix the performance issues identified in PR77445,
> and does highlight a fundamental issue with the early threader (which
> doesn't seem to me like it will be effective while it sets speed_p to
> false), so I'd like it to be considered for trunk if no better fix appears
> before stage 4.

But the PR has nothing to do with early threading but with late threading
fed a bogus profile.

> Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle
> which passes spot the threading opportunities.
>
> OK?

I don't think so.

Richard.

> Thanks,
> James
>
> ---
> gcc/
>
> 2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         PR tree-optimization/77445
>         * tree-ssa-threadbackward.c (profitable_jump_thread_path) Work
>         around sometimes corrupt edge frequency data.
>         (pass_early_thread_jumps::execute): Pass
>         optimize_bb_for_speed_p as the speed_p parameter to
>         find_jump_threads_backwards to enable threading in more cases.
>
> gcc/testsuite/
>
> 2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         PR tree-optimization/77445
>         * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes.
>         * gcc.dg/tree-ssa/pr66752-3.c: Likewise.
>
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index 896c8bf..39ec3d6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-ethread-details -fdump-tree-dce2" } */
 
 extern int status, pt;
 extern int count;
@@ -34,7 +34,7 @@  foo (int N, int c, int b, int *a)
 
 /* There are 4 FSM jump threading opportunities, all of which will be
    realized, which will eliminate testing of FLAG, completely.  */
-/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "thread1"} } */
+/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "ethread"} } */
 
 /* There should be no assignments or references to FLAG, verify they're
    eliminated as early as possible.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 9a9d1cb..5b087fb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,9 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 16"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread2" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */
+/* { dg-options "-O2 -fdump-tree-ethread-stats -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 16"  "ethread" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 6"  "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 4" "thread2" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 2" "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 203e20e..0d29ab5 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -311,7 +311,20 @@  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
       return NULL;
     }
 
-  if (speed_p && optimize_edge_for_speed_p (taken_edge))
+
+  /* FIXME: Edge frequency can get badly out shape as a result of
+     the jump threading passes.  In those cases,
+     EDGE_FREQUENCY (taken_edge) == 0 , and so trivially fails the
+     test for optimize_edge_for_speed_p.  The correct fix would
+     be to ensure that profiling information coming out of jump threading
+     is meaningful, but in lieu of that add a hack check to this cost model
+     which permits jump threading in the case EDGE_FREQUENCY has been
+     corrupted.  Only do this if the profile info is present and corrupt,
+     not if it is absent.  */
+  if (speed_p
+      && (optimize_edge_for_speed_p (taken_edge)
+	  || (profile_status_for_fn (cfun) != PROFILE_ABSENT
+	      && EDGE_FREQUENCY (taken_edge) == 0)))
     {
       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
 	{
@@ -870,7 +883,7 @@  pass_early_thread_jumps::execute (function *fun)
   FOR_EACH_BB_FN (bb, fun)
     {
       if (EDGE_COUNT (bb->succs) > 1)
-	find_jump_threads_backwards (bb, false);
+	find_jump_threads_backwards (bb, optimize_bb_for_speed_p (bb));
     }
   thread_through_all_blocks (true);
   return 0;