diff mbox

bb-reorder: Improve the simple algorithm for -Os (PR67864)

Message ID c792164a9e6f8c4ba4db346c2df0e2341c73a3df.1444322573.git.segher@kernel.crashing.org
State New
Headers show

Commit Message

Segher Boessenkool Oct. 8, 2015, 4:57 p.m. UTC
As the PR points out, the "simple" reorder algorithm makes bigger code
than the STC algorithm did, for -Os, for x86.  I now tested it for many
different targets and it turns out to be worse everywhere.

This simple patch tunes "simple" a bit; this makes it better than STC
almost everywhere.  The only exceptions (for the targets where I have
results) are x86 and mn10300.  For those targets it may be best to switch
the default algorithm for -Os to STC.

The raw results.  This is text size for vmlinux for 31 targets; as you
can see many did not build, but at least all primary targets did.
"none" is no basic block reordering; "trunk" is current trunk; "stc" is
with the STC algorithm; "nodyn" is with simple, but considering all edges
equally likely; "swaps" is that, but prefering the more likely edge from
conditional branches; and "falls" prefers the edge from conditional
branches that is already the fallthrough edge.  This patch is "falls".

   none    trunk     stc     nodyn    swaps    falls    best
 3728663  3721479  3700831  3706407  3717971  3690367  falls  arm
 2097684  2095560  2089484  2094024  2094212  2086720  falls  blackfin
 2096118  2107356  2081894  2092276  2103732  2077162  falls  cris
 3204044  3200972  3187196  3191932  3198156  3177980  falls  frv
 9254222  9340258  9208805  9265886  9331362  9247119   stc   i386
 3353034  3355482  3334726  3334466  3349710  3314970  falls  m32r
 4545720  4549824  4514256  4541832  4544540  4498416  falls  microblaze
 4276743  4266363  4246255  4259923  4261367  4227723  falls  mips
 5779199  5770567  5741663  5757663  5764475  5721803  falls  mips64
 2059078  2086000  2051005  2064365  2083923  2055705   stc   mn10300
 3311925  3320113  3293873  3305949  3317865  3284081  falls  nios2
 6738701  6747077  6710253  6742917  6740965  6696757  falls  parisc64
 8312408  8312744  8261016  8294480  8306488  8237188  falls  powerpc
17782722 17788382 17722326 17749526 17780642 17683810  falls  powerpc64
11016289 11029481 10970081 10980065 11024617 10942409  falls  s390
 1406832  1417224  1400344  1409392  1416172  1399428  falls  shnommu
 3771111  3776223  3751623  3768459  3771455  3732967  falls  sparc
 6113427  6113527  6074875  6091167  6106015  6049571  falls  sparc64
10449681 10529883 10398908 10458240 10522149 10440814   stc   x86_64
 1905733  1905733  1905733  1905733  1905733  1905733  -----  xtensa


Is this okay for trunk?


Segher


2015-10-08  Segher Boessenkool  <segher@kernel.crashing.org>

	PR rtl-optimization/67864
	* gcc/bb-reorder (reorder_basic_blocks_simple): Prefer existing
	fallthrough edges for conditional jumps.  Don't sort candidate
	edges if not optimizing for speed.

---
 gcc/bb-reorder.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Richard Biener Oct. 9, 2015, 9:29 a.m. UTC | #1
On Thu, Oct 8, 2015 at 6:57 PM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> As the PR points out, the "simple" reorder algorithm makes bigger code
> than the STC algorithm did, for -Os, for x86.  I now tested it for many
> different targets and it turns out to be worse everywhere.
>
> This simple patch tunes "simple" a bit; this makes it better than STC
> almost everywhere.  The only exceptions (for the targets where I have
> results) are x86 and mn10300.  For those targets it may be best to switch
> the default algorithm for -Os to STC.
>
> The raw results.  This is text size for vmlinux for 31 targets; as you
> can see many did not build, but at least all primary targets did.
> "none" is no basic block reordering; "trunk" is current trunk; "stc" is
> with the STC algorithm; "nodyn" is with simple, but considering all edges
> equally likely; "swaps" is that, but prefering the more likely edge from
> conditional branches; and "falls" prefers the edge from conditional
> branches that is already the fallthrough edge.  This patch is "falls".
>
>    none    trunk     stc     nodyn    swaps    falls    best
>  3728663  3721479  3700831  3706407  3717971  3690367  falls  arm
>  2097684  2095560  2089484  2094024  2094212  2086720  falls  blackfin
>  2096118  2107356  2081894  2092276  2103732  2077162  falls  cris
>  3204044  3200972  3187196  3191932  3198156  3177980  falls  frv
>  9254222  9340258  9208805  9265886  9331362  9247119   stc   i386
>  3353034  3355482  3334726  3334466  3349710  3314970  falls  m32r
>  4545720  4549824  4514256  4541832  4544540  4498416  falls  microblaze
>  4276743  4266363  4246255  4259923  4261367  4227723  falls  mips
>  5779199  5770567  5741663  5757663  5764475  5721803  falls  mips64
>  2059078  2086000  2051005  2064365  2083923  2055705   stc   mn10300
>  3311925  3320113  3293873  3305949  3317865  3284081  falls  nios2
>  6738701  6747077  6710253  6742917  6740965  6696757  falls  parisc64
>  8312408  8312744  8261016  8294480  8306488  8237188  falls  powerpc
> 17782722 17788382 17722326 17749526 17780642 17683810  falls  powerpc64
> 11016289 11029481 10970081 10980065 11024617 10942409  falls  s390
>  1406832  1417224  1400344  1409392  1416172  1399428  falls  shnommu
>  3771111  3776223  3751623  3768459  3771455  3732967  falls  sparc
>  6113427  6113527  6074875  6091167  6106015  6049571  falls  sparc64
> 10449681 10529883 10398908 10458240 10522149 10440814   stc   x86_64
>  1905733  1905733  1905733  1905733  1905733  1905733  -----  xtensa
>
>
> Is this okay for trunk?

I think the patch makes sense but it also raises a question for me - how
did we decide what edge gets EDGE_FALLTHRU when going out-of-cfglayout?
And isn't _that_ mechanism then not part of basic-block reordering that
needs to be tweaked for choosing the EDGE_FALLTHRU as better?

Thanks,
Richard.

>
> Segher
>
>
> 2015-10-08  Segher Boessenkool  <segher@kernel.crashing.org>
>
>         PR rtl-optimization/67864
>         * gcc/bb-reorder (reorder_basic_blocks_simple): Prefer existing
>         fallthrough edges for conditional jumps.  Don't sort candidate
>         edges if not optimizing for speed.
>
> ---
>  gcc/bb-reorder.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
> index cb001e8..3b7098e 100644
> --- a/gcc/bb-reorder.c
> +++ b/gcc/bb-reorder.c
> @@ -2318,16 +2318,24 @@ reorder_basic_blocks_simple (void)
>
>        if (any_condjump_p (end))
>         {
> -         edges[n++] = EDGE_SUCC (bb, 0);
> -         edges[n++] = EDGE_SUCC (bb, 1);
> +         edge e0 = EDGE_SUCC (bb, 0);
> +         edge e1 = EDGE_SUCC (bb, 1);
> +         /* When optimizing for size it is best to keep the original
> +            fallthrough edges.  */
> +         if (e1->flags & EDGE_FALLTHRU)
> +           std::swap (e0, e1);
> +         edges[n++] = e0;
> +         edges[n++] = e1;
>         }
>        else if (single_succ_p (bb))
>         edges[n++] = EDGE_SUCC (bb, 0);
>      }
>
> -  /* Sort the edges, the most desirable first.  */
> +  /* Sort the edges, the most desirable first.  When optimizing for size
> +     all edges are equally desirable.  */
>
> -  std::stable_sort (edges, edges + n, edge_order);
> +  if (optimize_function_for_speed_p (cfun))
> +    std::stable_sort (edges, edges + n, edge_order);
>
>    /* Now decide which of those edges to make fallthrough edges.  We set
>       BB_VISITED if a block already has a fallthrough successor assigned
> --
> 1.8.1.4
>
Bernd Schmidt Oct. 9, 2015, 10:35 a.m. UTC | #2
On 10/08/2015 06:57 PM, Segher Boessenkool wrote:
> As the PR points out, the "simple" reorder algorithm makes bigger code
> than the STC algorithm did, for -Os, for x86.  I now tested it for many
> different targets and it turns out to be worse everywhere.

That's somewhat disappointing. Wasn't it supposed to improve over it? 
What went wrong?

> Is this okay for trunk?
>
> 2015-10-08  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	PR rtl-optimization/67864
> 	* gcc/bb-reorder (reorder_basic_blocks_simple): Prefer existing
> 	fallthrough edges for conditional jumps.  Don't sort candidate
> 	edges if not optimizing for speed.

Ok. Although it would be nice to understand exactly what causes code 
growth and possibly get a real cost estimate rather than such a heuristic.


Bernd
Segher Boessenkool Oct. 9, 2015, 4:08 p.m. UTC | #3
On Fri, Oct 09, 2015 at 11:29:16AM +0200, Richard Biener wrote:
> I think the patch makes sense but it also raises a question for me - how
> did we decide what edge gets EDGE_FALLTHRU when going out-of-cfglayout?

Good question.  I think it just tries to make "natural" control flow; I'll
investigate.

> And isn't _that_ mechanism then not part of basic-block reordering that
> needs to be tweaked for choosing the EDGE_FALLTHRU as better?

Yes.

This patch uses the existing fallthrough for conditional branches, which
seems to work best; but it still can significantly improve on unconditional
branches.  I'm sure there are better heuristics possible but I don't know
them (and actually *solving* the problem isn't even polynomial of course).


Segher
Segher Boessenkool Oct. 9, 2015, 4:27 p.m. UTC | #4
On Fri, Oct 09, 2015 at 12:35:46PM +0200, Bernd Schmidt wrote:
> On 10/08/2015 06:57 PM, Segher Boessenkool wrote:
> >As the PR points out, the "simple" reorder algorithm makes bigger code
> >than the STC algorithm did, for -Os, for x86.  I now tested it for many
> >different targets and it turns out to be worse everywhere.
> 
> That's somewhat disappointing. Wasn't it supposed to improve over it? 
> What went wrong?

I first somehow tested -O2 only.  Then when you asked I tested -Os again,
but only on powerpc64, and it was 0.1% there.  It seems combine.ii is not
very representative for this :-/

> >Is this okay for trunk?
> >
> >2015-10-08  Segher Boessenkool  <segher@kernel.crashing.org>
> >
> >	PR rtl-optimization/67864
> >	* gcc/bb-reorder (reorder_basic_blocks_simple): Prefer existing
> >	fallthrough edges for conditional jumps.  Don't sort candidate
> >	edges if not optimizing for speed.
> 
> Ok. Although it would be nice to understand exactly what causes code 
> growth and possibly get a real cost estimate rather than such a heuristic.

There are two main factors: firstly, you want to have as many fallthroughs
as possible (any block that does not end in a fallthrough gets an extra
jump insn).  I haven't found any purely local heuristic that works better
than this patch.  Very likely you can do better by looking at bigger parts
of the graph (say, identify loops and diamonds).

The second thing (which still hurts x86) is that on some targets the "cheap"
conditional branches have a very short range.  "Simple" has no notion of
jump distance at all.


Segher
diff mbox

Patch

diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
index cb001e8..3b7098e 100644
--- a/gcc/bb-reorder.c
+++ b/gcc/bb-reorder.c
@@ -2318,16 +2318,24 @@  reorder_basic_blocks_simple (void)
 
       if (any_condjump_p (end))
 	{
-	  edges[n++] = EDGE_SUCC (bb, 0);
-	  edges[n++] = EDGE_SUCC (bb, 1);
+	  edge e0 = EDGE_SUCC (bb, 0);
+	  edge e1 = EDGE_SUCC (bb, 1);
+	  /* When optimizing for size it is best to keep the original
+	     fallthrough edges.  */
+	  if (e1->flags & EDGE_FALLTHRU)
+	    std::swap (e0, e1);
+	  edges[n++] = e0;
+	  edges[n++] = e1;
 	}
       else if (single_succ_p (bb))
 	edges[n++] = EDGE_SUCC (bb, 0);
     }
 
-  /* Sort the edges, the most desirable first.  */
+  /* Sort the edges, the most desirable first.  When optimizing for size
+     all edges are equally desirable.  */
 
-  std::stable_sort (edges, edges + n, edge_order);
+  if (optimize_function_for_speed_p (cfun))
+    std::stable_sort (edges, edges + n, edge_order);
 
   /* Now decide which of those edges to make fallthrough edges.  We set
      BB_VISITED if a block already has a fallthrough successor assigned