diff mbox series

bb-reorder: Remove a misfiring micro-optimization (PR96475)

Message ID ed130964e46e1941122dd75e381cca907ec28bd5.1596833936.git.segher@kernel.crashing.org
State New
Headers show
Series bb-reorder: Remove a misfiring micro-optimization (PR96475) | expand

Commit Message

Segher Boessenkool Aug. 7, 2020, 9:51 p.m. UTC
When the compgotos pass copies the tail of blocks ending in an indirect
jump, there is a micro-optimization to not copy the last one, since the
original block will then just be deleted.  This does not work properly
if cleanup_cfg does not merge all pairs of blocks we expect it to.


v2: This also deletes the other use of single_pred_p, which has the same
problem in principle, I just never have triggered it so far.

Tested on powerpc64-linux {-m32,-m64} like before.  Is this okay for
trunk?


Segher


2020-08-07  Segher Boessenkool  <segher@kernel.crashing.org>

	PR rtl-optimization/96475
	* bb-reorder.c (maybe_duplicate_computed_goto): Remove single_pred_p
	micro-optimization.
---
 gcc/bb-reorder.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

Comments

Segher Boessenkool Aug. 17, 2020, 11:27 p.m. UTC | #1
Ping (added some Cc:s).

Thanks in advance,


Segher


On Fri, Aug 07, 2020 at 09:51:04PM +0000, Segher Boessenkool wrote:
> When the compgotos pass copies the tail of blocks ending in an indirect
> jump, there is a micro-optimization to not copy the last one, since the
> original block will then just be deleted.  This does not work properly
> if cleanup_cfg does not merge all pairs of blocks we expect it to.
> 
> 
> v2: This also deletes the other use of single_pred_p, which has the same
> problem in principle, I just never have triggered it so far.
> 
> Tested on powerpc64-linux {-m32,-m64} like before.  Is this okay for
> trunk?
> 
> 
> Segher
> 
> 
> 2020-08-07  Segher Boessenkool  <segher@kernel.crashing.org>
> 
> 	PR rtl-optimization/96475
> 	* bb-reorder.c (maybe_duplicate_computed_goto): Remove single_pred_p
> 	micro-optimization.
> ---
>  gcc/bb-reorder.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
> index c635010..76e56b5 100644
> --- a/gcc/bb-reorder.c
> +++ b/gcc/bb-reorder.c
> @@ -2680,9 +2680,6 @@ make_pass_reorder_blocks (gcc::context *ctxt)
>  static bool
>  maybe_duplicate_computed_goto (basic_block bb, int max_size)
>  {
> -  if (single_pred_p (bb))
> -    return false;
> -
>    /* Make sure that the block is small enough.  */
>    rtx_insn *insn;
>    FOR_BB_INSNS (bb, insn)
> @@ -2700,10 +2697,9 @@ maybe_duplicate_computed_goto (basic_block bb, int max_size)
>      {
>        basic_block pred = e->src;
>  
> -      /* Do not duplicate BB into PRED if that is the last predecessor, or if
> -	 we cannot merge a copy of BB with PRED.  */
> -      if (single_pred_p (bb)
> -	  || !single_succ_p (pred)
> +      /* Do not duplicate BB into PRED if we cannot merge a copy of BB
> +	 with PRED.  */
> +      if (!single_succ_p (pred)
>  	  || e->flags & EDGE_COMPLEX
>  	  || pred->index < NUM_FIXED_BLOCKS
>  	  || (JUMP_P (BB_END (pred)) && !simplejump_p (BB_END (pred)))
> -- 
> 1.8.3.1
Richard Sandiford Aug. 19, 2020, 12:10 p.m. UTC | #2
Segher Boessenkool <segher@kernel.crashing.org> writes:
> When the compgotos pass copies the tail of blocks ending in an indirect
> jump, there is a micro-optimization to not copy the last one, since the
> original block will then just be deleted.  This does not work properly
> if cleanup_cfg does not merge all pairs of blocks we expect it to.
>
>
> v2: This also deletes the other use of single_pred_p, which has the same
> problem in principle, I just never have triggered it so far.

Could you go into more details?  I thought the reason you wanted to
remove the test in the loop was that each successful iteration of the
loop removes one incoming edge, meaning that if all previous iterations
succeed, the final iteration is bound to see a single predecessor.
And there's no guarantee in that case that the jump in the final
block will be removed.  I.e. the loop is unnecessarily leaving
cfgcleanup to finish off the last step for it.  So I agree that the
loop part looks good.  (I assume that was v1, but I missed it, sorry.)

But it looks like the point of the initial test is to avoid “duplicating”
the block if there would still only be one copy of the code.  That test
still seems reasonable since the case it rejects should be handled by
other cfg cleanups.  If for some reason it isn't, there doesn't seem
to be any reason to apply the max_size limit for a single predecessor,
since no code duplication will take place.  So ISTM that removing the
first condition is really a separate thing that should be its own patch.

Thanks,
Richard

>
> Tested on powerpc64-linux {-m32,-m64} like before.  Is this okay for
> trunk?
>
>
> Segher
>
>
> 2020-08-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	PR rtl-optimization/96475
> 	* bb-reorder.c (maybe_duplicate_computed_goto): Remove single_pred_p
> 	micro-optimization.
> ---
>  gcc/bb-reorder.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
> index c635010..76e56b5 100644
> --- a/gcc/bb-reorder.c
> +++ b/gcc/bb-reorder.c
> @@ -2680,9 +2680,6 @@ make_pass_reorder_blocks (gcc::context *ctxt)
>  static bool
>  maybe_duplicate_computed_goto (basic_block bb, int max_size)
>  {
> -  if (single_pred_p (bb))
> -    return false;
> -
>    /* Make sure that the block is small enough.  */
>    rtx_insn *insn;
>    FOR_BB_INSNS (bb, insn)
> @@ -2700,10 +2697,9 @@ maybe_duplicate_computed_goto (basic_block bb, int max_size)
>      {
>        basic_block pred = e->src;
>  
> -      /* Do not duplicate BB into PRED if that is the last predecessor, or if
> -	 we cannot merge a copy of BB with PRED.  */
> -      if (single_pred_p (bb)
> -	  || !single_succ_p (pred)
> +      /* Do not duplicate BB into PRED if we cannot merge a copy of BB
> +	 with PRED.  */
> +      if (!single_succ_p (pred)
>  	  || e->flags & EDGE_COMPLEX
>  	  || pred->index < NUM_FIXED_BLOCKS
>  	  || (JUMP_P (BB_END (pred)) && !simplejump_p (BB_END (pred)))
Segher Boessenkool Aug. 19, 2020, 5:40 p.m. UTC | #3
On Wed, Aug 19, 2020 at 01:10:36PM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > When the compgotos pass copies the tail of blocks ending in an indirect
> > jump, there is a micro-optimization to not copy the last one, since the
> > original block will then just be deleted.  This does not work properly
> > if cleanup_cfg does not merge all pairs of blocks we expect it to.
> >
> >
> > v2: This also deletes the other use of single_pred_p, which has the same
> > problem in principle, I just never have triggered it so far.
> 
> Could you go into more details?  I thought the reason you wanted to
> remove the test in the loop was that each successful iteration of the
> loop removes one incoming edge, meaning that if all previous iterations
> succeed, the final iteration is bound to see a single predecessor.
> And there's no guarantee in that case that the jump in the final
> block will be removed.  I.e. the loop is unnecessarily leaving
> cfgcleanup to finish off the last step for it.  So I agree that the
> loop part looks good.  (I assume that was v1, but I missed it, sorry.)

I didn't Cc: you, I should have :-)

Yes, that is part of the problem.  The case where it was noticed however
was some release of GCC (some 9 release iirc) that did something wrong
with cfg_cleanup apparenetly, so not everything was threaded optimally,
making duplicate_computed_gotos run into jumps to jumps and stopping.
Since it is such an important pass (for performance, for things with
bigger jump tables or with threaded FSMs etc.), it seemed a good idea to
make it more robust instead of slightly faster.

> But it looks like the point of the initial test is to avoid “duplicating”
> the block if there would still only be one copy of the code.  That test
> still seems reasonable since the case it rejects should be handled by
> other cfg cleanups.  If for some reason it isn't, there doesn't seem
> to be any reason to apply the max_size limit for a single predecessor,
> since no code duplication will take place.  So ISTM that removing the
> first condition is really a separate thing that should be its own patch.

I have never seen the second case misfiring in practice, only the first
one!

The original patch is at
https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551560.html
btw.  I'll prepare a testcase for that (it doesn't trigger with recent
GCC versions though :-/ )

Thanks,


Segher
Richard Sandiford Aug. 19, 2020, 6:18 p.m. UTC | #4
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Aug 19, 2020 at 01:10:36PM +0100, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > When the compgotos pass copies the tail of blocks ending in an indirect
>> > jump, there is a micro-optimization to not copy the last one, since the
>> > original block will then just be deleted.  This does not work properly
>> > if cleanup_cfg does not merge all pairs of blocks we expect it to.
>> >
>> >
>> > v2: This also deletes the other use of single_pred_p, which has the same
>> > problem in principle, I just never have triggered it so far.
>> 
>> Could you go into more details?  I thought the reason you wanted to
>> remove the test in the loop was that each successful iteration of the
>> loop removes one incoming edge, meaning that if all previous iterations
>> succeed, the final iteration is bound to see a single predecessor.
>> And there's no guarantee in that case that the jump in the final
>> block will be removed.  I.e. the loop is unnecessarily leaving
>> cfgcleanup to finish off the last step for it.  So I agree that the
>> loop part looks good.  (I assume that was v1, but I missed it, sorry.)
>
> I didn't Cc: you, I should have :-)
>
> Yes, that is part of the problem.  The case where it was noticed however
> was some release of GCC (some 9 release iirc) that did something wrong
> with cfg_cleanup apparenetly, so not everything was threaded optimally,
> making duplicate_computed_gotos run into jumps to jumps and stopping.
> Since it is such an important pass (for performance, for things with
> bigger jump tables or with threaded FSMs etc.), it seemed a good idea to
> make it more robust instead of slightly faster.
>
>> But it looks like the point of the initial test is to avoid “duplicating”
>> the block if there would still only be one copy of the code.  That test
>> still seems reasonable since the case it rejects should be handled by
>> other cfg cleanups.  If for some reason it isn't, there doesn't seem
>> to be any reason to apply the max_size limit for a single predecessor,
>> since no code duplication will take place.  So ISTM that removing the
>> first condition is really a separate thing that should be its own patch.
>
> I have never seen the second case misfiring in practice, only the first
> one!

Shucks, I guessed the wrong way round :-)

I'd argue that the first check isn't a micro-optimisation though.
It's testing whether the duplication really is a duplication
(rather than a move).

The second test does look like a micro-optimisation.

> The original patch is at
> https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551560.html
> btw.  I'll prepare a testcase for that (it doesn't trigger with recent
> GCC versions though :-/ )

Hmm, OK.  If the first check is the important one, then I think it'd
be better to make the max_size stuff conditional on !single_pred_p
rather than drop the test entirely.

Thanks,
Richard
Segher Boessenkool Aug. 19, 2020, 11:11 p.m. UTC | #5
Hi!

On Wed, Aug 19, 2020 at 07:18:02PM +0100, Richard Sandiford wrote:
> > I have never seen the second case misfiring in practice, only the first
> > one!
> 
> Shucks, I guessed the wrong way round :-)
> 
> I'd argue that the first check isn't a micro-optimisation though.
> It's testing whether the duplication really is a duplication
> (rather than a move).

Yes, by removing it the code is duplicated like in all other cases, but
then the original is removed later (as dead code).  With the check in
place we have it appended to the predecessor block (by insn chain
grafting instead of copying, done by cleanup_cfg).  That is what I call
a micro-optimisation.  Of course it isn't correct in all cases (maybe
those cases didn't happen before, maybe no one noticed; in any case,
this patch makes it more robust, while still keeping it O(N)).

> > The original patch is at
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551560.html
> > btw.  I'll prepare a testcase for that (it doesn't trigger with recent
> > GCC versions though :-/ )
> 
> Hmm, OK.  If the first check is the important one, then I think it'd
> be better to make the max_size stuff conditional on !single_pred_p
> rather than drop the test entirely.

Okay, I'll look at that.  Thanks,


Segher
Li, Pan2 via Gcc-patches Aug. 25, 2020, 8:39 p.m. UTC | #6
On Fri, 2020-08-07 at 21:51 +0000, Segher Boessenkool wrote:
> When the compgotos pass copies the tail of blocks ending in an indirect
> jump, there is a micro-optimization to not copy the last one, since the
> original block will then just be deleted.  This does not work properly
> if cleanup_cfg does not merge all pairs of blocks we expect it to.
> 
> 
> v2: This also deletes the other use of single_pred_p, which has the same
> problem in principle, I just never have triggered it so far.
> 
> Tested on powerpc64-linux {-m32,-m64} like before.  Is this okay for
> trunk?
> 
> 
> Segher
> 
> 
> 2020-08-07  Segher Boessenkool  <segher@kernel.crashing.org>
> 
> 	PR rtl-optimization/96475
> 	* bb-reorder.c (maybe_duplicate_computed_goto): Remove single_pred_p
> 	micro-optimization.
So this may have already been answered, but why didn't we merge the single pred
with its single succ?

Though I guess your patch wouldn't hurt, so OK.

jeff
Segher Boessenkool Sept. 8, 2020, 5:41 p.m. UTC | #7
Hi!

Sorry this took so long to come back to...

On Tue, Aug 25, 2020 at 02:39:56PM -0600, Jeff Law wrote:
> On Fri, 2020-08-07 at 21:51 +0000, Segher Boessenkool wrote:
> > When the compgotos pass copies the tail of blocks ending in an indirect
> > jump, there is a micro-optimization to not copy the last one, since the
> > original block will then just be deleted.  This does not work properly
> > if cleanup_cfg does not merge all pairs of blocks we expect it to.
> > 
> > 
> > v2: This also deletes the other use of single_pred_p, which has the same
> > problem in principle, I just never have triggered it so far.
> > 
> > Tested on powerpc64-linux {-m32,-m64} like before.  Is this okay for
> > trunk?
> > 
> > 
> > Segher
> > 
> > 
> > 2020-08-07  Segher Boessenkool  <segher@kernel.crashing.org>
> > 
> > 	PR rtl-optimization/96475
> > 	* bb-reorder.c (maybe_duplicate_computed_goto): Remove single_pred_p
> > 	micro-optimization.
> So this may have already been answered, but why didn't we merge the single pred
> with its single succ?

The patch makes the compgotos pass more robust, and also makes it work
in cases where the unmodified code would not (for example, with the
unmodified code, the original indirect jump is *not* copied to the
predecessors whenever possible).

I don't know if we hit this, or we hit a snag with some older GCC
releases that did not jump thread properly in all cases.  I suspect a
micture of the two :-/

> Though I guess your patch wouldn't hurt, so OK.

Thanks!


Segher
diff mbox series

Patch

diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
index c635010..76e56b5 100644
--- a/gcc/bb-reorder.c
+++ b/gcc/bb-reorder.c
@@ -2680,9 +2680,6 @@  make_pass_reorder_blocks (gcc::context *ctxt)
 static bool
 maybe_duplicate_computed_goto (basic_block bb, int max_size)
 {
-  if (single_pred_p (bb))
-    return false;
-
   /* Make sure that the block is small enough.  */
   rtx_insn *insn;
   FOR_BB_INSNS (bb, insn)
@@ -2700,10 +2697,9 @@  maybe_duplicate_computed_goto (basic_block bb, int max_size)
     {
       basic_block pred = e->src;
 
-      /* Do not duplicate BB into PRED if that is the last predecessor, or if
-	 we cannot merge a copy of BB with PRED.  */
-      if (single_pred_p (bb)
-	  || !single_succ_p (pred)
+      /* Do not duplicate BB into PRED if we cannot merge a copy of BB
+	 with PRED.  */
+      if (!single_succ_p (pred)
 	  || e->flags & EDGE_COMPLEX
 	  || pred->index < NUM_FIXED_BLOCKS
 	  || (JUMP_P (BB_END (pred)) && !simplejump_p (BB_END (pred)))