jump threading multiple paths that start from the same BB
diff mbox series

Message ID bcff7e98-814c-e13d-9057-7be151c627d4@redhat.com
State New
Headers show
Series
  • jump threading multiple paths that start from the same BB
Related show

Commit Message

Aldy Hernandez Nov. 7, 2017, 5:33 p.m. UTC
[One more time, but without rejected HTML mail, because apparently this 
is my first post to gcc-patches *ever* ;-)].

Howdy!

While poking around in the backwards threader I noticed that we bail if 
we have already seen a starting BB.

       /* Do not jump-thread twice from the same block.  */
       if (bitmap_bit_p (threaded_blocks, entry->src->index)

This limitation discards paths that are sub-paths of paths that have 
already been threaded.

The following patch scans the remaining to-be-threaded paths to identify 
if any of them start from the same point, and are thus sub-paths of the 
just-threaded path.  By removing the common prefix of blocks in upcoming 
threadable paths, and then rewiring first non-common block 
appropriately, we expose new threading opportunities, since we are no 
longer starting from the same BB.  We also simplify the would-be 
threaded paths, because we don't duplicate already duplicated paths.

This sounds confusing, but the documentation for the entry point to my 
patch (adjust_paths_after_duplication) shows an actual example:

+/* After an FSM path has been jump threaded, adjust the remaining FSM
+   paths that are subsets of this path, so these paths can be safely
+   threaded within the context of the new threaded path.
+
+   For example, suppose we have just threaded:
+
+   5 -> 6 -> 7 -> 8 -> 12      =>      5 -> 6' -> 7' -> 8' -> 12'
+
+   And we have an upcoming threading candidate:
+   5 -> 6 -> 7 -> 8 -> 15 -> 20
+
+   This function adjusts the upcoming path into:
+   8' -> 15 -> 20

Notice that we will no longer have two paths that start from the same 
BB.  One will start with bb5, while the adjusted path will start with 
bb8'.  With this we kill two birds-- we are able to thread more paths, 
and these paths will avoid duplicating a whole mess of things that have 
already been threaded.

The attached patch is a subset of some upcoming work that can live on 
its own.  It bootstraps and regtests.  Also, by running it on a handful 
of .ii files, I can see that we are able to thread sub-paths that we 
previously dropped on the floor.  More is better, right? :)

To test this, I stole Jeff's method of using cachegrind to benchmark 
instruction counts and conditional branches 
(https://gcc.gnu.org/ml/gcc-patches/2013-11/msg02434.html).

Basically, I bootstrapped two compilers, with and without improvements, 
and used each to build a stage1 trunk.  Each of these stage1-trunks were 
used on 246 .ii GCC files I have lying around from a bootstrap some 
random point last year.  I used no special flags on builds apart from 
--enable-languages=c,c++.

Although I would've wished a larger improvement, this works comes for 
free, as it's just a subset of other work I'm hacking on.

Without further ado, here are my monumental, earth shattering improvements:

Conditional branches
    Without patch: 411846839709
    With    patch: 411831330316
         %changed: -0.0037660%

Number of instructions
    Without patch: 2271844650634
    With    patch: 2271761153578
         %changed: -0.0036754%


OK for trunk?
Aldy

p.s. There is a lot of debugging/dumping code in my patch, which I can 
gladly remove if/when approved.  It helps keep my head straight while 
looking at this spaghetti :).

Comments

Jeff Law Nov. 27, 2017, 11:27 p.m. UTC | #1
On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> [One more time, but without rejected HTML mail, because apparently this
> is my first post to gcc-patches *ever* ;-)].
> 
> Howdy!
> 
> While poking around in the backwards threader I noticed that we bail if
> we have already seen a starting BB.
> 
>       /* Do not jump-thread twice from the same block.  */
>       if (bitmap_bit_p (threaded_blocks, entry->src->index)
> 
> This limitation discards paths that are sub-paths of paths that have
> already been threaded.
> 
> The following patch scans the remaining to-be-threaded paths to identify
> if any of them start from the same point, and are thus sub-paths of the
> just-threaded path.  By removing the common prefix of blocks in upcoming
> threadable paths, and then rewiring first non-common block
> appropriately, we expose new threading opportunities, since we are no
> longer starting from the same BB.  We also simplify the would-be
> threaded paths, because we don't duplicate already duplicated paths.
> 
> This sounds confusing, but the documentation for the entry point to my
> patch (adjust_paths_after_duplication) shows an actual example:
> 
> +/* After an FSM path has been jump threaded, adjust the remaining FSM
> +   paths that are subsets of this path, so these paths can be safely
> +   threaded within the context of the new threaded path.
> +
> +   For example, suppose we have just threaded:
> +
> +   5 -> 6 -> 7 -> 8 -> 12      =>      5 -> 6' -> 7' -> 8' -> 12'
> +
> +   And we have an upcoming threading candidate:
> +   5 -> 6 -> 7 -> 8 -> 15 -> 20
> +
> +   This function adjusts the upcoming path into:
> +   8' -> 15 -> 20
So a nit on the comment.  The destination of the target edge in a jump
threading path is not duplicated.  So for the comment to be correct, I
think you need to change 12' to just 12.

Presumably BB8 ends in a conditional which we can't determine anything
about statically (otherwise we wouldn't have 8->12 and 8->15).  It's
something that may confuse someone not familiar with this code.  Though
full CFGs may be too painful to show here as well.




> 
> Notice that we will no longer have two paths that start from the same
> BB.  One will start with bb5, while the adjusted path will start with
> bb8'.  With this we kill two birds-- we are able to thread more paths,
> and these paths will avoid duplicating a whole mess of things that have
> already been threaded.
Which is a good thing IMHO.  Essentially with the code right now we have
to wait until the next backwards threading pass to pick up the second
jump thread.   It's not really a secondary effect -- we found the
thread, but didn't have the support we needed in the copier to DTRT.


> 
> The attached patch is a subset of some upcoming work that can live on
> its own.  It bootstraps and regtests.  Also, by running it on a handful
> of .ii files, I can see that we are able to thread sub-paths that we
> previously dropped on the floor.  More is better, right? :)
Generally, yes.  Similarly, the more thoroughly we can thread earlier,
the better.  While my immediate goal is to remove tree-vrp's jump
threading, dropping one or two (of the four!) backwards passes is also
on the hitlist.


> 
> To test this, I stole Jeff's method of using cachegrind to benchmark
> instruction counts and conditional branches
> (https://gcc.gnu.org/ml/gcc-patches/2013-11/msg02434.html).
:-)

> 
> Basically, I bootstrapped two compilers, with and without improvements,
> and used each to build a stage1 trunk.  Each of these stage1-trunks were
> used on 246 .ii GCC files I have lying around from a bootstrap some
> random point last year.  I used no special flags on builds apart from
> --enable-languages=c,c++.
> 
> Although I would've wished a larger improvement, this works comes for
> free, as it's just a subset of other work I'm hacking on.
Yup.

> 
> Without further ado, here are my monumental, earth shattering improvements:
> 
> Conditional branches
>    Without patch: 411846839709
>    With    patch: 411831330316
>         %changed: -0.0037660%
> 
> Number of instructions
>    Without patch: 2271844650634
>    With    patch: 2271761153578
>         %changed: -0.0036754%
So that's pretty small.  A really good improvement would be on the order
of a half-percent reduction in runtime conditional branches.  I usually
test with .i files that have enable-checking turned on -- which tends to
give lots of opportunities due to the redundancies in our checking code.

On a positive note, you're eliminating roughly 4.5 other dynamic
instructions for every runtime conditional branch you remove, which is
actually a good ratio.  3.5 is what I typically see for a fairly
extensive amount of work.   Patrick P's work last year was on the order
of 7.5.  So while it didn't fire often, when it did it was highly effective.

We have installed changes of this nature when they were part of a larger
set of changes, particularly if they helped us elsewhere (like
eliminating key path to silence a bogus warning or addressing other
regressions).

I'm going to defer judgment right now.  I will do some further testing
as I walk through the various uninitialized and similar warnings as well
as with any work to see if we can reasonably eliminate one or two of the
backwards passes.  If it helps in either of those contexts then we have
a much stronger case to include now rather than defer until the larger
work lands.



> p.s. There is a lot of debugging/dumping code in my patch, which I can
> gladly remove if/when approved.  It helps keep my head straight while
> looking at this spaghetti :).
Yea :-)  Actually keeping some of it is probably useful.  My brain turns
to mush as I move between the old style and new style jump threading
implementations.  Better dumps would probably help.


SO the changes to test that path->length > 1 in mark_threaded_blocks
seems like it ought to go forward independently.

Conceptually it looks sane.  A testcase or two would help.


Jeff
Aldy Hernandez Nov. 29, 2017, 6:49 p.m. UTC | #2
On 11/27/2017 06:27 PM, Jeff Law wrote:
> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:

>> Without further ado, here are my monumental, earth shattering improvements:
>>
>> Conditional branches
>>     Without patch: 411846839709
>>     With    patch: 411831330316
>>          %changed: -0.0037660%
>>
>> Number of instructions
>>     Without patch: 2271844650634
>>     With    patch: 2271761153578
>>          %changed: -0.0036754%
> So that's pretty small.  A really good improvement would be on the order
> of a half-percent reduction in runtime conditional branches.  I usually
> test with .i files that have enable-checking turned on -- which tends to
> give lots of opportunities due to the redundancies in our checking code.
> 
> On a positive note, you're eliminating roughly 4.5 other dynamic
> instructions for every runtime conditional branch you remove, which is
> actually a good ratio.  3.5 is what I typically see for a fairly
> extensive amount of work.   Patrick P's work last year was on the order
> of 7.5.  So while it didn't fire often, when it did it was highly effective.

I've retested with .ii files gathered from a stage1 build with 
--enable-checking=yes,all.  The results are an order of magnitude better 
but not impressive by any means:

Conditional branches
    Without patch: 1333333689781
    With    patch: 1333666327560
         %changed: -0.0249416%

Number of instructions
    Without patch: 7347240547621
    With    patch: 7348622241739
         %changed: -0.0188021%


> 
> We have installed changes of this nature when they were part of a larger
> set of changes, particularly if they helped us elsewhere (like
> eliminating key path to silence a bogus warning or addressing other
> regressions).

I don't know if the above changes your view, but I am not losing sleep 
over this, so no worries.  I will just keep accumulating these along 
with the myriad of other changes I have ;-).

Thanks for taking a look at this, and making sure I'm headed in the 
right direction.

Aldy
Jeff Law Nov. 30, 2017, 11:34 p.m. UTC | #3
On 11/29/2017 11:49 AM, Aldy Hernandez wrote:
> On 11/27/2017 06:27 PM, Jeff Law wrote:
>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> 
>>> Without further ado, here are my monumental, earth shattering
>>> improvements:
>>>
>>> Conditional branches
>>>     Without patch: 411846839709
>>>     With    patch: 411831330316
>>>          %changed: -0.0037660%
>>>
>>> Number of instructions
>>>     Without patch: 2271844650634
>>>     With    patch: 2271761153578
>>>          %changed: -0.0036754%
>> So that's pretty small.  A really good improvement would be on the order
>> of a half-percent reduction in runtime conditional branches.  I usually
>> test with .i files that have enable-checking turned on -- which tends to
>> give lots of opportunities due to the redundancies in our checking code.
>>
>> On a positive note, you're eliminating roughly 4.5 other dynamic
>> instructions for every runtime conditional branch you remove, which is
>> actually a good ratio.  3.5 is what I typically see for a fairly
>> extensive amount of work.   Patrick P's work last year was on the order
>> of 7.5.  So while it didn't fire often, when it did it was highly
>> effective.
> 
> I've retested with .ii files gathered from a stage1 build with
> --enable-checking=yes,all.  The results are an order of magnitude better
> but not impressive by any means:
Ack.  It'd need another order of magnitude to start getting into that
interesting range by itself :-)  You're still hitting that 4.5 ratio of
non-branch to branch instructions eliminated.  So it's higher value when
it applies than most of the stuff I've done through the years.


>> We have installed changes of this nature when they were part of a larger
>> set of changes, particularly if they helped us elsewhere (like
>> eliminating key path to silence a bogus warning or addressing other
>> regressions).
> 
> I don't know if the above changes your view, but I am not losing sleep
> over this, so no worries.  I will just keep accumulating these along
> with the myriad of other changes I have ;-).
I'm not sure it changes anything either.   But I'll certainly keep it in
mind as I start digging deeper into the buglists for gcc-8.  If it helps
in the bugfixing effort, then we'll have an interesting decision to make.

jeff
Jeff Law June 29, 2018, 6:50 p.m. UTC | #4
[ Returning to another old patch... ]

On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> [One more time, but without rejected HTML mail, because apparently this
> is my first post to gcc-patches *ever* ;-)].
> 
> Howdy!
> 
> While poking around in the backwards threader I noticed that we bail if
> we have already seen a starting BB.
> 
>       /* Do not jump-thread twice from the same block.  */
>       if (bitmap_bit_p (threaded_blocks, entry->src->index)
> 
> This limitation discards paths that are sub-paths of paths that have
> already been threaded.
> 
> The following patch scans the remaining to-be-threaded paths to identify
> if any of them start from the same point, and are thus sub-paths of the
> just-threaded path.  By removing the common prefix of blocks in upcoming
> threadable paths, and then rewiring first non-common block
> appropriately, we expose new threading opportunities, since we are no
> longer starting from the same BB.  We also simplify the would-be
> threaded paths, because we don't duplicate already duplicated paths.
> 
> This sounds confusing, but the documentation for the entry point to my
> patch (adjust_paths_after_duplication) shows an actual example:
> 
> +/* After an FSM path has been jump threaded, adjust the remaining FSM
> +   paths that are subsets of this path, so these paths can be safely
> +   threaded within the context of the new threaded path.
> +
> +   For example, suppose we have just threaded:
> +
> +   5 -> 6 -> 7 -> 8 -> 12      =>      5 -> 6' -> 7' -> 8' -> 12'
> +
> +   And we have an upcoming threading candidate:
> +   5 -> 6 -> 7 -> 8 -> 15 -> 20
> +
> +   This function adjusts the upcoming path into:
> +   8' -> 15 -> 20
> 
> Notice that we will no longer have two paths that start from the same
> BB.  One will start with bb5, while the adjusted path will start with
> bb8'.  With this we kill two birds-- we are able to thread more paths,
> and these paths will avoid duplicating a whole mess of things that have
> already been threaded.
> 
> The attached patch is a subset of some upcoming work that can live on
> its own.  It bootstraps and regtests.  Also, by running it on a handful
> of .ii files, I can see that we are able to thread sub-paths that we
> previously dropped on the floor.  More is better, right? :)
> 
> To test this, I stole Jeff's method of using cachegrind to benchmark
> instruction counts and conditional branches
> (https://gcc.gnu.org/ml/gcc-patches/2013-11/msg02434.html).
> 
> Basically, I bootstrapped two compilers, with and without improvements,
> and used each to build a stage1 trunk.  Each of these stage1-trunks were
> used on 246 .ii GCC files I have lying around from a bootstrap some
> random point last year.  I used no special flags on builds apart from
> --enable-languages=c,c++.
> 
> Although I would've wished a larger improvement, this works comes for
> free, as it's just a subset of other work I'm hacking on.
> 
> Without further ado, here are my monumental, earth shattering improvements:
> 
> Conditional branches
>    Without patch: 411846839709
>    With    patch: 411831330316
>         %changed: -0.0037660%
> 
> Number of instructions
>    Without patch: 2271844650634
>    With    patch: 2271761153578
>         %changed: -0.0036754%
> 
> 
> OK for trunk?
> Aldy
> 
> p.s. There is a lot of debugging/dumping code in my patch, which I can
> gladly remove if/when approved.  It helps keep my head straight while
> looking at this spaghetti :).
> 
> curr.patch
> 
> 
> gcc/
> 
> 	* tree-ssa-threadupdate.c (mark_threaded_blocks): Avoid
> 	dereferencing path[] beyond its length.
> 	(debug_path): New.
> 	(debug_all_paths): New.
> 	(rewire_first_differing_edge): New.
> 	(adjust_paths_after_duplication): New.
> 	(duplicate_thread_path): Call adjust_paths_after_duplication.
> 	Add new argument.
> 	(thread_through_all_blocks): Add new argument to
> 	duplicate_thread_path.
This is fine for the trunk.  I'd keep the dumping code as-is.  It'll be
useful in the future :-)

> 
> diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
> index 1dab0f1fab4..53ac7181b4b 100644
> --- a/gcc/tree-ssa-threadupdate.c
> +++ b/gcc/tree-ssa-threadupdate.c
> +
> +/* Rewire a jump_thread_edge so that the source block is now a
> +   threaded source block.
> +
> +   PATH_NUM is an index into the global path table PATHS.
> +   EDGE_NUM is the jump thread edge number into said path.
> +
> +   Returns TRUE if we were able to successfully rewire the edge.  */
> +
> +static bool
> +rewire_first_differing_edge (unsigned path_num, unsigned edge_num)
> +{
> +  vec<jump_thread_edge *> *path = paths[path_num];
> +  edge &e = (*path)[edge_num]->e;
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "rewiring edge candidate: %d -> %d\n",
> +	     e->src->index, e->dest->index);
> +  basic_block src_copy = get_bb_copy (e->src);
> +  if (src_copy == NULL)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "ignoring candidate: there is no src COPY\n");
> +      return false;
> +    }
> +  edge new_edge = find_edge (src_copy, e->dest);
> +  /* If the previously threaded paths created a flow graph where we
> +     can no longer figure out where to go, give up.  */
> +  if (new_edge == NULL)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "ignoring candidate: we lost our way\n");
> +      return false;
> +    }
> +  e = new_edge;
> +  return true;
I was at first a bit confused about how this function had any visible
side effects.  Then I realized that "e" is actually a reference to an
edge in the jump threading path and that seemingly dead assignment at
the end isn't really dead :-)


Jeff
Aldy Hernandez July 1, 2018, 10:56 a.m. UTC | #5
On 06/29/2018 02:50 PM, Jeff Law wrote:
> [ Returning to another old patch... ]
> 
> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>> [One more time, but without rejected HTML mail, because apparently this
>> is my first post to gcc-patches *ever* ;-)].
>>
>> Howdy!
>>
>> While poking around in the backwards threader I noticed that we bail if
>> we have already seen a starting BB.
>>
>>        /* Do not jump-thread twice from the same block.  */
>>        if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>
>> This limitation discards paths that are sub-paths of paths that have
>> already been threaded.
>>
>> The following patch scans the remaining to-be-threaded paths to identify
>> if any of them start from the same point, and are thus sub-paths of the
>> just-threaded path.  By removing the common prefix of blocks in upcoming
>> threadable paths, and then rewiring first non-common block
>> appropriately, we expose new threading opportunities, since we are no
>> longer starting from the same BB.  We also simplify the would-be
>> threaded paths, because we don't duplicate already duplicated paths.
>>
>> This sounds confusing, but the documentation for the entry point to my
>> patch (adjust_paths_after_duplication) shows an actual example:
>>
>> +/* After an FSM path has been jump threaded, adjust the remaining FSM
>> +   paths that are subsets of this path, so these paths can be safely
>> +   threaded within the context of the new threaded path.
>> +
>> +   For example, suppose we have just threaded:
>> +
>> +   5 -> 6 -> 7 -> 8 -> 12      =>      5 -> 6' -> 7' -> 8' -> 12'
>> +
>> +   And we have an upcoming threading candidate:
>> +   5 -> 6 -> 7 -> 8 -> 15 -> 20
>> +
>> +   This function adjusts the upcoming path into:
>> +   8' -> 15 -> 20
>>
>> Notice that we will no longer have two paths that start from the same
>> BB.  One will start with bb5, while the adjusted path will start with
>> bb8'.  With this we kill two birds-- we are able to thread more paths,
>> and these paths will avoid duplicating a whole mess of things that have
>> already been threaded.
>>
>> The attached patch is a subset of some upcoming work that can live on
>> its own.  It bootstraps and regtests.  Also, by running it on a handful
>> of .ii files, I can see that we are able to thread sub-paths that we
>> previously dropped on the floor.  More is better, right? :)
>>
>> To test this, I stole Jeff's method of using cachegrind to benchmark
>> instruction counts and conditional branches
>> (https://gcc.gnu.org/ml/gcc-patches/2013-11/msg02434.html).
>>
>> Basically, I bootstrapped two compilers, with and without improvements,
>> and used each to build a stage1 trunk.  Each of these stage1-trunks were
>> used on 246 .ii GCC files I have lying around from a bootstrap some
>> random point last year.  I used no special flags on builds apart from
>> --enable-languages=c,c++.
>>
>> Although I would've wished a larger improvement, this works comes for
>> free, as it's just a subset of other work I'm hacking on.
>>
>> Without further ado, here are my monumental, earth shattering improvements:
>>
>> Conditional branches
>>     Without patch: 411846839709
>>     With    patch: 411831330316
>>          %changed: -0.0037660%
>>
>> Number of instructions
>>     Without patch: 2271844650634
>>     With    patch: 2271761153578
>>          %changed: -0.0036754%
>>
>>
>> OK for trunk?
>> Aldy
>>
>> p.s. There is a lot of debugging/dumping code in my patch, which I can
>> gladly remove if/when approved.  It helps keep my head straight while
>> looking at this spaghetti :).
>>
>> curr.patch
>>
>>
>> gcc/
>>
>> 	* tree-ssa-threadupdate.c (mark_threaded_blocks): Avoid
>> 	dereferencing path[] beyond its length.
>> 	(debug_path): New.
>> 	(debug_all_paths): New.
>> 	(rewire_first_differing_edge): New.
>> 	(adjust_paths_after_duplication): New.
>> 	(duplicate_thread_path): Call adjust_paths_after_duplication.
>> 	Add new argument.
>> 	(thread_through_all_blocks): Add new argument to
>> 	duplicate_thread_path.
> This is fine for the trunk.  I'd keep the dumping code as-is.  It'll be
> useful in the future :-)

Retested against current trunk and committed.

Thanks.

Aldy
Christophe Lyon July 2, 2018, 11:08 a.m. UTC | #6
On Sun, 1 Jul 2018 at 12:56, Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 06/29/2018 02:50 PM, Jeff Law wrote:
> > [ Returning to another old patch... ]
> >
> > On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> >> [One more time, but without rejected HTML mail, because apparently this
> >> is my first post to gcc-patches *ever* ;-)].
> >>
> >> Howdy!
> >>
> >> While poking around in the backwards threader I noticed that we bail if
> >> we have already seen a starting BB.
> >>
> >>        /* Do not jump-thread twice from the same block.  */
> >>        if (bitmap_bit_p (threaded_blocks, entry->src->index)
> >>
> >> This limitation discards paths that are sub-paths of paths that have
> >> already been threaded.
> >>
> >> The following patch scans the remaining to-be-threaded paths to identify
> >> if any of them start from the same point, and are thus sub-paths of the
> >> just-threaded path.  By removing the common prefix of blocks in upcoming
> >> threadable paths, and then rewiring first non-common block
> >> appropriately, we expose new threading opportunities, since we are no
> >> longer starting from the same BB.  We also simplify the would-be
> >> threaded paths, because we don't duplicate already duplicated paths.
> >>
> >> This sounds confusing, but the documentation for the entry point to my
> >> patch (adjust_paths_after_duplication) shows an actual example:
> >>
> >> +/* After an FSM path has been jump threaded, adjust the remaining FSM
> >> +   paths that are subsets of this path, so these paths can be safely
> >> +   threaded within the context of the new threaded path.
> >> +
> >> +   For example, suppose we have just threaded:
> >> +
> >> +   5 -> 6 -> 7 -> 8 -> 12      =>      5 -> 6' -> 7' -> 8' -> 12'
> >> +
> >> +   And we have an upcoming threading candidate:
> >> +   5 -> 6 -> 7 -> 8 -> 15 -> 20
> >> +
> >> +   This function adjusts the upcoming path into:
> >> +   8' -> 15 -> 20
> >>
> >> Notice that we will no longer have two paths that start from the same
> >> BB.  One will start with bb5, while the adjusted path will start with
> >> bb8'.  With this we kill two birds-- we are able to thread more paths,
> >> and these paths will avoid duplicating a whole mess of things that have
> >> already been threaded.
> >>
> >> The attached patch is a subset of some upcoming work that can live on
> >> its own.  It bootstraps and regtests.  Also, by running it on a handful
> >> of .ii files, I can see that we are able to thread sub-paths that we
> >> previously dropped on the floor.  More is better, right? :)
> >>
> >> To test this, I stole Jeff's method of using cachegrind to benchmark
> >> instruction counts and conditional branches
> >> (https://gcc.gnu.org/ml/gcc-patches/2013-11/msg02434.html).
> >>
> >> Basically, I bootstrapped two compilers, with and without improvements,
> >> and used each to build a stage1 trunk.  Each of these stage1-trunks were
> >> used on 246 .ii GCC files I have lying around from a bootstrap some
> >> random point last year.  I used no special flags on builds apart from
> >> --enable-languages=c,c++.
> >>
> >> Although I would've wished a larger improvement, this works comes for
> >> free, as it's just a subset of other work I'm hacking on.
> >>
> >> Without further ado, here are my monumental, earth shattering improvements:
> >>
> >> Conditional branches
> >>     Without patch: 411846839709
> >>     With    patch: 411831330316
> >>          %changed: -0.0037660%
> >>
> >> Number of instructions
> >>     Without patch: 2271844650634
> >>     With    patch: 2271761153578
> >>          %changed: -0.0036754%
> >>
> >>
> >> OK for trunk?
> >> Aldy
> >>
> >> p.s. There is a lot of debugging/dumping code in my patch, which I can
> >> gladly remove if/when approved.  It helps keep my head straight while
> >> looking at this spaghetti :).
> >>
> >> curr.patch
> >>
> >>
> >> gcc/
> >>
> >>      * tree-ssa-threadupdate.c (mark_threaded_blocks): Avoid
> >>      dereferencing path[] beyond its length.
> >>      (debug_path): New.
> >>      (debug_all_paths): New.
> >>      (rewire_first_differing_edge): New.
> >>      (adjust_paths_after_duplication): New.
> >>      (duplicate_thread_path): Call adjust_paths_after_duplication.
> >>      Add new argument.
> >>      (thread_through_all_blocks): Add new argument to
> >>      duplicate_thread_path.
> > This is fine for the trunk.  I'd keep the dumping code as-is.  It'll be
> > useful in the future :-)
>
> Retested against current trunk and committed.
>

Hi,

I've noticed a regression on aarch64:
FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
threaded: 3"
very likely caused by this patch (appeared between 262282 and 262294)

Christophe

> Thanks.
>
> Aldy
Aldy Hernandez July 3, 2018, 9:31 a.m. UTC | #7
On 07/02/2018 07:08 AM, Christophe Lyon wrote:

>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>>>> While poking around in the backwards threader I noticed that we bail if
>>>> we have already seen a starting BB.
>>>>
>>>>         /* Do not jump-thread twice from the same block.  */
>>>>         if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>>>
>>>> This limitation discards paths that are sub-paths of paths that have
>>>> already been threaded.
>>>>
>>>> The following patch scans the remaining to-be-threaded paths to identify
>>>> if any of them start from the same point, and are thus sub-paths of the
>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
>>>> threadable paths, and then rewiring first non-common block
>>>> appropriately, we expose new threading opportunities, since we are no
>>>> longer starting from the same BB.  We also simplify the would-be
>>>> threaded paths, because we don't duplicate already duplicated paths.
[snip]
> Hi,
> 
> I've noticed a regression on aarch64:
> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
> threaded: 3"
> very likely caused by this patch (appeared between 262282 and 262294)
> 
> Christophe

The test needs to be adjusted here.

The long story is that the aarch64 IL is different at thread3 time in 
that it has 2 profitable sub-paths that can now be threaded with my 
patch.  This is causing the threaded count to be 5 for aarch64, versus 3 
for x86 64.  Previously we couldn't thread these in aarch64, so the 
backwards threader would bail.

One can see the different threading opportunities by sticking 
debug_all_paths() at the top of thread_through_all_blocks().  You will 
notice that aarch64 has far more candidates to begin with.  The IL on 
the x86 backend, has no paths that start on the same BB.  The aarch64, 
on the other hand, has many to choose from:

path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,

Some of these prove unprofitable, but 2 more than before are profitable now.

BTW, I see another threading related failure on aarch64 which is 
unrelated to my patch, and was previously there:

FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps 
threaded"

This is probably another IL incompatibility between architectures.

Anyways... the attached path fixes the regression.  I have added a note 
to the test explaining the IL differences.  We really should rewrite all 
the threading tests (I am NOT volunteering ;-)).

OK for trunk?
Aldy
gcc/testsuite/

	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
	has a slightly different IL that provides more threading
	opportunities.

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 9ee8d12010b..e395de26ec0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -2,11 +2,16 @@
 /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 16"  "thread1" } } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread2" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 1"  "dom2" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */
 
+/* Most architectures get 3 threadable paths here, whereas aarch64 and
+   possibly others get 5.  We really should rewrite threading tests to
+   test a specific IL sequence, not gobs of code whose IL can vary
+   from architecture to architecture.  */
+/* { dg-final { scan-tree-dump "Jumps threaded: \[35\]" "thread3" } } */
+
 enum STATE {
   S0=0,
   SI,
Jeff Law July 4, 2018, 12:16 a.m. UTC | #8
On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
> 
>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>>>>> While poking around in the backwards threader I noticed that we bail if
>>>>>
>>>>> we have already seen a starting BB.
>>>>>
>>>>>         /* Do not jump-thread twice from the same block.  */
>>>>>         if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>>>>
>>>>> This limitation discards paths that are sub-paths of paths that have
>>>>> already been threaded.
>>>>>
>>>>> The following patch scans the remaining to-be-threaded paths to identify
>>>>>
>>>>> if any of them start from the same point, and are thus sub-paths of the
>>>>>
>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
>>>>>
>>>>> threadable paths, and then rewiring first non-common block
>>>>> appropriately, we expose new threading opportunities, since we are no
>>>>> longer starting from the same BB.  We also simplify the would-be
>>>>> threaded paths, because we don't duplicate already duplicated paths.
> [snip]
>> Hi,
>>
>> I've noticed a regression on aarch64:
>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
>> threaded: 3"
>> very likely caused by this patch (appeared between 262282 and 262294)
>>
>> Christophe
> 
> The test needs to be adjusted here.
> 
> The long story is that the aarch64 IL is different at thread3 time in
> that it has 2 profitable sub-paths that can now be threaded with my
> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
> for x86 64.  Previously we couldn't thread these in aarch64, so the
> backwards threader would bail.
> 
> One can see the different threading opportunities by sticking
> debug_all_paths() at the top of thread_through_all_blocks().  You will
> notice that aarch64 has far more candidates to begin with.  The IL on
> the x86 backend, has no paths that start on the same BB.  The aarch64,
> on the other hand, has many to choose from:
> 
> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
> 
> Some of these prove unprofitable, but 2 more than before are profitable now.
> 
> 
> BTW, I see another threading related failure on aarch64 which is
> unrelated to my patch, and was previously there:
> 
> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
> threaded"
> 
> This is probably another IL incompatibility between architectures.
> 
> Anyways... the attached path fixes the regression.  I have added a note
> to the test explaining the IL differences.  We really should rewrite all
> the threading tests (I am NOT volunteering ;-)).
> 
> OK for trunk?
> Aldy
> 
> curr.patch
> 
> 
> gcc/testsuite/
> 
> 	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
> 	has a slightly different IL that provides more threading
> 	opportunities.
OK.

WRT rewriting the tests.  I'd certainly agree that we don't have the
right set of knobs to allow us to characterize the target nor do we have
the right dumping/scanning facilities to describe and query the CFG changes.

The fact that the IL changes so much across targets is a sign that
target dependency (probably BRANCH_COST) is twiddling the gimple we
generate.  I strongly suspect we'd be a lot better off if we tackled the
BRANCH_COST problem first.

jeff

ps.  That particular test is the  test which led to the creation of the
backwards jump threader :-)
Aldy Hernandez July 4, 2018, 8:12 a.m. UTC | #9
On 07/03/2018 08:16 PM, Jeff Law wrote:
> On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
>> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
>>
>>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>>>>>> While poking around in the backwards threader I noticed that we bail if
>>>>>>
>>>>>> we have already seen a starting BB.
>>>>>>
>>>>>>          /* Do not jump-thread twice from the same block.  */
>>>>>>          if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>>>>>
>>>>>> This limitation discards paths that are sub-paths of paths that have
>>>>>> already been threaded.
>>>>>>
>>>>>> The following patch scans the remaining to-be-threaded paths to identify
>>>>>>
>>>>>> if any of them start from the same point, and are thus sub-paths of the
>>>>>>
>>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
>>>>>>
>>>>>> threadable paths, and then rewiring first non-common block
>>>>>> appropriately, we expose new threading opportunities, since we are no
>>>>>> longer starting from the same BB.  We also simplify the would-be
>>>>>> threaded paths, because we don't duplicate already duplicated paths.
>> [snip]
>>> Hi,
>>>
>>> I've noticed a regression on aarch64:
>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
>>> threaded: 3"
>>> very likely caused by this patch (appeared between 262282 and 262294)
>>>
>>> Christophe
>>
>> The test needs to be adjusted here.
>>
>> The long story is that the aarch64 IL is different at thread3 time in
>> that it has 2 profitable sub-paths that can now be threaded with my
>> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
>> for x86 64.  Previously we couldn't thread these in aarch64, so the
>> backwards threader would bail.
>>
>> One can see the different threading opportunities by sticking
>> debug_all_paths() at the top of thread_through_all_blocks().  You will
>> notice that aarch64 has far more candidates to begin with.  The IL on
>> the x86 backend, has no paths that start on the same BB.  The aarch64,
>> on the other hand, has many to choose from:
>>
>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
>>
>> Some of these prove unprofitable, but 2 more than before are profitable now.
>>
>>
>> BTW, I see another threading related failure on aarch64 which is
>> unrelated to my patch, and was previously there:
>>
>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
>> threaded"
>>
>> This is probably another IL incompatibility between architectures.
>>
>> Anyways... the attached path fixes the regression.  I have added a note
>> to the test explaining the IL differences.  We really should rewrite all
>> the threading tests (I am NOT volunteering ;-)).
>>
>> OK for trunk?
>> Aldy
>>
>> curr.patch
>>
>>
>> gcc/testsuite/
>>
>> 	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
>> 	has a slightly different IL that provides more threading
>> 	opportunities.
> OK.
> 
> WRT rewriting the tests.  I'd certainly agree that we don't have the
> right set of knobs to allow us to characterize the target nor do we have
> the right dumping/scanning facilities to describe and query the CFG changes.
> 
> The fact that the IL changes so much across targets is a sign that
> target dependency (probably BRANCH_COST) is twiddling the gimple we
> generate.  I strongly suspect we'd be a lot better off if we tackled the
> BRANCH_COST problem first.

Huh.  I've always accepted differing IL between architectures as a 
necessary evil for things like auto-vectorization and the like.

What's the ideal plan here? A knob to set default values for target 
dependent variables that can affect IL layout?  Then we could pass 
-fthis-is-an-IL-test and things be normalized?

Thanks.
Aldy
Richard Biener July 4, 2018, 9:01 a.m. UTC | #10
On Wed, Jul 4, 2018 at 10:12 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 07/03/2018 08:16 PM, Jeff Law wrote:
> > On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
> >> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
> >>
> >>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> >>>>>> While poking around in the backwards threader I noticed that we bail if
> >>>>>>
> >>>>>> we have already seen a starting BB.
> >>>>>>
> >>>>>>          /* Do not jump-thread twice from the same block.  */
> >>>>>>          if (bitmap_bit_p (threaded_blocks, entry->src->index)
> >>>>>>
> >>>>>> This limitation discards paths that are sub-paths of paths that have
> >>>>>> already been threaded.
> >>>>>>
> >>>>>> The following patch scans the remaining to-be-threaded paths to identify
> >>>>>>
> >>>>>> if any of them start from the same point, and are thus sub-paths of the
> >>>>>>
> >>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
> >>>>>>
> >>>>>> threadable paths, and then rewiring first non-common block
> >>>>>> appropriately, we expose new threading opportunities, since we are no
> >>>>>> longer starting from the same BB.  We also simplify the would-be
> >>>>>> threaded paths, because we don't duplicate already duplicated paths.
> >> [snip]
> >>> Hi,
> >>>
> >>> I've noticed a regression on aarch64:
> >>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
> >>> threaded: 3"
> >>> very likely caused by this patch (appeared between 262282 and 262294)
> >>>
> >>> Christophe
> >>
> >> The test needs to be adjusted here.
> >>
> >> The long story is that the aarch64 IL is different at thread3 time in
> >> that it has 2 profitable sub-paths that can now be threaded with my
> >> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
> >> for x86 64.  Previously we couldn't thread these in aarch64, so the
> >> backwards threader would bail.
> >>
> >> One can see the different threading opportunities by sticking
> >> debug_all_paths() at the top of thread_through_all_blocks().  You will
> >> notice that aarch64 has far more candidates to begin with.  The IL on
> >> the x86 backend, has no paths that start on the same BB.  The aarch64,
> >> on the other hand, has many to choose from:
> >>
> >> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
> >> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> >> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> >> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> >> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> >> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
> >>
> >> Some of these prove unprofitable, but 2 more than before are profitable now.
> >>
> >>
> >> BTW, I see another threading related failure on aarch64 which is
> >> unrelated to my patch, and was previously there:
> >>
> >> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
> >> threaded"
> >>
> >> This is probably another IL incompatibility between architectures.
> >>
> >> Anyways... the attached path fixes the regression.  I have added a note
> >> to the test explaining the IL differences.  We really should rewrite all
> >> the threading tests (I am NOT volunteering ;-)).
> >>
> >> OK for trunk?
> >> Aldy
> >>
> >> curr.patch
> >>
> >>
> >> gcc/testsuite/
> >>
> >>      * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
> >>      has a slightly different IL that provides more threading
> >>      opportunities.
> > OK.
> >
> > WRT rewriting the tests.  I'd certainly agree that we don't have the
> > right set of knobs to allow us to characterize the target nor do we have
> > the right dumping/scanning facilities to describe and query the CFG changes.
> >
> > The fact that the IL changes so much across targets is a sign that
> > target dependency (probably BRANCH_COST) is twiddling the gimple we
> > generate.  I strongly suspect we'd be a lot better off if we tackled the
> > BRANCH_COST problem first.
>
> Huh.  I've always accepted differing IL between architectures as a
> necessary evil for things like auto-vectorization and the like.
>
> What's the ideal plan here? A knob to set default values for target
> dependent variables that can affect IL layout?  Then we could pass
> -fthis-is-an-IL-test and things be normalized?

Use gimple testcases.  It's currently a bit awkward in some cases
but it worked for me in a few cases.

Richard.

> Thanks.
> Aldy
Jeff Law July 5, 2018, 7:27 p.m. UTC | #11
On 07/04/2018 02:12 AM, Aldy Hernandez wrote:
> 
> 
> On 07/03/2018 08:16 PM, Jeff Law wrote:
>> On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
>>> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
>>>
>>>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>>>>>>> While poking around in the backwards threader I noticed that we bail if
>>>>>>>
>>>>>>>
>>>>>>> we have already seen a starting BB.
>>>>>>>
>>>>>>>          /* Do not jump-thread twice from the same block.  */
>>>>>>>          if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>>>>>>
>>>>>>> This limitation discards paths that are sub-paths of paths that have
>>>>>>> already been threaded.
>>>>>>>
>>>>>>> The following patch scans the remaining to-be-threaded paths to identify
>>>>>>>
>>>>>>>
>>>>>>> if any of them start from the same point, and are thus sub-paths of the
>>>>>>>
>>>>>>>
>>>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
>>>>>>>
>>>>>>>
>>>>>>> threadable paths, and then rewiring first non-common block
>>>>>>> appropriately, we expose new threading opportunities, since we are no
>>>>>>>
>>>>>>> longer starting from the same BB.  We also simplify the would-be
>>>>>>> threaded paths, because we don't duplicate already duplicated paths.
>>> [snip]
>>>> Hi,
>>>>
>>>> I've noticed a regression on aarch64:
>>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
>>>> threaded: 3"
>>>> very likely caused by this patch (appeared between 262282 and 262294)
>>>>
>>>> Christophe
>>>
>>> The test needs to be adjusted here.
>>>
>>> The long story is that the aarch64 IL is different at thread3 time in
>>> that it has 2 profitable sub-paths that can now be threaded with my
>>> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
>>> for x86 64.  Previously we couldn't thread these in aarch64, so the
>>> backwards threader would bail.
>>>
>>> One can see the different threading opportunities by sticking
>>> debug_all_paths() at the top of thread_through_all_blocks().  You will
>>> notice that aarch64 has far more candidates to begin with.  The IL on
>>> the x86 backend, has no paths that start on the same BB.  The aarch64,
>>> on the other hand, has many to choose from:
>>>
>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
>>>
>>> Some of these prove unprofitable, but 2 more than before are profitable now.
>>>
>>>
>>>
>>> BTW, I see another threading related failure on aarch64 which is
>>> unrelated to my patch, and was previously there:
>>>
>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
>>> threaded"
>>>
>>> This is probably another IL incompatibility between architectures.
>>>
>>> Anyways... the attached path fixes the regression.  I have added a note
>>> to the test explaining the IL differences.  We really should rewrite all
>>> the threading tests (I am NOT volunteering ;-)).
>>>
>>> OK for trunk?
>>> Aldy
>>>
>>> curr.patch
>>>
>>>
>>> gcc/testsuite/
>>>
>>>     * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
>>>     has a slightly different IL that provides more threading
>>>     opportunities.
>> OK.
>>
>> WRT rewriting the tests.  I'd certainly agree that we don't have the
>> right set of knobs to allow us to characterize the target nor do we have
>> the right dumping/scanning facilities to describe and query the CFG
>> changes.
>>
>> The fact that the IL changes so much across targets is a sign that
>> target dependency (probably BRANCH_COST) is twiddling the gimple we
>> generate.  I strongly suspect we'd be a lot better off if we tackled the
>> BRANCH_COST problem first.
> 
> Huh.  I've always accepted differing IL between architectures as a
> necessary evil for things like auto-vectorization and the like.
Yes.  We've made a conscious decision that introducing target
dependencies for the autovectorizer makes sense.  BRANCH_COST on the
other hand is a different beast :-)

> 
> What's the ideal plan here? A knob to set default values for target
> dependent variables that can affect IL layout?  Then we could pass
> -fthis-is-an-IL-test and things be normalized?
Well, lots of things.

I'd like decisions about how to expand branches deferred until rtl
expansion.  Kai was poking at this in the past but never really got any
traction.

Many tests should turn into gimple IL tests.

And I'd like a better framework for testing what we're doing to the IL.


Jeff
Aldy Hernandez July 9, 2018, 7:19 a.m. UTC | #12
On 07/05/2018 03:27 PM, Jeff Law wrote:
> On 07/04/2018 02:12 AM, Aldy Hernandez wrote:
>>
>>
>> On 07/03/2018 08:16 PM, Jeff Law wrote:
>>> On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
>>>> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
>>>>
>>>>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
>>>>>>>> While poking around in the backwards threader I noticed that we bail if
>>>>>>>>
>>>>>>>>
>>>>>>>> we have already seen a starting BB.
>>>>>>>>
>>>>>>>>           /* Do not jump-thread twice from the same block.  */
>>>>>>>>           if (bitmap_bit_p (threaded_blocks, entry->src->index)
>>>>>>>>
>>>>>>>> This limitation discards paths that are sub-paths of paths that have
>>>>>>>> already been threaded.
>>>>>>>>
>>>>>>>> The following patch scans the remaining to-be-threaded paths to identify
>>>>>>>>
>>>>>>>>
>>>>>>>> if any of them start from the same point, and are thus sub-paths of the
>>>>>>>>
>>>>>>>>
>>>>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
>>>>>>>>
>>>>>>>>
>>>>>>>> threadable paths, and then rewiring first non-common block
>>>>>>>> appropriately, we expose new threading opportunities, since we are no
>>>>>>>>
>>>>>>>> longer starting from the same BB.  We also simplify the would-be
>>>>>>>> threaded paths, because we don't duplicate already duplicated paths.
>>>> [snip]
>>>>> Hi,
>>>>>
>>>>> I've noticed a regression on aarch64:
>>>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
>>>>> threaded: 3"
>>>>> very likely caused by this patch (appeared between 262282 and 262294)
>>>>>
>>>>> Christophe
>>>>
>>>> The test needs to be adjusted here.
>>>>
>>>> The long story is that the aarch64 IL is different at thread3 time in
>>>> that it has 2 profitable sub-paths that can now be threaded with my
>>>> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
>>>> for x86 64.  Previously we couldn't thread these in aarch64, so the
>>>> backwards threader would bail.
>>>>
>>>> One can see the different threading opportunities by sticking
>>>> debug_all_paths() at the top of thread_through_all_blocks().  You will
>>>> notice that aarch64 has far more candidates to begin with.  The IL on
>>>> the x86 backend, has no paths that start on the same BB.  The aarch64,
>>>> on the other hand, has many to choose from:
>>>>
>>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
>>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
>>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
>>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
>>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
>>>>
>>>> Some of these prove unprofitable, but 2 more than before are profitable now.
>>>>
>>>>
>>>>
>>>> BTW, I see another threading related failure on aarch64 which is
>>>> unrelated to my patch, and was previously there:
>>>>
>>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
>>>> threaded"
>>>>
>>>> This is probably another IL incompatibility between architectures.
>>>>
>>>> Anyways... the attached path fixes the regression.  I have added a note
>>>> to the test explaining the IL differences.  We really should rewrite all
>>>> the threading tests (I am NOT volunteering ;-)).
>>>>
>>>> OK for trunk?
>>>> Aldy
>>>>
>>>> curr.patch
>>>>
>>>>
>>>> gcc/testsuite/
>>>>
>>>>      * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
>>>>      has a slightly different IL that provides more threading
>>>>      opportunities.
>>> OK.
>>>
>>> WRT rewriting the tests.  I'd certainly agree that we don't have the
>>> right set of knobs to allow us to characterize the target nor do we have
>>> the right dumping/scanning facilities to describe and query the CFG
>>> changes.
>>>
>>> The fact that the IL changes so much across targets is a sign that
>>> target dependency (probably BRANCH_COST) is twiddling the gimple we
>>> generate.  I strongly suspect we'd be a lot better off if we tackled the
>>> BRANCH_COST problem first.
>>
>> Huh.  I've always accepted differing IL between architectures as a
>> necessary evil for things like auto-vectorization and the like.
> Yes.  We've made a conscious decision that introducing target
> dependencies for the autovectorizer makes sense.  BRANCH_COST on the
> other hand is a different beast :-)
> 
>>
>> What's the ideal plan here? A knob to set default values for target
>> dependent variables that can affect IL layout?  Then we could pass
>> -fthis-is-an-IL-test and things be normalized?
> Well, lots of things.
> 
> I'd like decisions about how to expand branches deferred until rtl
> expansion.  Kai was poking at this in the past but never really got any
> traction.

For the record, the problem in this testcase is that switch lowering is 
riddled with back end specific knowledge (GET_MODE_SIZE uses as well as 
some rtx cost hacks).

> 
> Many tests should turn into gimple IL tests.

Yeah, though for tests like the threading ones, they're already 
sufficiently convoluted that turning them into gimple IL tests will make 
them even harder to read.  Oh well, I guess?

To make it easier to transition to gimple IL tests, I suppose we could 
add -fdump-tree-*-something-more-suitable-for-gimplefe ;-).  Is it by 
design that the gimple fe is sufficiently different from what we dump 
for -fdump-tree (ahem, things like phi arguments, probabilities, etc)?

> 
> And I'd like a better framework for testing what we're doing to the IL.

Sigh.  Me too.

Aldy
Richard Biener July 9, 2018, 8:29 a.m. UTC | #13
On Mon, Jul 9, 2018 at 9:19 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 07/05/2018 03:27 PM, Jeff Law wrote:
> > On 07/04/2018 02:12 AM, Aldy Hernandez wrote:
> >>
> >>
> >> On 07/03/2018 08:16 PM, Jeff Law wrote:
> >>> On 07/03/2018 03:31 AM, Aldy Hernandez wrote:
> >>>> On 07/02/2018 07:08 AM, Christophe Lyon wrote:
> >>>>
> >>>>>>> On 11/07/2017 10:33 AM, Aldy Hernandez wrote:
> >>>>>>>> While poking around in the backwards threader I noticed that we bail if
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> we have already seen a starting BB.
> >>>>>>>>
> >>>>>>>>           /* Do not jump-thread twice from the same block.  */
> >>>>>>>>           if (bitmap_bit_p (threaded_blocks, entry->src->index)
> >>>>>>>>
> >>>>>>>> This limitation discards paths that are sub-paths of paths that have
> >>>>>>>> already been threaded.
> >>>>>>>>
> >>>>>>>> The following patch scans the remaining to-be-threaded paths to identify
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> if any of them start from the same point, and are thus sub-paths of the
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> just-threaded path.  By removing the common prefix of blocks in upcoming
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> threadable paths, and then rewiring first non-common block
> >>>>>>>> appropriately, we expose new threading opportunities, since we are no
> >>>>>>>>
> >>>>>>>> longer starting from the same BB.  We also simplify the would-be
> >>>>>>>> threaded paths, because we don't duplicate already duplicated paths.
> >>>> [snip]
> >>>>> Hi,
> >>>>>
> >>>>> I've noticed a regression on aarch64:
> >>>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump thread3 "Jumps
> >>>>> threaded: 3"
> >>>>> very likely caused by this patch (appeared between 262282 and 262294)
> >>>>>
> >>>>> Christophe
> >>>>
> >>>> The test needs to be adjusted here.
> >>>>
> >>>> The long story is that the aarch64 IL is different at thread3 time in
> >>>> that it has 2 profitable sub-paths that can now be threaded with my
> >>>> patch.  This is causing the threaded count to be 5 for aarch64, versus 3
> >>>> for x86 64.  Previously we couldn't thread these in aarch64, so the
> >>>> backwards threader would bail.
> >>>>
> >>>> One can see the different threading opportunities by sticking
> >>>> debug_all_paths() at the top of thread_through_all_blocks().  You will
> >>>> notice that aarch64 has far more candidates to begin with.  The IL on
> >>>> the x86 backend, has no paths that start on the same BB.  The aarch64,
> >>>> on the other hand, has many to choose from:
> >>>>
> >>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11,
> >>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> >>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16,
> >>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 11, 11 -> 35,
> >>>> path: 52 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> >>>> path: 51 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 17,
> >>>> path: 53 -> 56, 56 -> 57, 57 -> 58, 58 -> 10, 10 -> 16, 16 -> 19,
> >>>>
> >>>> Some of these prove unprofitable, but 2 more than before are profitable now.
> >>>>
> >>>>
> >>>>
> >>>> BTW, I see another threading related failure on aarch64 which is
> >>>> unrelated to my patch, and was previously there:
> >>>>
> >>>> FAIL: gcc.dg/tree-ssa/ssa-dom-thread-7.c scan-tree-dump-not vrp2 "Jumps
> >>>> threaded"
> >>>>
> >>>> This is probably another IL incompatibility between architectures.
> >>>>
> >>>> Anyways... the attached path fixes the regression.  I have added a note
> >>>> to the test explaining the IL differences.  We really should rewrite all
> >>>> the threading tests (I am NOT volunteering ;-)).
> >>>>
> >>>> OK for trunk?
> >>>> Aldy
> >>>>
> >>>> curr.patch
> >>>>
> >>>>
> >>>> gcc/testsuite/
> >>>>
> >>>>      * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust test because aarch64
> >>>>      has a slightly different IL that provides more threading
> >>>>      opportunities.
> >>> OK.
> >>>
> >>> WRT rewriting the tests.  I'd certainly agree that we don't have the
> >>> right set of knobs to allow us to characterize the target nor do we have
> >>> the right dumping/scanning facilities to describe and query the CFG
> >>> changes.
> >>>
> >>> The fact that the IL changes so much across targets is a sign that
> >>> target dependency (probably BRANCH_COST) is twiddling the gimple we
> >>> generate.  I strongly suspect we'd be a lot better off if we tackled the
> >>> BRANCH_COST problem first.
> >>
> >> Huh.  I've always accepted differing IL between architectures as a
> >> necessary evil for things like auto-vectorization and the like.
> > Yes.  We've made a conscious decision that introducing target
> > dependencies for the autovectorizer makes sense.  BRANCH_COST on the
> > other hand is a different beast :-)
> >
> >>
> >> What's the ideal plan here? A knob to set default values for target
> >> dependent variables that can affect IL layout?  Then we could pass
> >> -fthis-is-an-IL-test and things be normalized?
> > Well, lots of things.
> >
> > I'd like decisions about how to expand branches deferred until rtl
> > expansion.  Kai was poking at this in the past but never really got any
> > traction.
>
> For the record, the problem in this testcase is that switch lowering is
> riddled with back end specific knowledge (GET_MODE_SIZE uses as well as
> some rtx cost hacks).
>
> >
> > Many tests should turn into gimple IL tests.
>
> Yeah, though for tests like the threading ones, they're already
> sufficiently convoluted that turning them into gimple IL tests will make
> them even harder to read.  Oh well, I guess?
>
> To make it easier to transition to gimple IL tests, I suppose we could
> add -fdump-tree-*-something-more-suitable-for-gimplefe ;-).

There is -fdump-tree-*-gimple ;)

> Is it by
> design that the gimple fe is sufficiently different from what we dump
> for -fdump-tree (ahem, things like phi arguments, probabilities, etc)?

Well.  The only reason is that I didnt' want to adjust all testcases so
I introduced the -gimple dump modifier...

Note there's still some "magic" fiddling needed to make the GIMPLE FE
grok -gimple dump files as well as working around limitations with the
current way of having SSA/CFG testcases.

> >
> > And I'd like a better framework for testing what we're doing to the IL.
>
> Sigh.  Me too.

Well, the GIMPLE FE is supposed to be that - it's just not perfect
(or rather incomplete).

Richard.

> Aldy
Aldy Hernandez July 9, 2018, 8:46 a.m. UTC | #14
On 07/09/2018 04:29 AM, Richard Biener wrote:
> On Mon, Jul 9, 2018 at 9:19 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 07/05/2018 03:27 PM, Jeff Law wrote:
>>> On 07/04/2018 02:12 AM, Aldy Hernandez wrote:

>>> Many tests should turn into gimple IL tests.
>>
>> Yeah, though for tests like the threading ones, they're already
>> sufficiently convoluted that turning them into gimple IL tests will make
>> them even harder to read.  Oh well, I guess?
>>
>> To make it easier to transition to gimple IL tests, I suppose we could
>> add -fdump-tree-*-something-more-suitable-for-gimplefe ;-).
> 
> There is -fdump-tree-*-gimple ;)

Ahh!!!  Thanks.

> 
>> Is it by
>> design that the gimple fe is sufficiently different from what we dump
>> for -fdump-tree (ahem, things like phi arguments, probabilities, etc)?
> 
> Well.  The only reason is that I didnt' want to adjust all testcases so
> I introduced the -gimple dump modifier...
> 
> Note there's still some "magic" fiddling needed to make the GIMPLE FE
> grok -gimple dump files as well as working around limitations with the
> current way of having SSA/CFG testcases.

I'll keep it in mind as I deal with threading going forward.

> 
>>>
>>> And I'd like a better framework for testing what we're doing to the IL.
>>
>> Sigh.  Me too.
> 
> Well, the GIMPLE FE is supposed to be that - it's just not perfect
> (or rather incomplete).

Hey, it's pretty good IMO :).

Thanks.
Aldy
Jeff Law July 9, 2018, 7:56 p.m. UTC | #15
On 07/09/2018 01:19 AM, Aldy Hernandez wrote:
>>
>> I'd like decisions about how to expand branches deferred until rtl
>> expansion.  Kai was poking at this in the past but never really got any
>> traction.
> 
> For the record, the problem in this testcase is that switch lowering is
> riddled with back end specific knowledge (GET_MODE_SIZE uses as well as
> some rtx cost hacks).
Yea.  Switch lowering is going to have some of these as well, though I
think BRANCH_COST is more pervasive.

> 
>>
>> Many tests should turn into gimple IL tests.
> 
> Yeah, though for tests like the threading ones, they're already
> sufficiently convoluted that turning them into gimple IL tests will make
> them even harder to read.  Oh well, I guess?
It might make them harder to read, but it would guarantee consistent
gimple fed into the optimizer across our targets which in turn ought to
result in consistent behavior by the optimizer which in turn should
simplify the test and make them more consistent over time.

Jeff
Aldy Hernandez July 10, 2018, 8:43 a.m. UTC | #16
On 07/09/2018 03:56 PM, Jeff Law wrote:
> On 07/09/2018 01:19 AM, Aldy Hernandez wrote:
>>>
>>> I'd like decisions about how to expand branches deferred until rtl
>>> expansion.  Kai was poking at this in the past but never really got any
>>> traction.
>>
>> For the record, the problem in this testcase is that switch lowering is
>> riddled with back end specific knowledge (GET_MODE_SIZE uses as well as
>> some rtx cost hacks).
> Yea.  Switch lowering is going to have some of these as well, though I
> think BRANCH_COST is more pervasive.
> 
>>
>>>
>>> Many tests should turn into gimple IL tests.
>>
>> Yeah, though for tests like the threading ones, they're already
>> sufficiently convoluted that turning them into gimple IL tests will make
>> them even harder to read.  Oh well, I guess?
> It might make them harder to read, but it would guarantee consistent
> gimple fed into the optimizer across our targets which in turn ought to
> result in consistent behavior by the optimizer which in turn should
> simplify the test and make them more consistent over time.

Ok.  When I submit my queued up range based changes to the threader I'll 
see if I can convert a big chunk of the threader tests to gimple IL.

Patch
diff mbox series

gcc/

	* tree-ssa-threadupdate.c (mark_threaded_blocks): Avoid
	dereferencing path[] beyond its length.
	(debug_path): New.
	(debug_all_paths): New.
	(rewire_first_differing_edge): New.
	(adjust_paths_after_duplication): New.
	(duplicate_thread_path): Call adjust_paths_after_duplication.
	Add new argument.
	(thread_through_all_blocks): Add new argument to
	duplicate_thread_path.

diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 1dab0f1fab4..53ac7181b4b 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -1763,7 +1763,8 @@  mark_threaded_blocks (bitmap threaded_blocks)
     {
       vec<jump_thread_edge *> *path = paths[i];
 
-      if ((*path)[1]->type != EDGE_COPY_SRC_JOINER_BLOCK)
+      if (path->length () > 1
+	  && (*path)[1]->type != EDGE_COPY_SRC_JOINER_BLOCK)
 	{
 	  edge e = (*path)[0]->e;
 	  e->aux = (void *)path;
@@ -1783,7 +1784,8 @@  mark_threaded_blocks (bitmap threaded_blocks)
     {
       vec<jump_thread_edge *> *path = paths[i];
 
-      if ((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK)
+      if (path->length () > 1
+	  && (*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK)
 	{
 	  /* Attach the path to the starting edge if none is yet recorded.  */
 	  if ((*path)[0]->e->aux == NULL)
@@ -1812,7 +1814,8 @@  mark_threaded_blocks (bitmap threaded_blocks)
       vec<jump_thread_edge *> *path = paths[i];
       edge e = (*path)[0]->e;
 
-      if ((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && e->aux == path)
+      if (path->length () > 1
+	  && (*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && e->aux == path)
 	{
 	  unsigned int j;
 	  for (j = 1; j < path->length (); j++)
@@ -1978,6 +1981,169 @@  bb_in_bbs (basic_block bb, basic_block *bbs, int n)
   return false;
 }
 
+DEBUG_FUNCTION void
+debug_path (FILE *dump_file, int pathno)
+{
+  vec<jump_thread_edge *> *p = paths[pathno];
+  fprintf (dump_file, "path: ");
+  for (unsigned i = 0; i < p->length (); ++i)
+    fprintf (dump_file, "%d -> %d, ",
+	     (*p)[i]->e->src->index, (*p)[i]->e->dest->index);
+  fprintf (dump_file, "\n");
+}
+
+DEBUG_FUNCTION void
+debug_all_paths ()
+{
+  for (unsigned i = 0; i < paths.length (); ++i)
+    debug_path (stderr, i);
+}
+
+/* Rewire a jump_thread_edge so that the source block is now a
+   threaded source block.
+
+   PATH_NUM is an index into the global path table PATHS.
+   EDGE_NUM is the jump thread edge number into said path.
+
+   Returns TRUE if we were able to successfully rewire the edge.  */
+
+static bool
+rewire_first_differing_edge (unsigned path_num, unsigned edge_num)
+{
+  vec<jump_thread_edge *> *path = paths[path_num];
+  edge &e = (*path)[edge_num]->e;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "rewiring edge candidate: %d -> %d\n",
+	     e->src->index, e->dest->index);
+  basic_block src_copy = get_bb_copy (e->src);
+  if (src_copy == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "ignoring candidate: there is no src COPY\n");
+      return false;
+    }
+  edge new_edge = find_edge (src_copy, e->dest);
+  /* If the previously threaded paths created a flow graph where we
+     can no longer figure out where to go, give up.  */
+  if (new_edge == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "ignoring candidate: we lost our way\n");
+      return false;
+    }
+  e = new_edge;
+  return true;
+}
+
+/* After an FSM path has been jump threaded, adjust the remaining FSM
+   paths that are subsets of this path, so these paths can be safely
+   threaded within the context of the new threaded path.
+
+   For example, suppose we have just threaded:
+
+   5 -> 6 -> 7 -> 8 -> 12	=>	5 -> 6' -> 7' -> 8' -> 12'
+
+   And we have an upcoming threading candidate:
+   5 -> 6 -> 7 -> 8 -> 15 -> 20
+
+   This function adjusts the upcoming path into:
+   8' -> 15 -> 20
+
+   CURR_PATH_NUM is an index into the global paths table.  It
+   specifies the path that was just threaded.  */
+
+static void
+adjust_paths_after_duplication (unsigned curr_path_num)
+{
+  vec<jump_thread_edge *> *curr_path = paths[curr_path_num];
+  gcc_assert ((*curr_path)[0]->type == EDGE_FSM_THREAD);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "just threaded: ");
+      debug_path (dump_file, curr_path_num);
+    }
+
+  /* Iterate through all the other paths and adjust them.  */
+  for (unsigned cand_path_num = 0; cand_path_num < paths.length (); )
+    {
+      if (cand_path_num == curr_path_num)
+	{
+	  ++cand_path_num;
+	  continue;
+	}
+      /* Make sure the candidate to adjust starts with the same path
+	 as the recently threaded path and is an FSM thread.  */
+      vec<jump_thread_edge *> *cand_path = paths[cand_path_num];
+      if ((*cand_path)[0]->type != EDGE_FSM_THREAD
+	  || (*cand_path)[0]->e != (*curr_path)[0]->e)
+	{
+	  ++cand_path_num;
+	  continue;
+	}
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "adjusting candidate: ");
+	  debug_path (dump_file, cand_path_num);
+	}
+
+      /* Chop off from the candidate path any prefix it shares with
+	 the recently threaded path.  */
+      unsigned minlength = MIN (curr_path->length (), cand_path->length ());
+      unsigned j;
+      for (j = 0; j < minlength; ++j)
+	{
+	  edge cand_edge = (*cand_path)[j]->e;
+	  edge curr_edge = (*curr_path)[j]->e;
+
+	  /* Once the prefix no longer matches, adjust the first
+	     non-matching edge to point from an adjusted edge to
+	     wherever it was going.  */
+	  if (cand_edge != curr_edge)
+	    {
+	      gcc_assert (cand_edge->src == curr_edge->src);
+	      if (!rewire_first_differing_edge (cand_path_num, j))
+		goto remove_candidate_from_list;
+	      break;
+	    }
+	}
+      if (j == minlength)
+	{
+	  /* If we consumed the max subgraph we could look at, and
+	     still didn't find any different edges, it's the
+	     last edge after MINLENGTH.  */
+	  if (cand_path->length () > minlength)
+	    {
+	      if (!rewire_first_differing_edge (cand_path_num, j))
+		goto remove_candidate_from_list;
+	    }
+	  else if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "adjusting first edge after MINLENGTH.\n");
+	}
+      if (j > 0)
+	{
+	  /* If we are removing everything, delete the entire candidate.  */
+	  if (j == cand_path->length ())
+	    {
+	    remove_candidate_from_list:
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file, "adjusted candidate: [EMPTY]\n");
+	      delete_jump_thread_path (cand_path);
+	      paths.unordered_remove (cand_path_num);
+	      continue;
+	    }
+	  /* Otherwise, just remove the redundant sub-path.  */
+	  cand_path->block_remove (0, j);
+	}
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "adjusted candidate: ");
+	  debug_path (dump_file, cand_path_num);
+	}
+      ++cand_path_num;
+    }
+}
+
 /* Duplicates a jump-thread path of N_REGION basic blocks.
    The ENTRY edge is redirected to the duplicate of the region.
 
@@ -1985,11 +2151,14 @@  bb_in_bbs (basic_block bb, basic_block *bbs, int n)
    and create a single fallthru edge pointing to the same destination as the
    EXIT edge.
 
+   CURRENT_PATH_NO is an index into the global paths[] table
+   specifying the jump-thread path.
+
    Returns false if it is unable to copy the region, true otherwise.  */
 
 static bool
 duplicate_thread_path (edge entry, edge exit, basic_block *region,
-		       unsigned n_region)
+		       unsigned n_region, unsigned current_path_no)
 {
   unsigned i;
   struct loop *loop = entry->dest->loop_father;
@@ -2000,6 +2169,12 @@  duplicate_thread_path (edge entry, edge exit, basic_block *region,
   if (!can_copy_bbs_p (region, n_region))
     return false;
 
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\nabout to thread: ");
+      debug_path (dump_file, current_path_no);
+    }
+
   /* Some sanity checking.  Note that we do not check for all possible
      missuses of the functions.  I.e. if you ask to copy something weird,
      it will work, but the state of structures probably will not be
@@ -2128,6 +2303,8 @@  duplicate_thread_path (edge entry, edge exit, basic_block *region,
 
   free (region_copy);
 
+  adjust_paths_after_duplication (current_path_no);
+
   free_original_copy_tables ();
   return true;
 }
@@ -2251,7 +2428,7 @@  thread_through_all_blocks (bool may_peel_loop_headers)
       for (unsigned int j = 0; j < len - 1; j++)
 	region[j] = (*path)[j]->e->dest;
 
-      if (duplicate_thread_path (entry, exit, region, len - 1))
+      if (duplicate_thread_path (entry, exit, region, len - 1, i))
 	{
 	  /* We do not update dominance info.  */
 	  free_dominance_info (CDI_DOMINATORS);