diff mbox

Improving jump-thread pass for PR 54742

Message ID 20141123222207.GA24073@f1.c.bardezibar.internal
State New
Headers show

Commit Message

Sebastian Pop Nov. 23, 2014, 10:22 p.m. UTC
Jeff Law wrote:
> >PS: I have run some perf analysis with the patch:
> >- on a bootstrap of GCC I see 3209 FSM jump threads
> >- libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
> >   (measured on simulators and reduced data sets)
> >- coremark gets jump threaded (as expected)
> >- I'm setting up the llvm test-suite and I will report perf differences
> So that's *far* more jump threads than I would expect this to find
> in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
> expect to find.

The second patch attached limits the search for FSM jump threads to loops.  With
that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
(and 424 jump threads on powerpc64-linux bootstrap.)

> I haven't dug deep, but the first level analysis is not encouraging.
> 
> Basically I used the trunk compiler with and without your patch to
> build gcc-4.7.3's cc1 (4.7.3 simply because that's what I last used
> this testing framework).  So that gives me two cc1s that I then use
> to compile a bunch of .i files under valgrind's (cachegrind)
> control.
> 
> valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ......
> 
> That gives me two hunks of data for each input file I test.
> Specifically I get the dynamic number of instructions and the
> dynamic number of branches executed.
> 
> For jump threading those values correspond directly to the effect
> we're looking for -- a reduction in dynamic conditional jumps and a
> reduction in dynamic instructions executed.  Typically the change in
> dynamic instructions executed is 2-3X the change in dynamic
> conditional jumps -- which makes sense as removing the conditional
> jump usually means we remove a comparison and possibly some setup
> code as well.
> 
> Consistently with your patch those values get worse.   Across the
> entire set of .i files I get
> 
> For the trunk:
> 
> instructions:1339016494968
> branches:     243568982489
> 
> With your patch:
> 
> instructions:1339739533291
> branches:     243806615986
> 
> 
> So that's 723038323 more instructions and 237633497 more branches
> after installing your patch.  While we're looking a just under .1%
> regression in dynamic branches, that's a terrible result for this
> work.

One of the reasons I think we see more branches is that in sese region copying we
do not use the knowledge of the value of the condition for the last branch in a
jump-thread path: we rely on other propagation passes to remove the branch.  The
last attached patch adds:

  /* Remove the last branch in the jump thread path.  */
  remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit->dest);

Please let me know if the attached patches are producing better results on gcc.

I rebased the original patch on trunk and all patches bootstrapped together on
x86_64-linux and powerpc64-linux.

Thanks,
Sebastian

Comments

Jeff Law Nov. 24, 2014, 9 p.m. UTC | #1
On 11/23/14 15:22, Sebastian Pop wrote:
> The second patch attached limits the search for FSM jump threads to loops.  With
> that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
> (and 424 jump threads on powerpc64-linux bootstrap.)
>
Yea, that was one of the things I was going to poke at as well as a 
quick scan of your patch gave me the impression it wasn't limited to loops.

Again, I haven't looked much at the patch, but I got the impression 
you're doing a backwards walk through the predecessors to discover the 
result of the COND_EXPR.  Correct?

That's something I'd been wanting to do -- basically start with a 
COND_EXPR, then walk the dataflow backwards substituting values into the 
COND_EXPR (possibly creating non-gimple).  Ultimately the goal is to 
substitute and fold, getting to a constant :-)

The forward exhaustive stuff we do now is, crazy.   The backwards 
approach could be decoupled from DOM & VRP into an independent pass, 
which I think would be wise.

Using a SEME region copier is also something I really wanted to do long 
term.  In fact, I believe a lot of tree-ssa-threadupdate.c ought to be 
ripped out and replaced with a SEME based copier.

It appears you've built at least parts of two pieces needed to all this 
as a Bodik style optimizer.  Which is exactly the long term direction I 
think this code ought to take.


>
> One of the reasons I think we see more branches is that in sese region copying we
> do not use the knowledge of the value of the condition for the last branch in a
> jump-thread path: we rely on other propagation passes to remove the branch.  The
> last attached patch adds:
>
>    /* Remove the last branch in the jump thread path.  */
>    remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit->dest);
That's certainly a possibility.  But I would expect that even with this 
limitation something would be picking up the fact that the branch is 
statically computable (even if it's an RTL optimizer).  But it's 
definitely something to look for.

>
> Please let me know if the attached patches are producing better results on gcc.

For the trunk:
   instructions:1339016494968
   branches     :243568982489

First version of your patch:

   instructions:1339739533291
   branches:     243806615986

Latest version of your patch:

   instructions:1339749122609
   branches:     243809838262

Which is in the noise for this test.  Which makes me wonder if I botched 
something on the latest run.  It doesn't appear so, but I'm re-running 
just to be sure.  I'm also turning on -g so that I can use cg_annotate 
to poke a bit deeper and perhaps identify one or more concrete examples 
where your patch is making this worse.

Jeff
Sebastian Pop Nov. 24, 2014, 10:05 p.m. UTC | #2
Jeff Law wrote:
> On 11/23/14 15:22, Sebastian Pop wrote:
> >The second patch attached limits the search for FSM jump threads to loops.  With
> >that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
> >(and 424 jump threads on powerpc64-linux bootstrap.)
> >
> Yea, that was one of the things I was going to poke at as well as a
> quick scan of your patch gave me the impression it wasn't limited to
> loops.
> 
> Again, I haven't looked much at the patch, but I got the impression
> you're doing a backwards walk through the predecessors to discover
> the result of the COND_EXPR.  Correct?

Yes.

> 
> That's something I'd been wanting to do -- basically start with a
> COND_EXPR, then walk the dataflow backwards substituting values into
> the COND_EXPR (possibly creating non-gimple).  Ultimately the goal
> is to substitute and fold, getting to a constant :-)
> 
> The forward exhaustive stuff we do now is, crazy.   The backwards
> approach could be decoupled from DOM & VRP into an independent pass,
> which I think would be wise.
> 
> Using a SEME region copier is also something I really wanted to do
> long term.  In fact, I believe a lot of tree-ssa-threadupdate.c
> ought to be ripped out and replaced with a SEME based copier.

I did an experiment around these lines over the week-end, and now that you
mention it, I feel less shy to speak about; well the patch does not yet pass
bootstrap, and there still are about 20 failing test-cases.  I feel better
reading the code generation part of jump-threading after this patch ;-)
Basically I think all the tree-ssa-threadupdate.c can be replaced by
duplicate_seme_region that generalizes the code generation.

> 
> It appears you've built at least parts of two pieces needed to all
> this as a Bodik style optimizer.  Which is exactly the long term
> direction I think this code ought to take.
> 
> 
> >
> >One of the reasons I think we see more branches is that in sese region copying we
> >do not use the knowledge of the value of the condition for the last branch in a
> >jump-thread path: we rely on other propagation passes to remove the branch.  The
> >last attached patch adds:
> >
> >   /* Remove the last branch in the jump thread path.  */
> >   remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit->dest);
> That's certainly a possibility.  But I would expect that even with
> this limitation something would be picking up the fact that the
> branch is statically computable (even if it's an RTL optimizer).
> But it's definitely something to look for.
> 
> >
> >Please let me know if the attached patches are producing better results on gcc.
> 
> For the trunk:
>   instructions:1339016494968
>   branches     :243568982489
> 
> First version of your patch:
> 
>   instructions:1339739533291
>   branches:     243806615986
> 
> Latest version of your patch:
> 
>   instructions:1339749122609
>   branches:     243809838262

I think I got about the same results.

I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
valgrind was crashing not recognizing how to decode an instruction.  Then I
moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
there is a bit of noise in all these numbers)

$ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii

all 4 patches:

==153617== I   refs:      13,914,038,211
==153617== 
==153617== Branches:       1,926,407,760  (1,879,827,481 cond + 46,580,279 ind)
==153617== Mispredicts:      144,890,904  (  132,094,105 cond + 12,796,799 ind)
==153617== Mispred rate:             7.5% (          7.0%     +       27.4%   )

==34993== I   refs:      13,915,335,629
==34993== 
==34993== Branches:       1,926,597,919  (1,880,017,558 cond + 46,580,361 ind)
==34993== Mispredicts:      144,974,266  (  132,177,440 cond + 12,796,826 ind)
==34993== Mispred rate:             7.5% (          7.0%     +       27.4%   )

==140841== I   refs:      13,915,334,459
==140841== 
==140841== Branches:       1,926,597,819  (1,880,017,458 cond + 46,580,361 ind)
==140841== Mispredicts:      144,974,296  (  132,177,470 cond + 12,796,826 ind)
==140841== Mispred rate:             7.5% (          7.0%     +       27.4%   )

patch 1:

==99902== I   refs:      13,915,069,710
==99902== 
==99902== Branches:       1,926,963,813  (1,880,376,148 cond + 46,587,665 ind)
==99902== Mispredicts:      145,501,564  (  132,656,576 cond + 12,844,988 ind)
==99902== Mispred rate:             7.5% (          7.0%     +       27.5%   )

==3907== I   refs:      13,915,082,469
==3907== 
==3907== Branches:       1,926,965,218  (1,880,377,471 cond + 46,587,747 ind)
==3907== Mispredicts:      145,501,569  (  132,656,554 cond + 12,845,015 ind)
==3907== Mispred rate:             7.5% (          7.0%     +       27.5%   )

==44271== I   refs:      13,915,111,997
==44271== 
==44271== Branches:       1,926,968,863  (1,880,380,952 cond + 46,587,911 ind)
==44271== Mispredicts:      145,501,858  (  132,656,789 cond + 12,845,069 ind)
==44271== Mispred rate:             7.5% (          7.0%     +       27.5%   )

master no-patch:

==129233== I   refs:      13,910,221,913
==129233== 
==129233== Branches:       1,925,715,095  (1,879,277,776 cond + 46,437,319 ind)
==129233== Mispredicts:      144,133,332  (  131,510,534 cond + 12,622,798 ind)
==129233== Mispred rate:             7.4% (          6.9%     +       27.1%   )

==147659== I   refs:      13,910,216,249
==147659== 
==147659== Branches:       1,925,714,029  (1,879,276,708 cond + 46,437,321 ind)
==147659== Mispredicts:      144,127,970  (  131,505,172 cond + 12,622,798 ind)
==147659== Mispred rate:             7.4% (          6.9%     +       27.1%   )

==155206== I   refs:      13,910,201,237
==155206== 
==155206== Branches:       1,925,712,267  (1,879,275,030 cond + 46,437,237 ind)
==155206== Mispredicts:      144,128,313  (  131,505,542 cond + 12,622,771 ind)
==155206== Mispred rate:             7.4% (          6.9%     +       27.1%   )


I think that there are about 5 million more instructions executed with the first
patch, and the other patches on top do not really help.

> 
> Which is in the noise for this test.  Which makes me wonder if I
> botched something on the latest run.  It doesn't appear so, but I'm
> re-running just to be sure.  I'm also turning on -g so that I can
> use cg_annotate to poke a bit deeper and perhaps identify one or
> more concrete examples where your patch is making this worse.

Thanks,
Sebastian
Jeff Law Nov. 24, 2014, 10:23 p.m. UTC | #3
On 11/24/14 15:05, Sebastian Pop wrote:
>
> I did an experiment around these lines over the week-end, and now that you
> mention it, I feel less shy to speak about; well the patch does not yet pass
> bootstrap, and there still are about 20 failing test-cases.  I feel better
> reading the code generation part of jump-threading after this patch ;-)
> Basically I think all the tree-ssa-threadupdate.c can be replaced by
> duplicate_seme_region that generalizes the code generation.
Clearly next stage1 stuff, but definitely the right direction IMHO.  If 
you get the chance look at Bodik's thesis.    As far as I know he's the 
only person to really look at how to structure context sensitive 
optimizations in a sane way.



>
> I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
> valgrind was crashing not recognizing how to decode an instruction.  Then I
> moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
> compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
> there is a bit of noise in all these numbers)
Yea, glibc & valgrind really need to update in lock-step as glibc gains 
support for new ISAs.  Certain instructions are supposed to be 
interpreted as nops, but valgrind instead raises an illegal instruction.

There's a bit of noise when using valgrind like this, but it has 
definitely proven useful in the past.

I'm looking at bitmap_ior_and_compl right now.  Not sure if cg_annotate 
is sending me on a wild goose chase yet or not.

jeff
Jeff Law Nov. 24, 2014, 11:02 p.m. UTC | #4
On 11/23/14 15:22, Sebastian Pop wrote:
> Jeff Law wrote:
>>> PS: I have run some perf analysis with the patch:
>>> - on a bootstrap of GCC I see 3209 FSM jump threads
>>> - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
>>>    (measured on simulators and reduced data sets)
>>> - coremark gets jump threaded (as expected)
>>> - I'm setting up the llvm test-suite and I will report perf differences
>> So that's *far* more jump threads than I would expect this to find
>> in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
>> expect to find.
>
> The second patch attached limits the search for FSM jump threads to loops.  With
> that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
> (and 424 jump threads on powerpc64-linux bootstrap.)
[ ... ]

So why are we returning -1 (block should not be duplicated and not 
suitable for a joiner) at the end of thread_through_normal_block?


       /* When COND cannot be simplified, try to find paths from a control
          statement back through the PHI nodes which would affect that 
control
          statement.  */
       vec<basic_block, va_gc> *bb_path;
       vec_alloc (bb_path, n_basic_blocks_for_fn (cfun));
       vec_safe_push (bb_path, e->dest);
       hash_set<gimple> *visited_phis = new hash_set<gimple>;

       max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
       fsm_find_control_statement_thread_paths (cond, visited_phis, 
bb_path);

       delete visited_phis;
       vec_free (bb_path);

       return -1;

Returning -1 (instead of 0) says stop, there's no possibility to 
threading something on that path.   I think that's suppressing some 
profitable jump threads.  I haven't done  more than verify the bitmap 
code returns to its prior state with that change.

jeff
Sebastian Pop Nov. 24, 2014, 11:49 p.m. UTC | #5
Jeff Law wrote:
> On 11/23/14 15:22, Sebastian Pop wrote:
> >Jeff Law wrote:
> >>>PS: I have run some perf analysis with the patch:
> >>>- on a bootstrap of GCC I see 3209 FSM jump threads
> >>>- libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
> >>>   (measured on simulators and reduced data sets)
> >>>- coremark gets jump threaded (as expected)
> >>>- I'm setting up the llvm test-suite and I will report perf differences
> >>So that's *far* more jump threads than I would expect this to find
> >>in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
> >>expect to find.
> >
> >The second patch attached limits the search for FSM jump threads to loops.  With
> >that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
> >(and 424 jump threads on powerpc64-linux bootstrap.)
> [ ... ]
> 
> So why are we returning -1 (block should not be duplicated and not
> suitable for a joiner) at the end of thread_through_normal_block?
> 
> 
>       /* When COND cannot be simplified, try to find paths from a control
>          statement back through the PHI nodes which would affect
> that control
>          statement.  */
>       vec<basic_block, va_gc> *bb_path;
>       vec_alloc (bb_path, n_basic_blocks_for_fn (cfun));
>       vec_safe_push (bb_path, e->dest);
>       hash_set<gimple> *visited_phis = new hash_set<gimple>;
> 
>       max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
>       fsm_find_control_statement_thread_paths (cond, visited_phis,
> bb_path);
> 
>       delete visited_phis;
>       vec_free (bb_path);
> 
>       return -1;
> 
> Returning -1 (instead of 0) says stop, there's no possibility to
> threading something on that path.   I think that's suppressing some
> profitable jump threads.  

Thanks for spotting this.

> I haven't done  more than verify the
> bitmap code returns to its prior state with that change.

I removed the return -1 and started a bootstrap on powerpc64-linux.
I will report the valgrind output.

Thanks,
Sebastian
Sebastian Pop Nov. 25, 2014, 1:09 a.m. UTC | #6
Sebastian Pop wrote:
> I removed the return -1 and started a bootstrap on powerpc64-linux.

Bootstrap passed on top of the 4 previous patches on powerpc64-linux.

> I will report the valgrind output.

The output from valgrind looks closer to the output of master with no other
patches: still 1M more instructions executed, and 300K more branches

master no-patch:

==129233== I   refs:      13,910,221,913
==129233==
==129233== Branches:       1,925,715,095  (1,879,277,776 cond + 46,437,319 ind)
==129233== Mispredicts:      144,133,332  (  131,510,534 cond + 12,622,798 ind)
==129233== Mispred rate:             7.4% (          6.9%     +       27.1%   )

4 previous patches + patch to return 0:

==149012== I   refs:      13,911,870,743
==149012== 
==149012== Branches:       1,926,092,629  (1,879,657,768 cond + 46,434,861 ind)
==149012== Mispredicts:      145,551,513  (  132,827,091 cond + 12,724,422 ind)
==149012== Mispred rate:             7.5% (          7.0%     +       27.4%   )

==4492== I   refs:      13,911,899,691
==4492== 
==4492== Branches:       1,926,096,214  (1,879,661,186 cond + 46,435,028 ind)
==4492== Mispredicts:      145,551,707  (  132,827,231 cond + 12,724,476 ind)
==4492== Mispred rate:             7.5% (          7.0%     +       27.4%   )

==19521== I   refs:      13,911,855,711
==19521== 
==19521== Branches:       1,926,090,982  (1,879,656,202 cond + 46,434,780 ind)
==19521== Mispredicts:      145,551,343  (  132,826,948 cond + 12,724,395 ind)
==19521== Mispred rate:             7.5% (          7.0%     +       27.4%   )
Jeff Law Nov. 25, 2014, 4:55 a.m. UTC | #7
On 11/24/14 18:09, Sebastian Pop wrote:
> Sebastian Pop wrote:
>> I removed the return -1 and started a bootstrap on powerpc64-linux.
>
> Bootstrap passed on top of the 4 previous patches on powerpc64-linux.
>
>> I will report the valgrind output.
>
> The output from valgrind looks closer to the output of master with no other
> patches: still 1M more instructions executed, and 300K more branches
Just ran my suite where we get ~25k more branches, which definitely puts 
us in the noise.  (that's with all 4 patches + fixing the return value 
).  I'm going to look at little closer at this stuff tomorrow, but I 
think we've resolved the performance issue.  I'll dig deeper into the 
implementation tomorrow as well.

Cheers,
Jeff
Richard Biener Nov. 25, 2014, 9:37 a.m. UTC | #8
On Mon, Nov 24, 2014 at 11:05 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> Jeff Law wrote:
>> On 11/23/14 15:22, Sebastian Pop wrote:
>> >The second patch attached limits the search for FSM jump threads to loops.  With
>> >that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
>> >(and 424 jump threads on powerpc64-linux bootstrap.)
>> >
>> Yea, that was one of the things I was going to poke at as well as a
>> quick scan of your patch gave me the impression it wasn't limited to
>> loops.
>>
>> Again, I haven't looked much at the patch, but I got the impression
>> you're doing a backwards walk through the predecessors to discover
>> the result of the COND_EXPR.  Correct?
>
> Yes.
>
>>
>> That's something I'd been wanting to do -- basically start with a
>> COND_EXPR, then walk the dataflow backwards substituting values into
>> the COND_EXPR (possibly creating non-gimple).  Ultimately the goal
>> is to substitute and fold, getting to a constant :-)
>>
>> The forward exhaustive stuff we do now is, crazy.   The backwards
>> approach could be decoupled from DOM & VRP into an independent pass,
>> which I think would be wise.
>>
>> Using a SEME region copier is also something I really wanted to do
>> long term.  In fact, I believe a lot of tree-ssa-threadupdate.c
>> ought to be ripped out and replaced with a SEME based copier.
>
> I did an experiment around these lines over the week-end, and now that you
> mention it, I feel less shy to speak about; well the patch does not yet pass
> bootstrap, and there still are about 20 failing test-cases.  I feel better
> reading the code generation part of jump-threading after this patch ;-)
> Basically I think all the tree-ssa-threadupdate.c can be replaced by
> duplicate_seme_region that generalizes the code generation.

Btw I once thought about doing on-the-fly lattice use/update and folding
during basic-block copying (or even re-generating expressions via
simplifying gimple_build ()).  Or have a substitute-and-fold like
facility that can run on SEME regions and do this.

Richard.

>> It appears you've built at least parts of two pieces needed to all
>> this as a Bodik style optimizer.  Which is exactly the long term
>> direction I think this code ought to take.
>>
>>
>> >
>> >One of the reasons I think we see more branches is that in sese region copying we
>> >do not use the knowledge of the value of the condition for the last branch in a
>> >jump-thread path: we rely on other propagation passes to remove the branch.  The
>> >last attached patch adds:
>> >
>> >   /* Remove the last branch in the jump thread path.  */
>> >   remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit->dest);
>> That's certainly a possibility.  But I would expect that even with
>> this limitation something would be picking up the fact that the
>> branch is statically computable (even if it's an RTL optimizer).
>> But it's definitely something to look for.
>>
>> >
>> >Please let me know if the attached patches are producing better results on gcc.
>>
>> For the trunk:
>>   instructions:1339016494968
>>   branches     :243568982489
>>
>> First version of your patch:
>>
>>   instructions:1339739533291
>>   branches:     243806615986
>>
>> Latest version of your patch:
>>
>>   instructions:1339749122609
>>   branches:     243809838262
>
> I think I got about the same results.
>
> I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
> valgrind was crashing not recognizing how to decode an instruction.  Then I
> moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
> compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
> there is a bit of noise in all these numbers)
>
> $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii
>
> all 4 patches:
>
> ==153617== I   refs:      13,914,038,211
> ==153617==
> ==153617== Branches:       1,926,407,760  (1,879,827,481 cond + 46,580,279 ind)
> ==153617== Mispredicts:      144,890,904  (  132,094,105 cond + 12,796,799 ind)
> ==153617== Mispred rate:             7.5% (          7.0%     +       27.4%   )
>
> ==34993== I   refs:      13,915,335,629
> ==34993==
> ==34993== Branches:       1,926,597,919  (1,880,017,558 cond + 46,580,361 ind)
> ==34993== Mispredicts:      144,974,266  (  132,177,440 cond + 12,796,826 ind)
> ==34993== Mispred rate:             7.5% (          7.0%     +       27.4%   )
>
> ==140841== I   refs:      13,915,334,459
> ==140841==
> ==140841== Branches:       1,926,597,819  (1,880,017,458 cond + 46,580,361 ind)
> ==140841== Mispredicts:      144,974,296  (  132,177,470 cond + 12,796,826 ind)
> ==140841== Mispred rate:             7.5% (          7.0%     +       27.4%   )
>
> patch 1:
>
> ==99902== I   refs:      13,915,069,710
> ==99902==
> ==99902== Branches:       1,926,963,813  (1,880,376,148 cond + 46,587,665 ind)
> ==99902== Mispredicts:      145,501,564  (  132,656,576 cond + 12,844,988 ind)
> ==99902== Mispred rate:             7.5% (          7.0%     +       27.5%   )
>
> ==3907== I   refs:      13,915,082,469
> ==3907==
> ==3907== Branches:       1,926,965,218  (1,880,377,471 cond + 46,587,747 ind)
> ==3907== Mispredicts:      145,501,569  (  132,656,554 cond + 12,845,015 ind)
> ==3907== Mispred rate:             7.5% (          7.0%     +       27.5%   )
>
> ==44271== I   refs:      13,915,111,997
> ==44271==
> ==44271== Branches:       1,926,968,863  (1,880,380,952 cond + 46,587,911 ind)
> ==44271== Mispredicts:      145,501,858  (  132,656,789 cond + 12,845,069 ind)
> ==44271== Mispred rate:             7.5% (          7.0%     +       27.5%   )
>
> master no-patch:
>
> ==129233== I   refs:      13,910,221,913
> ==129233==
> ==129233== Branches:       1,925,715,095  (1,879,277,776 cond + 46,437,319 ind)
> ==129233== Mispredicts:      144,133,332  (  131,510,534 cond + 12,622,798 ind)
> ==129233== Mispred rate:             7.4% (          6.9%     +       27.1%   )
>
> ==147659== I   refs:      13,910,216,249
> ==147659==
> ==147659== Branches:       1,925,714,029  (1,879,276,708 cond + 46,437,321 ind)
> ==147659== Mispredicts:      144,127,970  (  131,505,172 cond + 12,622,798 ind)
> ==147659== Mispred rate:             7.4% (          6.9%     +       27.1%   )
>
> ==155206== I   refs:      13,910,201,237
> ==155206==
> ==155206== Branches:       1,925,712,267  (1,879,275,030 cond + 46,437,237 ind)
> ==155206== Mispredicts:      144,128,313  (  131,505,542 cond + 12,622,771 ind)
> ==155206== Mispred rate:             7.4% (          6.9%     +       27.1%   )
>
>
> I think that there are about 5 million more instructions executed with the first
> patch, and the other patches on top do not really help.
>
>>
>> Which is in the noise for this test.  Which makes me wonder if I
>> botched something on the latest run.  It doesn't appear so, but I'm
>> re-running just to be sure.  I'm also turning on -g so that I can
>> use cg_annotate to poke a bit deeper and perhaps identify one or
>> more concrete examples where your patch is making this worse.
>
> Thanks,
> Sebastian
Markus Trippelsdorf Nov. 25, 2014, 10:31 a.m. UTC | #9
On 2014.11.24 at 22:05 +0000, Sebastian Pop wrote:
> I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
> valgrind was crashing not recognizing how to decode an instruction.  Then I
> moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
> compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
> there is a bit of noise in all these numbers)
> 
> $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii

BTW perf is also available on gcc112:

trippels@gcc2-power8 ~ % perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]
  dummy                                              [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[:access]                                [Hardware breakpoint]
Jeff Law Nov. 25, 2014, 4:15 p.m. UTC | #10
On 11/24/14 21:55, Jeff Law wrote:
> On 11/24/14 18:09, Sebastian Pop wrote:
>> Sebastian Pop wrote:
>>> I removed the return -1 and started a bootstrap on powerpc64-linux.
>>
>> Bootstrap passed on top of the 4 previous patches on powerpc64-linux.
>>
>>> I will report the valgrind output.
>>
>> The output from valgrind looks closer to the output of master with no
>> other
>> patches: still 1M more instructions executed, and 300K more branches
> Just ran my suite where we get ~25k more branches, which definitely puts
> us in the noise.  (that's with all 4 patches + fixing the return value
> ).  I'm going to look at little closer at this stuff tomorrow, but I
> think we've resolved the performance issue.  I'll dig deeper into the
> implementation tomorrow as well.
I was running without your followup patches (must have used the wrong 
bits from my git stash), so those results are bogus, but in a good way.

After fixing that goof, I'm seeing consistent improvements with your set 
of 4 patches and the fix for the wrong return code.  Across the suite, 
~140M fewer branches, not huge, but definitely not in the noise.

So, time to dig into the implementation :-)

Jeff

ps.  In case you're curious about the noise, it's primarily address hashing.
diff mbox

Patch

From b9b6155099d81b5ee6322e8bba2e3ba5d4f00b6e Mon Sep 17 00:00:00 2001
From: Sebastian Pop <sebpop@gmail.com>
Date: Sun, 23 Nov 2014 10:52:11 -0600
Subject: [PATCH 4/4] make copied region single entry and remove last condition
 stmt

---
 gcc/tree-cfg.c              |   2 +-
 gcc/tree-cfg.h              |   1 +
 gcc/tree-ssa-threadupdate.c | 151 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 6d96c52..ffa5162 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -2666,7 +2666,7 @@  reinstall_phi_args (edge new_edge, edge old_edge)
    near its "logical" location.  This is of most help to humans looking
    at debugging dumps.  */
 
-static basic_block
+basic_block
 split_edge_bb_loc (edge edge_in)
 {
   basic_block dest = edge_in->dest;
diff --git a/gcc/tree-cfg.h b/gcc/tree-cfg.h
index 626e973..51f0899 100644
--- a/gcc/tree-cfg.h
+++ b/gcc/tree-cfg.h
@@ -67,6 +67,7 @@  extern void verify_gimple_in_cfg (struct function *, bool);
 extern tree gimple_block_label (basic_block);
 extern void add_phi_args_after_copy_bb (basic_block);
 extern void add_phi_args_after_copy (basic_block *, unsigned, edge);
+extern basic_block split_edge_bb_loc (edge);
 extern bool gimple_duplicate_sese_region (edge, edge, basic_block *, unsigned,
 					basic_block *, bool);
 extern bool gimple_duplicate_sese_tail (edge, edge, basic_block *, unsigned,
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index dd2b518..3ee2117 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2318,6 +2318,153 @@  bb_ends_with_multiway_branch (basic_block bb ATTRIBUTE_UNUSED)
   return false;
 }
 
+/* Duplicates a Single Entry Multiple Exit REGION (set of N_REGION basic
+   blocks).  The ENTRY edge is redirected to the duplicate of the region.  If
+   REGION is not a Single Entry region, ignore any incoming edges other than
+   ENTRY: this makes the copied region a Single Entry region.
+
+   Remove the last conditional statement in the last basic block in the REGION,
+   and create a single fallthru edge pointing to the same destination as the
+   EXIT edge.
+
+   The new basic blocks are stored to REGION_COPY in the same order as they had
+   in REGION, provided that REGION_COPY is not NULL.
+
+   Returns false if it is unable to copy the region, true otherwise.  */
+
+static bool
+duplicate_seme_region (edge entry, edge exit,
+		       basic_block *region, unsigned n_region,
+		       basic_block *region_copy)
+{
+  unsigned i;
+  bool free_region_copy = false, copying_header = false;
+  struct loop *loop = entry->dest->loop_father;
+  edge exit_copy;
+  edge redirected;
+  int total_freq = 0, entry_freq = 0;
+  gcov_type total_count = 0, entry_count = 0;
+
+  if (!can_copy_bbs_p (region, n_region))
+    return false;
+
+  /* Some sanity checking.  Note that we do not check for all possible
+     missuses of the functions.  I.e. if you ask to copy something weird,
+     it will work, but the state of structures probably will not be
+     correct.  */
+  for (i = 0; i < n_region; i++)
+    {
+      /* We do not handle subloops, i.e. all the blocks must belong to the
+	 same loop.  */
+      if (region[i]->loop_father != loop)
+	return false;
+    }
+
+  initialize_original_copy_tables ();
+
+  if (copying_header)
+    set_loop_copy (loop, loop_outer (loop));
+  else
+    set_loop_copy (loop, loop);
+
+  if (!region_copy)
+    {
+      region_copy = XNEWVEC (basic_block, n_region);
+      free_region_copy = true;
+    }
+
+  if (entry->dest->count)
+    {
+      total_count = entry->dest->count;
+      entry_count = entry->count;
+      /* Fix up corner cases, to avoid division by zero or creation of negative
+	 frequencies.  */
+      if (entry_count > total_count)
+	entry_count = total_count;
+    }
+  else
+    {
+      total_freq = entry->dest->frequency;
+      entry_freq = EDGE_FREQUENCY (entry);
+      /* Fix up corner cases, to avoid division by zero or creation of negative
+	 frequencies.  */
+      if (total_freq == 0)
+	total_freq = 1;
+      else if (entry_freq > total_freq)
+	entry_freq = total_freq;
+    }
+
+  copy_bbs (region, n_region, region_copy, &exit, 1, &exit_copy, loop,
+	    split_edge_bb_loc (entry), 0);
+  if (total_count)
+    {
+      scale_bbs_frequencies_gcov_type (region, n_region,
+				       total_count - entry_count,
+				       total_count);
+      scale_bbs_frequencies_gcov_type (region_copy, n_region, entry_count,
+				       total_count);
+    }
+  else
+    {
+      scale_bbs_frequencies_int (region, n_region, total_freq - entry_freq,
+				 total_freq);
+      scale_bbs_frequencies_int (region_copy, n_region, entry_freq, total_freq);
+    }
+
+#ifdef ENABLE_CHECKING
+  /* Make sure no edge other than ENTRY is entering the copied region.  */
+  for (i = 0; i < n_region; i++)
+    {
+      edge e;
+      edge_iterator ei;
+      basic_block bb = region_copy[i];
+
+      if (single_pred_p (bb))
+	continue;
+
+      for (ei = ei_start (bb->preds); (e = ei_safe_edge (ei)); ei_next (&ei))
+	{
+	  basic_block x = e->src;
+	  bool found = false;
+
+	  for (unsigned j = 0; j < n_region; j++)
+	    if (x == region_copy[j])
+	      {
+		found = true;
+		break;
+	      }
+
+	  gcc_assert (found);
+	}
+    }
+#endif
+
+  /* Remove the last branch in the jump thread path.  */
+  remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit->dest);
+  edge e = make_edge (region_copy[n_region - 1], exit->dest, EDGE_FALLTHRU);
+
+  if (e) {
+    rescan_loop_exit (e, true, false);
+    e->probability = REG_BR_PROB_BASE;
+    e->count = region_copy[n_region - 1]->count;
+    //copy_phi_args (e->dest, rd->path->last ()->e, e, rd->path, idx);
+  }
+
+  /* Redirect the entry and add the phi node arguments.  */
+  redirected = redirect_edge_and_branch (entry, get_bb_copy (entry->dest));
+  gcc_assert (redirected != NULL);
+  flush_pending_stmts (entry);
+
+  /* Add the other PHI node arguments.  */
+  add_phi_args_after_copy (region_copy, n_region, NULL);
+
+  if (free_region_copy)
+    free (region_copy);
+
+  free_original_copy_tables ();
+  return true;
+}
+
 /* Walk through all blocks and thread incoming edges to the appropriate
    outgoing edge for each edge pair recorded in THREADED_EDGES.
 
@@ -2363,8 +2510,8 @@  thread_through_all_blocks (bool may_peel_loop_headers)
       for (unsigned int j = 0; j < len - 1; j++)
 	region[j] = (*path)[j]->e->dest;
 
-      bool success = gimple_duplicate_sese_region (entry, exit, region,
-						   len - 1, NULL, 0);
+      bool success = duplicate_seme_region (entry, exit, region,
+					    len - 1, NULL);
       if (success)
 	{
 	  dump_jump_thread_path (stderr, *path, false);
-- 
1.9.1