diff mbox

[PR,10474] Shedule pass_cprop_hardreg before pass_thread_prologue_and_epilogue

Message ID 20130417154935.GC3656@virgil.suse
State New
Headers show

Commit Message

Martin Jambor April 17, 2013, 3:49 p.m. UTC
Hi,

I have discovered that scheduling pass_cprop_hardreg before
pass_thread_prologue_and_epilogue leads to significant increases in
numbers of performed shrink-wrappings.  For one it solves PR 10474 (at
least on x86_64-linux) but it also boosts the number of
shrink-wrappings performed during gcc bootstrap by nearly 80%
(3165->5692 functions).  It is also necessary (although not
sufficient) to perform shrink-wrapping in at least one function in the
povray benchmark.

The reason why it helps so much is that before register allocation
there are instructions moving the value of actual arguments from
"originally hard" register (e.g. SI, DI, etc.) to a pseudo at the
beginning of each function.  When the argument is live across a
function call, the pseudo is likely to be assigned to a callee-saved
register and then also accessed from that register, even in the first
BB, making it require prologue, though it could be fetched from the
original one.  When we convert all uses (at least in the first BB) to
the original register, the preparatory stage of shrink wrapping is
often capable of moving the register moves to a later BB, thus
creating fast paths which do not require prologue and epilogue.

We believe this change in the pipeline should not bring about any
negative effects.  During gcc bootstrap, the number of instructions
changed by pass_cprop_hardreg dropped but by only 1.2%.  We have also
ran SPEC 2006 CPU benchmarks on recent Intel and AMD hardware and all
run time differences could be attributed to noise.  The changes in
binary sizes were also small:

    |                | Trunk produced | New         |        |
    | Benchmark      |    binary size | binary size | % diff |
    |----------------+----------------+-------------+--------|
    | 400.perlbench  |        6219603 |     6136803 |  -1.33 |
    | 401.bzip2      |         359291 |      351659 |  -2.12 |
    | 403.gcc        |       16249718 |    15915774 |  -2.06 |
    | 410.bwaves     |         145249 |      145769 |   0.36 |
    | 416.gamess     |       40269686 |    40270270 |   0.00 |
    | 429.mcf        |          97142 |       97126 |  -0.02 |
    | 433.milc       |         715444 |      713236 |  -0.31 |
    | 434.zeusmp     |        1444596 |     1444676 |   0.01 |
    | 435.gromacs    |        6609207 |     6470039 |  -2.11 |
    | 436.cactusADM  |        4571319 |     4532607 |  -0.85 |
    | 437.leslie3d   |         492197 |      492357 |   0.03 |
    | 444.namd       |        1001921 |     1007001 |   0.51 |
    | 445.gobmk      |        8193495 |     8163839 |  -0.36 |
    | 450.soplex     |        5565070 |     5530734 |  -0.62 |
    | 453.povray     |        7468446 |     7340142 |  -1.72 |
    | 454.calculix   |        8474754 |     8464954 |  -0.12 |
    | 456.hmmer      |        1662315 |     1650147 |  -0.73 |
    | 458.sjeng      |         623065 |      620817 |  -0.36 |
    | 459.GemsFDTD   |        1456669 |     1461573 |   0.34 |
    | 462.libquantum |         249809 |      248401 |  -0.56 |
    | 464.h264ref    |        2784806 |     2772806 |  -0.43 |
    | 465.tonto      |       15511395 |    15480899 |  -0.20 |
    | 470.lbm        |          64327 |       64215 |  -0.17 |
    | 471.omnetpp    |        5325418 |     5293874 |  -0.59 |
    | 473.astar      |         365853 |      363261 |  -0.71 |
    | 481.wrf        |       22002287 |    21950783 |  -0.23 |
    | 482.sphinx3    |        1153616 |     1145248 |  -0.73 |
    | 483.xalancbmk  |       62458676 |    62001540 |  -0.73 |
    |----------------+----------------+-------------+--------|
    | TOTAL          |      221535374 |   220130550 |  -0.63 |

I have successfully bootstrapped and tested the patch on
x86-64-linux.  Is it OK for trunk?  Or should I also examine some
other aspect?

Thanks,

Martin


2013-03-28  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/10474
	* passes.c (init_optimization_passes): Move pass_cprop_hardreg before
	pass_thread_prologue_and_epilogue.

testsuite/
	* gcc.dg/pr10474.c: New test.

Comments

Jeff Law April 17, 2013, 6:43 p.m. UTC | #1
On 04/17/2013 09:49 AM, Martin Jambor wrote:
>
> The reason why it helps so much is that before register allocation
> there are instructions moving the value of actual arguments from
> "originally hard" register (e.g. SI, DI, etc.) to a pseudo at the
> beginning of each function.  When the argument is live across a
> function call, the pseudo is likely to be assigned to a callee-saved
> register and then also accessed from that register, even in the first
> BB, making it require prologue, though it could be fetched from the
> original one.  When we convert all uses (at least in the first BB) to
> the original register, the preparatory stage of shrink wrapping is
> often capable of moving the register moves to a later BB, thus
> creating fast paths which do not require prologue and epilogue.
I noticed similar effects when looking at range splitting.  Being able 
to move those calls into a deeper control level in the CFG would 
definitely be an improvement.

>
> We believe this change in the pipeline should not bring about any
> negative effects.  During gcc bootstrap, the number of instructions
> changed by pass_cprop_hardreg dropped but by only 1.2%.  We have also
> ran SPEC 2006 CPU benchmarks on recent Intel and AMD hardware and all
> run time differences could be attributed to noise.  The changes in
> binary sizes were also small:
Did anyone ponder just doing the hard register propagation on argument 
registers prior the prologue/epilogue handling, then the full blown 
propagation pass in its current location in the pipeline?

That would get you the benefit you're seeking and minimize other 
effects.  Of course if you try that and get effectively the same results 
as moving the full propagation pass before prologue/epilogue handling 
then the complexity of only propagating argument registers early is 
clearly not needed and we'd probably want to go with your patch as-is.


jeff
Martin Jambor April 18, 2013, 10:09 p.m. UTC | #2
Hi,

On Wed, Apr 17, 2013 at 12:43:59PM -0600, Jeff Law wrote:
> On 04/17/2013 09:49 AM, Martin Jambor wrote:
> >
> >The reason why it helps so much is that before register allocation
> >there are instructions moving the value of actual arguments from
> >"originally hard" register (e.g. SI, DI, etc.) to a pseudo at the
> >beginning of each function.  When the argument is live across a
> >function call, the pseudo is likely to be assigned to a callee-saved
> >register and then also accessed from that register, even in the first
> >BB, making it require prologue, though it could be fetched from the
> >original one.  When we convert all uses (at least in the first BB) to
> >the original register, the preparatory stage of shrink wrapping is
> >often capable of moving the register moves to a later BB, thus
> >creating fast paths which do not require prologue and epilogue.
> I noticed similar effects when looking at range splitting.  Being
> able to move those calls into a deeper control level in the CFG
> would definitely be an improvement.
> 
> >
> >We believe this change in the pipeline should not bring about any
> >negative effects.  During gcc bootstrap, the number of instructions
> >changed by pass_cprop_hardreg dropped but by only 1.2%.  We have also
> >ran SPEC 2006 CPU benchmarks on recent Intel and AMD hardware and all
> >run time differences could be attributed to noise.  The changes in
> >binary sizes were also small:

> Did anyone ponder just doing the hard register propagation on
> argument registers prior the prologue/epilogue handling, then the
> full blown propagation pass in its current location in the pipeline?

I did not because I did not think it would be substantially faster
than running the pass as-is twice.  I may be wrong but it would still
had to look at all statements and examine them at very similar level
of detail (to look for clobbers and manage value_data_entry chains)
and it would not really do that much less work fiddling with its own
data structures.

What would very likely be a working alternative for shrink-wrapping is
to have shrink-wrapping preparation invoke copyprop_hardreg_forward_1
on the first BB and the few BBs it tries to move stuff across.  But of
course that would be a bit ugly and so I think we should do it only if
there is a reason not to move the pass (or schedule it twice).

I also have not tried scheduling the hard register copy propagation
pass twice and measuring the impact on compile times.  Any suggestion
what might be a good testcase for that?

Thanks,

Martin

> 
> That would get you the benefit you're seeking and minimize other
> effects.  Of course if you try that and get effectively the same
> results as moving the full propagation pass before prologue/epilogue
> handling then the complexity of only propagating argument registers
> early is clearly not needed and we'd probably want to go with your
> patch as-is.
> 
> 
> jeff
>
Steven Bosscher April 18, 2013, 10:37 p.m. UTC | #3
On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
> I also have not tried scheduling the hard register copy propagation
> pass twice and measuring the impact on compile times.  Any suggestion
> what might be a good testcase for that?

I think a better question is when this would be useful in the first
place, and why. In other words: If you propagate hardregs before
shrink wrapping, what could be a source of new opportunities after
shrink wrapping?


The only things I can think of, given the current pass order, are:

* different basic block order due to shrink wrapping
 regcprop's effectiveness depends on the order of the basic blocks
(unfortunately)

* different basic block contents due to head/tail merging (pass_jump2)
 Head/tail merging extends some basic blocks and shortens others. The
elongated basic blocks may present new opportunities (regcprop is a
local pass).

* different basic block contents due to dead store elimination (pass_dse2)
 A removed dead store may also make an address calculation redundant,
changing the regcprop value chains.

* different basic block contents due to peephole2
 A peephole2 may present new regcprop opportunities, peephole2 misses
the context to avoid trivial copies.


On the other hand, running regcprop earlier also helps some other
passes. For example, I think regcprop before jump2 may result in more
successful head/tail merging attempts by making more input operands
match, but it could hurt if_after_reload by extending live times of
registers.


But wouldn't it be better to avoid these argument-register pseudos
being assigned to callee-saved registers? Perhaps splitting the live
range of the pseudos before the first call on each path will do the
trick, and let IRA pick the right registers for you instead.

Ciao!
Steven
Martin Jambor April 18, 2013, 11:08 p.m. UTC | #4
Hi,

On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
> > I also have not tried scheduling the hard register copy propagation
> > pass twice and measuring the impact on compile times.  Any suggestion
> > what might be a good testcase for that?
> 
> I think a better question is when this would be useful in the first
> place, and why. In other words: If you propagate hardregs before
> shrink wrapping, what could be a source of new opportunities after
> shrink wrapping?

Yes, we also did that and neither I nor Honza could think of any
potential problems there.  And of course, I'd also measure how many
statements the second run of the pass changed.  I'll probably do that
tomorrow anyway.

> 
> 
> The only things I can think of, given the current pass order, are:
> 
> * different basic block order due to shrink wrapping
>  regcprop's effectiveness depends on the order of the basic blocks
> (unfortunately)
> 
> * different basic block contents due to head/tail merging (pass_jump2)
>  Head/tail merging extends some basic blocks and shortens others. The
> elongated basic blocks may present new opportunities (regcprop is a
> local pass).
> 
> * different basic block contents due to dead store elimination (pass_dse2)
>  A removed dead store may also make an address calculation redundant,
> changing the regcprop value chains.
> 
> * different basic block contents due to peephole2
>  A peephole2 may present new regcprop opportunities, peephole2 misses
> the context to avoid trivial copies.
> 
> 
> On the other hand, running regcprop earlier also helps some other
> passes. For example, I think regcprop before jump2 may result in more
> successful head/tail merging attempts by making more input operands
> match, but it could hurt if_after_reload by extending live times of
> registers.
> 
> 
> But wouldn't it be better to avoid these argument-register pseudos
> being assigned to callee-saved registers? Perhaps splitting the live
> range of the pseudos before the first call on each path will do the
> trick, and let IRA pick the right registers for you instead.

First, where can I have a look how a live range is split?  ;-)

But second, if such a call is in a loop (or accessible by more than
one path), I wouldn't it be wrong to do that?  To avoid that, I
suppose I might end up doing another shrink-wrapping-like search for
the right edge for prologue and actually coming up with a very similar
result to the propagation and shrink-wrapping preparation.  But I'm
willing to try.

Thanks a lot,

Martin
Richard Biener April 19, 2013, 8:21 a.m. UTC | #5
On Fri, Apr 19, 2013 at 1:08 AM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi,
>
> On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
>> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
>> > I also have not tried scheduling the hard register copy propagation
>> > pass twice and measuring the impact on compile times.  Any suggestion
>> > what might be a good testcase for that?
>>
>> I think a better question is when this would be useful in the first
>> place, and why. In other words: If you propagate hardregs before
>> shrink wrapping, what could be a source of new opportunities after
>> shrink wrapping?
>
> Yes, we also did that and neither I nor Honza could think of any
> potential problems there.  And of course, I'd also measure how many
> statements the second run of the pass changed.  I'll probably do that
> tomorrow anyway.
>
>>
>>
>> The only things I can think of, given the current pass order, are:
>>
>> * different basic block order due to shrink wrapping
>>  regcprop's effectiveness depends on the order of the basic blocks
>> (unfortunately)
>>
>> * different basic block contents due to head/tail merging (pass_jump2)
>>  Head/tail merging extends some basic blocks and shortens others. The
>> elongated basic blocks may present new opportunities (regcprop is a
>> local pass).
>>
>> * different basic block contents due to dead store elimination (pass_dse2)
>>  A removed dead store may also make an address calculation redundant,
>> changing the regcprop value chains.
>>
>> * different basic block contents due to peephole2
>>  A peephole2 may present new regcprop opportunities, peephole2 misses
>> the context to avoid trivial copies.
>>
>>
>> On the other hand, running regcprop earlier also helps some other
>> passes. For example, I think regcprop before jump2 may result in more
>> successful head/tail merging attempts by making more input operands
>> match, but it could hurt if_after_reload by extending live times of
>> registers.
>>
>>
>> But wouldn't it be better to avoid these argument-register pseudos
>> being assigned to callee-saved registers? Perhaps splitting the live
>> range of the pseudos before the first call on each path will do the
>> trick, and let IRA pick the right registers for you instead.
>
> First, where can I have a look how a live range is split?  ;-)

Insert a copy and adjust all dominated uses:

 (set (new-pseudo old-pseudo))

... adjust downstream uses of old-pseudo to use new-pseudo ...

> But second, if such a call is in a loop (or accessible by more than
> one path), I wouldn't it be wrong to do that?  To avoid that, I
> suppose I might end up doing another shrink-wrapping-like search for
> the right edge for prologue and actually coming up with a very similar
> result to the propagation and shrink-wrapping preparation.  But I'm
> willing to try.

I suppose splitting life-ranges in general before argument setup might
make sense - I see hardreg copyprop as a hack around limitations
in register allocation.  Note that life-range splitting is undone by
regular copy propagation.

ISTR IRA splits life-ranges in loop code, so there must be already
some helpers that could be used.  Vlad?

Thanks,
Richard.

> Thanks a lot,
>
> Martin
>
Vladimir Makarov April 19, 2013, 1:37 p.m. UTC | #6
On 13-04-19 4:21 AM, Richard Biener wrote:
> On Fri, Apr 19, 2013 at 1:08 AM, Martin Jambor <mjambor@suse.cz> wrote:
>> Hi,
>>
>> On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
>>> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
>>>> I also have not tried scheduling the hard register copy propagation
>>>> pass twice and measuring the impact on compile times.  Any suggestion
>>>> what might be a good testcase for that?
>>> I think a better question is when this would be useful in the first
>>> place, and why. In other words: If you propagate hardregs before
>>> shrink wrapping, what could be a source of new opportunities after
>>> shrink wrapping?
>> Yes, we also did that and neither I nor Honza could think of any
>> potential problems there.  And of course, I'd also measure how many
>> statements the second run of the pass changed.  I'll probably do that
>> tomorrow anyway.
>>
>>>
>>> The only things I can think of, given the current pass order, are:
>>>
>>> * different basic block order due to shrink wrapping
>>>   regcprop's effectiveness depends on the order of the basic blocks
>>> (unfortunately)
>>>
>>> * different basic block contents due to head/tail merging (pass_jump2)
>>>   Head/tail merging extends some basic blocks and shortens others. The
>>> elongated basic blocks may present new opportunities (regcprop is a
>>> local pass).
>>>
>>> * different basic block contents due to dead store elimination (pass_dse2)
>>>   A removed dead store may also make an address calculation redundant,
>>> changing the regcprop value chains.
>>>
>>> * different basic block contents due to peephole2
>>>   A peephole2 may present new regcprop opportunities, peephole2 misses
>>> the context to avoid trivial copies.
>>>
>>>
>>> On the other hand, running regcprop earlier also helps some other
>>> passes. For example, I think regcprop before jump2 may result in more
>>> successful head/tail merging attempts by making more input operands
>>> match, but it could hurt if_after_reload by extending live times of
>>> registers.
>>>
>>>
>>> But wouldn't it be better to avoid these argument-register pseudos
>>> being assigned to callee-saved registers? Perhaps splitting the live
>>> range of the pseudos before the first call on each path will do the
>>> trick, and let IRA pick the right registers for you instead.
>> First, where can I have a look how a live range is split?  ;-)
> Insert a copy and adjust all dominated uses:
>
>   (set (new-pseudo old-pseudo))
>
> ... adjust downstream uses of old-pseudo to use new-pseudo ...
>
>> But second, if such a call is in a loop (or accessible by more than
>> one path), I wouldn't it be wrong to do that?  To avoid that, I
>> suppose I might end up doing another shrink-wrapping-like search for
>> the right edge for prologue and actually coming up with a very similar
>> result to the propagation and shrink-wrapping preparation.  But I'm
>> willing to try.
> I suppose splitting life-ranges in general before argument setup might
> make sense - I see hardreg copyprop as a hack around limitations
> in register allocation.  Note that life-range splitting is undone by
> regular copy propagation.
>
> ISTR IRA splits life-ranges in loop code, so there must be already
> some helpers that could be used.  Vlad?
>
>
   I'd not recommend to reuse this code as actual live-range splitting 
is buried under a lot of code to modify IR of IRA as we need the IR 
after live-range splitting.  Long ago I used splitting and rebuilding IR 
but building IR is a very time consuming procedure (I guess 1/2 of IRA) 
therefore the current solution is used.

There is an alternative simpler code for this in IRA.  The code was 
written by Bernd:

http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01531.html

   By the way, I  have plans to do a separate register pressure 
decreasing pass based on live-range shrinkage and rematerialization.   I 
found that we need this as more optimizations have a tendency to deal 
with this issue by themselves.  May be I'll find time to do this in this 
year (but most probably not for gcc4.9).  I am not sure even for release 
next after gcc4.9.
Jeff Law April 19, 2013, 3:27 p.m. UTC | #7
On 04/18/2013 05:08 PM, Martin Jambor wrote:
> Hi,
>
> On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
>> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
>>> I also have not tried scheduling the hard register copy propagation
>>> pass twice and measuring the impact on compile times.  Any suggestion
>>> what might be a good testcase for that?
>>
>> I think a better question is when this would be useful in the first
>> place, and why. In other words: If you propagate hardregs before
>> shrink wrapping, what could be a source of new opportunities after
>> shrink wrapping?
>
> Yes, we also did that and neither I nor Honza could think of any
> potential problems there.  And of course, I'd also measure how many
> statements the second run of the pass changed.  I'll probably do that
> tomorrow anyway.
I'd be very curious to see those numbers.  While I tend to think the 
opportunities missed by just running it early will be in the noise and 
nothing we can or should do anything about given the compile-time cost 
of running it twice.  However, experience has shown it's worth doing the 
investigative work to be sure.

>>
>>
>> But wouldn't it be better to avoid these argument-register pseudos
>> being assigned to callee-saved registers? Perhaps splitting the live
>> range of the pseudos before the first call on each path will do the
>> trick, and let IRA pick the right registers for you instead.
Isn't one of the difficulties here that the pseudo might correspond to 
an argument that wasn't passed in a register?  Thus you need alias 
analysis to know if it's valid to sink the load?

At least that's one of the issues I recall when I looked at this a 
couple years ago.

If we constrain ourselves to just sinking argreg->pseudo copies then we 
can obviously avoid that problem.

Rather than necessarily looking at this as a range splitting problem, 
can it be looked as a sinking problem?  Ultimately what we want is to 
sink those annoying arg->pseudo setups.  It seems like it'd be a fairly 
simple dataflow problem to determine those points.

>
> First, where can I have a look how a live range is split?  ;-)
There's been several implementations through the years; none that I'd 
say is suitable for reuse.
>
> But second, if such a call is in a loop (or accessible by more than
> one path), I wouldn't it be wrong to do that?  To avoid that, I
> suppose I might end up doing another shrink-wrapping-like search for
> the right edge for prologue and actually coming up with a very similar
> result to the propagation and shrink-wrapping preparation.  But I'm
> willing to try.
>
> Thanks a lot,
>
> Martin
>
Jeff Law April 19, 2013, 3:35 p.m. UTC | #8
On 04/18/2013 04:09 PM, Martin Jambor wrote:
>
> I did not because I did not think it would be substantially faster
> than running the pass as-is twice.
I wasn't looking to improve compile-time performance, but from a 
standpoint of not losing optimizations.  If we can show that a second 
pass does virtually nothing then this becomes a moot question.

>
> What would very likely be a working alternative for shrink-wrapping is
> to have shrink-wrapping preparation invoke copyprop_hardreg_forward_1
> on the first BB and the few BBs it tries to move stuff across.  But of
> course that would be a bit ugly and so I think we should do it only if
> there is a reason not to move the pass (or schedule it twice).
Or a pass to sink just the moves out of the incoming argument registers. 
  From a dataflow standpoint that should be a relatively cheap problem 
to solve.

For each argument register you need to know the block where it's 
generated (usually the first block), blocks which clobber and blocks 
which use.

You then run a latest insertion point algorithm, or a variant thereof to 
determine the latest safe insertion points.

Jeff
Martin Jambor April 24, 2013, 6:24 p.m. UTC | #9
Hi,

On Fri, Apr 19, 2013 at 09:27:28AM -0600, Jeff Law wrote:
> On 04/18/2013 05:08 PM, Martin Jambor wrote:
> >On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote:
> >>On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote:
> >>>I also have not tried scheduling the hard register copy propagation
> >>>pass twice and measuring the impact on compile times.  Any suggestion
> >>>what might be a good testcase for that?
> >>
> >>I think a better question is when this would be useful in the first
> >>place, and why. In other words: If you propagate hardregs before
> >>shrink wrapping, what could be a source of new opportunities after
> >>shrink wrapping?
> >
> >Yes, we also did that and neither I nor Honza could think of any
> >potential problems there.  And of course, I'd also measure how many
> >statements the second run of the pass changed.  I'll probably do that
> >tomorrow anyway.
> I'd be very curious to see those numbers.  While I tend to think the
> opportunities missed by just running it early will be in the noise
> and nothing we can or should do anything about given the
> compile-time cost of running it twice.  However, experience has
> shown it's worth doing the investigative work to be sure.
> 

Here they are.  First, I simply looked at how many instructions would
be changed by a second run of the pass in its current position during
C and C++ bootstrap:

    |                                     | Insns changed |      % |
    |-------------------------------------+---------------+--------|
    | Trunk - only pass in original place |        172608 | 100.00 |
    | First pass before pro/eipilogue     |        170322 |  98.68 |
    | Second pass in the original place   |          8778 |   5.09 |

5% was worth investigating more.  The 20 source files with highest
number of affected instructions by the second run were:

      939 mine/src/libgcc/config/libbid/bid_binarydecimal.c
      909 mine/src/libgcc/config/libbid/bid128_div.c
      813 mine/src/libgcc/config/libbid/bid64_div.c
      744 mine/src/libgcc/config/libbid/bid128_compare.c
      615 mine/src/libgcc/config/libbid/bid128_to_int32.c
      480 mine/src/libgcc/config/libbid/bid128_to_int64.c
      450 mine/src/libgcc/config/libbid/bid128_to_uint32.c
      408 mine/src/libgcc/config/libbid/bid128_fma.c
      354 mine/src/libgcc/config/libbid/bid128_to_uint64.c
      327 mine/src/libgcc/config/libbid/bid128_add.c
      246 mine/src/libgcc/libgcc2.c
      141 mine/src/libgcc/config/libbid/bid_round.c
      129 mine/src/libgcc/config/libbid/bid64_mul.c
      117 mine/src/libgcc/config/libbid/bid64_to_int64.c
       96 mine/src/libsanitizer/tsan/tsan_interceptors.cc
       96 mine/src/libgcc/config/libbid/bid64_compare.c
       87 mine/src/libgcc/config/libbid/bid128_noncomp.c
       84 mine/src/libgcc/config/libbid/bid64_to_bid128.c
       81 mine/src/libgcc/config/libbid/bid64_to_uint64.c
       63 mine/src/libgcc/config/libbid/bid64_to_int32.c

I have manually examined some of the late opportunities for
propagation in mine/src/libgcc/config/libbid/bid_binarydecimal.c and
majority of them was a result of peephole2.

Still, the list of files showed that the config sources of libraries
which might have been built too many times (I so not know how many but
for example I had multilib allowed which changes things a lot)
probably skew the numbers a lot.

So next time I measured only the number of instructions changed during
make stage2-bubble with multilib disabled.  In order to find out where
do the new opportunities come from, I added scheduled
pass_cprop_hardreg after every pass between
pass_branch_target_load_optimize1 and pass_fast_rtl_dce and counted
how many instructions are modified (relative to just having the pass
where it is now):

    |                                          | Insns changed |      % |
    |------------------------------------------+---------------+--------|
    | Trunk - only pass in original place      |         76225 | 100.00 |
    |------------------------------------------+---------------+--------|
    | Before pro/eipilogue                     |         77906 | 102.21 |
    | After pro/eipilogue                      |           267 |   0.35 |
    | After pass_rtl_dse2                      |             0 |   0.00 |
    | After pass_stack_adjustments             |             0 |   0.00 |
    | After pass_jump2                         |           372 |   0.49 |
    | After pass_peephole2                     |           119 |   0.16 |
    | After pass_if_after_reload               |            37 |   0.05 |
    | After pass_regrename - original position |             0 |   0.00 |

Which seems much better.  The 12 source files with most instructions
changed now were:

      116 src/libgcc/libgcc2.c
       64 src/libsanitizer/tsan/tsan_interceptors.cc
       36 src/libsanitizer/tsan/tsan_fd.cc
       31 src/gcc/cp/parser.c
       20 cp-demangle.c
       19 src/libiberty/cp-demangle.c
       12 gtype-desc.c
       12 src/libgcc/unwind-dw2.c
       11 src/gcc/config/i386/i386.c
       10 src/gcc/gengtype.c
       10 src/gcc/dwarf2out.c
        9 src/gcc/fold-const.c

I'm not sure what the conclusion is.  Probably that there are cases
where doing propagation late can be a good thing but these do not
occur that often.  And that more measurements should probably be done.
Anyway, I'll look into alternatives before (see below) pushing this
further.

By the way, scheduling pass_cprop_hardreg after pass_jump2 or
pass_peephole2 (or both) and doing a full bootstrap leads to bootstrap
miscompares.  I have not examined why.



> >>
> >>
> >>But wouldn't it be better to avoid these argument-register pseudos
> >>being assigned to callee-saved registers? Perhaps splitting the live
> >>range of the pseudos before the first call on each path will do the
> >>trick, and let IRA pick the right registers for you instead.
> Isn't one of the difficulties here that the pseudo might correspond
> to an argument that wasn't passed in a register?  Thus you need
> alias analysis to know if it's valid to sink the load?
> 
> At least that's one of the issues I recall when I looked at this a
> couple years ago.
> 
> If we constrain ourselves to just sinking argreg->pseudo copies then
> we can obviously avoid that problem.
> 
> Rather than necessarily looking at this as a range splitting
> problem, can it be looked as a sinking problem?  Ultimately what we
> want is to sink those annoying arg->pseudo setups.  It seems like
> it'd be a fairly simple dataflow problem to determine those points.
> 
> >
> >First, where can I have a look how a live range is split?  ;-)
> There's been several implementations through the years; none that
> I'd say is suitable for reuse.

I have looked at the patch Vlad suggested (most things are new to me
in RTL land and so almost everything takes me ages) and I'm certainly
willing to try and mimic some of it in order to (hopefully) get the
same effect that propagating and shrink-wrapping preparation moves can
do.  Yes, this is not enough to deal with parameters loaded from stack
but unlike latest insertion, it could also work when the parameters
are also used on the fast path, which is often the case.  In fact,
propagation helps exactly because they are used in the entry BB.
Hopefully they will end up in a caller-saved register on the fast path
and we'll flip it over to the callee-saved problematic one only on
(slow) paths going through calls.

Of course, the two approaches are not mutually exclusive and load
sinking might help too.

Thanks a lot for all suggestions,

Martin
Jeff Law April 24, 2013, 7:29 p.m. UTC | #10
On 04/24/2013 12:24 PM, Martin Jambor wrote:
>
> Here they are.  First, I simply looked at how many instructions would
> be changed by a second run of the pass in its current position during
> C and C++ bootstrap:
>
>      |                                     | Insns changed |      % |
>      |-------------------------------------+---------------+--------|
>      | Trunk - only pass in original place |        172608 | 100.00 |
>      | First pass before pro/eipilogue     |        170322 |  98.68 |
>      | Second pass in the original place   |          8778 |   5.09 |
>
> 5% was worth investigating more.  The 20 source files with highest
> number of affected instructions by the second run were:
>
>        939 mine/src/libgcc/config/libbid/bid_binarydecimal.c
>        909 mine/src/libgcc/config/libbid/bid128_div.c
>        813 mine/src/libgcc/config/libbid/bid64_div.c
>        744 mine/src/libgcc/config/libbid/bid128_compare.c
>        615 mine/src/libgcc/config/libbid/bid128_to_int32.c
>        480 mine/src/libgcc/config/libbid/bid128_to_int64.c
>        450 mine/src/libgcc/config/libbid/bid128_to_uint32.c
>        408 mine/src/libgcc/config/libbid/bid128_fma.c
>        354 mine/src/libgcc/config/libbid/bid128_to_uint64.c
>        327 mine/src/libgcc/config/libbid/bid128_add.c
>        246 mine/src/libgcc/libgcc2.c
>        141 mine/src/libgcc/config/libbid/bid_round.c
>        129 mine/src/libgcc/config/libbid/bid64_mul.c
>        117 mine/src/libgcc/config/libbid/bid64_to_int64.c
>         96 mine/src/libsanitizer/tsan/tsan_interceptors.cc
>         96 mine/src/libgcc/config/libbid/bid64_compare.c
>         87 mine/src/libgcc/config/libbid/bid128_noncomp.c
>         84 mine/src/libgcc/config/libbid/bid64_to_bid128.c
>         81 mine/src/libgcc/config/libbid/bid64_to_uint64.c
>         63 mine/src/libgcc/config/libbid/bid64_to_int32.c
The first thing that jumps out at me here is there's probably some idiom 
used in the BID code that is triggering.

> I have manually examined some of the late opportunities for
> propagation in mine/src/libgcc/config/libbid/bid_binarydecimal.c and
> majority of them was a result of peephole2.
I can pretty easily see how peep2 may expose opportunities for 
hard-cprop.  Of course, those opportunities may actually be undoing some 
of the benefit of the peep2 patterns.



>
> So next time I measured only the number of instructions changed during
> make stage2-bubble with multilib disabled.  In order to find out where
> do the new opportunities come from, I added scheduled
> pass_cprop_hardreg after every pass between
> pass_branch_target_load_optimize1 and pass_fast_rtl_dce and counted
> how many instructions are modified (relative to just having the pass
> where it is now):
Thanks.  That's a real interesting hunk of data.  Interesting that we 
have so many after {pro,epi}logue generation, a full 33% of the changed 
insns stem from here and I can't think of why that should be the case. 
Perhaps there's some second order effect that shows itself after the 
first pass of cprop-hardreg.

I can see several ways jump2 could open new propagation possibilities. 
As I noted earlier in this message, the opportunities after peep2 may 
actually be doing more harm than good.

It's probably not worth the work involved, but a more sensible 
visitation order for reg-cprop would probably be good.  Similarly we 
could have the capability to mark interesting blocks and just reg-cprop 
the interesting blocks after threading the prologue/epilogue.

>
> I'm not sure what the conclusion is.  Probably that there are cases
> where doing propagation late can be a good thing but these do not
> occur that often.  And that more measurements should probably be done.
> Anyway, I'll look into alternatives before (see below) pushing this
> further.
Knowing more about those opportunities would be useful.  The most 
interesting ones to me would be those right after the prologue/epilogue. 
  Having just run the cprop, then attached the prologue/epilogue, I 
wouldn't expect there to be many propagation opportunities.

>
> I have looked at the patch Vlad suggested (most things are new to me
> in RTL land and so almost everything takes me ages) and I'm certainly
> willing to try and mimic some of it in order to (hopefully) get the
> same effect that propagating and shrink-wrapping preparation moves can
> do.  Yes, this is not enough to deal with parameters loaded from stack
> but unlike latest insertion, it could also work when the parameters
> are also used on the fast path, which is often the case.  In fact,
> propagation helps exactly because they are used in the entry BB.
> Hopefully they will end up in a caller-saved register on the fast path
> and we'll flip it over to the callee-saved problematic one only on
> (slow) paths going through calls.
>
> Of course, the two approaches are not mutually exclusive and load
> sinking might help too.
Note that sinking copies is formulated as sink copies one at a time in 
Morgan's text.  Not sure that's needed in this case since we're just 
sinking a few, well defined copies.

And I agree, the approaches are not mutually exclusive; sinking a load 
out of the prologue and out of a hot path has a lot of value.  But 
sinking the loads is much more constrained than just sinking the 
argument copies.

jeff
diff mbox

Patch

Index: src/gcc/passes.c
===================================================================
--- src.orig/gcc/passes.c
+++ src/gcc/passes.c
@@ -1630,6 +1630,7 @@  init_optimization_passes (void)
 	  NEXT_PASS (pass_ree);
 	  NEXT_PASS (pass_compare_elim_after_reload);
 	  NEXT_PASS (pass_branch_target_load_optimize1);
+	  NEXT_PASS (pass_cprop_hardreg);
 	  NEXT_PASS (pass_thread_prologue_and_epilogue);
 	  NEXT_PASS (pass_rtl_dse2);
 	  NEXT_PASS (pass_stack_adjustments);
@@ -1637,7 +1638,6 @@  init_optimization_passes (void)
 	  NEXT_PASS (pass_peephole2);
 	  NEXT_PASS (pass_if_after_reload);
 	  NEXT_PASS (pass_regrename);
-	  NEXT_PASS (pass_cprop_hardreg);
 	  NEXT_PASS (pass_fast_rtl_dce);
 	  NEXT_PASS (pass_reorder_blocks);
 	  NEXT_PASS (pass_branch_target_load_optimize2);
Index: src/gcc/testsuite/gcc.dg/pr10474.c
===================================================================
--- /dev/null
+++ src/gcc/testsuite/gcc.dg/pr10474.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-rtl-pro_and_epilogue"  } */
+
+void f(int *i)
+{
+	if (!i)
+		return;
+	else
+	{
+		__builtin_printf("Hi");
+		*i=0;
+	}
+}
+
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */