Message ID | 20130417154935.GC3656@virgil.suse |
---|---|
State | New |
Headers | show |
On 04/17/2013 09:49 AM, Martin Jambor wrote: > > The reason why it helps so much is that before register allocation > there are instructions moving the value of actual arguments from > "originally hard" register (e.g. SI, DI, etc.) to a pseudo at the > beginning of each function. When the argument is live across a > function call, the pseudo is likely to be assigned to a callee-saved > register and then also accessed from that register, even in the first > BB, making it require prologue, though it could be fetched from the > original one. When we convert all uses (at least in the first BB) to > the original register, the preparatory stage of shrink wrapping is > often capable of moving the register moves to a later BB, thus > creating fast paths which do not require prologue and epilogue. I noticed similar effects when looking at range splitting. Being able to move those calls into a deeper control level in the CFG would definitely be an improvement. > > We believe this change in the pipeline should not bring about any > negative effects. During gcc bootstrap, the number of instructions > changed by pass_cprop_hardreg dropped but by only 1.2%. We have also > ran SPEC 2006 CPU benchmarks on recent Intel and AMD hardware and all > run time differences could be attributed to noise. The changes in > binary sizes were also small: Did anyone ponder just doing the hard register propagation on argument registers prior the prologue/epilogue handling, then the full blown propagation pass in its current location in the pipeline? That would get you the benefit you're seeking and minimize other effects. Of course if you try that and get effectively the same results as moving the full propagation pass before prologue/epilogue handling then the complexity of only propagating argument registers early is clearly not needed and we'd probably want to go with your patch as-is. jeff
Hi, On Wed, Apr 17, 2013 at 12:43:59PM -0600, Jeff Law wrote: > On 04/17/2013 09:49 AM, Martin Jambor wrote: > > > >The reason why it helps so much is that before register allocation > >there are instructions moving the value of actual arguments from > >"originally hard" register (e.g. SI, DI, etc.) to a pseudo at the > >beginning of each function. When the argument is live across a > >function call, the pseudo is likely to be assigned to a callee-saved > >register and then also accessed from that register, even in the first > >BB, making it require prologue, though it could be fetched from the > >original one. When we convert all uses (at least in the first BB) to > >the original register, the preparatory stage of shrink wrapping is > >often capable of moving the register moves to a later BB, thus > >creating fast paths which do not require prologue and epilogue. > I noticed similar effects when looking at range splitting. Being > able to move those calls into a deeper control level in the CFG > would definitely be an improvement. > > > > >We believe this change in the pipeline should not bring about any > >negative effects. During gcc bootstrap, the number of instructions > >changed by pass_cprop_hardreg dropped but by only 1.2%. We have also > >ran SPEC 2006 CPU benchmarks on recent Intel and AMD hardware and all > >run time differences could be attributed to noise. The changes in > >binary sizes were also small: > Did anyone ponder just doing the hard register propagation on > argument registers prior the prologue/epilogue handling, then the > full blown propagation pass in its current location in the pipeline? I did not because I did not think it would be substantially faster than running the pass as-is twice. I may be wrong but it would still had to look at all statements and examine them at very similar level of detail (to look for clobbers and manage value_data_entry chains) and it would not really do that much less work fiddling with its own data structures. What would very likely be a working alternative for shrink-wrapping is to have shrink-wrapping preparation invoke copyprop_hardreg_forward_1 on the first BB and the few BBs it tries to move stuff across. But of course that would be a bit ugly and so I think we should do it only if there is a reason not to move the pass (or schedule it twice). I also have not tried scheduling the hard register copy propagation pass twice and measuring the impact on compile times. Any suggestion what might be a good testcase for that? Thanks, Martin > > That would get you the benefit you're seeking and minimize other > effects. Of course if you try that and get effectively the same > results as moving the full propagation pass before prologue/epilogue > handling then the complexity of only propagating argument registers > early is clearly not needed and we'd probably want to go with your > patch as-is. > > > jeff >
On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: > I also have not tried scheduling the hard register copy propagation > pass twice and measuring the impact on compile times. Any suggestion > what might be a good testcase for that? I think a better question is when this would be useful in the first place, and why. In other words: If you propagate hardregs before shrink wrapping, what could be a source of new opportunities after shrink wrapping? The only things I can think of, given the current pass order, are: * different basic block order due to shrink wrapping regcprop's effectiveness depends on the order of the basic blocks (unfortunately) * different basic block contents due to head/tail merging (pass_jump2) Head/tail merging extends some basic blocks and shortens others. The elongated basic blocks may present new opportunities (regcprop is a local pass). * different basic block contents due to dead store elimination (pass_dse2) A removed dead store may also make an address calculation redundant, changing the regcprop value chains. * different basic block contents due to peephole2 A peephole2 may present new regcprop opportunities, peephole2 misses the context to avoid trivial copies. On the other hand, running regcprop earlier also helps some other passes. For example, I think regcprop before jump2 may result in more successful head/tail merging attempts by making more input operands match, but it could hurt if_after_reload by extending live times of registers. But wouldn't it be better to avoid these argument-register pseudos being assigned to callee-saved registers? Perhaps splitting the live range of the pseudos before the first call on each path will do the trick, and let IRA pick the right registers for you instead. Ciao! Steven
Hi, On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote: > On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: > > I also have not tried scheduling the hard register copy propagation > > pass twice and measuring the impact on compile times. Any suggestion > > what might be a good testcase for that? > > I think a better question is when this would be useful in the first > place, and why. In other words: If you propagate hardregs before > shrink wrapping, what could be a source of new opportunities after > shrink wrapping? Yes, we also did that and neither I nor Honza could think of any potential problems there. And of course, I'd also measure how many statements the second run of the pass changed. I'll probably do that tomorrow anyway. > > > The only things I can think of, given the current pass order, are: > > * different basic block order due to shrink wrapping > regcprop's effectiveness depends on the order of the basic blocks > (unfortunately) > > * different basic block contents due to head/tail merging (pass_jump2) > Head/tail merging extends some basic blocks and shortens others. The > elongated basic blocks may present new opportunities (regcprop is a > local pass). > > * different basic block contents due to dead store elimination (pass_dse2) > A removed dead store may also make an address calculation redundant, > changing the regcprop value chains. > > * different basic block contents due to peephole2 > A peephole2 may present new regcprop opportunities, peephole2 misses > the context to avoid trivial copies. > > > On the other hand, running regcprop earlier also helps some other > passes. For example, I think regcprop before jump2 may result in more > successful head/tail merging attempts by making more input operands > match, but it could hurt if_after_reload by extending live times of > registers. > > > But wouldn't it be better to avoid these argument-register pseudos > being assigned to callee-saved registers? Perhaps splitting the live > range of the pseudos before the first call on each path will do the > trick, and let IRA pick the right registers for you instead. First, where can I have a look how a live range is split? ;-) But second, if such a call is in a loop (or accessible by more than one path), I wouldn't it be wrong to do that? To avoid that, I suppose I might end up doing another shrink-wrapping-like search for the right edge for prologue and actually coming up with a very similar result to the propagation and shrink-wrapping preparation. But I'm willing to try. Thanks a lot, Martin
On Fri, Apr 19, 2013 at 1:08 AM, Martin Jambor <mjambor@suse.cz> wrote: > Hi, > > On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote: >> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: >> > I also have not tried scheduling the hard register copy propagation >> > pass twice and measuring the impact on compile times. Any suggestion >> > what might be a good testcase for that? >> >> I think a better question is when this would be useful in the first >> place, and why. In other words: If you propagate hardregs before >> shrink wrapping, what could be a source of new opportunities after >> shrink wrapping? > > Yes, we also did that and neither I nor Honza could think of any > potential problems there. And of course, I'd also measure how many > statements the second run of the pass changed. I'll probably do that > tomorrow anyway. > >> >> >> The only things I can think of, given the current pass order, are: >> >> * different basic block order due to shrink wrapping >> regcprop's effectiveness depends on the order of the basic blocks >> (unfortunately) >> >> * different basic block contents due to head/tail merging (pass_jump2) >> Head/tail merging extends some basic blocks and shortens others. The >> elongated basic blocks may present new opportunities (regcprop is a >> local pass). >> >> * different basic block contents due to dead store elimination (pass_dse2) >> A removed dead store may also make an address calculation redundant, >> changing the regcprop value chains. >> >> * different basic block contents due to peephole2 >> A peephole2 may present new regcprop opportunities, peephole2 misses >> the context to avoid trivial copies. >> >> >> On the other hand, running regcprop earlier also helps some other >> passes. For example, I think regcprop before jump2 may result in more >> successful head/tail merging attempts by making more input operands >> match, but it could hurt if_after_reload by extending live times of >> registers. >> >> >> But wouldn't it be better to avoid these argument-register pseudos >> being assigned to callee-saved registers? Perhaps splitting the live >> range of the pseudos before the first call on each path will do the >> trick, and let IRA pick the right registers for you instead. > > First, where can I have a look how a live range is split? ;-) Insert a copy and adjust all dominated uses: (set (new-pseudo old-pseudo)) ... adjust downstream uses of old-pseudo to use new-pseudo ... > But second, if such a call is in a loop (or accessible by more than > one path), I wouldn't it be wrong to do that? To avoid that, I > suppose I might end up doing another shrink-wrapping-like search for > the right edge for prologue and actually coming up with a very similar > result to the propagation and shrink-wrapping preparation. But I'm > willing to try. I suppose splitting life-ranges in general before argument setup might make sense - I see hardreg copyprop as a hack around limitations in register allocation. Note that life-range splitting is undone by regular copy propagation. ISTR IRA splits life-ranges in loop code, so there must be already some helpers that could be used. Vlad? Thanks, Richard. > Thanks a lot, > > Martin >
On 13-04-19 4:21 AM, Richard Biener wrote: > On Fri, Apr 19, 2013 at 1:08 AM, Martin Jambor <mjambor@suse.cz> wrote: >> Hi, >> >> On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote: >>> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: >>>> I also have not tried scheduling the hard register copy propagation >>>> pass twice and measuring the impact on compile times. Any suggestion >>>> what might be a good testcase for that? >>> I think a better question is when this would be useful in the first >>> place, and why. In other words: If you propagate hardregs before >>> shrink wrapping, what could be a source of new opportunities after >>> shrink wrapping? >> Yes, we also did that and neither I nor Honza could think of any >> potential problems there. And of course, I'd also measure how many >> statements the second run of the pass changed. I'll probably do that >> tomorrow anyway. >> >>> >>> The only things I can think of, given the current pass order, are: >>> >>> * different basic block order due to shrink wrapping >>> regcprop's effectiveness depends on the order of the basic blocks >>> (unfortunately) >>> >>> * different basic block contents due to head/tail merging (pass_jump2) >>> Head/tail merging extends some basic blocks and shortens others. The >>> elongated basic blocks may present new opportunities (regcprop is a >>> local pass). >>> >>> * different basic block contents due to dead store elimination (pass_dse2) >>> A removed dead store may also make an address calculation redundant, >>> changing the regcprop value chains. >>> >>> * different basic block contents due to peephole2 >>> A peephole2 may present new regcprop opportunities, peephole2 misses >>> the context to avoid trivial copies. >>> >>> >>> On the other hand, running regcprop earlier also helps some other >>> passes. For example, I think regcprop before jump2 may result in more >>> successful head/tail merging attempts by making more input operands >>> match, but it could hurt if_after_reload by extending live times of >>> registers. >>> >>> >>> But wouldn't it be better to avoid these argument-register pseudos >>> being assigned to callee-saved registers? Perhaps splitting the live >>> range of the pseudos before the first call on each path will do the >>> trick, and let IRA pick the right registers for you instead. >> First, where can I have a look how a live range is split? ;-) > Insert a copy and adjust all dominated uses: > > (set (new-pseudo old-pseudo)) > > ... adjust downstream uses of old-pseudo to use new-pseudo ... > >> But second, if such a call is in a loop (or accessible by more than >> one path), I wouldn't it be wrong to do that? To avoid that, I >> suppose I might end up doing another shrink-wrapping-like search for >> the right edge for prologue and actually coming up with a very similar >> result to the propagation and shrink-wrapping preparation. But I'm >> willing to try. > I suppose splitting life-ranges in general before argument setup might > make sense - I see hardreg copyprop as a hack around limitations > in register allocation. Note that life-range splitting is undone by > regular copy propagation. > > ISTR IRA splits life-ranges in loop code, so there must be already > some helpers that could be used. Vlad? > > I'd not recommend to reuse this code as actual live-range splitting is buried under a lot of code to modify IR of IRA as we need the IR after live-range splitting. Long ago I used splitting and rebuilding IR but building IR is a very time consuming procedure (I guess 1/2 of IRA) therefore the current solution is used. There is an alternative simpler code for this in IRA. The code was written by Bernd: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01531.html By the way, I have plans to do a separate register pressure decreasing pass based on live-range shrinkage and rematerialization. I found that we need this as more optimizations have a tendency to deal with this issue by themselves. May be I'll find time to do this in this year (but most probably not for gcc4.9). I am not sure even for release next after gcc4.9.
On 04/18/2013 05:08 PM, Martin Jambor wrote: > Hi, > > On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote: >> On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: >>> I also have not tried scheduling the hard register copy propagation >>> pass twice and measuring the impact on compile times. Any suggestion >>> what might be a good testcase for that? >> >> I think a better question is when this would be useful in the first >> place, and why. In other words: If you propagate hardregs before >> shrink wrapping, what could be a source of new opportunities after >> shrink wrapping? > > Yes, we also did that and neither I nor Honza could think of any > potential problems there. And of course, I'd also measure how many > statements the second run of the pass changed. I'll probably do that > tomorrow anyway. I'd be very curious to see those numbers. While I tend to think the opportunities missed by just running it early will be in the noise and nothing we can or should do anything about given the compile-time cost of running it twice. However, experience has shown it's worth doing the investigative work to be sure. >> >> >> But wouldn't it be better to avoid these argument-register pseudos >> being assigned to callee-saved registers? Perhaps splitting the live >> range of the pseudos before the first call on each path will do the >> trick, and let IRA pick the right registers for you instead. Isn't one of the difficulties here that the pseudo might correspond to an argument that wasn't passed in a register? Thus you need alias analysis to know if it's valid to sink the load? At least that's one of the issues I recall when I looked at this a couple years ago. If we constrain ourselves to just sinking argreg->pseudo copies then we can obviously avoid that problem. Rather than necessarily looking at this as a range splitting problem, can it be looked as a sinking problem? Ultimately what we want is to sink those annoying arg->pseudo setups. It seems like it'd be a fairly simple dataflow problem to determine those points. > > First, where can I have a look how a live range is split? ;-) There's been several implementations through the years; none that I'd say is suitable for reuse. > > But second, if such a call is in a loop (or accessible by more than > one path), I wouldn't it be wrong to do that? To avoid that, I > suppose I might end up doing another shrink-wrapping-like search for > the right edge for prologue and actually coming up with a very similar > result to the propagation and shrink-wrapping preparation. But I'm > willing to try. > > Thanks a lot, > > Martin >
On 04/18/2013 04:09 PM, Martin Jambor wrote: > > I did not because I did not think it would be substantially faster > than running the pass as-is twice. I wasn't looking to improve compile-time performance, but from a standpoint of not losing optimizations. If we can show that a second pass does virtually nothing then this becomes a moot question. > > What would very likely be a working alternative for shrink-wrapping is > to have shrink-wrapping preparation invoke copyprop_hardreg_forward_1 > on the first BB and the few BBs it tries to move stuff across. But of > course that would be a bit ugly and so I think we should do it only if > there is a reason not to move the pass (or schedule it twice). Or a pass to sink just the moves out of the incoming argument registers. From a dataflow standpoint that should be a relatively cheap problem to solve. For each argument register you need to know the block where it's generated (usually the first block), blocks which clobber and blocks which use. You then run a latest insertion point algorithm, or a variant thereof to determine the latest safe insertion points. Jeff
Hi, On Fri, Apr 19, 2013 at 09:27:28AM -0600, Jeff Law wrote: > On 04/18/2013 05:08 PM, Martin Jambor wrote: > >On Fri, Apr 19, 2013 at 12:37:58AM +0200, Steven Bosscher wrote: > >>On Fri, Apr 19, 2013 at 12:09 AM, Martin Jambor wrote: > >>>I also have not tried scheduling the hard register copy propagation > >>>pass twice and measuring the impact on compile times. Any suggestion > >>>what might be a good testcase for that? > >> > >>I think a better question is when this would be useful in the first > >>place, and why. In other words: If you propagate hardregs before > >>shrink wrapping, what could be a source of new opportunities after > >>shrink wrapping? > > > >Yes, we also did that and neither I nor Honza could think of any > >potential problems there. And of course, I'd also measure how many > >statements the second run of the pass changed. I'll probably do that > >tomorrow anyway. > I'd be very curious to see those numbers. While I tend to think the > opportunities missed by just running it early will be in the noise > and nothing we can or should do anything about given the > compile-time cost of running it twice. However, experience has > shown it's worth doing the investigative work to be sure. > Here they are. First, I simply looked at how many instructions would be changed by a second run of the pass in its current position during C and C++ bootstrap: | | Insns changed | % | |-------------------------------------+---------------+--------| | Trunk - only pass in original place | 172608 | 100.00 | | First pass before pro/eipilogue | 170322 | 98.68 | | Second pass in the original place | 8778 | 5.09 | 5% was worth investigating more. The 20 source files with highest number of affected instructions by the second run were: 939 mine/src/libgcc/config/libbid/bid_binarydecimal.c 909 mine/src/libgcc/config/libbid/bid128_div.c 813 mine/src/libgcc/config/libbid/bid64_div.c 744 mine/src/libgcc/config/libbid/bid128_compare.c 615 mine/src/libgcc/config/libbid/bid128_to_int32.c 480 mine/src/libgcc/config/libbid/bid128_to_int64.c 450 mine/src/libgcc/config/libbid/bid128_to_uint32.c 408 mine/src/libgcc/config/libbid/bid128_fma.c 354 mine/src/libgcc/config/libbid/bid128_to_uint64.c 327 mine/src/libgcc/config/libbid/bid128_add.c 246 mine/src/libgcc/libgcc2.c 141 mine/src/libgcc/config/libbid/bid_round.c 129 mine/src/libgcc/config/libbid/bid64_mul.c 117 mine/src/libgcc/config/libbid/bid64_to_int64.c 96 mine/src/libsanitizer/tsan/tsan_interceptors.cc 96 mine/src/libgcc/config/libbid/bid64_compare.c 87 mine/src/libgcc/config/libbid/bid128_noncomp.c 84 mine/src/libgcc/config/libbid/bid64_to_bid128.c 81 mine/src/libgcc/config/libbid/bid64_to_uint64.c 63 mine/src/libgcc/config/libbid/bid64_to_int32.c I have manually examined some of the late opportunities for propagation in mine/src/libgcc/config/libbid/bid_binarydecimal.c and majority of them was a result of peephole2. Still, the list of files showed that the config sources of libraries which might have been built too many times (I so not know how many but for example I had multilib allowed which changes things a lot) probably skew the numbers a lot. So next time I measured only the number of instructions changed during make stage2-bubble with multilib disabled. In order to find out where do the new opportunities come from, I added scheduled pass_cprop_hardreg after every pass between pass_branch_target_load_optimize1 and pass_fast_rtl_dce and counted how many instructions are modified (relative to just having the pass where it is now): | | Insns changed | % | |------------------------------------------+---------------+--------| | Trunk - only pass in original place | 76225 | 100.00 | |------------------------------------------+---------------+--------| | Before pro/eipilogue | 77906 | 102.21 | | After pro/eipilogue | 267 | 0.35 | | After pass_rtl_dse2 | 0 | 0.00 | | After pass_stack_adjustments | 0 | 0.00 | | After pass_jump2 | 372 | 0.49 | | After pass_peephole2 | 119 | 0.16 | | After pass_if_after_reload | 37 | 0.05 | | After pass_regrename - original position | 0 | 0.00 | Which seems much better. The 12 source files with most instructions changed now were: 116 src/libgcc/libgcc2.c 64 src/libsanitizer/tsan/tsan_interceptors.cc 36 src/libsanitizer/tsan/tsan_fd.cc 31 src/gcc/cp/parser.c 20 cp-demangle.c 19 src/libiberty/cp-demangle.c 12 gtype-desc.c 12 src/libgcc/unwind-dw2.c 11 src/gcc/config/i386/i386.c 10 src/gcc/gengtype.c 10 src/gcc/dwarf2out.c 9 src/gcc/fold-const.c I'm not sure what the conclusion is. Probably that there are cases where doing propagation late can be a good thing but these do not occur that often. And that more measurements should probably be done. Anyway, I'll look into alternatives before (see below) pushing this further. By the way, scheduling pass_cprop_hardreg after pass_jump2 or pass_peephole2 (or both) and doing a full bootstrap leads to bootstrap miscompares. I have not examined why. > >> > >> > >>But wouldn't it be better to avoid these argument-register pseudos > >>being assigned to callee-saved registers? Perhaps splitting the live > >>range of the pseudos before the first call on each path will do the > >>trick, and let IRA pick the right registers for you instead. > Isn't one of the difficulties here that the pseudo might correspond > to an argument that wasn't passed in a register? Thus you need > alias analysis to know if it's valid to sink the load? > > At least that's one of the issues I recall when I looked at this a > couple years ago. > > If we constrain ourselves to just sinking argreg->pseudo copies then > we can obviously avoid that problem. > > Rather than necessarily looking at this as a range splitting > problem, can it be looked as a sinking problem? Ultimately what we > want is to sink those annoying arg->pseudo setups. It seems like > it'd be a fairly simple dataflow problem to determine those points. > > > > >First, where can I have a look how a live range is split? ;-) > There's been several implementations through the years; none that > I'd say is suitable for reuse. I have looked at the patch Vlad suggested (most things are new to me in RTL land and so almost everything takes me ages) and I'm certainly willing to try and mimic some of it in order to (hopefully) get the same effect that propagating and shrink-wrapping preparation moves can do. Yes, this is not enough to deal with parameters loaded from stack but unlike latest insertion, it could also work when the parameters are also used on the fast path, which is often the case. In fact, propagation helps exactly because they are used in the entry BB. Hopefully they will end up in a caller-saved register on the fast path and we'll flip it over to the callee-saved problematic one only on (slow) paths going through calls. Of course, the two approaches are not mutually exclusive and load sinking might help too. Thanks a lot for all suggestions, Martin
On 04/24/2013 12:24 PM, Martin Jambor wrote: > > Here they are. First, I simply looked at how many instructions would > be changed by a second run of the pass in its current position during > C and C++ bootstrap: > > | | Insns changed | % | > |-------------------------------------+---------------+--------| > | Trunk - only pass in original place | 172608 | 100.00 | > | First pass before pro/eipilogue | 170322 | 98.68 | > | Second pass in the original place | 8778 | 5.09 | > > 5% was worth investigating more. The 20 source files with highest > number of affected instructions by the second run were: > > 939 mine/src/libgcc/config/libbid/bid_binarydecimal.c > 909 mine/src/libgcc/config/libbid/bid128_div.c > 813 mine/src/libgcc/config/libbid/bid64_div.c > 744 mine/src/libgcc/config/libbid/bid128_compare.c > 615 mine/src/libgcc/config/libbid/bid128_to_int32.c > 480 mine/src/libgcc/config/libbid/bid128_to_int64.c > 450 mine/src/libgcc/config/libbid/bid128_to_uint32.c > 408 mine/src/libgcc/config/libbid/bid128_fma.c > 354 mine/src/libgcc/config/libbid/bid128_to_uint64.c > 327 mine/src/libgcc/config/libbid/bid128_add.c > 246 mine/src/libgcc/libgcc2.c > 141 mine/src/libgcc/config/libbid/bid_round.c > 129 mine/src/libgcc/config/libbid/bid64_mul.c > 117 mine/src/libgcc/config/libbid/bid64_to_int64.c > 96 mine/src/libsanitizer/tsan/tsan_interceptors.cc > 96 mine/src/libgcc/config/libbid/bid64_compare.c > 87 mine/src/libgcc/config/libbid/bid128_noncomp.c > 84 mine/src/libgcc/config/libbid/bid64_to_bid128.c > 81 mine/src/libgcc/config/libbid/bid64_to_uint64.c > 63 mine/src/libgcc/config/libbid/bid64_to_int32.c The first thing that jumps out at me here is there's probably some idiom used in the BID code that is triggering. > I have manually examined some of the late opportunities for > propagation in mine/src/libgcc/config/libbid/bid_binarydecimal.c and > majority of them was a result of peephole2. I can pretty easily see how peep2 may expose opportunities for hard-cprop. Of course, those opportunities may actually be undoing some of the benefit of the peep2 patterns. > > So next time I measured only the number of instructions changed during > make stage2-bubble with multilib disabled. In order to find out where > do the new opportunities come from, I added scheduled > pass_cprop_hardreg after every pass between > pass_branch_target_load_optimize1 and pass_fast_rtl_dce and counted > how many instructions are modified (relative to just having the pass > where it is now): Thanks. That's a real interesting hunk of data. Interesting that we have so many after {pro,epi}logue generation, a full 33% of the changed insns stem from here and I can't think of why that should be the case. Perhaps there's some second order effect that shows itself after the first pass of cprop-hardreg. I can see several ways jump2 could open new propagation possibilities. As I noted earlier in this message, the opportunities after peep2 may actually be doing more harm than good. It's probably not worth the work involved, but a more sensible visitation order for reg-cprop would probably be good. Similarly we could have the capability to mark interesting blocks and just reg-cprop the interesting blocks after threading the prologue/epilogue. > > I'm not sure what the conclusion is. Probably that there are cases > where doing propagation late can be a good thing but these do not > occur that often. And that more measurements should probably be done. > Anyway, I'll look into alternatives before (see below) pushing this > further. Knowing more about those opportunities would be useful. The most interesting ones to me would be those right after the prologue/epilogue. Having just run the cprop, then attached the prologue/epilogue, I wouldn't expect there to be many propagation opportunities. > > I have looked at the patch Vlad suggested (most things are new to me > in RTL land and so almost everything takes me ages) and I'm certainly > willing to try and mimic some of it in order to (hopefully) get the > same effect that propagating and shrink-wrapping preparation moves can > do. Yes, this is not enough to deal with parameters loaded from stack > but unlike latest insertion, it could also work when the parameters > are also used on the fast path, which is often the case. In fact, > propagation helps exactly because they are used in the entry BB. > Hopefully they will end up in a caller-saved register on the fast path > and we'll flip it over to the callee-saved problematic one only on > (slow) paths going through calls. > > Of course, the two approaches are not mutually exclusive and load > sinking might help too. Note that sinking copies is formulated as sink copies one at a time in Morgan's text. Not sure that's needed in this case since we're just sinking a few, well defined copies. And I agree, the approaches are not mutually exclusive; sinking a load out of the prologue and out of a hot path has a lot of value. But sinking the loads is much more constrained than just sinking the argument copies. jeff
Index: src/gcc/passes.c =================================================================== --- src.orig/gcc/passes.c +++ src/gcc/passes.c @@ -1630,6 +1630,7 @@ init_optimization_passes (void) NEXT_PASS (pass_ree); NEXT_PASS (pass_compare_elim_after_reload); NEXT_PASS (pass_branch_target_load_optimize1); + NEXT_PASS (pass_cprop_hardreg); NEXT_PASS (pass_thread_prologue_and_epilogue); NEXT_PASS (pass_rtl_dse2); NEXT_PASS (pass_stack_adjustments); @@ -1637,7 +1638,6 @@ init_optimization_passes (void) NEXT_PASS (pass_peephole2); NEXT_PASS (pass_if_after_reload); NEXT_PASS (pass_regrename); - NEXT_PASS (pass_cprop_hardreg); NEXT_PASS (pass_fast_rtl_dce); NEXT_PASS (pass_reorder_blocks); NEXT_PASS (pass_branch_target_load_optimize2); Index: src/gcc/testsuite/gcc.dg/pr10474.c =================================================================== --- /dev/null +++ src/gcc/testsuite/gcc.dg/pr10474.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -fdump-rtl-pro_and_epilogue" } */ + +void f(int *i) +{ + if (!i) + return; + else + { + __builtin_printf("Hi"); + *i=0; + } +} + +/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */ +/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */