Message ID | e21c6260ba313fd4905f00fa0da29f92af14b13e.1470015604.git.segher@kernel.crashing.org |
---|---|
State | New |
Headers | show |
On 07/31/2016 07:42 PM, Segher Boessenkool wrote: > > Deciding what blocks should run with a certain component active so that > the total cost of executing the prologues (and epilogues) is optimal, is > not a computationally feasible problem. Really? It's just a dataflow problem is it not and one that ought to converge very quickly I'd think. Or is it more a function of having to run it on so many independent components? I'm still pondering the value of having every GPR be an independent component :-) ISTM this adds a fair amount of complexity to the implementation in that prologues and epilogues for a particular component can run more than once. Can you give a concrete example where this happens so that we can all understand it better? If we keep this aspect of the implementation it seems like a note in the developer section would be in order. It's certainly non-intuitive. I only glanced over the code that seems related to this aspect of the implementation. If we decide to go forward, I'd like to look at it again more closely. > > Now all that is left is inserting prologues and epilogues on all edges > that jump into resp. out of the "active" set of blocks. Often we need > to insert some components' prologues (or epilogues) on all edges into > (or out of) a block. In theory cross-jumping can unify all such, but > in practice that often fails; besides, that is a lot of work. So in > this case we insert the prologue and epilogue components at the "head" > or "tail" of a block, instead. Cross jumping is rather simplistic, so I'm not surprised that it doesn't catch all this stuff. > > As a final optimisation, if a block needs a prologue and its immediate > dominator has the block as a post-dominator, the dominator gets the > prologue as well. So why not just put it in the idom and not in the dominated block? Doesn't this just end up running that component's prologue twice? > > 2016-06-07 Segher Boessenkool <segher@kernel.crashing.org> > > * function.c (thread_prologue_and_epilogue_insns): Recompute the > live info. Call try_shrink_wrapping_separate. Compute the > prologue_seq afterwards, if it has possibly changed. Compute the > split_prologue_seq and epilogue_seq later, too. > * shrink-wrap.c: #include cfgbuild.h. > (dump_components): New function. > (struct sw): New struct. > (SW): New function. > (init_separate_shrink_wrap): New function. > (fini_separate_shrink_wrap): New function. > (place_prologue_for_one_component): New function. > (spread_components): New function. > (disqualify_problematic_components): New function. > (emit_common_heads_for_components): New function. > (emit_common_tails_for_components): New function. > (insert_prologue_epilogue_for_components): New function. > (try_shrink_wrapping_separate): New function. > * shrink-wrap.h: Declare try_shrink_wrapping_separate. > --- > gcc/function.c | 15 +- > gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > gcc/shrink-wrap.h | 1 + > 3 files changed, 729 insertions(+), 2 deletions(-) > > diff --git a/gcc/function.c b/gcc/function.c > index bba0705..390e9a6 100644 > --- a/gcc/function.c > +++ b/gcc/function.c > @@ -5912,6 +5912,12 @@ make_epilogue_seq (void) > void > thread_prologue_and_epilogue_insns (void) > { > + if (optimize > 1) > + { > + df_live_add_problem (); > + df_live_set_all_dirty (); > + } Perhaps conditional on separate shrink wrapping? > @@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void) > edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun)); > edge orig_entry_edge = entry_edge; > > - rtx_insn *split_prologue_seq = make_split_prologue_seq (); > rtx_insn *prologue_seq = make_prologue_seq (); > - rtx_insn *epilogue_seq = make_epilogue_seq (); > > /* Try to perform a kind of shrink-wrapping, making sure the > prologue/epilogue is emitted only around those parts of the > @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void) > > try_shrink_wrapping (&entry_edge, prologue_seq); > > + df_analyze (); > + try_shrink_wrapping_separate (entry_edge->dest); > + if (crtl->shrink_wrapped_separate) > + prologue_seq = make_prologue_seq (); Perhaps push the df_analyze call into try_shrink_wrapping_separate? ANd if it was successful, do you need to update the DF information? Is that perhaps the cause of some of the issues we're seeing with DCE, regrename, the scheduler? > diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c > index b85b1c3..643e375 100644 > --- a/gcc/shrink-wrap.c > +++ b/gcc/shrink-wrap.c > @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3. If not see > #include "output.h" > #include "tree-pass.h" > #include "cfgrtl.h" > +#include "cfgbuild.h" > #include "params.h" > #include "bb-reorder.h" > #include "shrink-wrap.h" > @@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq) > BITMAP_FREE (bb_with); > free_dominance_info (CDI_DOMINATORS); > } > + > +/* Separate shrink-wrapping > + > + Instead of putting all of the prologue and epilogue in one spot, we > + can put parts of it in places where those components are executed less > + frequently. The following code does this, for prologue and epilogue > + components that can be put in more than one location, and where those > + components can be executed more than once (the epilogue component will > + always be executed before the prologue component is executed a second > + time). > + > + What exactly is a component is target-dependent. The more usual > + components are simple saves/restores to/from the frame of callee-saved > + registers. This code treats components abstractly (as an sbitmap), > + letting the target handle all details. > + > + Prologue components are placed in such a way that for every component > + the prologue is executed as infrequently as possible. We do this by > + walking the dominator tree, comparing the cost of placing a prologue > + component before a block to the sum of costs determined for all subtrees > + of that block. > + > + From this placement, we then determine for each component all blocks > + where at least one of this block's dominators (including itself) will > + get a prologue inserted. That then is how the components are placed. > + We could place the epilogue components a bit smarter (we can save a > + bit of code size sometimes); this is a possible future improvement. > + > + Prologues and epilogues are preferably placed into a block, either at > + the beginning or end of it, if it is needed for all predecessor resp. > + successor edges; or placed on the edge otherwise. > + > + If the placement of any prologue/epilogue leads to a situation we cannot > + handle (for example, an abnormal edge would need to be split, or some > + targets want to use some specific registers that may not be available > + where we want to put them), separate shrink-wrapping for the components > + in that prologue/epilogue is aborted. */ > + > + > +/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the > + label LABEL. */ > +static void > +dump_components (const char *label, sbitmap components) > +{ > + if (bitmap_empty_p (components)) > + return; > + > + fprintf (dump_file, " [%s", label); > + > + for (unsigned int j = 0; j < components->n_bits; j++) > + if (bitmap_bit_p (components, j)) > + fprintf (dump_file, " %u", j); > + > + fprintf (dump_file, "]"); Consider allowing the target to provide a mapping from the component to a symbolic name of some kind and using that in the dumps. Related, consider using an enum rather than magic constants in the target bits (I noted seeing component #0 as a magic constant in the ppc implementation). > + > +/* A helper function for accessing the pass-specific info. */ > +static inline struct sw * > +SW (basic_block bb) > +{ > + gcc_assert (bb->aux); > + return (struct sw *) bb->aux; > +} > + > +/* Create the pass-specific data structures for separately shrink-wrapping > + with components COMPONENTS. */ > +static void > +init_separate_shrink_wrap (sbitmap components) So it seems like there's a toplevel list of components that's owned by the target and state at each block components that are owned by the generic code. That's fine. Just make sure we doc that the toplevel list of components is allocated by the backend (and where does it get freed?) Consider using auto_sbitmap rather than manually managing allocation/releasing of the per-block structures. In fact, can't all of SW become a class and we lose the explicit init/fini routines in favor of a ctor/dtor? > + > +/* Place code for prologues and epilogues for COMPONENTS where we can put > + that code at the start of basic blocks. */ > +static void > +emit_common_heads_for_components (sbitmap components) > +{ > + sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components)); > + sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components)); > + sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components)); > + > + basic_block bb; > + FOR_EACH_BB_FN (bb, cfun) > + { > + /* Find which prologue resp. epilogue components are needed for all > + predecessor edges to this block. */ Nit. Avoid ".resp". I actually had to look that abbreviation up :-) Being a bit more verbose in the comments would be preferred over ".resp". Having looked at this in more detail now, I'm wondering if we want to talk at all in the docs about when selection vs disqualifying happens. ISTM that we build the set of components we want to insert for each edge/block. When that's complete, we then prune those results with the disqualifying sets. For the PPC R0 vs LR is the only thing that causes disqualification right? Can't that be handled when we build the set of components we want to insert for each edge/block? Is there some advantage to handling disqualifications after all the potential insertion points have been handled? Jeff
On Thu, Sep 08, 2016 at 12:34:07PM -0600, Jeff Law wrote: > On 07/31/2016 07:42 PM, Segher Boessenkool wrote: > > > >Deciding what blocks should run with a certain component active so that > >the total cost of executing the prologues (and epilogues) is optimal, is > >not a computationally feasible problem. > Really? It's just a dataflow problem is it not and one that ought to > converge very quickly I'd think. Or is it more a function of having to > run it on so many independent components? I'm still pondering the value > of having every GPR be an independent component :-) The cost function (as a function of which BBs runs with a certain component enabled) is not monotonic: the cost for having a component active for a subset of BBs can be higher, the same, or equal to the cost for all such nodes (where "cost" means "how often do we execute this prologue"). > >Now all that is left is inserting prologues and epilogues on all edges > >that jump into resp. out of the "active" set of blocks. Often we need > >to insert some components' prologues (or epilogues) on all edges into > >(or out of) a block. In theory cross-jumping can unify all such, but > >in practice that often fails; besides, that is a lot of work. So in > >this case we insert the prologue and epilogue components at the "head" > >or "tail" of a block, instead. > Cross jumping is rather simplistic, so I'm not surprised that it doesn't > catch all this stuff. I hoped it would, so I could have so much simpler code. Sniff. > >As a final optimisation, if a block needs a prologue and its immediate > >dominator has the block as a post-dominator, the dominator gets the > >prologue as well. > So why not just put it in the idom and not in the dominated block? That's what it does :-) > > void > > thread_prologue_and_epilogue_insns (void) > > { > >+ if (optimize > 1) > >+ { > >+ df_live_add_problem (); > >+ df_live_set_all_dirty (); > >+ } > Perhaps conditional on separate shrink wrapping? Actually, I think we need to do this once more, one of the times always (also when only using "regular" shrink-wrapping), because otherwise the info in the dump file is out of whack. Everything is okay once the next pass starts, of course. I'll have a look. > >@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void) > > > > try_shrink_wrapping (&entry_edge, prologue_seq); > > > >+ df_analyze (); > >+ try_shrink_wrapping_separate (entry_edge->dest); > >+ if (crtl->shrink_wrapped_separate) > >+ prologue_seq = make_prologue_seq (); > Perhaps push the df_analyze call into try_shrink_wrapping_separate? Yeah possibly. > ANd if it was successful, do you need to update the DF information? Is > that perhaps the cause of some of the issues we're seeing with DCE, > regrename, the scheduler? See above. The DF info is correct when the next pass starts (or ends, I do not remember; the dump file does show correct information). > Consider allowing the target to provide a mapping from the component to > a symbolic name of some kind and using that in the dumps. Yeah that will help, once we do more than just GPRs anyway -- not everyone remembers the GCC register #s for all registers ;-) > Related, > consider using an enum rather than magic constants in the target bits (I > noted seeing component #0 as a magic constant in the ppc implementation). But 0 is the hard register used there :-) And since we now are C++, I cannot use enums as integers. Sigh. The meaning of "0" should of course be documented. > So it seems like there's a toplevel list of components that's owned by > the target and state at each block components that are owned by the > generic code. That's fine. Just make sure we doc that the toplevel > list of components is allocated by the backend (and where does it get > freed?) Every sbitmap is owned by the separate shrink-wrapping pass, not by the target. As the documentation says: +@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS (sbitma +Mark the components in the parameter as handled, so that the +@code{prologue} and @code{epilogue} named patterns know to ignore those +components. The target code should not hang on to the @code{sbitmap}, it +will be deleted after this call. +@end deftypefn > Consider using auto_sbitmap rather than manually managing > allocation/releasing of the per-block structures. In fact, can't all of > SW become a class and we lose the explicit init/fini routines in favor > of a ctor/dtor? Yes, you can always add indirection. I do not think the code becomes more readable that way (quite the opposite). Explicit is *good*. The init/fini code is small, and that is not an accident. > >+ /* Find which prologue resp. epilogue components are needed for all > >+ predecessor edges to this block. */ > Nit. Avoid ".resp". I actually had to look that abbreviation up :-) > Being a bit more verbose in the comments would be preferred over ".resp". "resp." is very common in mathematics papers. It means either "and" or "or", and it magically means either or both *and* exactly the thing you mean there. But I'll try to use it less often, sure. > Having looked at this in more detail now, I'm wondering if we want to > talk at all in the docs about when selection vs disqualifying happens. > ISTM that we build the set of components we want to insert for each > edge/block. When that's complete, we then prune those results with the > disqualifying sets. We decide which blocks should have which components active. From that, it is easy to derive on which edges we need a prologue or an epilogue. Now if the target says putting some prologue or epilogue on some certain edge will not work, we just don't do that component at all. It doesn't happen very often. In theory we could be smarter. > For the PPC R0 vs LR is the only thing that causes disqualification > right? Currently, yes. > Can't that be handled when we build the set of components we > want to insert for each edge/block? Is there some advantage to handling > disqualifications after all the potential insertion points have been > handled? We do not know if an edge needs a prologue, epilogue, or neither, until we have decided whether *both* ends of that edge want the component active or not. Segher
On 09/09/2016 03:57 PM, Segher Boessenkool wrote: > On Thu, Sep 08, 2016 at 12:34:07PM -0600, Jeff Law wrote: >> On 07/31/2016 07:42 PM, Segher Boessenkool wrote: >>> >>> Deciding what blocks should run with a certain component active so that >>> the total cost of executing the prologues (and epilogues) is optimal, is >>> not a computationally feasible problem. >> Really? It's just a dataflow problem is it not and one that ought to >> converge very quickly I'd think. Or is it more a function of having to >> run it on so many independent components? I'm still pondering the value >> of having every GPR be an independent component :-) > > The cost function (as a function of which BBs runs with a certain component > enabled) is not monotonic: the cost for having a component active for a > subset of BBs can be higher, the same, or equal to the cost for all such > nodes (where "cost" means "how often do we execute this prologue"). Understood. You covered this reasonably well in another reply. Thanks. >> Cross jumping is rather simplistic, so I'm not surprised that it doesn't >> catch all this stuff. > > I hoped it would, so I could have so much simpler code. Sniff. In general, I think our ability to identify and de-duplicate code is poor at best, especially at the RTL level. It's never been a major area of focus. So, yea. Sniff. > >>> As a final optimisation, if a block needs a prologue and its immediate >>> dominator has the block as a post-dominator, the dominator gets the >>> prologue as well. >> So why not just put it in the idom and not in the dominated block? > > That's what it does :-) Then I must have mis-parsed. Thanks for clarifying. > > >>> void >>> thread_prologue_and_epilogue_insns (void) >>> { >>> + if (optimize > 1) >>> + { >>> + df_live_add_problem (); >>> + df_live_set_all_dirty (); >>> + } >> Perhaps conditional on separate shrink wrapping? > > Actually, I think we need to do this once more, one of the times always > (also when only using "regular" shrink-wrapping), because otherwise the > info in the dump file is out of whack. Everything is okay once the next > pass starts, of course. I'll have a look. OK. It struck me as a bit odd, so just a verify on your side that it was intended is fine with me. > >>> @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void) >>> >>> try_shrink_wrapping (&entry_edge, prologue_seq); >>> >>> + df_analyze (); >>> + try_shrink_wrapping_separate (entry_edge->dest); >>> + if (crtl->shrink_wrapped_separate) >>> + prologue_seq = make_prologue_seq (); >> Perhaps push the df_analyze call into try_shrink_wrapping_separate? > > Yeah possibly. Your call. > >> ANd if it was successful, do you need to update the DF information? Is >> that perhaps the cause of some of the issues we're seeing with DCE, >> regrename, the scheduler? > > See above. The DF info is correct when the next pass starts (or ends, > I do not remember; the dump file does show correct information). Hmm, then explain again why DCE is mucking up? I don't immediately see how EPILOGUE_BEG notes come into play with DCE. It seems to rely on the DF data and AFAICT DF only cares about the EPILOGUE_BEG note in can_move_insns_across which shouldn't be used by DCE. > >> Related, >> consider using an enum rather than magic constants in the target bits (I >> noted seeing component #0 as a magic constant in the ppc implementation). > > But 0 is the hard register used there :-) Oh, missed that. Nevermind :-) >> So it seems like there's a toplevel list of components that's owned by >> the target and state at each block components that are owned by the >> generic code. That's fine. Just make sure we doc that the toplevel >> list of components is allocated by the backend (and where does it get >> freed?) > > Every sbitmap is owned by the separate shrink-wrapping pass, not by the > target. As the documentation says: I'm specifically referring to: +@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS (void) +@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB (basic_block) Which in the rs6000 implementation allocate and return an sbitmap. I don't see anywhere we could reasonably free them in the target with the existing hooks, so it seems like they have to be free'd by the generic code. So ownership of those sbitmaps isn't particularly clear. > >> Consider using auto_sbitmap rather than manually managing >> allocation/releasing of the per-block structures. In fact, can't all of >> SW become a class and we lose the explicit init/fini routines in favor >> of a ctor/dtor? > > Yes, you can always add indirection. I do not think the code becomes > more readable that way (quite the opposite). Explicit is *good*. The GCC project is moving away from this kind of explicit allocation/deallocation and more towards a RAII. Unless there is a clear need for the explicit allocation/deallocation, please put this stuff into a class with an appropriate ctor/dtor. FWIW, I was a big opponent of how much stuff happens "behind your back" with some languages (including C++). But over the last few years my personal stance has softened considerably after seeing how cleanly RAII solves certain problems. > >> Having looked at this in more detail now, I'm wondering if we want to >> talk at all in the docs about when selection vs disqualifying happens. >> ISTM that we build the set of components we want to insert for each >> edge/block. When that's complete, we then prune those results with the >> disqualifying sets. > > We decide which blocks should have which components active. From that, > it is easy to derive on which edges we need a prologue or an epilogue. > Now if the target says putting some prologue or epilogue on some certain > edge will not work, we just don't do that component at all. It doesn't > happen very often. In theory we could be smarter. Right. I was obviously over-simplifying, the key point I was verifying was that they're separate steps... Which led to the question about when to disqualify for the R0 vs LR issue. > >> For the PPC R0 vs LR is the only thing that causes disqualification >> right? > > Currently, yes. > >> Can't that be handled when we build the set of components we >> want to insert for each edge/block? Is there some advantage to handling >> disqualifications after all the potential insertion points have been >> handled? > > We do not know if an edge needs a prologue, epilogue, or neither, until > we have decided whether *both* ends of that edge want the component active > or not. Right. Hmm, maybe I'm not asking the question clearly. Whether or not an edge needs a prologue or epilogue is a function not just of the state at the head or tail of the edge, but instead is a function of global dataflow propagation? Thus we can't disqualify until after we've done the dataflow propagation? Right? Jeff
On Mon, Sep 12, 2016 at 12:02:50PM -0600, Jeff Law wrote: > >>>As a final optimisation, if a block needs a prologue and its immediate > >>>dominator has the block as a post-dominator, the dominator gets the > >>>prologue as well. > >>So why not just put it in the idom and not in the dominated block? > > > >That's what it does :-) > Then I must have mis-parsed. Thanks for clarifying. "As a final optimisation, if a block needs a prologue and its immediate dominator has the block as a post-dominator, ***that immediate dominator*** gets the prologue as well." That is clearer I hope :-) > Hmm, then explain again why DCE is mucking up? I don't immediately see > how EPILOGUE_BEG notes come into play with DCE. It seems to rely on the > DF data and AFAICT DF only cares about the EPILOGUE_BEG note in > can_move_insns_across which shouldn't be used by DCE. The register restore *is* dead code, but we need to have the same CFI for all convergent paths. > >>Consider using auto_sbitmap rather than manually managing > >>allocation/releasing of the per-block structures. In fact, can't all of > >>SW become a class and we lose the explicit init/fini routines in favor > >>of a ctor/dtor? > > > >Yes, you can always add indirection. I do not think the code becomes > >more readable that way (quite the opposite). Explicit is *good*. > The GCC project is moving away from this kind of explicit > allocation/deallocation and more towards a RAII. Unless there is a > clear need for the explicit allocation/deallocation, please put this > stuff into a class with an appropriate ctor/dtor. > > FWIW, I was a big opponent of how much stuff happens "behind your back" > with some languages (including C++). But over the last few years my > personal stance has softened considerably after seeing how cleanly RAII > solves certain problems. We then still cannot get rid of SW, which is a convenience macro to do a nasty cast on bb->aux. If bb->aux was some pretty class hierarchy, easy to use and all that, I would of course agree with your suggestion. But as it is it is just a bare pointer, so the less we hide the safer it is. > >>For the PPC R0 vs LR is the only thing that causes disqualification > >>right? > > > >Currently, yes. > > > >>Can't that be handled when we build the set of components we > >>want to insert for each edge/block? Is there some advantage to handling > >>disqualifications after all the potential insertion points have been > >>handled? > > > >We do not know if an edge needs a prologue, epilogue, or neither, until > >we have decided whether *both* ends of that edge want the component active > >or not. > Right. Hmm, maybe I'm not asking the question clearly. > > Whether or not an edge needs a prologue or epilogue is a function not > just of the state at the head or tail of the edge, but instead is a > function of global dataflow propagation? Thus we can't disqualify until > after we've done the dataflow propagation? Right? We can figure out before we decide what blocks need what components, what edges can not get a prologue or epilogue for which components. This complicates the selection algorithm a whole lot, for not much gain that I have seen so far, so I just give up in the cases that end up "bad". It is not easy at all to see what edges will need to get a *logue, because not always both blocks that edge connects are in the same dominator subtree (or tree even, for an epilogue-aware placement algorithm, but this patch doesn't do that yet; it's a more minor optimisation, only reduces code size a little). Segher
On 09/14/2016 07:38 AM, Segher Boessenkool wrote: > On Mon, Sep 12, 2016 at 12:02:50PM -0600, Jeff Law wrote: >>>>> As a final optimisation, if a block needs a prologue and its immediate >>>>> dominator has the block as a post-dominator, the dominator gets the >>>>> prologue as well. >>>> So why not just put it in the idom and not in the dominated block? >>> >>> That's what it does :-) >> Then I must have mis-parsed. Thanks for clarifying. > > "As a final optimisation, if a block needs a prologue and its immediate > dominator has the block as a post-dominator, ***that immediate dominator*** > gets the prologue as well." > > That is clearer I hope :-) It is :-) > >> Hmm, then explain again why DCE is mucking up? I don't immediately see >> how EPILOGUE_BEG notes come into play with DCE. It seems to rely on the >> DF data and AFAICT DF only cares about the EPILOGUE_BEG note in >> can_move_insns_across which shouldn't be used by DCE. > > The register restore *is* dead code, but we need to have the same CFI > for all convergent paths. OK. I think I was conflating multiple issues. So we need to keep the restore alive so that we have the same CFI across those paths, even though it appears dead on one or more paths. I think this points us back to what you were experimenting with to address the regrename problems -- specifically creating "uses" at those key points. That solves the DCE problem as well as one of the regrename problems, right? >> >> Whether or not an edge needs a prologue or epilogue is a function not >> just of the state at the head or tail of the edge, but instead is a >> function of global dataflow propagation? Thus we can't disqualify until >> after we've done the dataflow propagation? Right? > > We can figure out before we decide what blocks need what components, what > edges can not get a prologue or epilogue for which components. This > complicates the selection algorithm a whole lot, for not much gain that > I have seen so far, so I just give up in the cases that end up "bad". OK. I'll drop it :-) It was more a mental exercise in understanding then something I think needed to change. Jeff
diff --git a/gcc/function.c b/gcc/function.c index bba0705..390e9a6 100644 --- a/gcc/function.c +++ b/gcc/function.c @@ -5912,6 +5912,12 @@ make_epilogue_seq (void) void thread_prologue_and_epilogue_insns (void) { + if (optimize > 1) + { + df_live_add_problem (); + df_live_set_all_dirty (); + } + df_analyze (); /* Can't deal with multiple successors of the entry block at the @@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void) edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun)); edge orig_entry_edge = entry_edge; - rtx_insn *split_prologue_seq = make_split_prologue_seq (); rtx_insn *prologue_seq = make_prologue_seq (); - rtx_insn *epilogue_seq = make_epilogue_seq (); /* Try to perform a kind of shrink-wrapping, making sure the prologue/epilogue is emitted only around those parts of the @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void) try_shrink_wrapping (&entry_edge, prologue_seq); + df_analyze (); + try_shrink_wrapping_separate (entry_edge->dest); + if (crtl->shrink_wrapped_separate) + prologue_seq = make_prologue_seq (); + + rtx_insn *split_prologue_seq = make_split_prologue_seq (); + rtx_insn *epilogue_seq = make_epilogue_seq (); rtl_profile_for_bb (EXIT_BLOCK_PTR_FOR_FN (cfun)); diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c index b85b1c3..643e375 100644 --- a/gcc/shrink-wrap.c +++ b/gcc/shrink-wrap.c @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3. If not see #include "output.h" #include "tree-pass.h" #include "cfgrtl.h" +#include "cfgbuild.h" #include "params.h" #include "bb-reorder.h" #include "shrink-wrap.h" @@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq) BITMAP_FREE (bb_with); free_dominance_info (CDI_DOMINATORS); } + +/* Separate shrink-wrapping + + Instead of putting all of the prologue and epilogue in one spot, we + can put parts of it in places where those components are executed less + frequently. The following code does this, for prologue and epilogue + components that can be put in more than one location, and where those + components can be executed more than once (the epilogue component will + always be executed before the prologue component is executed a second + time). + + What exactly is a component is target-dependent. The more usual + components are simple saves/restores to/from the frame of callee-saved + registers. This code treats components abstractly (as an sbitmap), + letting the target handle all details. + + Prologue components are placed in such a way that for every component + the prologue is executed as infrequently as possible. We do this by + walking the dominator tree, comparing the cost of placing a prologue + component before a block to the sum of costs determined for all subtrees + of that block. + + From this placement, we then determine for each component all blocks + where at least one of this block's dominators (including itself) will + get a prologue inserted. That then is how the components are placed. + We could place the epilogue components a bit smarter (we can save a + bit of code size sometimes); this is a possible future improvement. + + Prologues and epilogues are preferably placed into a block, either at + the beginning or end of it, if it is needed for all predecessor resp. + successor edges; or placed on the edge otherwise. + + If the placement of any prologue/epilogue leads to a situation we cannot + handle (for example, an abnormal edge would need to be split, or some + targets want to use some specific registers that may not be available + where we want to put them), separate shrink-wrapping for the components + in that prologue/epilogue is aborted. */ + + +/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the + label LABEL. */ +static void +dump_components (const char *label, sbitmap components) +{ + if (bitmap_empty_p (components)) + return; + + fprintf (dump_file, " [%s", label); + + for (unsigned int j = 0; j < components->n_bits; j++) + if (bitmap_bit_p (components, j)) + fprintf (dump_file, " %u", j); + + fprintf (dump_file, "]"); +} + +/* The data we collect for each bb. */ +struct sw { + /* What components does this BB need? */ + sbitmap needs_components; + + /* What components does this BB have? This is the main decision this + pass makes. */ + sbitmap has_components; + + /* The components for which we placed code at the start of the BB (instead + of on all incoming edges). */ + sbitmap head_components; + + /* The components for which we placed code at the end of the BB (instead + of on all outgoing edges). */ + sbitmap tail_components; + + /* The frequency of executing the prologue for this BB, if a prologue is + placed on this BB. This is a pessimistic estimate (no prologue is + needed for edges from blocks that have the component under consideration + active already). */ + gcov_type own_cost; + + /* The frequency of executing the prologue for this BB and all BBs + dominated by it. */ + gcov_type total_cost; +}; + +/* A helper function for accessing the pass-specific info. */ +static inline struct sw * +SW (basic_block bb) +{ + gcc_assert (bb->aux); + return (struct sw *) bb->aux; +} + +/* Create the pass-specific data structures for separately shrink-wrapping + with components COMPONENTS. */ +static void +init_separate_shrink_wrap (sbitmap components) +{ + basic_block bb; + FOR_ALL_BB_FN (bb, cfun) + { + bb->aux = xcalloc (1, sizeof (struct sw)); + + SW (bb)->needs_components = targetm.shrink_wrap.components_for_bb (bb); + + if (dump_file) + { + fprintf (dump_file, "bb %d components:", bb->index); + dump_components ("has", SW (bb)->needs_components); + fprintf (dump_file, "\n"); + } + + SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (components)); + SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (components)); + SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (components)); + bitmap_clear (SW (bb)->has_components); + bitmap_clear (SW (bb)->head_components); + bitmap_clear (SW (bb)->tail_components); + } +} + +/* Destroy the pass-specific data. */ +static void +fini_separate_shrink_wrap (void) +{ + basic_block bb; + FOR_ALL_BB_FN (bb, cfun) + if (bb->aux) + { + sbitmap_free (SW (bb)->needs_components); + sbitmap_free (SW (bb)->has_components); + sbitmap_free (SW (bb)->head_components); + sbitmap_free (SW (bb)->tail_components); + free (bb->aux); + bb->aux = 0; + } +} + +/* Place the prologue for component WHICH, in the basic blocks dominated + by HEAD. Do a DFS over the dominator tree, and set bit WHICH in the + HAS_COMPONENTS of a block if either the block has that bit set in + NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all + dominator subtrees separately. */ +static void +place_prologue_for_one_component (unsigned int which, basic_block head) +{ + /* The block we are currently dealing with. */ + basic_block bb = head; + /* Is this the first time we visit this block, i.e. have we just gone + down the tree. */ + bool first_visit = true; + + /* Walk the dominator tree, visit one block per iteration of this loop. + Each basic block is visited twice: once before visiting any children + of the block, and once after visiting all of them (leaf nodes are + visited only once). As an optimization, we do not visit subtrees + that can no longer influence the prologue placement. */ + for (;;) + { + /* First visit of a block: set the (children) cost accumulator to zero; + if the block does not have the component itself, walk down. */ + if (first_visit) + { + /* Initialize the cost. The cost is the block execution frequency + that does not come from backedges. Calculating this by simply + adding the cost of all edges that aren't backedges does not + work: this does not always add up to the block frequency at + all, and even if it does, rounding error makes for bad + decisions. */ + SW (bb)->own_cost = bb->frequency; + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->preds) + if (dominated_by_p (CDI_DOMINATORS, e->src, bb)) + { + if (SW (bb)->own_cost > EDGE_FREQUENCY (e)) + SW (bb)->own_cost -= EDGE_FREQUENCY (e); + else + SW (bb)->own_cost = 0; + } + + SW (bb)->total_cost = 0; + + if (!bitmap_bit_p (SW (bb)->needs_components, which) + && first_dom_son (CDI_DOMINATORS, bb)) + { + bb = first_dom_son (CDI_DOMINATORS, bb); + continue; + } + } + + /* If this block does need the component itself, or it is cheaper to + put the prologue here than in all the descendants that need it, + mark it so. If this block's immediate post-dominator is dominated + by this block, and that needs the prologue, we can put it on this + block as well (earlier is better). */ + if (bitmap_bit_p (SW (bb)->needs_components, which) + || SW (bb)->total_cost > SW (bb)->own_cost) + { + SW (bb)->total_cost = SW (bb)->own_cost; + bitmap_set_bit (SW (bb)->has_components, which); + } + else + { + basic_block kid = get_immediate_dominator (CDI_POST_DOMINATORS, bb); + if (dominated_by_p (CDI_DOMINATORS, kid, bb) + && bitmap_bit_p (SW (kid)->has_components, which)) + { + SW (bb)->total_cost = SW (bb)->own_cost; + bitmap_set_bit (SW (bb)->has_components, which); + } + } + + /* We are back where we started, so we are done now. */ + if (bb == head) + return; + + /* We now know the cost of the subtree rooted at the current block. + Accumulate this cost in the parent. */ + basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb); + SW (parent)->total_cost += SW (bb)->total_cost; + + /* Don't walk the tree down unless necessary. */ + if (next_dom_son (CDI_DOMINATORS, bb) + && SW (parent)->total_cost <= SW (parent)->own_cost) + { + bb = next_dom_son (CDI_DOMINATORS, bb); + first_visit = true; + } + else + { + bb = parent; + first_visit = false; + } + } +} + +/* Mark HAS_COMPONENTS for every block dominated by at least one block with + PRO_COMPONENTS set for the respective components, starting at HEAD. */ +static void +spread_components (basic_block head) +{ + basic_block bb = head; + bool first_visit = true; + /* This keeps a tally of all components active. */ + sbitmap components = SW (head)->has_components; + + for (;;) + { + if (first_visit) + { + bitmap_ior (SW (bb)->has_components, SW (bb)->has_components, + components); + + if (first_dom_son (CDI_DOMINATORS, bb)) + { + components = SW (bb)->has_components; + bb = first_dom_son (CDI_DOMINATORS, bb); + continue; + } + } + + components = SW (bb)->has_components; + + if (next_dom_son (CDI_DOMINATORS, bb)) + { + bb = next_dom_son (CDI_DOMINATORS, bb); + basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb); + components = SW (parent)->has_components; + first_visit = true; + } + else + { + if (bb == head) + return; + bb = get_immediate_dominator (CDI_DOMINATORS, bb); + first_visit = false; + } + } +} + +/* If we cannot handle placing some component's prologues or epilogues where + we decided we should place them, unmark that component in COMPONENTS so + that it is not wrapped separately. */ +static void +disqualify_problematic_components (sbitmap components) +{ + sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components)); + + basic_block bb; + FOR_EACH_BB_FN (bb, cfun) + { + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->succs) + { + /* Find which components we want pro/epilogues for here. */ + bitmap_and_compl (epi, SW (e->src)->has_components, + SW (e->dest)->has_components); + bitmap_and_compl (pro, SW (e->dest)->has_components, + SW (e->src)->has_components); + + /* Ask the target what it thinks about things. */ + if (!bitmap_empty_p (epi)) + targetm.shrink_wrap.disqualify_components (components, e, epi, + false); + if (!bitmap_empty_p (pro)) + targetm.shrink_wrap.disqualify_components (components, e, pro, + true); + + /* If this edge doesn't need splitting, we're fine. */ + if (single_pred_p (e->dest) + && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun)) + continue; + + /* If the edge can be split, that is fine too. */ + if ((e->flags & EDGE_ABNORMAL) == 0) + continue; + + /* We also can handle sibcalls. */ + if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) + { + gcc_assert (e->flags & EDGE_SIBCALL); + continue; + } + + /* Remove from consideration those components we would need + pro/epilogues for on edges where we cannot insert them. */ + bitmap_and_compl (components, components, epi); + bitmap_and_compl (components, components, pro); + + if (dump_file && !bitmap_subset_p (epi, components)) + { + fprintf (dump_file, " BAD epi %d->%d", e->src->index, + e->dest->index); + if (e->flags & EDGE_EH) + fprintf (dump_file, " for EH"); + dump_components ("epi", epi); + fprintf (dump_file, "\n"); + } + + if (dump_file && !bitmap_subset_p (pro, components)) + { + fprintf (dump_file, " BAD pro %d->%d", e->src->index, + e->dest->index); + if (e->flags & EDGE_EH) + fprintf (dump_file, " for EH"); + dump_components ("pro", pro); + fprintf (dump_file, "\n"); + } + } + } + + sbitmap_free (pro); + sbitmap_free (epi); +} + +/* Place code for prologues and epilogues for COMPONENTS where we can put + that code at the start of basic blocks. */ +static void +emit_common_heads_for_components (sbitmap components) +{ + sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components)); + + basic_block bb; + FOR_EACH_BB_FN (bb, cfun) + { + /* Find which prologue resp. epilogue components are needed for all + predecessor edges to this block. */ + + /* First, select all possible components. */ + bitmap_copy (epi, components); + bitmap_copy (pro, components); + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & EDGE_ABNORMAL) + { + bitmap_clear (epi); + bitmap_clear (pro); + break; + } + + /* Deselect those epilogue components that should not be inserted + for this edge. */ + bitmap_and_compl (tmp, SW (e->src)->has_components, + SW (e->dest)->has_components); + bitmap_and (epi, epi, tmp); + + /* Similar, for the prologue. */ + bitmap_and_compl (tmp, SW (e->dest)->has_components, + SW (e->src)->has_components); + bitmap_and (pro, pro, tmp); + } + + if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro))) + fprintf (dump_file, " bb %d", bb->index); + + if (dump_file && !bitmap_empty_p (epi)) + dump_components ("epi", epi); + if (dump_file && !bitmap_empty_p (pro)) + dump_components ("pro", pro); + + if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro))) + fprintf (dump_file, "\n"); + + /* Place code after the BB note. */ + if (!bitmap_empty_p (pro)) + { + start_sequence (); + targetm.shrink_wrap.emit_prologue_components (pro); + rtx_insn *seq = get_insns (); + end_sequence (); + + emit_insn_after (seq, bb_note (bb)); + + bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, pro); + } + + if (!bitmap_empty_p (epi)) + { + start_sequence (); + targetm.shrink_wrap.emit_epilogue_components (epi); + rtx_insn *seq = get_insns (); + end_sequence (); + + emit_insn_after (seq, bb_note (bb)); + + bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, epi); + } + } + + sbitmap_free (pro); + sbitmap_free (epi); + sbitmap_free (tmp); +} + +/* Place code for prologues and epilogues for COMPONENTS where we can put + that code at the end of basic blocks. */ +static void +emit_common_tails_for_components (sbitmap components) +{ + sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components)); + + basic_block bb; + FOR_EACH_BB_FN (bb, cfun) + { + /* Find which prologue resp. epilogue components are needed for all + successor edges from this block. */ + if (EDGE_COUNT (bb->succs) == 0) + continue; + + /* First, select all possible components. */ + bitmap_copy (epi, components); + bitmap_copy (pro, components); + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->succs) + { + if (e->flags & EDGE_ABNORMAL) + { + bitmap_clear (epi); + bitmap_clear (pro); + break; + } + + /* Deselect those epilogue components that should not be inserted + for this edge, and also those that are already put at the head + of the successor block. */ + bitmap_and_compl (tmp, SW (e->src)->has_components, + SW (e->dest)->has_components); + bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components); + bitmap_and (epi, epi, tmp); + + /* Similarly, for the prologue. */ + bitmap_and_compl (tmp, SW (e->dest)->has_components, + SW (e->src)->has_components); + bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components); + bitmap_and (pro, pro, tmp); + } + + if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro))) + fprintf (dump_file, " bb %d", bb->index); + + if (dump_file && !bitmap_empty_p (epi)) + dump_components ("epi", epi); + if (dump_file && !bitmap_empty_p (pro)) + dump_components ("pro", pro); + + if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro))) + fprintf (dump_file, "\n"); + + /* Put the code at the end of the BB, but before any final jump. */ + if (!bitmap_empty_p (epi)) + { + start_sequence (); + targetm.shrink_wrap.emit_epilogue_components (epi); + rtx_insn *seq = get_insns (); + end_sequence (); + + rtx_insn *insn = BB_END (bb); + if (control_flow_insn_p (insn)) + emit_insn_before (seq, insn); + else + emit_insn_after (seq, insn); + + bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, epi); + } + + if (!bitmap_empty_p (pro)) + { + start_sequence (); + targetm.shrink_wrap.emit_prologue_components (pro); + rtx_insn *seq = get_insns (); + end_sequence (); + + rtx_insn *insn = BB_END (bb); + if (JUMP_P (insn) || (CALL_P (insn) && SIBLING_CALL_P (insn))) + emit_insn_before (seq, insn); + else + emit_insn_after (seq, insn); + + bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, pro); + } + } + + sbitmap_free (pro); + sbitmap_free (epi); + sbitmap_free (tmp); +} + +/* Place prologues and epilogues for COMPONENTS on edges, if we haven't already + placed them inside blocks directly. */ +static void +insert_prologue_epilogue_for_components (sbitmap components) +{ + sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components)); + sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components)); + + basic_block bb; + FOR_EACH_BB_FN (bb, cfun) + { + if (!bb->aux) + continue; + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->succs) + { + /* Find which pro/epilogue components are needed on this edge. */ + bitmap_and_compl (epi, SW (e->src)->has_components, + SW (e->dest)->has_components); + bitmap_and_compl (pro, SW (e->dest)->has_components, + SW (e->src)->has_components); + bitmap_and (epi, epi, components); + bitmap_and (pro, pro, components); + + /* Deselect those we already have put at the head or tail of the + edge's dest resp. src. */ + bitmap_and_compl (epi, epi, SW (e->dest)->head_components); + bitmap_and_compl (pro, pro, SW (e->dest)->head_components); + bitmap_and_compl (epi, epi, SW (e->src)->tail_components); + bitmap_and_compl (pro, pro, SW (e->src)->tail_components); + + if (!bitmap_empty_p (epi) || !bitmap_empty_p (pro)) + { + if (dump_file) + { + fprintf (dump_file, " %d->%d", e->src->index, + e->dest->index); + dump_components ("epi", epi); + dump_components ("pro", pro); + fprintf (dump_file, "\n"); + } + + /* Put the epilogue components in place. */ + start_sequence (); + targetm.shrink_wrap.emit_epilogue_components (epi); + rtx_insn *seq = get_insns (); + end_sequence (); + + if (e->flags & EDGE_SIBCALL) + { + gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)); + + rtx_insn *insn = BB_END (e->src); + gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn)); + emit_insn_before (seq, insn); + } + else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun)) + { + gcc_assert (e->flags & EDGE_FALLTHRU); + basic_block new_bb = split_edge (e); + emit_insn_after (seq, BB_END (new_bb)); + } + else + insert_insn_on_edge (seq, e); + + /* Put the prologue components in place. */ + start_sequence (); + targetm.shrink_wrap.emit_prologue_components (pro); + seq = get_insns (); + end_sequence (); + + insert_insn_on_edge (seq, e); + } + } + } + + sbitmap_free (pro); + sbitmap_free (epi); + + commit_edge_insertions (); +} + +/* The main entry point to this subpass. FIRST_BB is where the prologue + would be normally put. */ +void +try_shrink_wrapping_separate (basic_block first_bb) +{ + if (!(SHRINK_WRAPPING_ENABLED + && flag_shrink_wrap_separate + && optimize_function_for_speed_p (cfun) + && targetm.shrink_wrap.get_separate_components)) + return; + + /* We need LIVE info. */ + if (!df_live) + return; + + /* We don't handle "strange" functions. */ + if (cfun->calls_alloca + || cfun->calls_setjmp + || cfun->can_throw_non_call_exceptions + || crtl->calls_eh_return + || crtl->has_nonlocal_goto + || crtl->saves_all_registers) + return; + + /* Ask the target what components there are. If it returns NULL, don't + do anything. */ + sbitmap components = targetm.shrink_wrap.get_separate_components (); + if (!components) + return; + + calculate_dominance_info (CDI_DOMINATORS); + calculate_dominance_info (CDI_POST_DOMINATORS); + + init_separate_shrink_wrap (components); + + sbitmap_iterator sbi; + unsigned int j; + EXECUTE_IF_SET_IN_BITMAP (components, 0, j, sbi) + place_prologue_for_one_component (j, first_bb); + + spread_components (first_bb); + + disqualify_problematic_components (components); + + /* Don't separately shrink-wrap anything where the "main" prologue will + go; the target code can often optimize things if it is presented with + all components together (say, if it generates store-multiple insns). */ + bitmap_and_compl (components, components, SW (first_bb)->has_components); + + if (bitmap_empty_p (components)) + { + if (dump_file) + fprintf (dump_file, "Not wrapping anything separately.\n"); + } + else + { + if (dump_file) + { + fprintf (dump_file, "The components we wrap separately are"); + dump_components ("sep", components); + fprintf (dump_file, "\n"); + + fprintf (dump_file, "... Inserting common heads...\n"); + } + + emit_common_heads_for_components (components); + + if (dump_file) + fprintf (dump_file, "... Inserting common tails...\n"); + + emit_common_tails_for_components (components); + + if (dump_file) + fprintf (dump_file, "... Inserting the more difficult ones...\n"); + + insert_prologue_epilogue_for_components (components); + + if (dump_file) + fprintf (dump_file, "... Done.\n"); + + targetm.shrink_wrap.set_handled_components (components); + + crtl->shrink_wrapped_separate = true; + } + + fini_separate_shrink_wrap (); + + sbitmap_free (components); + free_dominance_info (CDI_DOMINATORS); + free_dominance_info (CDI_POST_DOMINATORS); +} diff --git a/gcc/shrink-wrap.h b/gcc/shrink-wrap.h index e06ab37..05fcb41 100644 --- a/gcc/shrink-wrap.h +++ b/gcc/shrink-wrap.h @@ -25,6 +25,7 @@ along with GCC; see the file COPYING3. If not see /* In shrink-wrap.c. */ extern bool requires_stack_frame_p (rtx_insn *, HARD_REG_SET, HARD_REG_SET); extern void try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq); +extern void try_shrink_wrapping_separate (basic_block first_bb); #define SHRINK_WRAPPING_ENABLED \ (flag_shrink_wrap && targetm.have_simple_return ())