Message ID | alpine.LSU.2.20.1809191455250.16707@zhemvz.fhfr.qr |
---|---|
State | New |
Headers | show |
Series | [RFC] Fix PR63155 (some more) | expand |
On Wed, Sep 19, 2018 at 03:06:30PM +0200, Richard Biener wrote: > > The second testcase in the above PR runs into our O(N) bitmap element > search limitation and spends 8s (60%) of the compile-time in the SSA propagator > engine (when optimizing). The patch improves that to 0.9s (15%). For the > first testcase it isn't that bad but still the patch improves CCP from 36% to > 14%. > > The "solution" is to use sbitmap instead of a bitmap to avoid > the linearity when doing add_ssa_edge. We pay for that (but not > actually with the testcases) with a linear-time bitmap_first_set_bit > in process_ssa_edge_worklist. I do not (yet...) have a testcase Perhaps you could remember the first set bit number in an integral variable next to the bitmap. On bitmap_set_bit this would be just a comparison + conditional update, in simulate_stmt it would be again conditional on clearing the first set bit and in that case it could start from that bit onwards rather than from the beginning (EXECUTE_IF_SET_IN_BITMAP), similarly in process_ssa_edge_worklist (non-conditional in that case, we are always clearing the first bit). Or instead of tracking exact first bit track it lazily, i.e. just remember the smallest bitnum we've set and in process_ssa_edge_worklist don't use bitmap_first_set_bit, but walk from that minimum using EXECUTE_IF_SET_IN_BITMAP and update the minimum to whatever we've found. Similarly, empty_p can be done linearly by keeping on the side the count of set bits and updating it on set_bit when previously not set, as well on clear_bit. Jakub
On Wed, 19 Sep 2018, Jakub Jelinek wrote: > On Wed, Sep 19, 2018 at 03:06:30PM +0200, Richard Biener wrote: > > > > The second testcase in the above PR runs into our O(N) bitmap element > > search limitation and spends 8s (60%) of the compile-time in the SSA propagator > > engine (when optimizing). The patch improves that to 0.9s (15%). For the > > first testcase it isn't that bad but still the patch improves CCP from 36% to > > 14%. > > > > The "solution" is to use sbitmap instead of a bitmap to avoid > > the linearity when doing add_ssa_edge. We pay for that (but not > > actually with the testcases) with a linear-time bitmap_first_set_bit > > in process_ssa_edge_worklist. I do not (yet...) have a testcase > > Perhaps you could remember the first set bit number in an integral variable > next to the bitmap. On bitmap_set_bit this would be just a comparison + > conditional update, in simulate_stmt it would be again conditional on > clearing the first set bit and in that case it could start from that bit > onwards rather than from the beginning (EXECUTE_IF_SET_IN_BITMAP), similarly > in process_ssa_edge_worklist (non-conditional in that case, we are always > clearing the first bit). Or instead of tracking exact first bit track > it lazily, i.e. just remember the smallest bitnum we've set and in > process_ssa_edge_worklist don't use bitmap_first_set_bit, but walk from > that minimum using EXECUTE_IF_SET_IN_BITMAP and update the minimum to > whatever we've found. I thought about that but as you say it's easily invalidated and we can only optimize the searching start point. So it felt somewhat like a hack ;) I can still do that if you think that otherwise using an sbitmap is fine (I'm not 100% convinced about that yet). > Similarly, empty_p can be done linearly by keeping on the side the count of > set bits and updating it on set_bit when previously not set, as well on > clear_bit. I elided empty_p by re-using the fact that bitmap_first_set_bit already has to do the walk (and may fail). Of course using a on-the-side count might be more optimal (I was wondering why sbitmap itself didn't have that). I wonder if reviving Stevens [patch][RFC] bitmaps as lists *or* trees makes more sense so we can turn this kind of random-access bitmaps to tree view. Richard.
On Wed, Sep 19, 2018 at 3:06 PM Richard Biener wrote: > If we'd only had a O(log n) search sparse bitmap implementation ... > (Steven posted patches to switch bitmap from/to such one but IIRC > that at least lacked bitmap_first_set_bit). But bitmap_first_set_bit would be easy to implement. Just take the left-most node of the splay tree. Actually all bit-tests would be easy to implement. It's only enumeration and set operations on the tree-views that would be complicated fluff (easier to "switch views" than re-implement). Impressive that you remember >5yr old patches like that ;-) Ciao! Steven
On Wed, 19 Sep 2018, Steven Bosscher wrote: > On Wed, Sep 19, 2018 at 3:06 PM Richard Biener wrote: > > If we'd only had a O(log n) search sparse bitmap implementation ... > > (Steven posted patches to switch bitmap from/to such one but IIRC > > that at least lacked bitmap_first_set_bit). > > But bitmap_first_set_bit would be easy to implement. Just take the > left-most node of the splay tree. > > Actually all bit-tests would be easy to implement. It's only > enumeration and set operations on the tree-views that would be > complicated fluff (easier to "switch views" than re-implement). > > Impressive that you remember >5yr old patches like that ;-) ;) Well, it's still useful (but obviously doesn't apply). Not sure iff the worst-case behavior of splay-trees makes it a pointless exercise though ;) Richard.
On Wed, 19 Sep 2018, Richard Biener wrote: > > The second testcase in the above PR runs into our O(N) bitmap element > search limitation and spends 8s (60%) of the compile-time in the SSA propagator > engine (when optimizing). The patch improves that to 0.9s (15%). For the > first testcase it isn't that bad but still the patch improves CCP from 36% to > 14%. > > The "solution" is to use sbitmap instead of a bitmap to avoid > the linearity when doing add_ssa_edge. We pay for that (but not > actually with the testcases) with a linear-time bitmap_first_set_bit > in process_ssa_edge_worklist. I do not (yet...) have a testcase > that overall gets slower with this approach. I suppose using > std::set<int> would "solve" the complexity issue but we'd pay > back with horribly inefficient memory use. Similarly with > our sparse bitmap implementation which lacks an ordered > first_set_bit (it only can get any set bit fast, breaking optimal > iteration order). > > If we'd only had a O(log n) search sparse bitmap implementation ... > (Steven posted patches to switch bitmap from/to such one but IIRC > that at least lacked bitmap_first_set_bit). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > OK for trunk? So it turns out that while the bitmap data structure isn't optimal the issue with the testcase is that we end up with a full-set-universe for the SSA worklist mostly because we are queueing PHIs via uses on backedges that are not yet executable (so we'd re-simulate the PHI anyway once that became excutable - no need to set the individual bits). So I'm testing the following instead. Bootstrap & regtest running on x86_64-unknown-linux-gnu. Richard. 2018-09-20 Richard Biener <rguenther@suse.de> PR tree-optimization/63155 * tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to the worklist when the edge of the respective argument isn't executable. Index: gcc/tree-ssa-propagate.c =================================================================== --- gcc/tree-ssa-propagate.c (revision 264438) +++ gcc/tree-ssa-propagate.c (working copy) @@ -168,10 +168,18 @@ add_ssa_edge (tree var) FOR_EACH_IMM_USE_FAST (use_p, iter, var) { gimple *use_stmt = USE_STMT (use_p); + basic_block use_bb = gimple_bb (use_stmt); /* If we did not yet simulate the block wait for this to happen and do not add the stmt to the SSA edge worklist. */ - if (! (gimple_bb (use_stmt)->flags & BB_VISITED)) + if (! (use_bb->flags & BB_VISITED)) + continue; + + /* If this is a use on a not yet executable edge do not bother to + queue it. */ + if (gimple_code (use_stmt) == GIMPLE_PHI + && !(EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->flags + & EDGE_EXECUTABLE)) continue; if (prop_simulate_again_p (use_stmt)
Index: gcc/tree-ssa-propagate.h =================================================================== --- gcc/tree-ssa-propagate.h (revision 264418) +++ gcc/tree-ssa-propagate.h (working copy) @@ -94,7 +94,7 @@ class ssa_propagation_engine private: /* Internal implementation details. */ void simulate_stmt (gimple *stmt); - void process_ssa_edge_worklist (void); + bool process_ssa_edge_worklist (void); void simulate_block (basic_block); }; Index: gcc/tree-ssa-propagate.c =================================================================== --- gcc/tree-ssa-propagate.c (revision 264418) +++ gcc/tree-ssa-propagate.c (working copy) @@ -119,7 +119,7 @@ static int *cfg_order_to_bb; definition has changed. SSA edges are def-use edges in the SSA web. For each D-U edge, we store the target statement or PHI node UID in a bitmap. UIDs order stmts in execution order. */ -static bitmap ssa_edge_worklist; +static sbitmap ssa_edge_worklist; static vec<gimple *> uid_to_stmt; /* Return true if the block worklist empty. */ @@ -175,8 +175,9 @@ add_ssa_edge (tree var) continue; if (prop_simulate_again_p (use_stmt) - && bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt))) + && !bitmap_bit_p (ssa_edge_worklist, gimple_uid (use_stmt))) { + bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt)); uid_to_stmt[gimple_uid (use_stmt)] = use_stmt; if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -317,11 +318,14 @@ ssa_propagation_engine::simulate_stmt (g when an SSA edge is added to it in simulate_stmt. Return true if a stmt was simulated. */ -void +bool ssa_propagation_engine::process_ssa_edge_worklist (void) { /* Process the next entry from the worklist. */ - unsigned stmt_uid = bitmap_first_set_bit (ssa_edge_worklist); + int stmt_uid = bitmap_first_set_bit (ssa_edge_worklist); + if (stmt_uid == -1) + return false; + bitmap_clear_bit (ssa_edge_worklist, stmt_uid); gimple *stmt = uid_to_stmt[stmt_uid]; @@ -335,6 +339,7 @@ ssa_propagation_engine::process_ssa_edge } simulate_stmt (stmt); + return true; } @@ -412,9 +417,6 @@ ssa_prop_init (void) edge_iterator ei; basic_block bb; - /* Worklists of SSA edges. */ - ssa_edge_worklist = BITMAP_ALLOC (NULL); - /* Worklist of basic-blocks. */ bb_to_cfg_order = XNEWVEC (int, last_basic_block_for_fn (cfun) + 1); cfg_order_to_bb = XNEWVEC (int, n_basic_blocks_for_fn (cfun)); @@ -455,6 +457,10 @@ ssa_prop_init (void) } uid_to_stmt.safe_grow (gimple_stmt_max_uid (cfun)); + /* Worklists of SSA edges. */ + ssa_edge_worklist = sbitmap_alloc (gimple_stmt_max_uid (cfun)); + bitmap_clear (ssa_edge_worklist); + /* Seed the algorithm by adding the successors of the entry block to the edge worklist. */ FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR_FOR_FN (cfun)->succs) @@ -473,7 +479,8 @@ ssa_prop_fini (void) BITMAP_FREE (cfg_blocks); free (bb_to_cfg_order); free (cfg_order_to_bb); - BITMAP_FREE (ssa_edge_worklist); + sbitmap_free (ssa_edge_worklist); + ssa_edge_worklist = NULL; uid_to_stmt.release (); } @@ -789,8 +796,7 @@ ssa_propagation_engine::ssa_propagate (v ssa_prop_init (); /* Iterate until the worklists are empty. */ - while (! cfg_blocks_empty_p () - || ! bitmap_empty_p (ssa_edge_worklist)) + while (1) { /* First simulate whole blocks. */ if (! cfg_blocks_empty_p ()) @@ -802,7 +808,10 @@ ssa_propagation_engine::ssa_propagate (v } /* Then simulate from the SSA edge worklist. */ - process_ssa_edge_worklist (); + if (process_ssa_edge_worklist ()) + continue; + + break; } ssa_prop_fini ();