diff mbox series

[RFC] Fix PR63155 (some more)

Message ID alpine.LSU.2.20.1809191455250.16707@zhemvz.fhfr.qr
State New
Headers show
Series [RFC] Fix PR63155 (some more) | expand

Commit Message

Richard Biener Sept. 19, 2018, 1:06 p.m. UTC
The second testcase in the above PR runs into our O(N) bitmap element
search limitation and spends 8s (60%) of the compile-time in the SSA propagator
engine (when optimizing).  The patch improves that to 0.9s (15%).  For the 
first testcase it isn't that bad but still the patch improves CCP from 36% to 
14%.

The "solution" is to use sbitmap instead of a bitmap to avoid
the linearity when doing add_ssa_edge.  We pay for that (but not
actually with the testcases) with a linear-time bitmap_first_set_bit
in process_ssa_edge_worklist.  I do not (yet...) have a testcase
that overall gets slower with this approach.  I suppose using
std::set<int> would "solve" the complexity issue but we'd pay
back with horribly inefficient memory use.  Similarly with
our sparse bitmap implementation which lacks an ordered
first_set_bit (it only can get any set bit fast, breaking optimal
iteration order).

If we'd only had a O(log n) search sparse bitmap implementation ...
(Steven posted patches to switch bitmap from/to such one but IIRC
that at least lacked bitmap_first_set_bit).

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK for trunk?

Thanks,
Richard.

2018-09-19  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.h
	(ssa_propagation_engine::process_ssa_edge_worklist): Change
	prototype.
	* tree-ssa-propagate.c (ssa_edge_worklist): Turn into an sbitmap.
	(add_ssa_edge): Adjust.
	(ssa_propagation_engine::process_ssa_edge_worklist): If there
	was nothing to do return false.
	(ssa_prop_init): Allocate and clear ssa_edge_worklist.
	(ssa_prop_fini): Adjust.
	(ssa_propagation_engine::ssa_propagate): Elide the check
	for an empty ssa_edge_worklist by using the
	process_ssa_edge_worklist return value.

Comments

Jakub Jelinek Sept. 19, 2018, 1:27 p.m. UTC | #1
On Wed, Sep 19, 2018 at 03:06:30PM +0200, Richard Biener wrote:
> 
> The second testcase in the above PR runs into our O(N) bitmap element
> search limitation and spends 8s (60%) of the compile-time in the SSA propagator
> engine (when optimizing).  The patch improves that to 0.9s (15%).  For the 
> first testcase it isn't that bad but still the patch improves CCP from 36% to 
> 14%.
> 
> The "solution" is to use sbitmap instead of a bitmap to avoid
> the linearity when doing add_ssa_edge.  We pay for that (but not
> actually with the testcases) with a linear-time bitmap_first_set_bit
> in process_ssa_edge_worklist.  I do not (yet...) have a testcase

Perhaps you could remember the first set bit number in an integral variable
next to the bitmap.  On bitmap_set_bit this would be just a comparison +
conditional update, in simulate_stmt it would be again conditional on
clearing the first set bit and in that case it could start from that bit
onwards rather than from the beginning (EXECUTE_IF_SET_IN_BITMAP), similarly
in process_ssa_edge_worklist (non-conditional in that case, we are always
clearing the first bit).  Or instead of tracking exact first bit track
it lazily, i.e. just remember the smallest bitnum we've set and in
process_ssa_edge_worklist don't use bitmap_first_set_bit, but walk from
that minimum using EXECUTE_IF_SET_IN_BITMAP and update the minimum to
whatever we've found.

Similarly, empty_p can be done linearly by keeping on the side the count of
set bits and updating it on set_bit when previously not set, as well on
clear_bit.

	Jakub
Richard Biener Sept. 19, 2018, 2:04 p.m. UTC | #2
On Wed, 19 Sep 2018, Jakub Jelinek wrote:

> On Wed, Sep 19, 2018 at 03:06:30PM +0200, Richard Biener wrote:
> > 
> > The second testcase in the above PR runs into our O(N) bitmap element
> > search limitation and spends 8s (60%) of the compile-time in the SSA propagator
> > engine (when optimizing).  The patch improves that to 0.9s (15%).  For the 
> > first testcase it isn't that bad but still the patch improves CCP from 36% to 
> > 14%.
> > 
> > The "solution" is to use sbitmap instead of a bitmap to avoid
> > the linearity when doing add_ssa_edge.  We pay for that (but not
> > actually with the testcases) with a linear-time bitmap_first_set_bit
> > in process_ssa_edge_worklist.  I do not (yet...) have a testcase
> 
> Perhaps you could remember the first set bit number in an integral variable
> next to the bitmap.  On bitmap_set_bit this would be just a comparison +
> conditional update, in simulate_stmt it would be again conditional on
> clearing the first set bit and in that case it could start from that bit
> onwards rather than from the beginning (EXECUTE_IF_SET_IN_BITMAP), similarly
> in process_ssa_edge_worklist (non-conditional in that case, we are always
> clearing the first bit).  Or instead of tracking exact first bit track
> it lazily, i.e. just remember the smallest bitnum we've set and in
> process_ssa_edge_worklist don't use bitmap_first_set_bit, but walk from
> that minimum using EXECUTE_IF_SET_IN_BITMAP and update the minimum to
> whatever we've found.

I thought about that but as you say it's easily invalidated and we can
only optimize the searching start point.  So it felt somewhat like
a hack ;)

I can still do that if you think that otherwise using an sbitmap
is fine (I'm not 100% convinced about that yet).

> Similarly, empty_p can be done linearly by keeping on the side the count of
> set bits and updating it on set_bit when previously not set, as well on
> clear_bit.

I elided empty_p by re-using the fact that bitmap_first_set_bit
already has to do the walk (and may fail).  Of course using a 
on-the-side count might be more optimal (I was wondering why
sbitmap itself didn't have that).

I wonder if reviving Stevens [patch][RFC] bitmaps as lists *or* trees
makes more sense so we can turn this kind of random-access bitmaps
to tree view.

Richard.
Steven Bosscher Sept. 19, 2018, 2:30 p.m. UTC | #3
On Wed, Sep 19, 2018 at 3:06 PM Richard Biener wrote:
> If we'd only had a O(log n) search sparse bitmap implementation ...
> (Steven posted patches to switch bitmap from/to such one but IIRC
> that at least lacked bitmap_first_set_bit).

But bitmap_first_set_bit would be easy to implement. Just take the
left-most node of the splay tree.

Actually all bit-tests would be easy to implement. It's only
enumeration and set operations on the tree-views that would be
complicated fluff (easier to "switch views" than re-implement).

Impressive that you remember >5yr old patches like that ;-)

Ciao!
Steven
Richard Biener Sept. 20, 2018, 7 a.m. UTC | #4
On Wed, 19 Sep 2018, Steven Bosscher wrote:

> On Wed, Sep 19, 2018 at 3:06 PM Richard Biener wrote:
> > If we'd only had a O(log n) search sparse bitmap implementation ...
> > (Steven posted patches to switch bitmap from/to such one but IIRC
> > that at least lacked bitmap_first_set_bit).
> 
> But bitmap_first_set_bit would be easy to implement. Just take the
> left-most node of the splay tree.
> 
> Actually all bit-tests would be easy to implement. It's only
> enumeration and set operations on the tree-views that would be
> complicated fluff (easier to "switch views" than re-implement).
> 
> Impressive that you remember >5yr old patches like that ;-)

;)

Well, it's still useful (but obviously doesn't apply).  Not sure
iff the worst-case behavior of splay-trees makes it a pointless
exercise though ;)

Richard.
Richard Biener Sept. 20, 2018, 12:31 p.m. UTC | #5
On Wed, 19 Sep 2018, Richard Biener wrote:

> 
> The second testcase in the above PR runs into our O(N) bitmap element
> search limitation and spends 8s (60%) of the compile-time in the SSA propagator
> engine (when optimizing).  The patch improves that to 0.9s (15%).  For the 
> first testcase it isn't that bad but still the patch improves CCP from 36% to 
> 14%.
> 
> The "solution" is to use sbitmap instead of a bitmap to avoid
> the linearity when doing add_ssa_edge.  We pay for that (but not
> actually with the testcases) with a linear-time bitmap_first_set_bit
> in process_ssa_edge_worklist.  I do not (yet...) have a testcase
> that overall gets slower with this approach.  I suppose using
> std::set<int> would "solve" the complexity issue but we'd pay
> back with horribly inefficient memory use.  Similarly with
> our sparse bitmap implementation which lacks an ordered
> first_set_bit (it only can get any set bit fast, breaking optimal
> iteration order).
> 
> If we'd only had a O(log n) search sparse bitmap implementation ...
> (Steven posted patches to switch bitmap from/to such one but IIRC
> that at least lacked bitmap_first_set_bit).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK for trunk?

So it turns out that while the bitmap data structure isn't optimal
the issue with the testcase is that we end up with a full-set-universe
for the SSA worklist mostly because we are queueing PHIs via uses
on backedges that are not yet executable (so we'd re-simulate the PHI
anyway once that became excutable - no need to set the individual bits).

So I'm testing the following instead.

Bootstrap & regtest running on x86_64-unknown-linux-gnu.

Richard.

2018-09-20  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/63155
	* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
	the worklist when the edge of the respective argument isn't
	executable.

Index: gcc/tree-ssa-propagate.c
===================================================================
--- gcc/tree-ssa-propagate.c	(revision 264438)
+++ gcc/tree-ssa-propagate.c	(working copy)
@@ -168,10 +168,18 @@ add_ssa_edge (tree var)
   FOR_EACH_IMM_USE_FAST (use_p, iter, var)
     {
       gimple *use_stmt = USE_STMT (use_p);
+      basic_block use_bb = gimple_bb (use_stmt);
 
       /* If we did not yet simulate the block wait for this to happen
          and do not add the stmt to the SSA edge worklist.  */
-      if (! (gimple_bb (use_stmt)->flags & BB_VISITED))
+      if (! (use_bb->flags & BB_VISITED))
+	continue;
+
+      /* If this is a use on a not yet executable edge do not bother to
+	 queue it.  */
+      if (gimple_code (use_stmt) == GIMPLE_PHI
+	  && !(EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->flags
+	       & EDGE_EXECUTABLE))
 	continue;
 
       if (prop_simulate_again_p (use_stmt)
diff mbox series

Patch

Index: gcc/tree-ssa-propagate.h
===================================================================
--- gcc/tree-ssa-propagate.h	(revision 264418)
+++ gcc/tree-ssa-propagate.h	(working copy)
@@ -94,7 +94,7 @@  class ssa_propagation_engine
  private:
   /* Internal implementation details.  */
   void simulate_stmt (gimple *stmt);
-  void process_ssa_edge_worklist (void);
+  bool process_ssa_edge_worklist (void);
   void simulate_block (basic_block);
 
 };
Index: gcc/tree-ssa-propagate.c
===================================================================
--- gcc/tree-ssa-propagate.c	(revision 264418)
+++ gcc/tree-ssa-propagate.c	(working copy)
@@ -119,7 +119,7 @@  static int *cfg_order_to_bb;
    definition has changed.  SSA edges are def-use edges in the SSA
    web.  For each D-U edge, we store the target statement or PHI node
    UID in a bitmap.  UIDs order stmts in execution order.   */
-static bitmap ssa_edge_worklist;
+static sbitmap ssa_edge_worklist;
 static vec<gimple *> uid_to_stmt;
 
 /* Return true if the block worklist empty.  */
@@ -175,8 +175,9 @@  add_ssa_edge (tree var)
 	continue;
 
       if (prop_simulate_again_p (use_stmt)
-	  && bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt)))
+	  && !bitmap_bit_p (ssa_edge_worklist, gimple_uid (use_stmt)))
 	{
+	  bitmap_set_bit (ssa_edge_worklist, gimple_uid (use_stmt));
 	  uid_to_stmt[gimple_uid (use_stmt)] = use_stmt;
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    {
@@ -317,11 +318,14 @@  ssa_propagation_engine::simulate_stmt (g
    when an SSA edge is added to it in simulate_stmt.  Return true if a stmt
    was simulated.  */
 
-void
+bool
 ssa_propagation_engine::process_ssa_edge_worklist (void)
 {
   /* Process the next entry from the worklist.  */
-  unsigned stmt_uid = bitmap_first_set_bit (ssa_edge_worklist);
+  int stmt_uid = bitmap_first_set_bit (ssa_edge_worklist);
+  if (stmt_uid == -1)
+    return false;
+
   bitmap_clear_bit (ssa_edge_worklist, stmt_uid);
   gimple *stmt = uid_to_stmt[stmt_uid];
 
@@ -335,6 +339,7 @@  ssa_propagation_engine::process_ssa_edge
     }
 
   simulate_stmt (stmt);
+  return true;
 }
 
 
@@ -412,9 +417,6 @@  ssa_prop_init (void)
   edge_iterator ei;
   basic_block bb;
 
-  /* Worklists of SSA edges.  */
-  ssa_edge_worklist = BITMAP_ALLOC (NULL);
-
   /* Worklist of basic-blocks.  */
   bb_to_cfg_order = XNEWVEC (int, last_basic_block_for_fn (cfun) + 1);
   cfg_order_to_bb = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
@@ -455,6 +457,10 @@  ssa_prop_init (void)
     }
   uid_to_stmt.safe_grow (gimple_stmt_max_uid (cfun));
 
+  /* Worklists of SSA edges.  */
+  ssa_edge_worklist = sbitmap_alloc (gimple_stmt_max_uid (cfun));
+  bitmap_clear (ssa_edge_worklist);
+
   /* Seed the algorithm by adding the successors of the entry block to the
      edge worklist.  */
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR_FOR_FN (cfun)->succs)
@@ -473,7 +479,8 @@  ssa_prop_fini (void)
   BITMAP_FREE (cfg_blocks);
   free (bb_to_cfg_order);
   free (cfg_order_to_bb);
-  BITMAP_FREE (ssa_edge_worklist);
+  sbitmap_free (ssa_edge_worklist);
+  ssa_edge_worklist = NULL;
   uid_to_stmt.release ();
 }
 
@@ -789,8 +796,7 @@  ssa_propagation_engine::ssa_propagate (v
   ssa_prop_init ();
 
   /* Iterate until the worklists are empty.  */
-  while (! cfg_blocks_empty_p ()
-	 || ! bitmap_empty_p (ssa_edge_worklist))
+  while (1)
     {
       /* First simulate whole blocks.  */
       if (! cfg_blocks_empty_p ())
@@ -802,7 +808,10 @@  ssa_propagation_engine::ssa_propagate (v
 	}
 
       /* Then simulate from the SSA edge worklist.  */
-      process_ssa_edge_worklist ();
+      if (process_ssa_edge_worklist ())
+	continue;
+
+      break;
     }
 
   ssa_prop_fini ();