Patchwork Use a smaller, dynamic worklist in compute_global_livein

login
register
mail settings
Submitter Richard Guenther
Date Aug. 7, 2012, 8:31 a.m.
Message ID <CAFiYyc0UCZEDN083psy-tYdk3jOiSVAKgCwQGwy6=X53W=CChA@mail.gmail.com>
Download mbox | patch
Permalink /patch/175535/
State New
Headers show

Comments

Richard Guenther - Aug. 7, 2012, 8:31 a.m.
On Tue, Aug 7, 2012 at 8:24 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> Hello,
>
> In the test case for PR54146, compute_global_livein allocates/frees a
> worklist for >400,000 basic blocks on each invocation. And it's called
> a lot, for rewrite_into_loop_closed_ssa. But the maximum number of
> basic blocks ever on the work list was only ~6500. So the work list
> can be much smaller most of the time.
>
> Bootstrapped&tested on x86_64-unknown-linux-gnu. OK for trunk?

comments elsewhere about tracking vops can you fix that to
s/SSA_OP_ALL_USES/SSA_OP_USE/?

 void
-compute_global_livein (bitmap livein ATTRIBUTE_UNUSED, bitmap
def_blocks ATTRIBUTE_UNUSED)
+compute_global_livein (bitmap livein, bitmap def_blocks)
 {
-  basic_block bb, *worklist, *tos;
   unsigned i;
   bitmap_iterator bi;
+  VEC (basic_block, heap) *worklist;

-  tos = worklist
-    = (basic_block *) xmalloc (sizeof (basic_block) * (last_basic_block + 1));
+  /* Normally the work list size is bounded by the number of basic
+     blocks in the largest loop.  We don't know this number, but we
+     can be fairly sure that it will be relatively small.  */
+  worklist = VEC_alloc (basic_block, heap, MAX (8, n_basic_blocks / 100));

I suppose if we really want to optimize this we can make worklist static
so we at most grow once.  (in case you want to optimize allocation overhead,
not memory used).  Other structures in the ssa rewriter have this kind of
lifetime, too (the per-SSA name info for example).

Ok with the first change, whatever you prefer on the
compute_global_livein thing.

Thanks,
Richard.

> Ciao!
> Steven
Steven Bosscher - Aug. 7, 2012, 9:03 a.m.
On Tue, Aug 7, 2012 at 10:31 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 8:24 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> Hello,
>>
>> In the test case for PR54146, compute_global_livein allocates/frees a
>> worklist for >400,000 basic blocks on each invocation. And it's called
>> a lot, for rewrite_into_loop_closed_ssa. But the maximum number of
>> basic blocks ever on the work list was only ~6500. So the work list
>> can be much smaller most of the time.
>>
>> Bootstrapped&tested on x86_64-unknown-linux-gnu. OK for trunk?
>
> Index: tree-ssa-loop-manip.c
> ===================================================================
> --- tree-ssa-loop-manip.c       (revision 190176)
> +++ tree-ssa-loop-manip.c       (working copy)
> @@ -162,10 +162,8 @@ add_exit_phis_var (tree var, bitmap live
>    basic_block def_bb = gimple_bb (SSA_NAME_DEF_STMT (var));
>    bitmap_iterator bi;
>
> -  if (is_gimple_reg (var))
> -    bitmap_clear_bit (livein, def_bb->index);
> -  else
> -    bitmap_set_bit (livein, def_bb->index);
> +  gcc_checking_assert (is_gimple_reg (var));
> +  bitmap_clear_bit (livein, def_bb->index);
>
> !is_gimple_reg is true for virtual operand PHIs ... and at least
> find_uses_to_rename_stmt loops over all uses (so, I suppose given the
> comments elsewhere about tracking vops can you fix that to
> s/SSA_OP_ALL_USES/SSA_OP_USE/?

I'll give that a try.


>  void
> -compute_global_livein (bitmap livein ATTRIBUTE_UNUSED, bitmap
> def_blocks ATTRIBUTE_UNUSED)
> +compute_global_livein (bitmap livein, bitmap def_blocks)
>  {
> -  basic_block bb, *worklist, *tos;
>    unsigned i;
>    bitmap_iterator bi;
> +  VEC (basic_block, heap) *worklist;
>
> -  tos = worklist
> -    = (basic_block *) xmalloc (sizeof (basic_block) * (last_basic_block + 1));
> +  /* Normally the work list size is bounded by the number of basic
> +     blocks in the largest loop.  We don't know this number, but we
> +     can be fairly sure that it will be relatively small.  */
> +  worklist = VEC_alloc (basic_block, heap, MAX (8, n_basic_blocks / 100));
>
> I suppose if we really want to optimize this we can make worklist static
> so we at most grow once.  (in case you want to optimize allocation overhead,
> not memory used).  Other structures in the ssa rewriter have this kind of
> lifetime, too (the per-SSA name info for example).

I don't want this to be static. This shouldn't be a persistent piece of data.
But I actually did try a static work list (the full list, not the
VEC), and it made no difference at all on the timings.

All time is spent in the loop in compute_global_livein itself. There
are >400,000 basic blocks but the maximum loop depth is only 3. I've
tried out livein as a sparseset last night (allocated and filled in
add_exit_phis_var and re-used for every call to add_exit_phis_edge),
that reduces the time spent in compute_global_livein by a factor 2.5
for this test case (i.e. not the order-of-magnitude change I'm looking
for).

Why can't the normal SSA updater be used to rewrite into loop-closed SSA form?

Ciao!
Steven
Richard Guenther - Aug. 7, 2012, 9:22 a.m.
On Tue, Aug 7, 2012 at 11:03 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 10:31 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Tue, Aug 7, 2012 at 8:24 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> Hello,
>>>
>>> In the test case for PR54146, compute_global_livein allocates/frees a
>>> worklist for >400,000 basic blocks on each invocation. And it's called
>>> a lot, for rewrite_into_loop_closed_ssa. But the maximum number of
>>> basic blocks ever on the work list was only ~6500. So the work list
>>> can be much smaller most of the time.
>>>
>>> Bootstrapped&tested on x86_64-unknown-linux-gnu. OK for trunk?
>>
>> Index: tree-ssa-loop-manip.c
>> ===================================================================
>> --- tree-ssa-loop-manip.c       (revision 190176)
>> +++ tree-ssa-loop-manip.c       (working copy)
>> @@ -162,10 +162,8 @@ add_exit_phis_var (tree var, bitmap live
>>    basic_block def_bb = gimple_bb (SSA_NAME_DEF_STMT (var));
>>    bitmap_iterator bi;
>>
>> -  if (is_gimple_reg (var))
>> -    bitmap_clear_bit (livein, def_bb->index);
>> -  else
>> -    bitmap_set_bit (livein, def_bb->index);
>> +  gcc_checking_assert (is_gimple_reg (var));
>> +  bitmap_clear_bit (livein, def_bb->index);
>>
>> !is_gimple_reg is true for virtual operand PHIs ... and at least
>> find_uses_to_rename_stmt loops over all uses (so, I suppose given the
>> comments elsewhere about tracking vops can you fix that to
>> s/SSA_OP_ALL_USES/SSA_OP_USE/?
>
> I'll give that a try.
>
>
>>  void
>> -compute_global_livein (bitmap livein ATTRIBUTE_UNUSED, bitmap
>> def_blocks ATTRIBUTE_UNUSED)
>> +compute_global_livein (bitmap livein, bitmap def_blocks)
>>  {
>> -  basic_block bb, *worklist, *tos;
>>    unsigned i;
>>    bitmap_iterator bi;
>> +  VEC (basic_block, heap) *worklist;
>>
>> -  tos = worklist
>> -    = (basic_block *) xmalloc (sizeof (basic_block) * (last_basic_block + 1));
>> +  /* Normally the work list size is bounded by the number of basic
>> +     blocks in the largest loop.  We don't know this number, but we
>> +     can be fairly sure that it will be relatively small.  */
>> +  worklist = VEC_alloc (basic_block, heap, MAX (8, n_basic_blocks / 100));
>>
>> I suppose if we really want to optimize this we can make worklist static
>> so we at most grow once.  (in case you want to optimize allocation overhead,
>> not memory used).  Other structures in the ssa rewriter have this kind of
>> lifetime, too (the per-SSA name info for example).
>
> I don't want this to be static. This shouldn't be a persistent piece of data.
> But I actually did try a static work list (the full list, not the
> VEC), and it made no difference at all on the timings.

Ah, so the patch only reduces allocation but not compile-time itself.

> All time is spent in the loop in compute_global_livein itself. There
> are >400,000 basic blocks but the maximum loop depth is only 3. I've
> tried out livein as a sparseset last night (allocated and filled in
> add_exit_phis_var and re-used for every call to add_exit_phis_edge),
> that reduces the time spent in compute_global_livein by a factor 2.5
> for this test case (i.e. not the order-of-magnitude change I'm looking
> for).

Another optimization would be to do

@@ -440,13 +442,13 @@ compute_global_livein (bitmap livein ATT
              && ! bitmap_bit_p (def_blocks, pred_index)
              && bitmap_set_bit (livein, pred_index))
            {
-             *tos++ = pred;
+             VEC_safe_push (basic_block, heap, worklist, pred);

thus combine the bitmap_set_bit and bitmap_bit_p tests on livein.

> Why can't the normal SSA updater be used to rewrite into loop-closed SSA form?

It does not have a mode to trigger PHI inserts for loop-closed SSA
form.  I suppose
most of the time we should try harder and not call rewrite_into_loop_closed_ssa
so many times (eventually for the easy cases where we _do_ have an old->new
name mapping or symbols to rename the SSA updater should deal with this).

Richard.

> Ciao!
> Steven
Steven Bosscher - Aug. 7, 2012, 9:45 a.m.
On Tue, Aug 7, 2012 at 11:22 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> Another optimization would be to do
>
> @@ -440,13 +442,13 @@ compute_global_livein (bitmap livein ATT
>               && ! bitmap_bit_p (def_blocks, pred_index)
>               && bitmap_set_bit (livein, pred_index))
>             {
> -             *tos++ = pred;
> +             VEC_safe_push (basic_block, heap, worklist, pred);
>
> thus combine the bitmap_set_bit and bitmap_bit_p tests on livein.

That's just a micro-optimization that doesn't help much, because after
the bitmap_bit_p in the if, the last accessed bitmap member is cached
and the bitmap_set_bit is almost free.

What is needed here, is a (much) smaller computed livein. In the end,
the whole set of ~140,000 livein blocks is pruned to just 3, namely
the loop exits...

Ciao!
Steven
Richard Guenther - Aug. 7, 2012, 9:52 a.m.
On Tue, Aug 7, 2012 at 11:45 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 11:22 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> Another optimization would be to do
>>
>> @@ -440,13 +442,13 @@ compute_global_livein (bitmap livein ATT
>>               && ! bitmap_bit_p (def_blocks, pred_index)
>>               && bitmap_set_bit (livein, pred_index))
>>             {
>> -             *tos++ = pred;
>> +             VEC_safe_push (basic_block, heap, worklist, pred);
>>
>> thus combine the bitmap_set_bit and bitmap_bit_p tests on livein.
>
> That's just a micro-optimization that doesn't help much, because after
> the bitmap_bit_p in the if, the last accessed bitmap member is cached
> and the bitmap_set_bit is almost free.
>
> What is needed here, is a (much) smaller computed livein. In the end,
> the whole set of ~140,000 livein blocks is pruned to just 3, namely
> the loop exits...

So I wonder why simply looping over all SSA defs in a loop body and checking
whether a use is outside of it is not enough to compute this information ...
(yes, we might end up creating too many loop closed PHIs, namely on all
exits rather than just on those that the name is live over, but ... at
least maybe
we can prune it with DOM info)

Richard.

> Ciao!
> Steven
Steven Bosscher - Aug. 7, 2012, 9:58 a.m.
On Tue, Aug 7, 2012 at 11:52 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> So I wonder why simply looping over all SSA defs in a loop body and checking
> whether a use is outside of it is not enough to compute this information ...
> (yes, we might end up creating too many loop closed PHIs, namely on all
> exits rather than just on those that the name is live over, but ... at
> least maybe we can prune it with DOM info)

I've been thinking the same thing, but I don't understand loop-closed
SSA form well enough to try and rewrite it along those lines.

Ciao!
Steven
Richard Guenther - Aug. 7, 2012, 10:39 a.m.
On Tue, Aug 7, 2012 at 11:58 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 11:52 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> So I wonder why simply looping over all SSA defs in a loop body and checking
>> whether a use is outside of it is not enough to compute this information ...
>> (yes, we might end up creating too many loop closed PHIs, namely on all
>> exits rather than just on those that the name is live over, but ... at
>> least maybe we can prune it with DOM info)
>
> I've been thinking the same thing, but I don't understand loop-closed
> SSA form well enough to try and rewrite it along those lines.

I've put it on my TODO to look into.

Richard.

> Ciao!
> Steven

Patch

Index: tree-ssa-loop-manip.c
===================================================================
--- tree-ssa-loop-manip.c       (revision 190176)
+++ tree-ssa-loop-manip.c       (working copy)
@@ -162,10 +162,8 @@  add_exit_phis_var (tree var, bitmap live
   basic_block def_bb = gimple_bb (SSA_NAME_DEF_STMT (var));
   bitmap_iterator bi;

-  if (is_gimple_reg (var))
-    bitmap_clear_bit (livein, def_bb->index);
-  else
-    bitmap_set_bit (livein, def_bb->index);
+  gcc_checking_assert (is_gimple_reg (var));
+  bitmap_clear_bit (livein, def_bb->index);

!is_gimple_reg is true for virtual operand PHIs ... and at least
find_uses_to_rename_stmt loops over all uses (so, I suppose given the