diff mbox

[RFC,P2,PR,tree-optimization/33562] Lowering more complex assignments.

Message ID 56BD1EFB.90008@redhat.com
State New
Headers show

Commit Message

Jeff Law Feb. 11, 2016, 11:53 p.m. UTC
So I've never thought much about our Complex support and I don't have a 
lot of confidence in the coverage for our testsuite for these changes. 
So I'd really appreciate someone with more experience thinking about 
this issue for a bit.

I was looking at 33562 (P2), figuring it was DSE, which I wrote ~15 
years ago, I could probably get up-to-speed and so something about it 
without major surgery (and for Complex I'm pretty sure I can).

But while looking at the gimple code we handed off to DSE, it occurred 
to me that this would be easy to optimize if it were lowered.  Then, of 
course i remembered that we did lower complex stuff.

So this turned into a short debugging session in tree-complex.c.

The key statement in the test looks like



complex int t = 0

Where x is a complex object *and* has its address taken.  So the IL 
looks something like:

t = __complex__ (0, 0);



init_dont_simulate_again ignores this statement because the LHS is not 
an SSA_NAME (remember, t has had its address taken elsewhere in the 
sources).

So complex lowering ends up totally ignoring the function in question.

ISTM that we can and should still lower this code.  We don't want to set 
sim_again_p because the LHS is not in SSA form, so we don't really 
want/need to set up and track a lattice for this object.

So the first step was to figure out the conditions under which we ought 
to detect an assignment to/from an aggregate that is not in SSA_FORM.

I think we can do that with something like this in the GIMPLE_ASSIGN case:
              /* Simple assignments involving complex types where
                  the RHS or LHS is addressable should be lowered, but
                  do not inherently trigger resimulation.  */
               if (TREE_CODE (TREE_TYPE (gimple_assign_lhs (stmt)))
                   == COMPLEX_TYPE)
                 saw_a_complex_op = true;


Essentially anytime we see a simple assignment (which would include 
initialization) we go ahead and let the complex lowering pass run, but 
we don't set sim_again_p.


Then we need to actually lower the construct.  expand_complex_move has 
99% of what we need.   If we take the code from this clause:

    else if (rhs && TREE_CODE (rhs) == SSA_NAME && !TREE_SIDE_EFFECTS (lhs))
[ ... ]

Refactor it into its own function and call it when this condition is true:

+  /* Assignment to/from a complex object that is addressable.  */
+  else if (TREE_CODE (lhs) == VAR_DECL
+          && TREE_CODE (TREE_TYPE (lhs)) == COMPLEX_TYPE
+          && (TREE_CODE (rhs) == VAR_DECL
+              || TREE_CODE (rhs) == COMPLEX_CST
+              || TREE_CODE (rhs) == COMPLEX_EXPR)
+          && TREE_CODE (TREE_TYPE (rhs)) == COMPLEX_TYPE)

Then complex-4.c and complex-5.c both work.  A variant of complex-4.c 
that I hacked up were we pass in a complex int, and assign that to a 
local addressable complex int gets lowered (and better optimized) as well.

So what am I missing here?  Is there any kind of aliasing issues I need 
to be aware of?  Any other dragons lurking?
jeff

Comments

Richard Biener Feb. 12, 2016, 9:04 a.m. UTC | #1
On Fri, Feb 12, 2016 at 12:53 AM, Jeff Law <law@redhat.com> wrote:
>
> So I've never thought much about our Complex support and I don't have a lot
> of confidence in the coverage for our testsuite for these changes. So I'd
> really appreciate someone with more experience thinking about this issue for
> a bit.
>
> I was looking at 33562 (P2), figuring it was DSE, which I wrote ~15 years
> ago, I could probably get up-to-speed and so something about it without
> major surgery (and for Complex I'm pretty sure I can).
>
> But while looking at the gimple code we handed off to DSE, it occurred to me
> that this would be easy to optimize if it were lowered.  Then, of course i
> remembered that we did lower complex stuff.
>
> So this turned into a short debugging session in tree-complex.c.
>
> The key statement in the test looks like
>
>
>
> complex int t = 0
>
> Where x is a complex object *and* has its address taken.  So the IL looks
> something like:
>
> t = __complex__ (0, 0);
>
>
>
> init_dont_simulate_again ignores this statement because the LHS is not an
> SSA_NAME (remember, t has had its address taken elsewhere in the sources).
>
> So complex lowering ends up totally ignoring the function in question.
>
> ISTM that we can and should still lower this code.  We don't want to set
> sim_again_p because the LHS is not in SSA form, so we don't really want/need
> to set up and track a lattice for this object.
>
> So the first step was to figure out the conditions under which we ought to
> detect an assignment to/from an aggregate that is not in SSA_FORM.
>
> I think we can do that with something like this in the GIMPLE_ASSIGN case:
>              /* Simple assignments involving complex types where
>                  the RHS or LHS is addressable should be lowered, but
>                  do not inherently trigger resimulation.  */
>               if (TREE_CODE (TREE_TYPE (gimple_assign_lhs (stmt)))
>                   == COMPLEX_TYPE)
>                 saw_a_complex_op = true;
>
>
> Essentially anytime we see a simple assignment (which would include
> initialization) we go ahead and let the complex lowering pass run, but we
> don't set sim_again_p.
>
>
> Then we need to actually lower the construct.  expand_complex_move has 99%
> of what we need.   If we take the code from this clause:
>
>    else if (rhs && TREE_CODE (rhs) == SSA_NAME && !TREE_SIDE_EFFECTS (lhs))
> [ ... ]
>
> Refactor it into its own function and call it when this condition is true:
>
> +  /* Assignment to/from a complex object that is addressable.  */
> +  else if (TREE_CODE (lhs) == VAR_DECL
> +          && TREE_CODE (TREE_TYPE (lhs)) == COMPLEX_TYPE
> +          && (TREE_CODE (rhs) == VAR_DECL
> +              || TREE_CODE (rhs) == COMPLEX_CST
> +              || TREE_CODE (rhs) == COMPLEX_EXPR)
> +          && TREE_CODE (TREE_TYPE (rhs)) == COMPLEX_TYPE)
>
> Then complex-4.c and complex-5.c both work.  A variant of complex-4.c that I
> hacked up were we pass in a complex int, and assign that to a local
> addressable complex int gets lowered (and better optimized) as well.
>
> So what am I missing here?  Is there any kind of aliasing issues I need to
> be aware of?  Any other dragons lurking?

I think what you are missing is that you'll "optimize"

_Complex double x, y;

void foo (void)
{
  x = y;
}

then but dependent on the targets capability DCmode moves might be
cheaper than four DFmode moves and nothing will combine them together
again.

In complex lowering we're careful to only "optimize" when we know how to
extract components in an efficient way (well, mostly).

That the DSE opportunities in complex-[45].c are exposed if you lower
the aggregate inits is obvious but the lowering is only a workaround for
our lack of handling for this case.  I think the specific cases can be
easily made work by enhancing tree DSE in dse_possible_dead_store_p
to pick up partial kills by adjusting *ref (its offset/size) - keeping the
original ref for picking up uses.

But really we simply need a better DSE algorithm.

I think the two testcases are quite artificial and any "workaround" we
implement will cause more regressions elsewhere (code-size, RA, etc.)
if there are no followup optimization opportunities.

Richard.


> jeff
>
>
>
>
>
> diff --git a/gcc/tree-complex.c b/gcc/tree-complex.c
> index d781a8a..64346a5 100644
> --- a/gcc/tree-complex.c
> +++ b/gcc/tree-complex.c
> @@ -240,6 +240,14 @@ init_dont_simulate_again (void)
>                 op0 = gimple_assign_rhs1 (stmt);
>               if (gimple_num_ops (stmt) > 2)
>                 op1 = gimple_assign_rhs2 (stmt);
> +
> +             /* Simple assignments involving complex types where
> +                the RHS or LHS is addressable should be lowered, but
> +                do not inherently trigger resimulation.  */
> +             if (TREE_CODE (TREE_TYPE (gimple_assign_lhs (stmt)))
> +                 == COMPLEX_TYPE)
> +               saw_a_complex_op = true;
> +
>               break;
>
>             case GIMPLE_COND:
> @@ -777,6 +785,67 @@ update_phi_components (basic_block bb)
>      }
>  }
>
> +/* Extract the real and imag parts from RHS of the statement at GSI
> +   into R and I respectively.
> +
> +   This wouldn't be necessary except that extracting these
> +   components from a COMPLEX_EXPR is different from everything
> +   else.  */
> +
> +static void
> +extract_real_and_imag_component (gimple_stmt_iterator *gsi, tree *r, tree
> *i)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  if (gimple_assign_rhs_code (stmt) == COMPLEX_EXPR)
> +    {
> +      *r = gimple_assign_rhs1 (stmt);
> +      *i = gimple_assign_rhs2 (stmt);
> +    }
> +  else
> +    {
> +      tree tmp = gimple_assign_rhs1 (stmt);
> +      *r = extract_component (gsi, tmp, 0, true);
> +      *i = extract_component (gsi, tmp, 1, true);
> +    }
> +}
> +
> +/* Helper for expand_move_complex.  */
> +
> +static void
> +expand_complex_move_1 (gimple_stmt_iterator *gsi, gimple *stmt,
> +                      tree lhs, tree inner_type)
> +{
> +  tree x, r, i;
> +  gimple *t;
> +  location_t loc = gimple_location (stmt);
> +
> +  extract_real_and_imag_component (gsi, &r, &i);
> +  x = build1 (REALPART_EXPR, inner_type, unshare_expr (lhs));
> +  t = gimple_build_assign (x, r);
> +  gimple_set_location (t, loc);
> +  gsi_insert_before (gsi, t, GSI_SAME_STMT);
> +
> +  if (stmt == gsi_stmt (*gsi))
> +    {
> +      x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
> +      gimple_assign_set_lhs (stmt, x);
> +      gimple_assign_set_rhs1 (stmt, i);
> +    }
> +  else
> +    {
> +      x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
> +      t = gimple_build_assign (x, i);
> +      gimple_set_location (t, loc);
> +      gsi_insert_before (gsi, t, GSI_SAME_STMT);
> +
> +      stmt = gsi_stmt (*gsi);
> +      gcc_assert (gimple_code (stmt) == GIMPLE_RETURN);
> +      gimple_return_set_retval (as_a <greturn *> (stmt), lhs);
> +    }
> +
> +  update_stmt (stmt);
> +}
> +
>  /* Expand a complex move to scalars.  */
>
>  static void
> @@ -829,54 +898,20 @@ expand_complex_move (gimple_stmt_iterator *gsi, tree
> type)
>         }
>        else
>         {
> -         if (gimple_assign_rhs_code (stmt) != COMPLEX_EXPR)
> -           {
> -             r = extract_component (gsi, rhs, 0, true);
> -             i = extract_component (gsi, rhs, 1, true);
> -           }
> -         else
> -           {
> -             r = gimple_assign_rhs1 (stmt);
> -             i = gimple_assign_rhs2 (stmt);
> -           }
> +         extract_real_and_imag_component (gsi, &r, &i);
>           update_complex_assignment (gsi, r, i);
>         }
>      }
>    else if (rhs && TREE_CODE (rhs) == SSA_NAME && !TREE_SIDE_EFFECTS (lhs))
> -    {
> -      tree x;
> -      gimple *t;
> -      location_t loc;
> -
> -      loc = gimple_location (stmt);
> -      r = extract_component (gsi, rhs, 0, false);
> -      i = extract_component (gsi, rhs, 1, false);
> -
> -      x = build1 (REALPART_EXPR, inner_type, unshare_expr (lhs));
> -      t = gimple_build_assign (x, r);
> -      gimple_set_location (t, loc);
> -      gsi_insert_before (gsi, t, GSI_SAME_STMT);
> -
> -      if (stmt == gsi_stmt (*gsi))
> -       {
> -         x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
> -         gimple_assign_set_lhs (stmt, x);
> -         gimple_assign_set_rhs1 (stmt, i);
> -       }
> -      else
> -       {
> -         x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
> -         t = gimple_build_assign (x, i);
> -         gimple_set_location (t, loc);
> -         gsi_insert_before (gsi, t, GSI_SAME_STMT);
> -
> -         stmt = gsi_stmt (*gsi);
> -         gcc_assert (gimple_code (stmt) == GIMPLE_RETURN);
> -         gimple_return_set_retval (as_a <greturn *> (stmt), lhs);
> -       }
> -
> -      update_stmt (stmt);
> -    }
> +    expand_complex_move_1 (gsi, stmt, lhs, inner_type);
> +  /* Initialization of a complex object that is addressable.  */
> +  else if (TREE_CODE (lhs) == VAR_DECL
> +          && TREE_CODE (TREE_TYPE (lhs)) == COMPLEX_TYPE
> +          && (TREE_CODE (rhs) == VAR_DECL
> +              || TREE_CODE (rhs) == COMPLEX_CST
> +              || TREE_CODE (rhs) == COMPLEX_EXPR)
> +          && TREE_CODE (TREE_TYPE (rhs)) == COMPLEX_TYPE)
> +    expand_complex_move_1 (gsi, stmt, lhs, inner_type);
>  }
>
>  /* Expand complex addition to scalars:
>
Jeff Law Feb. 12, 2016, 5:21 p.m. UTC | #2
On 02/12/2016 02:04 AM, Richard Biener wrote:
>>
>> So what am I missing here?  Is there any kind of aliasing issues I need to
>> be aware of?  Any other dragons lurking?
>
> I think what you are missing is that you'll "optimize"
>
> _Complex double x, y;
>
> void foo (void)
> {
>    x = y;
> }
>
> then but dependent on the targets capability DCmode moves might be
> cheaper than four DFmode moves and nothing will combine them together
> again.
True.  In today's world where we can load/store large objects 
efficiently, that effect probably dwarfs anything we get from handling 
the trivial/artificial cases with more lowering.


>
> In complex lowering we're careful to only "optimize" when we know how to
> extract components in an efficient way (well, mostly).
Right.  If I understood the stuff in tree-complex, it's mostly concerned 
with lowering when the complex object is actually a degenerate.

>
> That the DSE opportunities in complex-[45].c are exposed if you lower
> the aggregate inits is obvious but the lowering is only a workaround for
> our lack of handling for this case.  I think the specific cases can be
> easily made work by enhancing tree DSE in dse_possible_dead_store_p
> to pick up partial kills by adjusting *ref (its offset/size) - keeping the
> original ref for picking up uses.
>
> But really we simply need a better DSE algorithm.
So the way to fix DSE is to keep the existing algorithm and track the 
hunks of Complex/aggregates that have been set a second time.

My first thought was to implement this as an inverted bitmap.  ie, set 
it to 1 for every byte in the complex/aggregate that is set by the first 
store.  Clear bits for each byte in subsequent stores to the pieces. 
If the bitmap reaches an empty state, then the initial store is dead.

Adjusting *ref could work too, but would probably be painful if the 
subsequent stores don't happen in a convenient order.


jeff
Jeff Law Feb. 14, 2016, 4:35 p.m. UTC | #3
On 02/12/2016 10:21 AM, Jeff Law wrote:
>> But really we simply need a better DSE algorithm.
> So the way to fix DSE is to keep the existing algorithm and track the
> hunks of Complex/aggregates that have been set a second time.
>
> My first thought was to implement this as an inverted bitmap.  ie, set
> it to 1 for every byte in the complex/aggregate that is set by the first
> store.  Clear bits for each byte in subsequent stores to the pieces. If
> the bitmap reaches an empty state, then the initial store is dead.
>
> Adjusting *ref could work too, but would probably be painful if the
> subsequent stores don't happen in a convenient order.
So that was scary easy.  We should have done this a long time ago.

Essentially I call ao_get_ref_base to get the offset/size/max_size 
filled in for the first statement.  Those are used to initialize the 
live bytes bitfield, as long as max_size != -1.

Then when we have a possible killing statement, we use the ao in a 
similar manner to determine which bytes to clear (taking care that the 
base is the same between the two references and that in the killing 
statement that the size/max_size are the same.).

When all the live bytes are zero then we've killed the original statement.

It's ~20 lines of code.

I need to pull together some additional tests, but it looks likely we'll 
be able to wrap this up easily for gcc-6.

Jeff
Richard Biener Feb. 14, 2016, 6:38 p.m. UTC | #4
On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 02/12/2016 10:21 AM, Jeff Law wrote:
>>> But really we simply need a better DSE algorithm.
>> So the way to fix DSE is to keep the existing algorithm and track the
>> hunks of Complex/aggregates that have been set a second time.
>>
>> My first thought was to implement this as an inverted bitmap.  ie,
>set
>> it to 1 for every byte in the complex/aggregate that is set by the
>first
>> store.  Clear bits for each byte in subsequent stores to the pieces.
>If
>> the bitmap reaches an empty state, then the initial store is dead.
>>
>> Adjusting *ref could work too, but would probably be painful if the
>> subsequent stores don't happen in a convenient order.
>So that was scary easy.  We should have done this a long time ago.
>
>Essentially I call ao_get_ref_base to get the offset/size/max_size 
>filled in for the first statement.  Those are used to initialize the 
>live bytes bitfield, as long as max_size != -1.
>
>Then when we have a possible killing statement, we use the ao in a 
>similar manner to determine which bytes to clear (taking care that the 
>base is the same between the two references and that in the killing 
>statement that the size/max_size are the same.).
>
>When all the live bytes are zero then we've killed the original
>statement.
>
>It's ~20 lines of code.
>
>I need to pull together some additional tests, but it looks likely
>we'll 
>be able to wrap this up easily for gcc-6.

BTW, we had sth like this before but it had both correctness and more importantly scalability issues.

Richard.

>Jeff
Jeff Law Feb. 16, 2016, 3:54 p.m. UTC | #5
On 02/14/2016 11:38 AM, Richard Biener wrote:
> On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com>
> wrote:
>> On 02/12/2016 10:21 AM, Jeff Law wrote:
>>>> But really we simply need a better DSE algorithm.
>>> So the way to fix DSE is to keep the existing algorithm and track
>>> the hunks of Complex/aggregates that have been set a second
>>> time.
>>>
>>> My first thought was to implement this as an inverted bitmap.
>>> ie,
>> set
>>> it to 1 for every byte in the complex/aggregate that is set by
>>> the
>> first
>>> store.  Clear bits for each byte in subsequent stores to the
>>> pieces.
>> If
>>> the bitmap reaches an empty state, then the initial store is
>>> dead.
>>>
>>> Adjusting *ref could work too, but would probably be painful if
>>> the subsequent stores don't happen in a convenient order.
>> So that was scary easy.  We should have done this a long time ago.
>>
>> Essentially I call ao_get_ref_base to get the offset/size/max_size
>> filled in for the first statement.  Those are used to initialize
>> the live bytes bitfield, as long as max_size != -1.
>>
>> Then when we have a possible killing statement, we use the ao in a
>> similar manner to determine which bytes to clear (taking care that
>> the base is the same between the two references and that in the
>> killing statement that the size/max_size are the same.).
>>
>> When all the live bytes are zero then we've killed the original
>> statement.
>>
>> It's ~20 lines of code.
>>
>> I need to pull together some additional tests, but it looks likely
>> we'll be able to wrap this up easily for gcc-6.
>
> BTW, we had sth like this before but it had both correctness and more
> importantly scalability issues.
Really?  Do you have a pointer to any kind of discussion?  I certainly 
don't want to re-introduce something that was ultimately abandoned due 
to unfixable problems.

I don't offhand see any scalability issues -- unless 
ao_ref_init/ao_ref_base don't scale well.  This change doesn't cause any 
additional looping in dse_possible_dead_store_p (it can only cause us to 
exit the loop earlier) and the bitmap stuff is cheap because we're going 
to hit the cache as we clear consecutive ranges of bits.

jeff
Jeff Law Feb. 16, 2016, 9:20 p.m. UTC | #6
On 02/14/2016 11:38 AM, Richard Biener wrote:
> On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>> On 02/12/2016 10:21 AM, Jeff Law wrote:
>>>> But really we simply need a better DSE algorithm.
>>> So the way to fix DSE is to keep the existing algorithm and track the
>>> hunks of Complex/aggregates that have been set a second time.
>>>
>>> My first thought was to implement this as an inverted bitmap.  ie,
>> set
>>> it to 1 for every byte in the complex/aggregate that is set by the
>> first
>>> store.  Clear bits for each byte in subsequent stores to the pieces.
>> If
>>> the bitmap reaches an empty state, then the initial store is dead.
>>>
>>> Adjusting *ref could work too, but would probably be painful if the
>>> subsequent stores don't happen in a convenient order.
>> So that was scary easy.  We should have done this a long time ago.
>>
>> Essentially I call ao_get_ref_base to get the offset/size/max_size
>> filled in for the first statement.  Those are used to initialize the
>> live bytes bitfield, as long as max_size != -1.
>>
>> Then when we have a possible killing statement, we use the ao in a
>> similar manner to determine which bytes to clear (taking care that the
>> base is the same between the two references and that in the killing
>> statement that the size/max_size are the same.).
>>
>> When all the live bytes are zero then we've killed the original
>> statement.
>>
>> It's ~20 lines of code.
>>
>> I need to pull together some additional tests, but it looks likely
>> we'll
>> be able to wrap this up easily for gcc-6.
>
> BTW, we had sth like this before but it had both correctness and more importantly scalability issues.
If you're referring to 48141, that was a completely different issue and 
I don't think there's any significant parallels between how that 
affected RTL DSE and tree DSE.

The issue in 48141 is that RTL DSE keeps a list of active stores.  Those 
lists were getting long, often with things that just weren't interesting.

Tree DSE doesn't maintain a list of that nature (it used to, but you 
fixed all that a few years back).  These days it does a walk of the IL. 
  When it finds a memory store, it walks immediate uses of the virtual 
operands to see if an immediate use happens to write the same memory 
location.  If it does, then the first write is dead.  That's (of course) 
over-simplified, but captures the essence of how tree DSE works.

As long as the calls to ao_ref_init/ao_ref are cheap, my code shouldn't 
change the overall performance of tree DSE.

Jeff
Jeff Law Feb. 17, 2016, 7:30 a.m. UTC | #7
On 02/14/2016 11:38 AM, Richard Biener wrote:
> On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>> On 02/12/2016 10:21 AM, Jeff Law wrote:
>>>> But really we simply need a better DSE algorithm.
>>> So the way to fix DSE is to keep the existing algorithm and track the
>>> hunks of Complex/aggregates that have been set a second time.
>>>
>>> My first thought was to implement this as an inverted bitmap.  ie,
>> set
>>> it to 1 for every byte in the complex/aggregate that is set by the
>> first
>>> store.  Clear bits for each byte in subsequent stores to the pieces.
>> If
>>> the bitmap reaches an empty state, then the initial store is dead.
>>>
>>> Adjusting *ref could work too, but would probably be painful if the
>>> subsequent stores don't happen in a convenient order.
>> So that was scary easy.  We should have done this a long time ago.
>>
>> Essentially I call ao_get_ref_base to get the offset/size/max_size
>> filled in for the first statement.  Those are used to initialize the
>> live bytes bitfield, as long as max_size != -1.
>>
>> Then when we have a possible killing statement, we use the ao in a
>> similar manner to determine which bytes to clear (taking care that the
>> base is the same between the two references and that in the killing
>> statement that the size/max_size are the same.).
>>
>> When all the live bytes are zero then we've killed the original
>> statement.
>>
>> It's ~20 lines of code.
>>
>> I need to pull together some additional tests, but it looks likely
>> we'll
>> be able to wrap this up easily for gcc-6.
>
> BTW, we had sth like this before but it had both correctness and more importantly scalability issues.
Just a couple more tibits.

I instrumented a bootstrap -- the improved DSE finds ~20k additional DSE 
opportunities during a GCC bootstrap that could not be found by the 
current DSE.  Yes, 20k additional statements deleted by tree DSE.  Yow!

Of those additional opportunities, none require bit level tracking.  So 
we can just punt when the size/offset is not byte sized/aligned.

Of those additional opportunities > 99% are for sizes of 64 bytes or 
smaller.  Thus we can pack those into 1 or 2 bitmap elements, depending 
on the starting offset.  So the bitmap side will be efficient with no 
real searching if we choose our PARAM value wisely.

The outliers are, well, strange.  There were cases where we found new 
DSE opportunities for objects of size 2k bytes or larger.  There weren't 
many of these, but I was surprised at the size.  Most likely it's a 
clobber or mem* thing that's participating in DSE.  But I haven't looked 
closely at those cases yet.

There's a ton of statements that are clobbering zero-sized objects.  My 
code can determine when those clobbers are redundant (with some later 
clobber), but I haven't looked closely to see if that's actually a good 
thing to do or not.

Anyway, I still don't see anything which makes me think this can't 
wrap-up in the immediate future.

jeff
Richard Biener Feb. 17, 2016, 10:48 a.m. UTC | #8
On Wed, Feb 17, 2016 at 8:30 AM, Jeff Law <law@redhat.com> wrote:
> On 02/14/2016 11:38 AM, Richard Biener wrote:
>>
>> On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com>
>> wrote:
>>>
>>> On 02/12/2016 10:21 AM, Jeff Law wrote:
>>>>>
>>>>> But really we simply need a better DSE algorithm.
>>>>
>>>> So the way to fix DSE is to keep the existing algorithm and track the
>>>> hunks of Complex/aggregates that have been set a second time.
>>>>
>>>> My first thought was to implement this as an inverted bitmap.  ie,
>>>
>>> set
>>>>
>>>> it to 1 for every byte in the complex/aggregate that is set by the
>>>
>>> first
>>>>
>>>> store.  Clear bits for each byte in subsequent stores to the pieces.
>>>
>>> If
>>>>
>>>> the bitmap reaches an empty state, then the initial store is dead.
>>>>
>>>> Adjusting *ref could work too, but would probably be painful if the
>>>> subsequent stores don't happen in a convenient order.
>>>
>>> So that was scary easy.  We should have done this a long time ago.
>>>
>>> Essentially I call ao_get_ref_base to get the offset/size/max_size
>>> filled in for the first statement.  Those are used to initialize the
>>> live bytes bitfield, as long as max_size != -1.
>>>
>>> Then when we have a possible killing statement, we use the ao in a
>>> similar manner to determine which bytes to clear (taking care that the
>>> base is the same between the two references and that in the killing
>>> statement that the size/max_size are the same.).
>>>
>>> When all the live bytes are zero then we've killed the original
>>> statement.
>>>
>>> It's ~20 lines of code.
>>>
>>> I need to pull together some additional tests, but it looks likely
>>> we'll
>>> be able to wrap this up easily for gcc-6.
>>
>>
>> BTW, we had sth like this before but it had both correctness and more
>> importantly scalability issues.
>
> Just a couple more tibits.
>
> I instrumented a bootstrap -- the improved DSE finds ~20k additional DSE
> opportunities during a GCC bootstrap that could not be found by the current
> DSE.  Yes, 20k additional statements deleted by tree DSE.  Yow!

Well, DCE also can do quite some DSE and it runs after DSE - did that 20k
more DSE affect the overall end-result?

> Of those additional opportunities, none require bit level tracking.  So we
> can just punt when the size/offset is not byte sized/aligned.

Yep.  I expect us to eventually lower all those bit-precision stuff.

> Of those additional opportunities > 99% are for sizes of 64 bytes or
> smaller.  Thus we can pack those into 1 or 2 bitmap elements, depending on
> the starting offset.  So the bitmap side will be efficient with no real
> searching if we choose our PARAM value wisely.

So then please use a uint64_t or even uint32_t mask please.  Which means
a fixed size SBITMAP (32 bits) if you like to use the bitmap interface.

> The outliers are, well, strange.  There were cases where we found new DSE
> opportunities for objects of size 2k bytes or larger.  There weren't many of
> these, but I was surprised at the size.  Most likely it's a clobber or mem*
> thing that's participating in DSE.  But I haven't looked closely at those
> cases yet.

I suspect it's memset followed by actually initializing all elements.  We have
quite some of those I think.

> There's a ton of statements that are clobbering zero-sized objects.  My code
> can determine when those clobbers are redundant (with some later clobber),
> but I haven't looked closely to see if that's actually a good thing to do or
> not.
>
> Anyway, I still don't see anything which makes me think this can't wrap-up
> in the immediate future.

Can you share your work-in-progress patch?

Thanks,
Richard.

> jeff
diff mbox

Patch

diff --git a/gcc/tree-complex.c b/gcc/tree-complex.c
index d781a8a..64346a5 100644
--- a/gcc/tree-complex.c
+++ b/gcc/tree-complex.c
@@ -240,6 +240,14 @@  init_dont_simulate_again (void)
 		op0 = gimple_assign_rhs1 (stmt);
 	      if (gimple_num_ops (stmt) > 2)
 		op1 = gimple_assign_rhs2 (stmt);
+
+	      /* Simple assignments involving complex types where
+		 the RHS or LHS is addressable should be lowered, but
+		 do not inherently trigger resimulation.  */
+	      if (TREE_CODE (TREE_TYPE (gimple_assign_lhs (stmt)))
+		  == COMPLEX_TYPE)
+		saw_a_complex_op = true;
+
 	      break;
 
 	    case GIMPLE_COND:
@@ -777,6 +785,67 @@  update_phi_components (basic_block bb)
     }
 }
 
+/* Extract the real and imag parts from RHS of the statement at GSI
+   into R and I respectively.
+
+   This wouldn't be necessary except that extracting these
+   components from a COMPLEX_EXPR is different from everything
+   else.  */
+
+static void
+extract_real_and_imag_component (gimple_stmt_iterator *gsi, tree *r, tree *i)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  if (gimple_assign_rhs_code (stmt) == COMPLEX_EXPR)
+    {
+      *r = gimple_assign_rhs1 (stmt);
+      *i = gimple_assign_rhs2 (stmt);
+    }
+  else
+    {
+      tree tmp = gimple_assign_rhs1 (stmt);
+      *r = extract_component (gsi, tmp, 0, true);
+      *i = extract_component (gsi, tmp, 1, true);
+    }
+}
+
+/* Helper for expand_move_complex.  */
+
+static void
+expand_complex_move_1 (gimple_stmt_iterator *gsi, gimple *stmt,
+		       tree lhs, tree inner_type)
+{
+  tree x, r, i;
+  gimple *t;
+  location_t loc = gimple_location (stmt);
+
+  extract_real_and_imag_component (gsi, &r, &i);
+  x = build1 (REALPART_EXPR, inner_type, unshare_expr (lhs));
+  t = gimple_build_assign (x, r);
+  gimple_set_location (t, loc);
+  gsi_insert_before (gsi, t, GSI_SAME_STMT);
+
+  if (stmt == gsi_stmt (*gsi))
+    {
+      x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
+      gimple_assign_set_lhs (stmt, x);
+      gimple_assign_set_rhs1 (stmt, i);
+    }
+  else
+    {
+      x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
+      t = gimple_build_assign (x, i);
+      gimple_set_location (t, loc);
+      gsi_insert_before (gsi, t, GSI_SAME_STMT);
+
+      stmt = gsi_stmt (*gsi);
+      gcc_assert (gimple_code (stmt) == GIMPLE_RETURN);
+      gimple_return_set_retval (as_a <greturn *> (stmt), lhs);
+    }
+
+  update_stmt (stmt);
+}
+
 /* Expand a complex move to scalars.  */
 
 static void
@@ -829,54 +898,20 @@  expand_complex_move (gimple_stmt_iterator *gsi, tree type)
 	}
       else
 	{
-	  if (gimple_assign_rhs_code (stmt) != COMPLEX_EXPR)
-	    {
-	      r = extract_component (gsi, rhs, 0, true);
-	      i = extract_component (gsi, rhs, 1, true);
-	    }
-	  else
-	    {
-	      r = gimple_assign_rhs1 (stmt);
-	      i = gimple_assign_rhs2 (stmt);
-	    }
+	  extract_real_and_imag_component (gsi, &r, &i);
 	  update_complex_assignment (gsi, r, i);
 	}
     }
   else if (rhs && TREE_CODE (rhs) == SSA_NAME && !TREE_SIDE_EFFECTS (lhs))
-    {
-      tree x;
-      gimple *t;
-      location_t loc;
-
-      loc = gimple_location (stmt);
-      r = extract_component (gsi, rhs, 0, false);
-      i = extract_component (gsi, rhs, 1, false);
-
-      x = build1 (REALPART_EXPR, inner_type, unshare_expr (lhs));
-      t = gimple_build_assign (x, r);
-      gimple_set_location (t, loc);
-      gsi_insert_before (gsi, t, GSI_SAME_STMT);
-
-      if (stmt == gsi_stmt (*gsi))
-	{
-	  x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
-	  gimple_assign_set_lhs (stmt, x);
-	  gimple_assign_set_rhs1 (stmt, i);
-	}
-      else
-	{
-	  x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
-	  t = gimple_build_assign (x, i);
-	  gimple_set_location (t, loc);
-	  gsi_insert_before (gsi, t, GSI_SAME_STMT);
-
-	  stmt = gsi_stmt (*gsi);
-	  gcc_assert (gimple_code (stmt) == GIMPLE_RETURN);
-	  gimple_return_set_retval (as_a <greturn *> (stmt), lhs);
-	}
-
-      update_stmt (stmt);
-    }
+    expand_complex_move_1 (gsi, stmt, lhs, inner_type);
+  /* Initialization of a complex object that is addressable.  */
+  else if (TREE_CODE (lhs) == VAR_DECL
+	   && TREE_CODE (TREE_TYPE (lhs)) == COMPLEX_TYPE
+	   && (TREE_CODE (rhs) == VAR_DECL
+	       || TREE_CODE (rhs) == COMPLEX_CST
+	       || TREE_CODE (rhs) == COMPLEX_EXPR)
+	   && TREE_CODE (TREE_TYPE (rhs)) == COMPLEX_TYPE)
+    expand_complex_move_1 (gsi, stmt, lhs, inner_type);
 }
 
 /* Expand complex addition to scalars: