diff mbox

[RFA,PR,tree-optimization/33562,1/4] Byte tracking in DSE

Message ID 79d271d8-739e-23c6-ed24-e591cb7fb941@redhat.com
State New
Headers show

Commit Message

Jeff Law Dec. 16, 2016, 1:54 a.m. UTC
This is the first of the 4 part patchkit to address deficiencies in our 
DSE implementation.


This patch addresses the P2 regression 33562 which has been a low 
priority regression since gcc-4.3.  To summarize, DSE no longer has the 
ability to detect an aggregate store as dead if subsequent stores are 
done in a piecemeal fashion.

I originally tackled this by changing how we lower complex objects. 
That was sufficient to address 33562, but was reasonably rejected.

This version attacks the problem by improving DSE to track stores to 
memory at a byte level.  That allows us to determine if a series of 
stores completely covers an earlier store (thus making the earlier store 
dead).

A useful side effect of this is we can detect when parts of a store are 
dead and potentially rewrite the store.  This patch implements that for 
complex object initializations.  While not strictly part of 33562, it's 
so closely related that I felt it belongs as part of this patch.

This originally limited the size of the tracked memory space to 64 
bytes.  I bumped the limit after working through the CONSTRUCTOR and 
mem* trimming patches.  The 256 byte limit is still fairly arbitrary and 
I wouldn't lose sleep if we throttled back to 64 or 128 bytes.

Later patches in the kit will build upon this patch.  So if pieces look 
like skeleton code, that's because it is.


Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?
PR tree-optimization/33562
	* params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
	* tree-ssa-dse.c: Include params.h.
	(initialize_ao_ref_for_dse): New, partially extracted from
	dse_optimize_stmt.
	(valid_io_ref_for_dse): New.
	(clear_bytes_written_by, trim_complex_store): Likewise.
	(trim_partially_dead_store): Likewise.
	(dse_partially_dead_store_p): Track what bytes were originally stored
	into memory by the statement as well as the subset of bytes that
	are still live.   If we "fail", but have identified the store as
	partially dead, try to rewrite it to store fewer bytes of data.
	Exit the main loop if we find a full kill as a single statement
	or via group of statements.
	(dse_optimize_stmt): Use initialize_ao_ref_for_dse.


	* gcc.dg/tree-ssa/complex-4.c: No longer xfailed.
	* gcc.dg/tree-ssa/complex-5.c: Likewise.
	* gcc.dg/tree-ssa/ssa-dse-9.c: Likewise.
	* gcc.dg/tree-ssa/ssa-dse-18.c: New test.
	* gcc.dg/tree-ssa/ssa-dse-19.c: Likewise.
	* gcc.dg/tree-ssa/ssa-dse-20.c: Likewise.
	* gcc.dg/tree-ssa/ssa-dse-21.c: Likewise.

Comments

Trevor Saunders Dec. 16, 2016, 7:29 a.m. UTC | #1
On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote:
>    unsigned cnt = 0;
> +  bitmap live_bytes = NULL;
> +  bitmap orig_live_bytes = NULL;
>  
>    *use_stmt = NULL;
>  
> +  /* REF is a memory write.  Go ahead and get its base, size, extent
> +     information and encode the bytes written into LIVE_BYTES.  We can
> +     handle any case where we have a known base and maximum size.
> +
> +     However, experimentation has shown that bit level tracking is not
> +     useful in practice, so we only track at the byte level.
> +
> +     Furthermore, experimentation has shown that 99% of the cases
> +     that require byte tracking are 64 bytes or less.  */
> +  if (valid_ao_ref_for_dse (ref)
> +      && (ref->max_size / BITS_PER_UNIT
> +	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
> +    {
> +      live_bytes = BITMAP_ALLOC (NULL);
> +      orig_live_bytes = BITMAP_ALLOC (NULL);
> +      bitmap_set_range (live_bytes,
> +			ref->offset / BITS_PER_UNIT,
> +			ref->max_size / BITS_PER_UNIT);
> +      bitmap_copy (orig_live_bytes, live_bytes);

would it maybe make sense to use sbitmaps since the length is known to
be short, and doesn't change after allocation?

> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt)
>  	}
>  
>        if (fail)
> -	return false;
> +	{
> +	  /* STMT might be partially dead and we may be able to reduce
> +	     how many memory locations it stores into.  */
> +	  if (live_bytes
> +	      && !bitmap_equal_p (live_bytes, orig_live_bytes)
> +	      && !gimple_clobber_p (stmt))
> +	    trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);
> +	  return false;

shouldn't you free the bitmaps here?

Trev
Richard Biener Dec. 16, 2016, 1:57 p.m. UTC | #2
On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote:
>
> This is the first of the 4 part patchkit to address deficiencies in our DSE
> implementation.
>
>
> This patch addresses the P2 regression 33562 which has been a low priority
> regression since gcc-4.3.  To summarize, DSE no longer has the ability to
> detect an aggregate store as dead if subsequent stores are done in a
> piecemeal fashion.
>
> I originally tackled this by changing how we lower complex objects. That was
> sufficient to address 33562, but was reasonably rejected.
>
> This version attacks the problem by improving DSE to track stores to memory
> at a byte level.  That allows us to determine if a series of stores
> completely covers an earlier store (thus making the earlier store dead).
>
> A useful side effect of this is we can detect when parts of a store are dead
> and potentially rewrite the store.  This patch implements that for complex
> object initializations.  While not strictly part of 33562, it's so closely
> related that I felt it belongs as part of this patch.
>
> This originally limited the size of the tracked memory space to 64 bytes.  I
> bumped the limit after working through the CONSTRUCTOR and mem* trimming
> patches.  The 256 byte limit is still fairly arbitrary and I wouldn't lose
> sleep if we throttled back to 64 or 128 bytes.
>
> Later patches in the kit will build upon this patch.  So if pieces look like
> skeleton code, that's because it is.
>
>
> Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?

Apart from what Trevor says about using sbitmaps (try to avoid the initial
zeroing please) and the missed freeing (you can use auto_[s]bitmap?)
some comments below.

>         PR tree-optimization/33562
>         * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
>         * tree-ssa-dse.c: Include params.h.
>         (initialize_ao_ref_for_dse): New, partially extracted from
>         dse_optimize_stmt.
>         (valid_io_ref_for_dse): New.
>         (clear_bytes_written_by, trim_complex_store): Likewise.
>         (trim_partially_dead_store): Likewise.
>         (dse_partially_dead_store_p): Track what bytes were originally
> stored
>         into memory by the statement as well as the subset of bytes that
>         are still live.   If we "fail", but have identified the store as
>         partially dead, try to rewrite it to store fewer bytes of data.
>         Exit the main loop if we find a full kill as a single statement
>         or via group of statements.
>         (dse_optimize_stmt): Use initialize_ao_ref_for_dse.
>
>
>         * gcc.dg/tree-ssa/complex-4.c: No longer xfailed.
>         * gcc.dg/tree-ssa/complex-5.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-9.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-dse-19.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-20.c: Likewise.
>         * gcc.dg/tree-ssa/ssa-dse-21.c: Likewise.
>
> diff --git a/gcc/params.def b/gcc/params.def
> index 50f75a7..ddc3d65 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER,
>          "Average number of iterations of a loop.",
>          10, 1, 0)
>
> +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
> +        "dse-max-object-size",
> +        "Maximum size (in bytes) of objects tracked by dead store
> elimination.",
> +        256, 0, 0)
> +
>  DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE,
>          "scev-max-expr-size",
>          "Bound on size of expressions used in the scalar evolutions
> analyzer.",
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> index 87a2638..3155741 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
> @@ -10,4 +10,4 @@ int f(void)
>    return g(&t);
>  }
>
> -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> index e2cd403..e6d027f 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
> @@ -8,4 +8,4 @@ int f(void)
>   __imag__ t = 2;
>  }
>
> -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> new file mode 100644
> index 0000000..92b2df8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +int g(_Complex int*);
> +int f(void)
> +{
> +  _Complex int t = 0;
> +  int i, j;
> + __imag__ t += 2;
> +  return g(&t);
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> new file mode 100644
> index 0000000..718b746
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +int g(_Complex int*);
> +int f(void)
> +{
> +  _Complex int t = 0;
> +  int i, j;
> + __real__ t += 2;
> +  return g(&t);
> +}
> +
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> new file mode 100644
> index 0000000..4e14d9b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
> +_Complex int t = 0;
> +int f(void)
> +{
> +  t = 0;
> + __imag__ t = 2;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> new file mode 100644
> index 0000000..d1e0b85
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
> +_Complex int t = 0;
> +int f(void)
> +{
> +  t = 0;
> + __real__ t = 2;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> index 594c20c..ae48ddd 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
> @@ -11,4 +11,4 @@ foo ()
>  }
>
>  /* We should eliminate the first assignment.  */
> -/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */
> diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
> index 778b363..eea185c 100644
> --- a/gcc/tree-ssa-dse.c
> +++ b/gcc/tree-ssa-dse.c
> @@ -33,6 +33,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-dfa.h"
>  #include "domwalk.h"
>  #include "tree-cfgcleanup.h"
> +#include "params.h"
>
>  /* This file implements dead store elimination.
>
> @@ -68,6 +69,158 @@ along with GCC; see the file COPYING3.  If not see
>     remove their dead edges eventually.  */
>  static bitmap need_eh_cleanup;
>
> +/* STMT is a statement that may write into memory.  Analyze it and
> +   initialize WRITE to describe how STMT affects memory.
> +
> +   Return TRUE if the the statement was analyzed, FALSE otherwise.
> +
> +   It is always safe to return FALSE.  But typically better optimziation
> +   can be achieved by analyzing more statements.  */
> +
> +static bool
> +initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write)
> +{
> +  /* It's advantageous to handle certain mem* functions.  */
> +  if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> +    {
> +      switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt)))
> +       {
> +         case BUILT_IN_MEMCPY:
> +         case BUILT_IN_MEMMOVE:
> +         case BUILT_IN_MEMSET:
> +           {
> +             tree size = NULL_TREE;
> +             if (gimple_call_num_args (stmt) == 3)
> +               size = gimple_call_arg (stmt, 2);
> +             tree ptr = gimple_call_arg (stmt, 0);
> +             ao_ref_init_from_ptr_and_size (write, ptr, size);
> +             return true;
> +           }
> +         default:
> +           break;
> +       }
> +    }
> +  else if (is_gimple_assign (stmt))
> +    {
> +      ao_ref_init (write, gimple_assign_lhs (stmt));
> +      return true;
> +    }
> +  return false;
> +}
> +
> +/* Given REF from the the alias oracle, return TRUE if it is a valid
> +   memory reference for dead store elimination, false otherwise.
> +
> +   In particular, the reference must have a known base, known maximum
> +   size, start at a byte offset and have a size that is one or more
> +   bytes.  */
> +
> +static bool
> +valid_ao_ref_for_dse (ao_ref *ref)
> +{
> +  return (ao_ref_base (ref)
> +         && ref->max_size != -1
> +         && (ref->offset % BITS_PER_UNIT) == 0
> +         && (ref->size % BITS_PER_UNIT) == 0);
> +}
> +
> +/* Clear any bytes written by STMT from the bitmap LIVE_BYTES.  The base
> +   address written by STMT must match the one found in REF, which must
> +   have its base address previously initialized.
> +
> +   This routine must be conservative.  If we don't know the offset or
> +   actual size written, assume nothing was written.  */
> +
> +static void
> +clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref)
> +{
> +  ao_ref write;
> +  if (!initialize_ao_ref_for_dse (stmt, &write))
> +    return;
> +
> +  /* Verify we have the same base memory address and that the write
> +     has a known size.  If so, then clear the appropriate bytes in
> +     the LIVE_BYTES bitmap.  */
> +  if (valid_ao_ref_for_dse (&write)
> +      && write.base == ref->base
> +      && write.size == write.max_size)
> +    bitmap_clear_range (live_bytes,
> +                       write.offset / BITS_PER_UNIT,
> +                       write.size / BITS_PER_UNIT);
> +}
> +
> +/* STMT initializes an object from COMPLEX_CST where one or more of the
> +   bytes written are dead stores.  ORIG is the bitmap of bytes stored by
> +   STMT.  LIVE is the bitmap of stores that are actually live.
> +
> +   Attempt to rewrite STMT so that only the real or imaginary part of
> +   the object is actually stored.  */
> +
> +static void
> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt)
> +{
> +  bitmap dead = BITMAP_ALLOC (NULL);
> +  bitmap_and_compl (dead, orig, live);
> +
> +  /* So we have to verify that either the real or imag part as a whole
> +     is dead.  They are always the same size.  Thus for one to be dead
> +     the number of live bytes would have to be the same as the number of
> +     dead bytes.  */
> +  if (bitmap_count_bits (live) == bitmap_count_bits (dead))

popcount is expensive, so is this really a short-cut?

> +    {
> +      /* Hopefully we have live bits that look like 0xff00 or 0xff
> (assuming
> +        8 bytes for the underlying real/imag part).  If so we can optimize
> +        this case.  */
> +      if (bitmap_first_set_bit (live) == 0
> +         && bitmap_first_set_bit (dead) == bitmap_count_bits (live))

Hmm, but does that properly handle byte-wise reads from it?  Like
reading the realpart plus the last byte from the imagpart?

> +       {
> +         /* TREE_REALPART is live */
> +         tree x = TREE_REALPART (gimple_assign_rhs1 (stmt));
> +         tree y = gimple_assign_lhs (stmt);
> +         y = build1 (REALPART_EXPR, TREE_TYPE (x), y);
> +         gimple_assign_set_lhs (stmt, y);
> +         gimple_assign_set_rhs1 (stmt, x);
> +       }
> +      else if (bitmap_first_set_bit (dead) == 0
> +         && bitmap_first_set_bit (live) == bitmap_count_bits (dead))
> +       {

Likewise.

> +         /* TREE_IMAGPART is live */
> +         tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt));
> +         tree y = gimple_assign_lhs (stmt);
> +         y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y);
> +         gimple_assign_set_lhs (stmt, y);
> +         gimple_assign_set_rhs1 (stmt, x);
> +       }
> +      /* Other cases indicate parts of both the real and imag subobjects
> +        are live.  We do not try to optimize those cases.  */
> +    }
> +  BITMAP_FREE (dead);
> +}
> +
> +/* STMT is a memory write where one or more bytes written are dead
> +   stores.  ORIG is the bitmap of bytes stored by STMT.  LIVE is the
> +   bitmap of stores that are actually live.
> +
> +   Attempt to rewrite STMT so that it writes fewer memory locations.  Right
> +   now we only support trimming at the start or end of the memory region.
> +   It's not clear how much there is to be gained by trimming from the
> middle
> +   of the region.  */
> +
> +static void
> +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt)
> +{
> +  if (is_gimple_assign (stmt))
> +    {
> +      switch (gimple_assign_rhs_code (stmt))
> +       {
> +       case COMPLEX_CST:
> +         trim_complex_store (orig, live, stmt);

so patch #1 only handles stores from COMPLEX_CST?  Ok...

> +         break;
> +       default:
> +         break;
> +       }
> +    }
> +}
>
>  /* A helper of dse_optimize_stmt.
>     Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
> @@ -79,9 +232,32 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>  {
>    gimple *temp;
>    unsigned cnt = 0;
> +  bitmap live_bytes = NULL;
> +  bitmap orig_live_bytes = NULL;
>
>    *use_stmt = NULL;
>
> +  /* REF is a memory write.  Go ahead and get its base, size, extent
> +     information and encode the bytes written into LIVE_BYTES.  We can
> +     handle any case where we have a known base and maximum size.
> +
> +     However, experimentation has shown that bit level tracking is not
> +     useful in practice, so we only track at the byte level.
> +
> +     Furthermore, experimentation has shown that 99% of the cases
> +     that require byte tracking are 64 bytes or less.  */
> +  if (valid_ao_ref_for_dse (ref)
> +      && (ref->max_size / BITS_PER_UNIT
> +         <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
> +    {
> +      live_bytes = BITMAP_ALLOC (NULL);
> +      orig_live_bytes = BITMAP_ALLOC (NULL);
> +      bitmap_set_range (live_bytes,
> +                       ref->offset / BITS_PER_UNIT,
> +                       ref->max_size / BITS_PER_UNIT);
> +      bitmap_copy (orig_live_bytes, live_bytes);

So I'd use a once-per-pass allocated sbitmap here.  I don't see why you need
the orig_live_bytes bitmap though (just keep that implicitely by the known
range?)

> +    }
> +
>    /* Find the first dominated statement that clobbers (part of) the
>       memory stmt stores to with no intermediate statement that may use
>       part of the memory stmt stores.  That is, find a store that may
> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>         }
>
>        if (fail)
> -       return false;
> +       {
> +         /* STMT might be partially dead and we may be able to reduce
> +            how many memory locations it stores into.  */
> +         if (live_bytes
> +             && !bitmap_equal_p (live_bytes, orig_live_bytes)
> +             && !gimple_clobber_p (stmt))
> +           trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);

The actual transform in dse_possible_dead_store_p looks a bit misplaced.
I see it's somehow convenient but then maybe returning a enum from this
function might be cleaner.  Well, I'm not too torn about this, so maybe
just rename the function a bit (no good suggestion either).

The rest of the patch (the infrastructure) looks reasonable.

Richard.

> +         return false;
> +       }
>
>        /* If we didn't find any definition this means the store is dead
>           if it isn't a store to global reachable memory.  In this case
> @@ -177,12 +362,18 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
> gimple **use_stmt)
>           temp = stmt;
>           break;
>         }
> +
> +      if (live_bytes && temp)
> +       clear_bytes_written_by (live_bytes, temp, ref);
>      }
> -  /* Continue walking until we reach a kill.  */
> -  while (!stmt_kills_ref_p (temp, ref));
> +  /* Continue walking until we reach a full kill as a single statement
> +     or there are no more live bytes.  */
> +  while (!stmt_kills_ref_p (temp, ref)
> +        && !(live_bytes && bitmap_empty_p (live_bytes)));
>
>    *use_stmt = temp;
> -
> +  BITMAP_FREE (live_bytes);
> +  BITMAP_FREE (orig_live_bytes);
>    return true;
>  }
>
> @@ -214,6 +405,10 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>           || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF))
>      return;
>
> +  ao_ref ref;
> +  if (!initialize_ao_ref_for_dse (stmt, &ref))
> +    return;
> +
>    /* We know we have virtual definitions.  We can handle assignments and
>       some builtin calls.  */
>    if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
> @@ -225,12 +420,6 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>           case BUILT_IN_MEMSET:
>             {
>               gimple *use_stmt;
> -             ao_ref ref;
> -             tree size = NULL_TREE;
> -             if (gimple_call_num_args (stmt) == 3)
> -               size = gimple_call_arg (stmt, 2);
> -             tree ptr = gimple_call_arg (stmt, 0);
> -             ao_ref_init_from_ptr_and_size (&ref, ptr, size);
>               if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
>                 return;
>
> @@ -244,6 +433,7 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>               tree lhs = gimple_call_lhs (stmt);
>               if (lhs)
>                 {
> +                 tree ptr = gimple_call_arg (stmt, 0);
>                   gimple *new_stmt = gimple_build_assign (lhs, ptr);
>                   unlink_stmt_vdef (stmt);
>                   if (gsi_replace (gsi, new_stmt, true))
> @@ -274,13 +464,8 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi)
>        if (operand_equal_p (gimple_assign_rhs1 (stmt),
>                            gimple_assign_lhs (stmt), 0))
>         use_stmt = stmt;
> -      else
> -       {
> -         ao_ref ref;
> -         ao_ref_init (&ref, gimple_assign_lhs (stmt));
> -         if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
> -           return;
> -       }
> +      else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
> +       return;
>
>        /* Now we know that use_stmt kills the LHS of stmt.  */
>
>
Jeff Law Dec. 16, 2016, 3:32 p.m. UTC | #3
On 12/16/2016 12:29 AM, Trevor Saunders wrote:
> On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote:
>>    unsigned cnt = 0;
>> +  bitmap live_bytes = NULL;
>> +  bitmap orig_live_bytes = NULL;
>>
>>    *use_stmt = NULL;
>>
>> +  /* REF is a memory write.  Go ahead and get its base, size, extent
>> +     information and encode the bytes written into LIVE_BYTES.  We can
>> +     handle any case where we have a known base and maximum size.
>> +
>> +     However, experimentation has shown that bit level tracking is not
>> +     useful in practice, so we only track at the byte level.
>> +
>> +     Furthermore, experimentation has shown that 99% of the cases
>> +     that require byte tracking are 64 bytes or less.  */
>> +  if (valid_ao_ref_for_dse (ref)
>> +      && (ref->max_size / BITS_PER_UNIT
>> +	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
>> +    {
>> +      live_bytes = BITMAP_ALLOC (NULL);
>> +      orig_live_bytes = BITMAP_ALLOC (NULL);
>> +      bitmap_set_range (live_bytes,
>> +			ref->offset / BITS_PER_UNIT,
>> +			ref->max_size / BITS_PER_UNIT);
>> +      bitmap_copy (orig_live_bytes, live_bytes);
>
> would it maybe make sense to use sbitmaps since the length is known to
> be short, and doesn't change after allocation?
It would.  I tend to default to bitmaps these days (odd since I was 
probably the largest user of sbitmaps for a long time).
>
>> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt)
>>  	}
>>
>>        if (fail)
>> -	return false;
>> +	{
>> +	  /* STMT might be partially dead and we may be able to reduce
>> +	     how many memory locations it stores into.  */
>> +	  if (live_bytes
>> +	      && !bitmap_equal_p (live_bytes, orig_live_bytes)
>> +	      && !gimple_clobber_p (stmt))
>> +	    trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);
>> +	  return false;
>
> shouldn't you free the bitmaps here?
Yes.  I think switching to an auto-sbitmap would resolve this as well.


jeff
Jeff Law Dec. 16, 2016, 5:07 p.m. UTC | #4
On 12/16/2016 06:57 AM, Richard Biener wrote:
>
> Apart from what Trevor says about using sbitmaps (try to avoid the initial
> zeroing please) and the missed freeing (you can use auto_[s]bitmap?)
> some comments below.
In progress.  We'll need a few routines for sbitmaps that don't 
currently exist, but nothing that should be hard to implement correctly.

>> +
>> +static void
>> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt)
>> +{
>> +  bitmap dead = BITMAP_ALLOC (NULL);
>> +  bitmap_and_compl (dead, orig, live);
>> +
>> +  /* So we have to verify that either the real or imag part as a whole
>> +     is dead.  They are always the same size.  Thus for one to be dead
>> +     the number of live bytes would have to be the same as the number of
>> +     dead bytes.  */
>> +  if (bitmap_count_bits (live) == bitmap_count_bits (dead))
>
> popcount is expensive, so is this really a short-cut?
It's less about being a short cut and more about realizing both halves 
are the same size.  Thus to be trimmable the number of live bytes must 
exactly equal the number of dead bytes.  This works regardless of the 
size of the underlying components of the complex type.


>
>> +    {
>> +      /* Hopefully we have live bits that look like 0xff00 or 0xff
>> (assuming
>> +        8 bytes for the underlying real/imag part).  If so we can optimize
>> +        this case.  */
>> +      if (bitmap_first_set_bit (live) == 0
>> +         && bitmap_first_set_bit (dead) == bitmap_count_bits (live))
>
> Hmm, but does that properly handle byte-wise reads from it?  Like
> reading the realpart plus the last byte from the imagpart?
It should.  We've already verified that exactly half of the object is 
live and half is dead.  So we just need to make sure that the live part 
is contiguous at the start or end of the object.

Again, written this way we don't have to care about the size of the 
components of the COMPLEX_EXPR.

You're right in that if we keep the test in this form we should avoid 
the redundant popcount calls.

>> +
>> +/* STMT is a memory write where one or more bytes written are dead
>> +   stores.  ORIG is the bitmap of bytes stored by STMT.  LIVE is the
>> +   bitmap of stores that are actually live.
>> +
>> +   Attempt to rewrite STMT so that it writes fewer memory locations.  Right
>> +   now we only support trimming at the start or end of the memory region.
>> +   It's not clear how much there is to be gained by trimming from the
>> middle
>> +   of the region.  */
>> +
>> +static void
>> +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt)
>> +{
>> +  if (is_gimple_assign (stmt))
>> +    {
>> +      switch (gimple_assign_rhs_code (stmt))
>> +       {
>> +       case COMPLEX_CST:
>> +         trim_complex_store (orig, live, stmt);
>
> so patch #1 only handles stores from COMPLEX_CST?  Ok...
Patch #1 handles detecting of any aggregate store that is made fully 
dead by subsequent partial stores.   It also handles trimming of a 
partially dead complex constant initialization.

>> +  if (valid_ao_ref_for_dse (ref)
>> +      && (ref->max_size / BITS_PER_UNIT
>> +         <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
>> +    {
>> +      live_bytes = BITMAP_ALLOC (NULL);
>> +      orig_live_bytes = BITMAP_ALLOC (NULL);
>> +      bitmap_set_range (live_bytes,
>> +                       ref->offset / BITS_PER_UNIT,
>> +                       ref->max_size / BITS_PER_UNIT);
>> +      bitmap_copy (orig_live_bytes, live_bytes);
>
> So I'd use a once-per-pass allocated sbitmap here.  I don't see why you need
> the orig_live_bytes bitmap though (just keep that implicitely by the known
> range?)
That would require resizing the bitmap or biasing.  While we have a 
consistent maximum size, ref->offset varies.  And that's probably the 
one case that sparse bitmaps handle better than simple bitmaps.

So I guess we'll have to bias everything.  It's not particularly hard, 
but it can be error prone.

>
>> +    }
>> +
>>    /* Find the first dominated statement that clobbers (part of) the
>>       memory stmt stores to with no intermediate statement that may use
>>       part of the memory stmt stores.  That is, find a store that may
>> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt,
>> gimple **use_stmt)
>>         }
>>
>>        if (fail)
>> -       return false;
>> +       {
>> +         /* STMT might be partially dead and we may be able to reduce
>> +            how many memory locations it stores into.  */
>> +         if (live_bytes
>> +             && !bitmap_equal_p (live_bytes, orig_live_bytes)
>> +             && !gimple_clobber_p (stmt))
>> +           trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);
>
> The actual transform in dse_possible_dead_store_p looks a bit misplaced.
> I see it's somehow convenient but then maybe returning a enum from this
> function might be cleaner.  Well, I'm not too torn about this, so maybe
> just rename the function a bit (no good suggestion either).
I pondered that as well.  I don't have a strong opinion either way. 
I'll look into returning a tri-state (live, partially dead, dead).


Jeff
Joseph Myers Dec. 16, 2016, 6:35 p.m. UTC | #5
On Thu, 15 Dec 2016, Jeff Law wrote:

> This version attacks the problem by improving DSE to track stores to memory at
> a byte level.  That allows us to determine if a series of stores completely
> covers an earlier store (thus making the earlier store dead).

Question: suppose you have an assignment for a struct with padding, then 
stores for all the elements.  Does it detect that the original assignment 
is dead (because there is no need to copy padding on struct assignment)?
Jakub Jelinek Dec. 16, 2016, 6:43 p.m. UTC | #6
On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote:
> On Thu, 15 Dec 2016, Jeff Law wrote:
> 
> > This version attacks the problem by improving DSE to track stores to memory at
> > a byte level.  That allows us to determine if a series of stores completely
> > covers an earlier store (thus making the earlier store dead).
> 
> Question: suppose you have an assignment for a struct with padding, then 
> stores for all the elements.  Does it detect that the original assignment 
> is dead (because there is no need to copy padding on struct assignment)?

We fold memcpy into struct assignment early, does the same apply also to
memcpy?  Or shall we fold memcpy into struct assignment only when there is
no padding (or find different IL representation of that)?

	Jakub
Jeff Law Dec. 16, 2016, 6:44 p.m. UTC | #7
On 12/16/2016 11:35 AM, Joseph Myers wrote:
> On Thu, 15 Dec 2016, Jeff Law wrote:
>
>> This version attacks the problem by improving DSE to track stores to memory at
>> a byte level.  That allows us to determine if a series of stores completely
>> covers an earlier store (thus making the earlier store dead).
>
> Question: suppose you have an assignment for a struct with padding, then
> stores for all the elements.  Does it detect that the original assignment
> is dead (because there is no need to copy padding on struct assignment)?
It would not detect the original assignment as dead in that case as it 
does not know anything special about the padding bytes.

So we'd likely end up just trimming the original assignment, 
particularly if it was done via a CONSTRUCTOR assignment or memset.

This situation could be addressed by handling the padding bytes 
specially -- ie, just clear them from LIVE when LIVE is initialized. 
It'd likely "just work" then.

Seems like it might be a reasonable follow-up as it'd hopefully make 
more of the original assignments fully dead rather than just trimmable.

jeff
Jeff Law Dec. 16, 2016, 6:52 p.m. UTC | #8
On 12/16/2016 11:43 AM, Jakub Jelinek wrote:
> On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote:
>> On Thu, 15 Dec 2016, Jeff Law wrote:
>>
>>> This version attacks the problem by improving DSE to track stores to memory at
>>> a byte level.  That allows us to determine if a series of stores completely
>>> covers an earlier store (thus making the earlier store dead).
>>
>> Question: suppose you have an assignment for a struct with padding, then
>> stores for all the elements.  Does it detect that the original assignment
>> is dead (because there is no need to copy padding on struct assignment)?
>
> We fold memcpy into struct assignment early, does the same apply also to
> memcpy?  Or shall we fold memcpy into struct assignment only when there is
> no padding (or find different IL representation of that)?
So a memcpy would presumably cover the entire object, padding included.

Subsequent component stores would not cover the padding.  So we'd trim 
the head/tail of the memcpy.


A memcpy folded to a structure assignment almost certainly covers the 
whole assignment when represented in an ao_ref.  So it'd be no different 
-- except that I haven't written code to trim structure assignments 
(unless they're represented by a CONSTRUCTOR assignment).  Regardless I 
would not suggest changing if/how/when we fold memcpy into a structure 
assignment.

Jeff
Joseph Myers Dec. 16, 2016, 6:56 p.m. UTC | #9
On Fri, 16 Dec 2016, Jakub Jelinek wrote:

> On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote:
> > On Thu, 15 Dec 2016, Jeff Law wrote:
> > 
> > > This version attacks the problem by improving DSE to track stores to memory at
> > > a byte level.  That allows us to determine if a series of stores completely
> > > covers an earlier store (thus making the earlier store dead).
> > 
> > Question: suppose you have an assignment for a struct with padding, then 
> > stores for all the elements.  Does it detect that the original assignment 
> > is dead (because there is no need to copy padding on struct assignment)?
> 
> We fold memcpy into struct assignment early, does the same apply also to
> memcpy?  Or shall we fold memcpy into struct assignment only when there is
> no padding (or find different IL representation of that)?

There is at least an arguable case that memset needs to set padding to 0 
(unlike initializers), and memcpy needs to copy it (unlike struct 
assignment).  Atomic compare-exchange definitely needs to work on the 
whole representation including padding.

However, when you store in a member of a structure (or union), that 
results in all the padding becoming undefined.  So in the cases of memcpy 
/ memset where this optimization is interesting (element stores follow the 
memcpy / memset), you can still optimize.

(Note the desire for an option to avoid such padding-related 
optimizations, by treating padding for optimization purposes as if there 
were dummy fields there that can all be cleared by a {} initializer or 
memset: PR 77992.)
Jeff Law Dec. 16, 2016, 6:57 p.m. UTC | #10
On 12/16/2016 06:57 AM, Richard Biener wrote:
> On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote:
>> +       {
>> +         /* STMT might be partially dead and we may be able to reduce
>> +            how many memory locations it stores into.  */
>> +         if (live_bytes
>> +             && !bitmap_equal_p (live_bytes, orig_live_bytes)
>> +             && !gimple_clobber_p (stmt))
>> +           trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);
>
> The actual transform in dse_possible_dead_store_p looks a bit misplaced.
> I see it's somehow convenient but then maybe returning a enum from this
> function might be cleaner.  Well, I'm not too torn about this, so maybe
> just rename the function a bit (no good suggestion either).
So the problem with returning a tri-state is that we'd also need to 
return the bitmaps or trimming data so that our caller would know how to 
trim in the partially dead case.  Certainly possible, though marginally 
more awkward from an implementation standpoint.  Renaming seems 
appropriate since dse_possible_dead_store_p does more than just detect a 
dead store with these changes.

Jeff
Jeff Law Dec. 16, 2016, 7:29 p.m. UTC | #11
On 12/16/2016 06:57 AM, Richard Biener wrote:
> On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote:
>> +  /* REF is a memory write.  Go ahead and get its base, size, extent
>> +     information and encode the bytes written into LIVE_BYTES.  We can
>> +     handle any case where we have a known base and maximum size.
>> +
>> +     However, experimentation has shown that bit level tracking is not
>> +     useful in practice, so we only track at the byte level.
>> +
>> +     Furthermore, experimentation has shown that 99% of the cases
>> +     that require byte tracking are 64 bytes or less.  */
>> +  if (valid_ao_ref_for_dse (ref)
>> +      && (ref->max_size / BITS_PER_UNIT
>> +         <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
>> +    {
>> +      live_bytes = BITMAP_ALLOC (NULL);
>> +      orig_live_bytes = BITMAP_ALLOC (NULL);
>> +      bitmap_set_range (live_bytes,
>> +                       ref->offset / BITS_PER_UNIT,
>> +                       ref->max_size / BITS_PER_UNIT);
>> +      bitmap_copy (orig_live_bytes, live_bytes);
>
> So I'd use a once-per-pass allocated sbitmap here.  I don't see why you need
> the orig_live_bytes bitmap though (just keep that implicitely by the known
> range?)
So if we use a once-per-pass allocated bitmap, that actually facilitates 
returning a tri-state from dse_possible_dead_store_p and moving the 
trimming into dse_optimize_stmt.

ORIG_LIVE_BYTES was more convenience than anything -- we want it so that 
we can compute the dead bytes for trimming.  But we can certainly 
compute it on-demand at that time.

Jeff
Jeff Law Dec. 17, 2016, 8:19 a.m. UTC | #12
On 12/16/2016 12:29 AM, Trevor Saunders wrote:
> On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote:
>>    unsigned cnt = 0;
>> +  bitmap live_bytes = NULL;
>> +  bitmap orig_live_bytes = NULL;
>>
>>    *use_stmt = NULL;
>>
>> +  /* REF is a memory write.  Go ahead and get its base, size, extent
>> +     information and encode the bytes written into LIVE_BYTES.  We can
>> +     handle any case where we have a known base and maximum size.
>> +
>> +     However, experimentation has shown that bit level tracking is not
>> +     useful in practice, so we only track at the byte level.
>> +
>> +     Furthermore, experimentation has shown that 99% of the cases
>> +     that require byte tracking are 64 bytes or less.  */
>> +  if (valid_ao_ref_for_dse (ref)
>> +      && (ref->max_size / BITS_PER_UNIT
>> +	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
>> +    {
>> +      live_bytes = BITMAP_ALLOC (NULL);
>> +      orig_live_bytes = BITMAP_ALLOC (NULL);
>> +      bitmap_set_range (live_bytes,
>> +			ref->offset / BITS_PER_UNIT,
>> +			ref->max_size / BITS_PER_UNIT);
>> +      bitmap_copy (orig_live_bytes, live_bytes);
>
> would it maybe make sense to use sbitmaps since the length is known to
> be short, and doesn't change after allocation?
So a few interesting things have to be dealt if we want to make this 
change.  I already mentioned the need to bias based on ref->offset so 
that the range of bytes we're tracking is represented 0..size.

While we know the length of the potential dead store we don't know the 
length of the subsequent stores that we're hoping make the original a 
dead store.  Thus when we start clearing LIVE_BYTES based on those 
subsequent stores, we have to normalize those against the ref->offset of 
the original store.

What's even more fun is sizing.  Those subsequent stores may be 
considerably larger.  Which means that a bitmap_clear_range call has to 
be a hell of a lot more careful when working with sbitmaps (we just 
happily stop all over memory in that case) whereas a bitmap the right 
things will "just happen".

On a positive size since we've normalized the potential dead store's 
byte range to 0..size, it means computing trims is easier because we 
inherently know how many bits were originally set.  So compute_trims 
becomes trivial and we can simplify trim_complex_store a bit as well.

And, of course, we don't have a bitmap_{clear,set}_range or a 
bitmap_count_bits implementation for sbitmaps.


It's all a bit of "ugh", but should be manageable.

Jeff
Richard Biener Dec. 17, 2016, 8:31 a.m. UTC | #13
On December 16, 2016 7:43:22 PM GMT+01:00, Jakub Jelinek <jakub@redhat.com> wrote:
>On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote:
>> On Thu, 15 Dec 2016, Jeff Law wrote:
>> 
>> > This version attacks the problem by improving DSE to track stores
>to memory at
>> > a byte level.  That allows us to determine if a series of stores
>completely
>> > covers an earlier store (thus making the earlier store dead).
>> 
>> Question: suppose you have an assignment for a struct with padding,
>then 
>> stores for all the elements.  Does it detect that the original
>assignment 
>> is dead (because there is no need to copy padding on struct
>assignment)?

In GIMPLE struct assignment is equal to memcpy, so that info is lost after leaving frontend realm.

>We fold memcpy into struct assignment early, does the same apply also
>to
>memcpy?  Or shall we fold memcpy into struct assignment only when there
>is
>no padding (or find different IL representation of that)?

If we want to change that we can use a char[] type for copying on GIMPLE.

Richard.

>	Jakub
Richard Sandiford Dec. 17, 2016, 4:57 p.m. UTC | #14
Jeff Law <law@redhat.com> writes:
> This is the first of the 4 part patchkit to address deficiencies in our 
> DSE implementation.
>
> This patch addresses the P2 regression 33562 which has been a low 
> priority regression since gcc-4.3.  To summarize, DSE no longer has the 
> ability to detect an aggregate store as dead if subsequent stores are 
> done in a piecemeal fashion.
>
> I originally tackled this by changing how we lower complex objects. 
> That was sufficient to address 33562, but was reasonably rejected.
>
> This version attacks the problem by improving DSE to track stores to 
> memory at a byte level.  That allows us to determine if a series of 
> stores completely covers an earlier store (thus making the earlier store 
> dead).
>
> A useful side effect of this is we can detect when parts of a store are 
> dead and potentially rewrite the store.  This patch implements that for 
> complex object initializations.  While not strictly part of 33562, it's 
> so closely related that I felt it belongs as part of this patch.
>
> This originally limited the size of the tracked memory space to 64 
> bytes.  I bumped the limit after working through the CONSTRUCTOR and 
> mem* trimming patches.  The 256 byte limit is still fairly arbitrary and 
> I wouldn't lose sleep if we throttled back to 64 or 128 bytes.

FWIW (and shouldn't affect whether the patch goes in)...

If SVE support is accepted for GCC 8 then we have the additional problem
that the sizes of useful aggregates can be runtime invariants.  For example,
in the SVE equivalent of float64x2x2_t, it would be good to be able to
detect that an assignment to the full structure is dead if we later
assign to both of the individual vectors.  Probably the most convenient
way of doing that would be to track ranges rather than individual bytes,
like the local part of the RTL DSE pass does.  (It was relatively easy
to convert local RTL DSE to variable-length modes, but the global part
also uses byte bitmaps.)

I guess bitmaps are going to be more efficient for small structures or
for accesses that occur in an arbitrary order.  But tracking ranges
means adding at most one fixed-size range record per update, so I suppose
the throttle would be on the number of records (= number of gaps + 1)
rather than the size of the structure.

Thanks,
Richard
Trevor Saunders Dec. 21, 2016, 1:43 p.m. UTC | #15
On Sat, Dec 17, 2016 at 01:19:41AM -0700, Jeff Law wrote:
> On 12/16/2016 12:29 AM, Trevor Saunders wrote:
> > On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote:
> > >    unsigned cnt = 0;
> > > +  bitmap live_bytes = NULL;
> > > +  bitmap orig_live_bytes = NULL;
> > > 
> > >    *use_stmt = NULL;
> > > 
> > > +  /* REF is a memory write.  Go ahead and get its base, size, extent
> > > +     information and encode the bytes written into LIVE_BYTES.  We can
> > > +     handle any case where we have a known base and maximum size.
> > > +
> > > +     However, experimentation has shown that bit level tracking is not
> > > +     useful in practice, so we only track at the byte level.
> > > +
> > > +     Furthermore, experimentation has shown that 99% of the cases
> > > +     that require byte tracking are 64 bytes or less.  */
> > > +  if (valid_ao_ref_for_dse (ref)
> > > +      && (ref->max_size / BITS_PER_UNIT
> > > +	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
> > > +    {
> > > +      live_bytes = BITMAP_ALLOC (NULL);
> > > +      orig_live_bytes = BITMAP_ALLOC (NULL);
> > > +      bitmap_set_range (live_bytes,
> > > +			ref->offset / BITS_PER_UNIT,
> > > +			ref->max_size / BITS_PER_UNIT);
> > > +      bitmap_copy (orig_live_bytes, live_bytes);
> > 
> > would it maybe make sense to use sbitmaps since the length is known to
> > be short, and doesn't change after allocation?
> So a few interesting things have to be dealt if we want to make this change.
> I already mentioned the need to bias based on ref->offset so that the range
> of bytes we're tracking is represented 0..size.
> 
> While we know the length of the potential dead store we don't know the
> length of the subsequent stores that we're hoping make the original a dead
> store.  Thus when we start clearing LIVE_BYTES based on those subsequent
> stores, we have to normalize those against the ref->offset of the original
> store.
> 
> What's even more fun is sizing.  Those subsequent stores may be considerably
> larger.  Which means that a bitmap_clear_range call has to be a hell of a
> lot more careful when working with sbitmaps (we just happily stop all over
> memory in that case) whereas a bitmap the right things will "just happen".
> 
> On a positive size since we've normalized the potential dead store's byte
> range to 0..size, it means computing trims is easier because we inherently
> know how many bits were originally set.  So compute_trims becomes trivial
> and we can simplify trim_complex_store a bit as well.
> 
> And, of course, we don't have a bitmap_{clear,set}_range or a
> bitmap_count_bits implementation for sbitmaps.
> 
> 
> It's all a bit of "ugh", but should be manageable.

yeah, but that seems like enough work that you could reasonably stick
with bitmap instead.

Trev

p.s. sorry I've been falling behind lately.

> 
> Jeff
Jeff Law Dec. 21, 2016, 4:02 p.m. UTC | #16
On 12/21/2016 06:43 AM, Trevor Saunders wrote:
>> So a few interesting things have to be dealt if we want to make this change.
>> I already mentioned the need to bias based on ref->offset so that the range
>> of bytes we're tracking is represented 0..size.
>>
>> While we know the length of the potential dead store we don't know the
>> length of the subsequent stores that we're hoping make the original a dead
>> store.  Thus when we start clearing LIVE_BYTES based on those subsequent
>> stores, we have to normalize those against the ref->offset of the original
>> store.
>>
>> What's even more fun is sizing.  Those subsequent stores may be considerably
>> larger.  Which means that a bitmap_clear_range call has to be a hell of a
>> lot more careful when working with sbitmaps (we just happily stop all over
>> memory in that case) whereas a bitmap the right things will "just happen".
>>
>> On a positive size since we've normalized the potential dead store's byte
>> range to 0..size, it means computing trims is easier because we inherently
>> know how many bits were originally set.  So compute_trims becomes trivial
>> and we can simplify trim_complex_store a bit as well.
>>
>> And, of course, we don't have a bitmap_{clear,set}_range or a
>> bitmap_count_bits implementation for sbitmaps.
>>
>>
>> It's all a bit of "ugh", but should be manageable.
>
> yeah, but that seems like enough work that you could reasonably stick
> with bitmap instead.
Well, the conversion is done :-)  It was largely as I expected with the 
devil being in the normalization details which are well isolated.

>
> p.s. sorry I've been falling behind lately.
It happens.  You might want to peek briefly at Aldy's auto_bitmap class 
as part of the 71691 patches.  It looks like a fairly straightforward 
conversion to auto_*.  I'm sure there's all kinds of places we could use it.

Jeff
Jeff Law Dec. 21, 2016, 6:16 p.m. UTC | #17
On 12/16/2016 12:29 AM, Trevor Saunders wrote:
> On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote:
>>    unsigned cnt = 0;
>> +  bitmap live_bytes = NULL;
>> +  bitmap orig_live_bytes = NULL;
>>
>>    *use_stmt = NULL;
>>
>> +  /* REF is a memory write.  Go ahead and get its base, size, extent
>> +     information and encode the bytes written into LIVE_BYTES.  We can
>> +     handle any case where we have a known base and maximum size.
>> +
>> +     However, experimentation has shown that bit level tracking is not
>> +     useful in practice, so we only track at the byte level.
>> +
>> +     Furthermore, experimentation has shown that 99% of the cases
>> +     that require byte tracking are 64 bytes or less.  */
>> +  if (valid_ao_ref_for_dse (ref)
>> +      && (ref->max_size / BITS_PER_UNIT
>> +	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
>> +    {
>> +      live_bytes = BITMAP_ALLOC (NULL);
>> +      orig_live_bytes = BITMAP_ALLOC (NULL);
>> +      bitmap_set_range (live_bytes,
>> +			ref->offset / BITS_PER_UNIT,
>> +			ref->max_size / BITS_PER_UNIT);
>> +      bitmap_copy (orig_live_bytes, live_bytes);
>
> would it maybe make sense to use sbitmaps since the length is known to
> be short, and doesn't change after allocation?
New version will use auto_sbitmap and sbitmaps.

Jeff
Jeff Law Dec. 21, 2016, 6:23 p.m. UTC | #18
On 12/16/2016 06:57 AM, Richard Biener wrote:
>
> Apart from what Trevor says about using sbitmaps (try to avoid the initial
> zeroing please) and the missed freeing (you can use auto_[s]bitmap?)
> some comments below.
New version uses sbitmaps and avoids zero-ing when we can.

>> +static void
>> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt)
>> +{
>> +  bitmap dead = BITMAP_ALLOC (NULL);
>> +  bitmap_and_compl (dead, orig, live);
>> +
>> +  /* So we have to verify that either the real or imag part as a whole
>> +     is dead.  They are always the same size.  Thus for one to be dead
>> +     the number of live bytes would have to be the same as the number of
>> +     dead bytes.  */
>> +  if (bitmap_count_bits (live) == bitmap_count_bits (dead))
>
> popcount is expensive, so is this really a short-cut?
Note this can all be done more efficiently.  Using sbitmaps forces us to 
normalize the offset/size and I've dropped the original bitmap as the 
ao_ref carries the data we need.  Those two changes make other 
simplifications fall-out and what we get in compute_trims a last_set_bit 
and first_set_bit on the live bitmap -- no other bitmap 
traversals/searches/counts.

That makes compute_trims more efficient than the code in 
trim_complex_store.  So I've made trim_complex_store re-use the more 
general trimming code, and it's still agnostic of the underlying sizes, 
it just cares that the underlying components are the same size.

>
> The actual transform in dse_possible_dead_store_p looks a bit misplaced.
> I see it's somehow convenient but then maybe returning a enum from this
> function might be cleaner.  Well, I'm not too torn about this, so maybe
> just rename the function a bit (no good suggestion either).
>
> The rest of the patch (the infrastructure) looks reasonable.
I think I mentioned it earlier, but moving towards a single allocated 
bitmap for the entire pass makes it much easier to have 
dse_possible_dead_store_p to return a tri-state and live information. 
You'll see that change in the upcoming revision to the patchkit.

Jeff
>
diff mbox

Patch

diff --git a/gcc/params.def b/gcc/params.def
index 50f75a7..ddc3d65 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -532,6 +532,11 @@  DEFPARAM(PARAM_AVG_LOOP_NITER,
 	 "Average number of iterations of a loop.",
 	 10, 1, 0)
 
+DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
+	 "dse-max-object-size",
+	 "Maximum size (in bytes) of objects tracked by dead store elimination.",
+	 256, 0, 0)
+
 DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE,
  	 "scev-max-expr-size",
 	 "Bound on size of expressions used in the scalar evolutions analyzer.",
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
index 87a2638..3155741 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c
@@ -10,4 +10,4 @@  int f(void)
   return g(&t);
 }
 
-/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
index e2cd403..e6d027f 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c
@@ -8,4 +8,4 @@  int f(void)
  __imag__ t = 2;
 }
 
-/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
new file mode 100644
index 0000000..92b2df8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+int g(_Complex int*);
+int f(void)
+{
+  _Complex int t = 0;
+  int i, j;
+ __imag__ t += 2;
+  return g(&t);
+}
+
+
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
new file mode 100644
index 0000000..718b746
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+int g(_Complex int*);
+int f(void)
+{
+  _Complex int t = 0;
+  int i, j;
+ __real__ t += 2;
+  return g(&t);
+}
+
+
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
new file mode 100644
index 0000000..4e14d9b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
+_Complex int t = 0;
+int f(void)
+{
+  t = 0;
+ __imag__ t = 2;
+}
+
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
new file mode 100644
index 0000000..d1e0b85
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */
+_Complex int t = 0;
+int f(void)
+{
+  t = 0;
+ __real__ t = 2;
+}
+
+/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
index 594c20c..ae48ddd 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c
@@ -11,4 +11,4 @@  foo ()
 }
 
 /* We should eliminate the first assignment.  */
-/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */
diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 778b363..eea185c 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -33,6 +33,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-dfa.h"
 #include "domwalk.h"
 #include "tree-cfgcleanup.h"
+#include "params.h"
 
 /* This file implements dead store elimination.
 
@@ -68,6 +69,158 @@  along with GCC; see the file COPYING3.  If not see
    remove their dead edges eventually.  */
 static bitmap need_eh_cleanup;
 
+/* STMT is a statement that may write into memory.  Analyze it and
+   initialize WRITE to describe how STMT affects memory.
+
+   Return TRUE if the the statement was analyzed, FALSE otherwise.
+
+   It is always safe to return FALSE.  But typically better optimziation
+   can be achieved by analyzing more statements.  */
+
+static bool
+initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write)
+{
+  /* It's advantageous to handle certain mem* functions.  */
+  if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
+    {
+      switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt)))
+	{
+	  case BUILT_IN_MEMCPY:
+	  case BUILT_IN_MEMMOVE:
+	  case BUILT_IN_MEMSET:
+	    {
+	      tree size = NULL_TREE;
+	      if (gimple_call_num_args (stmt) == 3)
+		size = gimple_call_arg (stmt, 2);
+	      tree ptr = gimple_call_arg (stmt, 0);
+	      ao_ref_init_from_ptr_and_size (write, ptr, size);
+	      return true;
+	    }
+	  default:
+	    break;
+	}
+    }
+  else if (is_gimple_assign (stmt))
+    {
+      ao_ref_init (write, gimple_assign_lhs (stmt));
+      return true;
+    }
+  return false;
+}
+
+/* Given REF from the the alias oracle, return TRUE if it is a valid
+   memory reference for dead store elimination, false otherwise.
+
+   In particular, the reference must have a known base, known maximum
+   size, start at a byte offset and have a size that is one or more
+   bytes.  */
+
+static bool
+valid_ao_ref_for_dse (ao_ref *ref)
+{
+  return (ao_ref_base (ref)
+	  && ref->max_size != -1
+	  && (ref->offset % BITS_PER_UNIT) == 0
+	  && (ref->size % BITS_PER_UNIT) == 0);
+}
+
+/* Clear any bytes written by STMT from the bitmap LIVE_BYTES.  The base
+   address written by STMT must match the one found in REF, which must
+   have its base address previously initialized.
+
+   This routine must be conservative.  If we don't know the offset or
+   actual size written, assume nothing was written.  */
+
+static void
+clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref)
+{
+  ao_ref write;
+  if (!initialize_ao_ref_for_dse (stmt, &write))
+    return;
+
+  /* Verify we have the same base memory address and that the write
+     has a known size.  If so, then clear the appropriate bytes in
+     the LIVE_BYTES bitmap.  */
+  if (valid_ao_ref_for_dse (&write)
+      && write.base == ref->base
+      && write.size == write.max_size)
+    bitmap_clear_range (live_bytes,
+			write.offset / BITS_PER_UNIT,
+			write.size / BITS_PER_UNIT);
+}
+
+/* STMT initializes an object from COMPLEX_CST where one or more of the
+   bytes written are dead stores.  ORIG is the bitmap of bytes stored by
+   STMT.  LIVE is the bitmap of stores that are actually live.
+
+   Attempt to rewrite STMT so that only the real or imaginary part of
+   the object is actually stored.  */
+
+static void
+trim_complex_store (bitmap orig, bitmap live, gimple *stmt)
+{
+  bitmap dead = BITMAP_ALLOC (NULL);
+  bitmap_and_compl (dead, orig, live);
+
+  /* So we have to verify that either the real or imag part as a whole
+     is dead.  They are always the same size.  Thus for one to be dead
+     the number of live bytes would have to be the same as the number of
+     dead bytes.  */
+  if (bitmap_count_bits (live) == bitmap_count_bits (dead))
+    {
+      /* Hopefully we have live bits that look like 0xff00 or 0xff (assuming
+	 8 bytes for the underlying real/imag part).  If so we can optimize
+	 this case.  */
+      if (bitmap_first_set_bit (live) == 0
+	  && bitmap_first_set_bit (dead) == bitmap_count_bits (live))
+	{
+	  /* TREE_REALPART is live */
+	  tree x = TREE_REALPART (gimple_assign_rhs1 (stmt));
+	  tree y = gimple_assign_lhs (stmt);
+	  y = build1 (REALPART_EXPR, TREE_TYPE (x), y);
+	  gimple_assign_set_lhs (stmt, y);
+	  gimple_assign_set_rhs1 (stmt, x);
+	}
+      else if (bitmap_first_set_bit (dead) == 0
+	  && bitmap_first_set_bit (live) == bitmap_count_bits (dead))
+	{
+	  /* TREE_IMAGPART is live */
+	  tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt));
+	  tree y = gimple_assign_lhs (stmt);
+	  y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y);
+	  gimple_assign_set_lhs (stmt, y);
+	  gimple_assign_set_rhs1 (stmt, x);
+	}
+      /* Other cases indicate parts of both the real and imag subobjects
+	 are live.  We do not try to optimize those cases.  */
+    }
+  BITMAP_FREE (dead);
+}
+
+/* STMT is a memory write where one or more bytes written are dead
+   stores.  ORIG is the bitmap of bytes stored by STMT.  LIVE is the
+   bitmap of stores that are actually live.
+
+   Attempt to rewrite STMT so that it writes fewer memory locations.  Right
+   now we only support trimming at the start or end of the memory region.
+   It's not clear how much there is to be gained by trimming from the middle
+   of the region.  */
+
+static void
+trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt)
+{
+  if (is_gimple_assign (stmt))
+    {
+      switch (gimple_assign_rhs_code (stmt))
+	{
+	case COMPLEX_CST:
+	  trim_complex_store (orig, live, stmt);
+	  break;
+	default:
+	  break;
+	}
+    }
+}
 
 /* A helper of dse_optimize_stmt.
    Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
@@ -79,9 +232,32 @@  dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt)
 {
   gimple *temp;
   unsigned cnt = 0;
+  bitmap live_bytes = NULL;
+  bitmap orig_live_bytes = NULL;
 
   *use_stmt = NULL;
 
+  /* REF is a memory write.  Go ahead and get its base, size, extent
+     information and encode the bytes written into LIVE_BYTES.  We can
+     handle any case where we have a known base and maximum size.
+
+     However, experimentation has shown that bit level tracking is not
+     useful in practice, so we only track at the byte level.
+
+     Furthermore, experimentation has shown that 99% of the cases
+     that require byte tracking are 64 bytes or less.  */
+  if (valid_ao_ref_for_dse (ref)
+      && (ref->max_size / BITS_PER_UNIT
+	  <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE)))
+    {
+      live_bytes = BITMAP_ALLOC (NULL);
+      orig_live_bytes = BITMAP_ALLOC (NULL);
+      bitmap_set_range (live_bytes,
+			ref->offset / BITS_PER_UNIT,
+			ref->max_size / BITS_PER_UNIT);
+      bitmap_copy (orig_live_bytes, live_bytes);
+    }
+
   /* Find the first dominated statement that clobbers (part of) the
      memory stmt stores to with no intermediate statement that may use
      part of the memory stmt stores.  That is, find a store that may
@@ -164,7 +341,15 @@  dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt)
 	}
 
       if (fail)
-	return false;
+	{
+	  /* STMT might be partially dead and we may be able to reduce
+	     how many memory locations it stores into.  */
+	  if (live_bytes
+	      && !bitmap_equal_p (live_bytes, orig_live_bytes)
+	      && !gimple_clobber_p (stmt))
+	    trim_partially_dead_store (orig_live_bytes, live_bytes, stmt);
+	  return false;
+	}
 
       /* If we didn't find any definition this means the store is dead
          if it isn't a store to global reachable memory.  In this case
@@ -177,12 +362,18 @@  dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt)
 	  temp = stmt;
 	  break;
 	}
+
+      if (live_bytes && temp)
+	clear_bytes_written_by (live_bytes, temp, ref);
     }
-  /* Continue walking until we reach a kill.  */
-  while (!stmt_kills_ref_p (temp, ref));
+  /* Continue walking until we reach a full kill as a single statement
+     or there are no more live bytes.  */
+  while (!stmt_kills_ref_p (temp, ref)
+	 && !(live_bytes && bitmap_empty_p (live_bytes)));
 
   *use_stmt = temp;
-
+  BITMAP_FREE (live_bytes);
+  BITMAP_FREE (orig_live_bytes);
   return true;
 }
 
@@ -214,6 +405,10 @@  dse_optimize_stmt (gimple_stmt_iterator *gsi)
 	  || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF))
     return;
 
+  ao_ref ref;
+  if (!initialize_ao_ref_for_dse (stmt, &ref))
+    return;
+
   /* We know we have virtual definitions.  We can handle assignments and
      some builtin calls.  */
   if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL))
@@ -225,12 +420,6 @@  dse_optimize_stmt (gimple_stmt_iterator *gsi)
 	  case BUILT_IN_MEMSET:
 	    {
 	      gimple *use_stmt;
-	      ao_ref ref;
-	      tree size = NULL_TREE;
-	      if (gimple_call_num_args (stmt) == 3)
-		size = gimple_call_arg (stmt, 2);
-	      tree ptr = gimple_call_arg (stmt, 0);
-	      ao_ref_init_from_ptr_and_size (&ref, ptr, size);
 	      if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
 		return;
 
@@ -244,6 +433,7 @@  dse_optimize_stmt (gimple_stmt_iterator *gsi)
 	      tree lhs = gimple_call_lhs (stmt);
 	      if (lhs)
 		{
+		  tree ptr = gimple_call_arg (stmt, 0);
 		  gimple *new_stmt = gimple_build_assign (lhs, ptr);
 		  unlink_stmt_vdef (stmt);
 		  if (gsi_replace (gsi, new_stmt, true))
@@ -274,13 +464,8 @@  dse_optimize_stmt (gimple_stmt_iterator *gsi)
       if (operand_equal_p (gimple_assign_rhs1 (stmt),
 			   gimple_assign_lhs (stmt), 0))
 	use_stmt = stmt;
-      else
-	{
-	  ao_ref ref;
-	  ao_ref_init (&ref, gimple_assign_lhs (stmt));
-  	  if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
-	    return;
-	}
+      else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt))
+	return;
 
       /* Now we know that use_stmt kills the LHS of stmt.  */