Message ID | 79d271d8-739e-23c6-ed24-e591cb7fb941@redhat.com |
---|---|
State | New |
Headers | show |
On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote: > unsigned cnt = 0; > + bitmap live_bytes = NULL; > + bitmap orig_live_bytes = NULL; > > *use_stmt = NULL; > > + /* REF is a memory write. Go ahead and get its base, size, extent > + information and encode the bytes written into LIVE_BYTES. We can > + handle any case where we have a known base and maximum size. > + > + However, experimentation has shown that bit level tracking is not > + useful in practice, so we only track at the byte level. > + > + Furthermore, experimentation has shown that 99% of the cases > + that require byte tracking are 64 bytes or less. */ > + if (valid_ao_ref_for_dse (ref) > + && (ref->max_size / BITS_PER_UNIT > + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) > + { > + live_bytes = BITMAP_ALLOC (NULL); > + orig_live_bytes = BITMAP_ALLOC (NULL); > + bitmap_set_range (live_bytes, > + ref->offset / BITS_PER_UNIT, > + ref->max_size / BITS_PER_UNIT); > + bitmap_copy (orig_live_bytes, live_bytes); would it maybe make sense to use sbitmaps since the length is known to be short, and doesn't change after allocation? > @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt) > } > > if (fail) > - return false; > + { > + /* STMT might be partially dead and we may be able to reduce > + how many memory locations it stores into. */ > + if (live_bytes > + && !bitmap_equal_p (live_bytes, orig_live_bytes) > + && !gimple_clobber_p (stmt)) > + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); > + return false; shouldn't you free the bitmaps here? Trev
On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote: > > This is the first of the 4 part patchkit to address deficiencies in our DSE > implementation. > > > This patch addresses the P2 regression 33562 which has been a low priority > regression since gcc-4.3. To summarize, DSE no longer has the ability to > detect an aggregate store as dead if subsequent stores are done in a > piecemeal fashion. > > I originally tackled this by changing how we lower complex objects. That was > sufficient to address 33562, but was reasonably rejected. > > This version attacks the problem by improving DSE to track stores to memory > at a byte level. That allows us to determine if a series of stores > completely covers an earlier store (thus making the earlier store dead). > > A useful side effect of this is we can detect when parts of a store are dead > and potentially rewrite the store. This patch implements that for complex > object initializations. While not strictly part of 33562, it's so closely > related that I felt it belongs as part of this patch. > > This originally limited the size of the tracked memory space to 64 bytes. I > bumped the limit after working through the CONSTRUCTOR and mem* trimming > patches. The 256 byte limit is still fairly arbitrary and I wouldn't lose > sleep if we throttled back to 64 or 128 bytes. > > Later patches in the kit will build upon this patch. So if pieces look like > skeleton code, that's because it is. > > > Bootstrapped and regression tested on x86_64-linux-gnu. OK for the trunk? Apart from what Trevor says about using sbitmaps (try to avoid the initial zeroing please) and the missed freeing (you can use auto_[s]bitmap?) some comments below. > PR tree-optimization/33562 > * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM. > * tree-ssa-dse.c: Include params.h. > (initialize_ao_ref_for_dse): New, partially extracted from > dse_optimize_stmt. > (valid_io_ref_for_dse): New. > (clear_bytes_written_by, trim_complex_store): Likewise. > (trim_partially_dead_store): Likewise. > (dse_partially_dead_store_p): Track what bytes were originally > stored > into memory by the statement as well as the subset of bytes that > are still live. If we "fail", but have identified the store as > partially dead, try to rewrite it to store fewer bytes of data. > Exit the main loop if we find a full kill as a single statement > or via group of statements. > (dse_optimize_stmt): Use initialize_ao_ref_for_dse. > > > * gcc.dg/tree-ssa/complex-4.c: No longer xfailed. > * gcc.dg/tree-ssa/complex-5.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-9.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-18.c: New test. > * gcc.dg/tree-ssa/ssa-dse-19.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-20.c: Likewise. > * gcc.dg/tree-ssa/ssa-dse-21.c: Likewise. > > diff --git a/gcc/params.def b/gcc/params.def > index 50f75a7..ddc3d65 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER, > "Average number of iterations of a loop.", > 10, 1, 0) > > +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE, > + "dse-max-object-size", > + "Maximum size (in bytes) of objects tracked by dead store > elimination.", > + 256, 0, 0) > + > DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE, > "scev-max-expr-size", > "Bound on size of expressions used in the scalar evolutions > analyzer.", > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > index 87a2638..3155741 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c > @@ -10,4 +10,4 @@ int f(void) > return g(&t); > } > > -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail > *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > index e2cd403..e6d027f 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c > @@ -8,4 +8,4 @@ int f(void) > __imag__ t = 2; > } > > -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail > *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > new file mode 100644 > index 0000000..92b2df8 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > +int g(_Complex int*); > +int f(void) > +{ > + _Complex int t = 0; > + int i, j; > + __imag__ t += 2; > + return g(&t); > +} > + > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > new file mode 100644 > index 0000000..718b746 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > +int g(_Complex int*); > +int f(void) > +{ > + _Complex int t = 0; > + int i, j; > + __real__ t += 2; > + return g(&t); > +} > + > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > new file mode 100644 > index 0000000..4e14d9b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ > +_Complex int t = 0; > +int f(void) > +{ > + t = 0; > + __imag__ t = 2; > +} > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > new file mode 100644 > index 0000000..d1e0b85 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ > +_Complex int t = 0; > +int f(void) > +{ > + t = 0; > + __real__ t = 2; > +} > + > +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ > +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > index 594c20c..ae48ddd 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c > @@ -11,4 +11,4 @@ foo () > } > > /* We should eliminate the first assignment. */ > -/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */ > diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c > index 778b363..eea185c 100644 > --- a/gcc/tree-ssa-dse.c > +++ b/gcc/tree-ssa-dse.c > @@ -33,6 +33,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-dfa.h" > #include "domwalk.h" > #include "tree-cfgcleanup.h" > +#include "params.h" > > /* This file implements dead store elimination. > > @@ -68,6 +69,158 @@ along with GCC; see the file COPYING3. If not see > remove their dead edges eventually. */ > static bitmap need_eh_cleanup; > > +/* STMT is a statement that may write into memory. Analyze it and > + initialize WRITE to describe how STMT affects memory. > + > + Return TRUE if the the statement was analyzed, FALSE otherwise. > + > + It is always safe to return FALSE. But typically better optimziation > + can be achieved by analyzing more statements. */ > + > +static bool > +initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write) > +{ > + /* It's advantageous to handle certain mem* functions. */ > + if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > + { > + switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt))) > + { > + case BUILT_IN_MEMCPY: > + case BUILT_IN_MEMMOVE: > + case BUILT_IN_MEMSET: > + { > + tree size = NULL_TREE; > + if (gimple_call_num_args (stmt) == 3) > + size = gimple_call_arg (stmt, 2); > + tree ptr = gimple_call_arg (stmt, 0); > + ao_ref_init_from_ptr_and_size (write, ptr, size); > + return true; > + } > + default: > + break; > + } > + } > + else if (is_gimple_assign (stmt)) > + { > + ao_ref_init (write, gimple_assign_lhs (stmt)); > + return true; > + } > + return false; > +} > + > +/* Given REF from the the alias oracle, return TRUE if it is a valid > + memory reference for dead store elimination, false otherwise. > + > + In particular, the reference must have a known base, known maximum > + size, start at a byte offset and have a size that is one or more > + bytes. */ > + > +static bool > +valid_ao_ref_for_dse (ao_ref *ref) > +{ > + return (ao_ref_base (ref) > + && ref->max_size != -1 > + && (ref->offset % BITS_PER_UNIT) == 0 > + && (ref->size % BITS_PER_UNIT) == 0); > +} > + > +/* Clear any bytes written by STMT from the bitmap LIVE_BYTES. The base > + address written by STMT must match the one found in REF, which must > + have its base address previously initialized. > + > + This routine must be conservative. If we don't know the offset or > + actual size written, assume nothing was written. */ > + > +static void > +clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref) > +{ > + ao_ref write; > + if (!initialize_ao_ref_for_dse (stmt, &write)) > + return; > + > + /* Verify we have the same base memory address and that the write > + has a known size. If so, then clear the appropriate bytes in > + the LIVE_BYTES bitmap. */ > + if (valid_ao_ref_for_dse (&write) > + && write.base == ref->base > + && write.size == write.max_size) > + bitmap_clear_range (live_bytes, > + write.offset / BITS_PER_UNIT, > + write.size / BITS_PER_UNIT); > +} > + > +/* STMT initializes an object from COMPLEX_CST where one or more of the > + bytes written are dead stores. ORIG is the bitmap of bytes stored by > + STMT. LIVE is the bitmap of stores that are actually live. > + > + Attempt to rewrite STMT so that only the real or imaginary part of > + the object is actually stored. */ > + > +static void > +trim_complex_store (bitmap orig, bitmap live, gimple *stmt) > +{ > + bitmap dead = BITMAP_ALLOC (NULL); > + bitmap_and_compl (dead, orig, live); > + > + /* So we have to verify that either the real or imag part as a whole > + is dead. They are always the same size. Thus for one to be dead > + the number of live bytes would have to be the same as the number of > + dead bytes. */ > + if (bitmap_count_bits (live) == bitmap_count_bits (dead)) popcount is expensive, so is this really a short-cut? > + { > + /* Hopefully we have live bits that look like 0xff00 or 0xff > (assuming > + 8 bytes for the underlying real/imag part). If so we can optimize > + this case. */ > + if (bitmap_first_set_bit (live) == 0 > + && bitmap_first_set_bit (dead) == bitmap_count_bits (live)) Hmm, but does that properly handle byte-wise reads from it? Like reading the realpart plus the last byte from the imagpart? > + { > + /* TREE_REALPART is live */ > + tree x = TREE_REALPART (gimple_assign_rhs1 (stmt)); > + tree y = gimple_assign_lhs (stmt); > + y = build1 (REALPART_EXPR, TREE_TYPE (x), y); > + gimple_assign_set_lhs (stmt, y); > + gimple_assign_set_rhs1 (stmt, x); > + } > + else if (bitmap_first_set_bit (dead) == 0 > + && bitmap_first_set_bit (live) == bitmap_count_bits (dead)) > + { Likewise. > + /* TREE_IMAGPART is live */ > + tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt)); > + tree y = gimple_assign_lhs (stmt); > + y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y); > + gimple_assign_set_lhs (stmt, y); > + gimple_assign_set_rhs1 (stmt, x); > + } > + /* Other cases indicate parts of both the real and imag subobjects > + are live. We do not try to optimize those cases. */ > + } > + BITMAP_FREE (dead); > +} > + > +/* STMT is a memory write where one or more bytes written are dead > + stores. ORIG is the bitmap of bytes stored by STMT. LIVE is the > + bitmap of stores that are actually live. > + > + Attempt to rewrite STMT so that it writes fewer memory locations. Right > + now we only support trimming at the start or end of the memory region. > + It's not clear how much there is to be gained by trimming from the > middle > + of the region. */ > + > +static void > +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt) > +{ > + if (is_gimple_assign (stmt)) > + { > + switch (gimple_assign_rhs_code (stmt)) > + { > + case COMPLEX_CST: > + trim_complex_store (orig, live, stmt); so patch #1 only handles stores from COMPLEX_CST? Ok... > + break; > + default: > + break; > + } > + } > +} > > /* A helper of dse_optimize_stmt. > Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate > @@ -79,9 +232,32 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > { > gimple *temp; > unsigned cnt = 0; > + bitmap live_bytes = NULL; > + bitmap orig_live_bytes = NULL; > > *use_stmt = NULL; > > + /* REF is a memory write. Go ahead and get its base, size, extent > + information and encode the bytes written into LIVE_BYTES. We can > + handle any case where we have a known base and maximum size. > + > + However, experimentation has shown that bit level tracking is not > + useful in practice, so we only track at the byte level. > + > + Furthermore, experimentation has shown that 99% of the cases > + that require byte tracking are 64 bytes or less. */ > + if (valid_ao_ref_for_dse (ref) > + && (ref->max_size / BITS_PER_UNIT > + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) > + { > + live_bytes = BITMAP_ALLOC (NULL); > + orig_live_bytes = BITMAP_ALLOC (NULL); > + bitmap_set_range (live_bytes, > + ref->offset / BITS_PER_UNIT, > + ref->max_size / BITS_PER_UNIT); > + bitmap_copy (orig_live_bytes, live_bytes); So I'd use a once-per-pass allocated sbitmap here. I don't see why you need the orig_live_bytes bitmap though (just keep that implicitely by the known range?) > + } > + > /* Find the first dominated statement that clobbers (part of) the > memory stmt stores to with no intermediate statement that may use > part of the memory stmt stores. That is, find a store that may > @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > } > > if (fail) > - return false; > + { > + /* STMT might be partially dead and we may be able to reduce > + how many memory locations it stores into. */ > + if (live_bytes > + && !bitmap_equal_p (live_bytes, orig_live_bytes) > + && !gimple_clobber_p (stmt)) > + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); The actual transform in dse_possible_dead_store_p looks a bit misplaced. I see it's somehow convenient but then maybe returning a enum from this function might be cleaner. Well, I'm not too torn about this, so maybe just rename the function a bit (no good suggestion either). The rest of the patch (the infrastructure) looks reasonable. Richard. > + return false; > + } > > /* If we didn't find any definition this means the store is dead > if it isn't a store to global reachable memory. In this case > @@ -177,12 +362,18 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, > gimple **use_stmt) > temp = stmt; > break; > } > + > + if (live_bytes && temp) > + clear_bytes_written_by (live_bytes, temp, ref); > } > - /* Continue walking until we reach a kill. */ > - while (!stmt_kills_ref_p (temp, ref)); > + /* Continue walking until we reach a full kill as a single statement > + or there are no more live bytes. */ > + while (!stmt_kills_ref_p (temp, ref) > + && !(live_bytes && bitmap_empty_p (live_bytes))); > > *use_stmt = temp; > - > + BITMAP_FREE (live_bytes); > + BITMAP_FREE (orig_live_bytes); > return true; > } > > @@ -214,6 +405,10 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF)) > return; > > + ao_ref ref; > + if (!initialize_ao_ref_for_dse (stmt, &ref)) > + return; > + > /* We know we have virtual definitions. We can handle assignments and > some builtin calls. */ > if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) > @@ -225,12 +420,6 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > case BUILT_IN_MEMSET: > { > gimple *use_stmt; > - ao_ref ref; > - tree size = NULL_TREE; > - if (gimple_call_num_args (stmt) == 3) > - size = gimple_call_arg (stmt, 2); > - tree ptr = gimple_call_arg (stmt, 0); > - ao_ref_init_from_ptr_and_size (&ref, ptr, size); > if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > return; > > @@ -244,6 +433,7 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > tree lhs = gimple_call_lhs (stmt); > if (lhs) > { > + tree ptr = gimple_call_arg (stmt, 0); > gimple *new_stmt = gimple_build_assign (lhs, ptr); > unlink_stmt_vdef (stmt); > if (gsi_replace (gsi, new_stmt, true)) > @@ -274,13 +464,8 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) > if (operand_equal_p (gimple_assign_rhs1 (stmt), > gimple_assign_lhs (stmt), 0)) > use_stmt = stmt; > - else > - { > - ao_ref ref; > - ao_ref_init (&ref, gimple_assign_lhs (stmt)); > - if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > - return; > - } > + else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) > + return; > > /* Now we know that use_stmt kills the LHS of stmt. */ > >
On 12/16/2016 12:29 AM, Trevor Saunders wrote: > On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote: >> unsigned cnt = 0; >> + bitmap live_bytes = NULL; >> + bitmap orig_live_bytes = NULL; >> >> *use_stmt = NULL; >> >> + /* REF is a memory write. Go ahead and get its base, size, extent >> + information and encode the bytes written into LIVE_BYTES. We can >> + handle any case where we have a known base and maximum size. >> + >> + However, experimentation has shown that bit level tracking is not >> + useful in practice, so we only track at the byte level. >> + >> + Furthermore, experimentation has shown that 99% of the cases >> + that require byte tracking are 64 bytes or less. */ >> + if (valid_ao_ref_for_dse (ref) >> + && (ref->max_size / BITS_PER_UNIT >> + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) >> + { >> + live_bytes = BITMAP_ALLOC (NULL); >> + orig_live_bytes = BITMAP_ALLOC (NULL); >> + bitmap_set_range (live_bytes, >> + ref->offset / BITS_PER_UNIT, >> + ref->max_size / BITS_PER_UNIT); >> + bitmap_copy (orig_live_bytes, live_bytes); > > would it maybe make sense to use sbitmaps since the length is known to > be short, and doesn't change after allocation? It would. I tend to default to bitmaps these days (odd since I was probably the largest user of sbitmaps for a long time). > >> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt) >> } >> >> if (fail) >> - return false; >> + { >> + /* STMT might be partially dead and we may be able to reduce >> + how many memory locations it stores into. */ >> + if (live_bytes >> + && !bitmap_equal_p (live_bytes, orig_live_bytes) >> + && !gimple_clobber_p (stmt)) >> + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); >> + return false; > > shouldn't you free the bitmaps here? Yes. I think switching to an auto-sbitmap would resolve this as well. jeff
On 12/16/2016 06:57 AM, Richard Biener wrote: > > Apart from what Trevor says about using sbitmaps (try to avoid the initial > zeroing please) and the missed freeing (you can use auto_[s]bitmap?) > some comments below. In progress. We'll need a few routines for sbitmaps that don't currently exist, but nothing that should be hard to implement correctly. >> + >> +static void >> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt) >> +{ >> + bitmap dead = BITMAP_ALLOC (NULL); >> + bitmap_and_compl (dead, orig, live); >> + >> + /* So we have to verify that either the real or imag part as a whole >> + is dead. They are always the same size. Thus for one to be dead >> + the number of live bytes would have to be the same as the number of >> + dead bytes. */ >> + if (bitmap_count_bits (live) == bitmap_count_bits (dead)) > > popcount is expensive, so is this really a short-cut? It's less about being a short cut and more about realizing both halves are the same size. Thus to be trimmable the number of live bytes must exactly equal the number of dead bytes. This works regardless of the size of the underlying components of the complex type. > >> + { >> + /* Hopefully we have live bits that look like 0xff00 or 0xff >> (assuming >> + 8 bytes for the underlying real/imag part). If so we can optimize >> + this case. */ >> + if (bitmap_first_set_bit (live) == 0 >> + && bitmap_first_set_bit (dead) == bitmap_count_bits (live)) > > Hmm, but does that properly handle byte-wise reads from it? Like > reading the realpart plus the last byte from the imagpart? It should. We've already verified that exactly half of the object is live and half is dead. So we just need to make sure that the live part is contiguous at the start or end of the object. Again, written this way we don't have to care about the size of the components of the COMPLEX_EXPR. You're right in that if we keep the test in this form we should avoid the redundant popcount calls. >> + >> +/* STMT is a memory write where one or more bytes written are dead >> + stores. ORIG is the bitmap of bytes stored by STMT. LIVE is the >> + bitmap of stores that are actually live. >> + >> + Attempt to rewrite STMT so that it writes fewer memory locations. Right >> + now we only support trimming at the start or end of the memory region. >> + It's not clear how much there is to be gained by trimming from the >> middle >> + of the region. */ >> + >> +static void >> +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt) >> +{ >> + if (is_gimple_assign (stmt)) >> + { >> + switch (gimple_assign_rhs_code (stmt)) >> + { >> + case COMPLEX_CST: >> + trim_complex_store (orig, live, stmt); > > so patch #1 only handles stores from COMPLEX_CST? Ok... Patch #1 handles detecting of any aggregate store that is made fully dead by subsequent partial stores. It also handles trimming of a partially dead complex constant initialization. >> + if (valid_ao_ref_for_dse (ref) >> + && (ref->max_size / BITS_PER_UNIT >> + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) >> + { >> + live_bytes = BITMAP_ALLOC (NULL); >> + orig_live_bytes = BITMAP_ALLOC (NULL); >> + bitmap_set_range (live_bytes, >> + ref->offset / BITS_PER_UNIT, >> + ref->max_size / BITS_PER_UNIT); >> + bitmap_copy (orig_live_bytes, live_bytes); > > So I'd use a once-per-pass allocated sbitmap here. I don't see why you need > the orig_live_bytes bitmap though (just keep that implicitely by the known > range?) That would require resizing the bitmap or biasing. While we have a consistent maximum size, ref->offset varies. And that's probably the one case that sparse bitmaps handle better than simple bitmaps. So I guess we'll have to bias everything. It's not particularly hard, but it can be error prone. > >> + } >> + >> /* Find the first dominated statement that clobbers (part of) the >> memory stmt stores to with no intermediate statement that may use >> part of the memory stmt stores. That is, find a store that may >> @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, >> gimple **use_stmt) >> } >> >> if (fail) >> - return false; >> + { >> + /* STMT might be partially dead and we may be able to reduce >> + how many memory locations it stores into. */ >> + if (live_bytes >> + && !bitmap_equal_p (live_bytes, orig_live_bytes) >> + && !gimple_clobber_p (stmt)) >> + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); > > The actual transform in dse_possible_dead_store_p looks a bit misplaced. > I see it's somehow convenient but then maybe returning a enum from this > function might be cleaner. Well, I'm not too torn about this, so maybe > just rename the function a bit (no good suggestion either). I pondered that as well. I don't have a strong opinion either way. I'll look into returning a tri-state (live, partially dead, dead). Jeff
On Thu, 15 Dec 2016, Jeff Law wrote: > This version attacks the problem by improving DSE to track stores to memory at > a byte level. That allows us to determine if a series of stores completely > covers an earlier store (thus making the earlier store dead). Question: suppose you have an assignment for a struct with padding, then stores for all the elements. Does it detect that the original assignment is dead (because there is no need to copy padding on struct assignment)?
On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote: > On Thu, 15 Dec 2016, Jeff Law wrote: > > > This version attacks the problem by improving DSE to track stores to memory at > > a byte level. That allows us to determine if a series of stores completely > > covers an earlier store (thus making the earlier store dead). > > Question: suppose you have an assignment for a struct with padding, then > stores for all the elements. Does it detect that the original assignment > is dead (because there is no need to copy padding on struct assignment)? We fold memcpy into struct assignment early, does the same apply also to memcpy? Or shall we fold memcpy into struct assignment only when there is no padding (or find different IL representation of that)? Jakub
On 12/16/2016 11:35 AM, Joseph Myers wrote: > On Thu, 15 Dec 2016, Jeff Law wrote: > >> This version attacks the problem by improving DSE to track stores to memory at >> a byte level. That allows us to determine if a series of stores completely >> covers an earlier store (thus making the earlier store dead). > > Question: suppose you have an assignment for a struct with padding, then > stores for all the elements. Does it detect that the original assignment > is dead (because there is no need to copy padding on struct assignment)? It would not detect the original assignment as dead in that case as it does not know anything special about the padding bytes. So we'd likely end up just trimming the original assignment, particularly if it was done via a CONSTRUCTOR assignment or memset. This situation could be addressed by handling the padding bytes specially -- ie, just clear them from LIVE when LIVE is initialized. It'd likely "just work" then. Seems like it might be a reasonable follow-up as it'd hopefully make more of the original assignments fully dead rather than just trimmable. jeff
On 12/16/2016 11:43 AM, Jakub Jelinek wrote: > On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote: >> On Thu, 15 Dec 2016, Jeff Law wrote: >> >>> This version attacks the problem by improving DSE to track stores to memory at >>> a byte level. That allows us to determine if a series of stores completely >>> covers an earlier store (thus making the earlier store dead). >> >> Question: suppose you have an assignment for a struct with padding, then >> stores for all the elements. Does it detect that the original assignment >> is dead (because there is no need to copy padding on struct assignment)? > > We fold memcpy into struct assignment early, does the same apply also to > memcpy? Or shall we fold memcpy into struct assignment only when there is > no padding (or find different IL representation of that)? So a memcpy would presumably cover the entire object, padding included. Subsequent component stores would not cover the padding. So we'd trim the head/tail of the memcpy. A memcpy folded to a structure assignment almost certainly covers the whole assignment when represented in an ao_ref. So it'd be no different -- except that I haven't written code to trim structure assignments (unless they're represented by a CONSTRUCTOR assignment). Regardless I would not suggest changing if/how/when we fold memcpy into a structure assignment. Jeff
On Fri, 16 Dec 2016, Jakub Jelinek wrote: > On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote: > > On Thu, 15 Dec 2016, Jeff Law wrote: > > > > > This version attacks the problem by improving DSE to track stores to memory at > > > a byte level. That allows us to determine if a series of stores completely > > > covers an earlier store (thus making the earlier store dead). > > > > Question: suppose you have an assignment for a struct with padding, then > > stores for all the elements. Does it detect that the original assignment > > is dead (because there is no need to copy padding on struct assignment)? > > We fold memcpy into struct assignment early, does the same apply also to > memcpy? Or shall we fold memcpy into struct assignment only when there is > no padding (or find different IL representation of that)? There is at least an arguable case that memset needs to set padding to 0 (unlike initializers), and memcpy needs to copy it (unlike struct assignment). Atomic compare-exchange definitely needs to work on the whole representation including padding. However, when you store in a member of a structure (or union), that results in all the padding becoming undefined. So in the cases of memcpy / memset where this optimization is interesting (element stores follow the memcpy / memset), you can still optimize. (Note the desire for an option to avoid such padding-related optimizations, by treating padding for optimization purposes as if there were dummy fields there that can all be cleared by a {} initializer or memset: PR 77992.)
On 12/16/2016 06:57 AM, Richard Biener wrote: > On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote: >> + { >> + /* STMT might be partially dead and we may be able to reduce >> + how many memory locations it stores into. */ >> + if (live_bytes >> + && !bitmap_equal_p (live_bytes, orig_live_bytes) >> + && !gimple_clobber_p (stmt)) >> + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); > > The actual transform in dse_possible_dead_store_p looks a bit misplaced. > I see it's somehow convenient but then maybe returning a enum from this > function might be cleaner. Well, I'm not too torn about this, so maybe > just rename the function a bit (no good suggestion either). So the problem with returning a tri-state is that we'd also need to return the bitmaps or trimming data so that our caller would know how to trim in the partially dead case. Certainly possible, though marginally more awkward from an implementation standpoint. Renaming seems appropriate since dse_possible_dead_store_p does more than just detect a dead store with these changes. Jeff
On 12/16/2016 06:57 AM, Richard Biener wrote: > On Fri, Dec 16, 2016 at 2:54 AM, Jeff Law <law@redhat.com> wrote: >> + /* REF is a memory write. Go ahead and get its base, size, extent >> + information and encode the bytes written into LIVE_BYTES. We can >> + handle any case where we have a known base and maximum size. >> + >> + However, experimentation has shown that bit level tracking is not >> + useful in practice, so we only track at the byte level. >> + >> + Furthermore, experimentation has shown that 99% of the cases >> + that require byte tracking are 64 bytes or less. */ >> + if (valid_ao_ref_for_dse (ref) >> + && (ref->max_size / BITS_PER_UNIT >> + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) >> + { >> + live_bytes = BITMAP_ALLOC (NULL); >> + orig_live_bytes = BITMAP_ALLOC (NULL); >> + bitmap_set_range (live_bytes, >> + ref->offset / BITS_PER_UNIT, >> + ref->max_size / BITS_PER_UNIT); >> + bitmap_copy (orig_live_bytes, live_bytes); > > So I'd use a once-per-pass allocated sbitmap here. I don't see why you need > the orig_live_bytes bitmap though (just keep that implicitely by the known > range?) So if we use a once-per-pass allocated bitmap, that actually facilitates returning a tri-state from dse_possible_dead_store_p and moving the trimming into dse_optimize_stmt. ORIG_LIVE_BYTES was more convenience than anything -- we want it so that we can compute the dead bytes for trimming. But we can certainly compute it on-demand at that time. Jeff
On 12/16/2016 12:29 AM, Trevor Saunders wrote: > On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote: >> unsigned cnt = 0; >> + bitmap live_bytes = NULL; >> + bitmap orig_live_bytes = NULL; >> >> *use_stmt = NULL; >> >> + /* REF is a memory write. Go ahead and get its base, size, extent >> + information and encode the bytes written into LIVE_BYTES. We can >> + handle any case where we have a known base and maximum size. >> + >> + However, experimentation has shown that bit level tracking is not >> + useful in practice, so we only track at the byte level. >> + >> + Furthermore, experimentation has shown that 99% of the cases >> + that require byte tracking are 64 bytes or less. */ >> + if (valid_ao_ref_for_dse (ref) >> + && (ref->max_size / BITS_PER_UNIT >> + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) >> + { >> + live_bytes = BITMAP_ALLOC (NULL); >> + orig_live_bytes = BITMAP_ALLOC (NULL); >> + bitmap_set_range (live_bytes, >> + ref->offset / BITS_PER_UNIT, >> + ref->max_size / BITS_PER_UNIT); >> + bitmap_copy (orig_live_bytes, live_bytes); > > would it maybe make sense to use sbitmaps since the length is known to > be short, and doesn't change after allocation? So a few interesting things have to be dealt if we want to make this change. I already mentioned the need to bias based on ref->offset so that the range of bytes we're tracking is represented 0..size. While we know the length of the potential dead store we don't know the length of the subsequent stores that we're hoping make the original a dead store. Thus when we start clearing LIVE_BYTES based on those subsequent stores, we have to normalize those against the ref->offset of the original store. What's even more fun is sizing. Those subsequent stores may be considerably larger. Which means that a bitmap_clear_range call has to be a hell of a lot more careful when working with sbitmaps (we just happily stop all over memory in that case) whereas a bitmap the right things will "just happen". On a positive size since we've normalized the potential dead store's byte range to 0..size, it means computing trims is easier because we inherently know how many bits were originally set. So compute_trims becomes trivial and we can simplify trim_complex_store a bit as well. And, of course, we don't have a bitmap_{clear,set}_range or a bitmap_count_bits implementation for sbitmaps. It's all a bit of "ugh", but should be manageable. Jeff
On December 16, 2016 7:43:22 PM GMT+01:00, Jakub Jelinek <jakub@redhat.com> wrote: >On Fri, Dec 16, 2016 at 06:35:58PM +0000, Joseph Myers wrote: >> On Thu, 15 Dec 2016, Jeff Law wrote: >> >> > This version attacks the problem by improving DSE to track stores >to memory at >> > a byte level. That allows us to determine if a series of stores >completely >> > covers an earlier store (thus making the earlier store dead). >> >> Question: suppose you have an assignment for a struct with padding, >then >> stores for all the elements. Does it detect that the original >assignment >> is dead (because there is no need to copy padding on struct >assignment)? In GIMPLE struct assignment is equal to memcpy, so that info is lost after leaving frontend realm. >We fold memcpy into struct assignment early, does the same apply also >to >memcpy? Or shall we fold memcpy into struct assignment only when there >is >no padding (or find different IL representation of that)? If we want to change that we can use a char[] type for copying on GIMPLE. Richard. > Jakub
Jeff Law <law@redhat.com> writes: > This is the first of the 4 part patchkit to address deficiencies in our > DSE implementation. > > This patch addresses the P2 regression 33562 which has been a low > priority regression since gcc-4.3. To summarize, DSE no longer has the > ability to detect an aggregate store as dead if subsequent stores are > done in a piecemeal fashion. > > I originally tackled this by changing how we lower complex objects. > That was sufficient to address 33562, but was reasonably rejected. > > This version attacks the problem by improving DSE to track stores to > memory at a byte level. That allows us to determine if a series of > stores completely covers an earlier store (thus making the earlier store > dead). > > A useful side effect of this is we can detect when parts of a store are > dead and potentially rewrite the store. This patch implements that for > complex object initializations. While not strictly part of 33562, it's > so closely related that I felt it belongs as part of this patch. > > This originally limited the size of the tracked memory space to 64 > bytes. I bumped the limit after working through the CONSTRUCTOR and > mem* trimming patches. The 256 byte limit is still fairly arbitrary and > I wouldn't lose sleep if we throttled back to 64 or 128 bytes. FWIW (and shouldn't affect whether the patch goes in)... If SVE support is accepted for GCC 8 then we have the additional problem that the sizes of useful aggregates can be runtime invariants. For example, in the SVE equivalent of float64x2x2_t, it would be good to be able to detect that an assignment to the full structure is dead if we later assign to both of the individual vectors. Probably the most convenient way of doing that would be to track ranges rather than individual bytes, like the local part of the RTL DSE pass does. (It was relatively easy to convert local RTL DSE to variable-length modes, but the global part also uses byte bitmaps.) I guess bitmaps are going to be more efficient for small structures or for accesses that occur in an arbitrary order. But tracking ranges means adding at most one fixed-size range record per update, so I suppose the throttle would be on the number of records (= number of gaps + 1) rather than the size of the structure. Thanks, Richard
On Sat, Dec 17, 2016 at 01:19:41AM -0700, Jeff Law wrote: > On 12/16/2016 12:29 AM, Trevor Saunders wrote: > > On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote: > > > unsigned cnt = 0; > > > + bitmap live_bytes = NULL; > > > + bitmap orig_live_bytes = NULL; > > > > > > *use_stmt = NULL; > > > > > > + /* REF is a memory write. Go ahead and get its base, size, extent > > > + information and encode the bytes written into LIVE_BYTES. We can > > > + handle any case where we have a known base and maximum size. > > > + > > > + However, experimentation has shown that bit level tracking is not > > > + useful in practice, so we only track at the byte level. > > > + > > > + Furthermore, experimentation has shown that 99% of the cases > > > + that require byte tracking are 64 bytes or less. */ > > > + if (valid_ao_ref_for_dse (ref) > > > + && (ref->max_size / BITS_PER_UNIT > > > + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) > > > + { > > > + live_bytes = BITMAP_ALLOC (NULL); > > > + orig_live_bytes = BITMAP_ALLOC (NULL); > > > + bitmap_set_range (live_bytes, > > > + ref->offset / BITS_PER_UNIT, > > > + ref->max_size / BITS_PER_UNIT); > > > + bitmap_copy (orig_live_bytes, live_bytes); > > > > would it maybe make sense to use sbitmaps since the length is known to > > be short, and doesn't change after allocation? > So a few interesting things have to be dealt if we want to make this change. > I already mentioned the need to bias based on ref->offset so that the range > of bytes we're tracking is represented 0..size. > > While we know the length of the potential dead store we don't know the > length of the subsequent stores that we're hoping make the original a dead > store. Thus when we start clearing LIVE_BYTES based on those subsequent > stores, we have to normalize those against the ref->offset of the original > store. > > What's even more fun is sizing. Those subsequent stores may be considerably > larger. Which means that a bitmap_clear_range call has to be a hell of a > lot more careful when working with sbitmaps (we just happily stop all over > memory in that case) whereas a bitmap the right things will "just happen". > > On a positive size since we've normalized the potential dead store's byte > range to 0..size, it means computing trims is easier because we inherently > know how many bits were originally set. So compute_trims becomes trivial > and we can simplify trim_complex_store a bit as well. > > And, of course, we don't have a bitmap_{clear,set}_range or a > bitmap_count_bits implementation for sbitmaps. > > > It's all a bit of "ugh", but should be manageable. yeah, but that seems like enough work that you could reasonably stick with bitmap instead. Trev p.s. sorry I've been falling behind lately. > > Jeff
On 12/21/2016 06:43 AM, Trevor Saunders wrote: >> So a few interesting things have to be dealt if we want to make this change. >> I already mentioned the need to bias based on ref->offset so that the range >> of bytes we're tracking is represented 0..size. >> >> While we know the length of the potential dead store we don't know the >> length of the subsequent stores that we're hoping make the original a dead >> store. Thus when we start clearing LIVE_BYTES based on those subsequent >> stores, we have to normalize those against the ref->offset of the original >> store. >> >> What's even more fun is sizing. Those subsequent stores may be considerably >> larger. Which means that a bitmap_clear_range call has to be a hell of a >> lot more careful when working with sbitmaps (we just happily stop all over >> memory in that case) whereas a bitmap the right things will "just happen". >> >> On a positive size since we've normalized the potential dead store's byte >> range to 0..size, it means computing trims is easier because we inherently >> know how many bits were originally set. So compute_trims becomes trivial >> and we can simplify trim_complex_store a bit as well. >> >> And, of course, we don't have a bitmap_{clear,set}_range or a >> bitmap_count_bits implementation for sbitmaps. >> >> >> It's all a bit of "ugh", but should be manageable. > > yeah, but that seems like enough work that you could reasonably stick > with bitmap instead. Well, the conversion is done :-) It was largely as I expected with the devil being in the normalization details which are well isolated. > > p.s. sorry I've been falling behind lately. It happens. You might want to peek briefly at Aldy's auto_bitmap class as part of the 71691 patches. It looks like a fairly straightforward conversion to auto_*. I'm sure there's all kinds of places we could use it. Jeff
On 12/16/2016 12:29 AM, Trevor Saunders wrote: > On Thu, Dec 15, 2016 at 06:54:43PM -0700, Jeff Law wrote: >> unsigned cnt = 0; >> + bitmap live_bytes = NULL; >> + bitmap orig_live_bytes = NULL; >> >> *use_stmt = NULL; >> >> + /* REF is a memory write. Go ahead and get its base, size, extent >> + information and encode the bytes written into LIVE_BYTES. We can >> + handle any case where we have a known base and maximum size. >> + >> + However, experimentation has shown that bit level tracking is not >> + useful in practice, so we only track at the byte level. >> + >> + Furthermore, experimentation has shown that 99% of the cases >> + that require byte tracking are 64 bytes or less. */ >> + if (valid_ao_ref_for_dse (ref) >> + && (ref->max_size / BITS_PER_UNIT >> + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) >> + { >> + live_bytes = BITMAP_ALLOC (NULL); >> + orig_live_bytes = BITMAP_ALLOC (NULL); >> + bitmap_set_range (live_bytes, >> + ref->offset / BITS_PER_UNIT, >> + ref->max_size / BITS_PER_UNIT); >> + bitmap_copy (orig_live_bytes, live_bytes); > > would it maybe make sense to use sbitmaps since the length is known to > be short, and doesn't change after allocation? New version will use auto_sbitmap and sbitmaps. Jeff
On 12/16/2016 06:57 AM, Richard Biener wrote: > > Apart from what Trevor says about using sbitmaps (try to avoid the initial > zeroing please) and the missed freeing (you can use auto_[s]bitmap?) > some comments below. New version uses sbitmaps and avoids zero-ing when we can. >> +static void >> +trim_complex_store (bitmap orig, bitmap live, gimple *stmt) >> +{ >> + bitmap dead = BITMAP_ALLOC (NULL); >> + bitmap_and_compl (dead, orig, live); >> + >> + /* So we have to verify that either the real or imag part as a whole >> + is dead. They are always the same size. Thus for one to be dead >> + the number of live bytes would have to be the same as the number of >> + dead bytes. */ >> + if (bitmap_count_bits (live) == bitmap_count_bits (dead)) > > popcount is expensive, so is this really a short-cut? Note this can all be done more efficiently. Using sbitmaps forces us to normalize the offset/size and I've dropped the original bitmap as the ao_ref carries the data we need. Those two changes make other simplifications fall-out and what we get in compute_trims a last_set_bit and first_set_bit on the live bitmap -- no other bitmap traversals/searches/counts. That makes compute_trims more efficient than the code in trim_complex_store. So I've made trim_complex_store re-use the more general trimming code, and it's still agnostic of the underlying sizes, it just cares that the underlying components are the same size. > > The actual transform in dse_possible_dead_store_p looks a bit misplaced. > I see it's somehow convenient but then maybe returning a enum from this > function might be cleaner. Well, I'm not too torn about this, so maybe > just rename the function a bit (no good suggestion either). > > The rest of the patch (the infrastructure) looks reasonable. I think I mentioned it earlier, but moving towards a single allocated bitmap for the entire pass makes it much easier to have dse_possible_dead_store_p to return a tri-state and live information. You'll see that change in the upcoming revision to the patchkit. Jeff >
diff --git a/gcc/params.def b/gcc/params.def index 50f75a7..ddc3d65 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER, "Average number of iterations of a loop.", 10, 1, 0) +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE, + "dse-max-object-size", + "Maximum size (in bytes) of objects tracked by dead store elimination.", + 256, 0, 0) + DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE, "scev-max-expr-size", "Bound on size of expressions used in the scalar evolutions analyzer.", diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c index 87a2638..3155741 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-4.c @@ -10,4 +10,4 @@ int f(void) return g(&t); } -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c index e2cd403..e6d027f 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/complex-5.c @@ -8,4 +8,4 @@ int f(void) __imag__ t = 2; } -/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c new file mode 100644 index 0000000..92b2df8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-18.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ +int g(_Complex int*); +int f(void) +{ + _Complex int t = 0; + int i, j; + __imag__ t += 2; + return g(&t); +} + + +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c new file mode 100644 index 0000000..718b746 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-19.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ +int g(_Complex int*); +int f(void) +{ + _Complex int t = 0; + int i, j; + __real__ t += 2; + return g(&t); +} + + +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c new file mode 100644 index 0000000..4e14d9b --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-20.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ +_Complex int t = 0; +int f(void) +{ + t = 0; + __imag__ t = 2; +} + +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c new file mode 100644 index 0000000..d1e0b85 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-21.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O1 -fno-tree-dce -fdump-tree-optimized" } */ +_Complex int t = 0; +int f(void) +{ + t = 0; + __real__ t = 2; +} + +/* { dg-final { scan-tree-dump-times "__complex__" 0 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "REALPART_EXPR" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "IMAGPART_EXPR" 1 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c index 594c20c..ae48ddd 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-9.c @@ -11,4 +11,4 @@ foo () } /* We should eliminate the first assignment. */ -/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "VDEF" 2 "dse1" } } */ diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c index 778b363..eea185c 100644 --- a/gcc/tree-ssa-dse.c +++ b/gcc/tree-ssa-dse.c @@ -33,6 +33,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-dfa.h" #include "domwalk.h" #include "tree-cfgcleanup.h" +#include "params.h" /* This file implements dead store elimination. @@ -68,6 +69,158 @@ along with GCC; see the file COPYING3. If not see remove their dead edges eventually. */ static bitmap need_eh_cleanup; +/* STMT is a statement that may write into memory. Analyze it and + initialize WRITE to describe how STMT affects memory. + + Return TRUE if the the statement was analyzed, FALSE otherwise. + + It is always safe to return FALSE. But typically better optimziation + can be achieved by analyzing more statements. */ + +static bool +initialize_ao_ref_for_dse (gimple *stmt, ao_ref *write) +{ + /* It's advantageous to handle certain mem* functions. */ + if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) + { + switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt))) + { + case BUILT_IN_MEMCPY: + case BUILT_IN_MEMMOVE: + case BUILT_IN_MEMSET: + { + tree size = NULL_TREE; + if (gimple_call_num_args (stmt) == 3) + size = gimple_call_arg (stmt, 2); + tree ptr = gimple_call_arg (stmt, 0); + ao_ref_init_from_ptr_and_size (write, ptr, size); + return true; + } + default: + break; + } + } + else if (is_gimple_assign (stmt)) + { + ao_ref_init (write, gimple_assign_lhs (stmt)); + return true; + } + return false; +} + +/* Given REF from the the alias oracle, return TRUE if it is a valid + memory reference for dead store elimination, false otherwise. + + In particular, the reference must have a known base, known maximum + size, start at a byte offset and have a size that is one or more + bytes. */ + +static bool +valid_ao_ref_for_dse (ao_ref *ref) +{ + return (ao_ref_base (ref) + && ref->max_size != -1 + && (ref->offset % BITS_PER_UNIT) == 0 + && (ref->size % BITS_PER_UNIT) == 0); +} + +/* Clear any bytes written by STMT from the bitmap LIVE_BYTES. The base + address written by STMT must match the one found in REF, which must + have its base address previously initialized. + + This routine must be conservative. If we don't know the offset or + actual size written, assume nothing was written. */ + +static void +clear_bytes_written_by (bitmap live_bytes, gimple *stmt, ao_ref *ref) +{ + ao_ref write; + if (!initialize_ao_ref_for_dse (stmt, &write)) + return; + + /* Verify we have the same base memory address and that the write + has a known size. If so, then clear the appropriate bytes in + the LIVE_BYTES bitmap. */ + if (valid_ao_ref_for_dse (&write) + && write.base == ref->base + && write.size == write.max_size) + bitmap_clear_range (live_bytes, + write.offset / BITS_PER_UNIT, + write.size / BITS_PER_UNIT); +} + +/* STMT initializes an object from COMPLEX_CST where one or more of the + bytes written are dead stores. ORIG is the bitmap of bytes stored by + STMT. LIVE is the bitmap of stores that are actually live. + + Attempt to rewrite STMT so that only the real or imaginary part of + the object is actually stored. */ + +static void +trim_complex_store (bitmap orig, bitmap live, gimple *stmt) +{ + bitmap dead = BITMAP_ALLOC (NULL); + bitmap_and_compl (dead, orig, live); + + /* So we have to verify that either the real or imag part as a whole + is dead. They are always the same size. Thus for one to be dead + the number of live bytes would have to be the same as the number of + dead bytes. */ + if (bitmap_count_bits (live) == bitmap_count_bits (dead)) + { + /* Hopefully we have live bits that look like 0xff00 or 0xff (assuming + 8 bytes for the underlying real/imag part). If so we can optimize + this case. */ + if (bitmap_first_set_bit (live) == 0 + && bitmap_first_set_bit (dead) == bitmap_count_bits (live)) + { + /* TREE_REALPART is live */ + tree x = TREE_REALPART (gimple_assign_rhs1 (stmt)); + tree y = gimple_assign_lhs (stmt); + y = build1 (REALPART_EXPR, TREE_TYPE (x), y); + gimple_assign_set_lhs (stmt, y); + gimple_assign_set_rhs1 (stmt, x); + } + else if (bitmap_first_set_bit (dead) == 0 + && bitmap_first_set_bit (live) == bitmap_count_bits (dead)) + { + /* TREE_IMAGPART is live */ + tree x = TREE_IMAGPART (gimple_assign_rhs1 (stmt)); + tree y = gimple_assign_lhs (stmt); + y = build1 (IMAGPART_EXPR, TREE_TYPE (x), y); + gimple_assign_set_lhs (stmt, y); + gimple_assign_set_rhs1 (stmt, x); + } + /* Other cases indicate parts of both the real and imag subobjects + are live. We do not try to optimize those cases. */ + } + BITMAP_FREE (dead); +} + +/* STMT is a memory write where one or more bytes written are dead + stores. ORIG is the bitmap of bytes stored by STMT. LIVE is the + bitmap of stores that are actually live. + + Attempt to rewrite STMT so that it writes fewer memory locations. Right + now we only support trimming at the start or end of the memory region. + It's not clear how much there is to be gained by trimming from the middle + of the region. */ + +static void +trim_partially_dead_store (bitmap orig, bitmap live, gimple *stmt) +{ + if (is_gimple_assign (stmt)) + { + switch (gimple_assign_rhs_code (stmt)) + { + case COMPLEX_CST: + trim_complex_store (orig, live, stmt); + break; + default: + break; + } + } +} /* A helper of dse_optimize_stmt. Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate @@ -79,9 +232,32 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt) { gimple *temp; unsigned cnt = 0; + bitmap live_bytes = NULL; + bitmap orig_live_bytes = NULL; *use_stmt = NULL; + /* REF is a memory write. Go ahead and get its base, size, extent + information and encode the bytes written into LIVE_BYTES. We can + handle any case where we have a known base and maximum size. + + However, experimentation has shown that bit level tracking is not + useful in practice, so we only track at the byte level. + + Furthermore, experimentation has shown that 99% of the cases + that require byte tracking are 64 bytes or less. */ + if (valid_ao_ref_for_dse (ref) + && (ref->max_size / BITS_PER_UNIT + <= PARAM_VALUE (PARAM_DSE_MAX_OBJECT_SIZE))) + { + live_bytes = BITMAP_ALLOC (NULL); + orig_live_bytes = BITMAP_ALLOC (NULL); + bitmap_set_range (live_bytes, + ref->offset / BITS_PER_UNIT, + ref->max_size / BITS_PER_UNIT); + bitmap_copy (orig_live_bytes, live_bytes); + } + /* Find the first dominated statement that clobbers (part of) the memory stmt stores to with no intermediate statement that may use part of the memory stmt stores. That is, find a store that may @@ -164,7 +341,15 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt) } if (fail) - return false; + { + /* STMT might be partially dead and we may be able to reduce + how many memory locations it stores into. */ + if (live_bytes + && !bitmap_equal_p (live_bytes, orig_live_bytes) + && !gimple_clobber_p (stmt)) + trim_partially_dead_store (orig_live_bytes, live_bytes, stmt); + return false; + } /* If we didn't find any definition this means the store is dead if it isn't a store to global reachable memory. In this case @@ -177,12 +362,18 @@ dse_possible_dead_store_p (ao_ref *ref, gimple *stmt, gimple **use_stmt) temp = stmt; break; } + + if (live_bytes && temp) + clear_bytes_written_by (live_bytes, temp, ref); } - /* Continue walking until we reach a kill. */ - while (!stmt_kills_ref_p (temp, ref)); + /* Continue walking until we reach a full kill as a single statement + or there are no more live bytes. */ + while (!stmt_kills_ref_p (temp, ref) + && !(live_bytes && bitmap_empty_p (live_bytes))); *use_stmt = temp; - + BITMAP_FREE (live_bytes); + BITMAP_FREE (orig_live_bytes); return true; } @@ -214,6 +405,10 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) || TREE_CODE (gimple_assign_lhs (stmt)) != MEM_REF)) return; + ao_ref ref; + if (!initialize_ao_ref_for_dse (stmt, &ref)) + return; + /* We know we have virtual definitions. We can handle assignments and some builtin calls. */ if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) @@ -225,12 +420,6 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) case BUILT_IN_MEMSET: { gimple *use_stmt; - ao_ref ref; - tree size = NULL_TREE; - if (gimple_call_num_args (stmt) == 3) - size = gimple_call_arg (stmt, 2); - tree ptr = gimple_call_arg (stmt, 0); - ao_ref_init_from_ptr_and_size (&ref, ptr, size); if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) return; @@ -244,6 +433,7 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) tree lhs = gimple_call_lhs (stmt); if (lhs) { + tree ptr = gimple_call_arg (stmt, 0); gimple *new_stmt = gimple_build_assign (lhs, ptr); unlink_stmt_vdef (stmt); if (gsi_replace (gsi, new_stmt, true)) @@ -274,13 +464,8 @@ dse_optimize_stmt (gimple_stmt_iterator *gsi) if (operand_equal_p (gimple_assign_rhs1 (stmt), gimple_assign_lhs (stmt), 0)) use_stmt = stmt; - else - { - ao_ref ref; - ao_ref_init (&ref, gimple_assign_lhs (stmt)); - if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) - return; - } + else if (!dse_possible_dead_store_p (&ref, stmt, &use_stmt)) + return; /* Now we know that use_stmt kills the LHS of stmt. */