diff mbox

[PR,tree-optimization/79090] Fix two minor DSE bugs

Message ID bd59c1d4-0628-58dc-7505-c396ab87c542@redhat.com
State New
Headers show

Commit Message

Jeff Law Jan. 15, 2017, 9:34 a.m. UTC
At one time I know I had the max_size == size test in 
valid_ao_ref_for_dse.  But it got lost at some point.  This is what 
caused the Ada failure.

Technically it'd be OK for the potentially dead store to have a variable 
size as long as the later stores covered the entire range of the 
potentially dead store.  I doubt this happens enough to be worth checking.

The ppc64 big endian failures were more interesting.  We had this in the IL:

memmove (dst, src, 0)

The trimming code assumes that there's at least one live byte in the 
store, which obviously isn't the case here.  The net result is we 
compute an incorrect trim and the copy goes wild with incorrect 
addresses and lengths.  This is trivial to fix by validating that the 
store has a nonzero length.

I was a bit curious how often this happened in practice because such a 
call is trivially dead.  ~80 during a bootstrap and a few dozen in the 
testsuite.  Given how trivial it is to detect and optimize, this patch 
includes removal of such calls.  This hunk makes the check for zero size 
in valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if 
we add more builtin support without filtering zero size we'd regress again.


I wanted to get this in tonight given the Ada and ppc64 breakage so I 
didn't do testcase reductions for the ppc test or the zero length 
optimization (acats covers the Ada breakage).   I'll get testcases done 
Monday.  I'll also finish generating suitable baselines for 
ppc64-linux-gnu over the rest of the weekend to verify if this fixes all 
the ppc64 regressions or just a subset of them.

Bootstrapped on x86_64-linux-gnu and ppc64-linux-gnu.  Regression tested 
on x86_64-linux-gnu.  Also verified the ppc64 testresults are improved 
(anything where sel-sched was faulting ought to be fixed now,  maybe 
others).  Installing on the trunk.

Jeff
commit 04c5d9cab65e2f5b219f9610621bea06173b9cb8
Author: Jeff Law <law@redhat.com>
Date:   Sun Jan 15 01:37:32 2017 -0700

    	PR tree-optimization/79090
    	* tree-ssa-dse.c (valid_ao_ref_for_dse): Reject zero length and
    	variable length stores.
    	(compute_trims): Delete dead assignment to *trim_tail.
    	(dse_dom_walker::dse_optimize_stmt): Optimize mem* calls with
    	zero length.

Comments

Richard Biener Jan. 16, 2017, 8:51 a.m. UTC | #1
On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>
> At one time I know I had the max_size == size test in valid_ao_ref_for_dse.
> But it got lost at some point.  This is what caused the Ada failure.
>
> Technically it'd be OK for the potentially dead store to have a variable
> size as long as the later stores covered the entire range of the potentially
> dead store.  I doubt this happens enough to be worth checking.
>
> The ppc64 big endian failures were more interesting.  We had this in the IL:
>
> memmove (dst, src, 0)
>
> The trimming code assumes that there's at least one live byte in the store,
> which obviously isn't the case here.  The net result is we compute an
> incorrect trim and the copy goes wild with incorrect addresses and lengths.
> This is trivial to fix by validating that the store has a nonzero length.
>
> I was a bit curious how often this happened in practice because such a call
> is trivially dead.  ~80 during a bootstrap and a few dozen in the testsuite.
> Given how trivial it is to detect and optimize, this patch includes removal
> of such calls.  This hunk makes the check for zero size in
> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if we add
> more builtin support without filtering zero size we'd regress again.

Interesting - we do fold memset (..., 0) away so this means we either
have an unfolded
memset stmt in the IL before DSE.

> I wanted to get this in tonight given the Ada and ppc64 breakage so I didn't
> do testcase reductions for the ppc test or the zero length optimization
> (acats covers the Ada breakage).   I'll get testcases done Monday.  I'll
> also finish generating suitable baselines for ppc64-linux-gnu over the rest
> of the weekend to verify if this fixes all the ppc64 regressions or just a
> subset of them.
>
> Bootstrapped on x86_64-linux-gnu and ppc64-linux-gnu.  Regression tested on
> x86_64-linux-gnu.  Also verified the ppc64 testresults are improved
> (anything where sel-sched was faulting ought to be fixed now,  maybe
> others).  Installing on the trunk.
>
> Jeff
>
> commit 04c5d9cab65e2f5b219f9610621bea06173b9cb8
> Author: Jeff Law <law@redhat.com>
> Date:   Sun Jan 15 01:37:32 2017 -0700
>
>         PR tree-optimization/79090
>         * tree-ssa-dse.c (valid_ao_ref_for_dse): Reject zero length and
>         variable length stores.
>         (compute_trims): Delete dead assignment to *trim_tail.
>         (dse_dom_walker::dse_optimize_stmt): Optimize mem* calls with
>         zero length.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 746f6af..36982c6 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,12 @@
> +2017-01-14  Jeff Law  <law@redhat.com>
> +
> +       PR tree-optimization/79090
> +       * tree-ssa-dse.c (valid_ao_ref_for_dse): Reject zero length and
> +       variable length stores.
> +       (compute_trims): Delete dead assignment to *trim_tail.
> +       (dse_dom_walker::dse_optimize_stmt): Optimize mem* calls with
> +       zero length.
> +
>  2017-01-14  Bernd Schmidt  <bschmidt@redhat.com>
>
>         PR rtl-optimization/78626
> diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
> index 2e6f8ff..65dcb12 100644
> --- a/gcc/tree-ssa-dse.c
> +++ b/gcc/tree-ssa-dse.c
> @@ -129,6 +129,8 @@ valid_ao_ref_for_dse (ao_ref *ref)
>  {
>    return (ao_ref_base (ref)
>           && ref->max_size != -1
> +         && ref->size != 0
> +         && ref->max_size == ref->size
>           && (ref->offset % BITS_PER_UNIT) == 0
>           && (ref->size % BITS_PER_UNIT) == 0
>           && (ref->size != -1));
> @@ -221,7 +223,6 @@ compute_trims (ao_ref *ref, sbitmap live, int
> *trim_head, int *trim_tail)
>       the REF to compute the trims.  */
>
>    /* Now identify how much, if any of the tail we can chop off.  */
> -  *trim_tail = 0;
>    int last_orig = (ref->size / BITS_PER_UNIT) - 1;
>    int last_live = bitmap_last_set_bit (live);
>    *trim_tail = (last_orig - last_live) & ~0x1;
> @@ -700,6 +701,16 @@ dse_dom_walker::dse_optimize_stmt (gimple_stmt_iterator
> *gsi)
>           case BUILT_IN_MEMMOVE:
>           case BUILT_IN_MEMSET:
>             {
> +             /* Occasionally calls with an explicit length of zero
> +                show up in the IL.  It's pointless to do analysis
> +                on them, they're trivially dead.  */
> +             tree size = gimple_call_arg (stmt, 2);
> +             if (integer_zerop (size))
> +               {
> +                 delete_dead_call (gsi);
> +                 return;
> +               }
> +
>               gimple *use_stmt;
>               enum dse_store_status store_status;
>               m_byte_tracking_enabled
>
Jeff Law Jan. 16, 2017, 6:27 p.m. UTC | #2
On 01/16/2017 01:51 AM, Richard Biener wrote:
> On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>>
>> At one time I know I had the max_size == size test in valid_ao_ref_for_dse.
>> But it got lost at some point.  This is what caused the Ada failure.
>>
>> Technically it'd be OK for the potentially dead store to have a variable
>> size as long as the later stores covered the entire range of the potentially
>> dead store.  I doubt this happens enough to be worth checking.
>>
>> The ppc64 big endian failures were more interesting.  We had this in the IL:
>>
>> memmove (dst, src, 0)
>>
>> The trimming code assumes that there's at least one live byte in the store,
>> which obviously isn't the case here.  The net result is we compute an
>> incorrect trim and the copy goes wild with incorrect addresses and lengths.
>> This is trivial to fix by validating that the store has a nonzero length.
>>
>> I was a bit curious how often this happened in practice because such a call
>> is trivially dead.  ~80 during a bootstrap and a few dozen in the testsuite.
>> Given how trivial it is to detect and optimize, this patch includes removal
>> of such calls.  This hunk makes the check for zero size in
>> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if we add
>> more builtin support without filtering zero size we'd regress again.
>
> Interesting - we do fold memset (..., 0) away so this means we either
> have an unfolded memset stmt in the IL before DSE.
It's actually exposed by fre3, both in the original test and in the 
reduced testcase.  In the reduced testcase we have this just prior to FRE3:

;;   basic block 3, loop depth 0, count 0, freq 7326, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 [73.3%]  (TRUE_VALUE,EXECUTABLE)
   _3 = MEM[(const struct vec *)_4].m_num;
   if (_3 != 0)
     goto <bb 4>; [36.64%]
   else
     goto <bb 5>; [63.36%]
;;    succ:       4 [36.6%]  (TRUE_VALUE,EXECUTABLE)
;;                5 [63.4%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
   _6 = vec_av_set.m_vec;
   _7 = _6->m_num;
   _8 = _7 - _3;
   _6->m_num = _8;
   _9 = (long unsigned int) _8;
   _10 = _9 * 4;
   slot.2_11 = slot;
   dest.3_12 = dest;
   memmove (dest.3_12, slot.2_11, _10);
;;    succ:       5 [100.0%]  (FALLTHRU,EXECUTABLE)


_3 has the value _6->m_num.  Thus _8 will have the value 0, which in 
turn makes _10 have the value zero as seen in the .fre3 dump:

;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
   _4->m_num = 0;
   slot.2_11 = slot;
   dest.3_12 = dest;
   memmove (dest.3_12, slot.2_11, 0);

In the full test its similar.

I don't know if you want to try and catch this in FRE though.

If we detect in DCE (where it makes reasonable sense) rather than DSE, 
then we detect the dead mem* about 17 passes earlier and the dead 
argument setup about 20 passes earlier.  In the testcase I looked at, I 
didn't see additional secondary optimizations enabled, but I could 
imagine cases where it might.  Seems like a gcc-8 thing though.

Jeff
Richard Biener Jan. 16, 2017, 10:36 p.m. UTC | #3
On January 16, 2017 7:27:53 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 01/16/2017 01:51 AM, Richard Biener wrote:
>> On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>> At one time I know I had the max_size == size test in
>valid_ao_ref_for_dse.
>>> But it got lost at some point.  This is what caused the Ada failure.
>>>
>>> Technically it'd be OK for the potentially dead store to have a
>variable
>>> size as long as the later stores covered the entire range of the
>potentially
>>> dead store.  I doubt this happens enough to be worth checking.
>>>
>>> The ppc64 big endian failures were more interesting.  We had this in
>the IL:
>>>
>>> memmove (dst, src, 0)
>>>
>>> The trimming code assumes that there's at least one live byte in the
>store,
>>> which obviously isn't the case here.  The net result is we compute
>an
>>> incorrect trim and the copy goes wild with incorrect addresses and
>lengths.
>>> This is trivial to fix by validating that the store has a nonzero
>length.
>>>
>>> I was a bit curious how often this happened in practice because such
>a call
>>> is trivially dead.  ~80 during a bootstrap and a few dozen in the
>testsuite.
>>> Given how trivial it is to detect and optimize, this patch includes
>removal
>>> of such calls.  This hunk makes the check for zero size in
>>> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if
>we add
>>> more builtin support without filtering zero size we'd regress again.
>>
>> Interesting - we do fold memset (..., 0) away so this means we either
>> have an unfolded memset stmt in the IL before DSE.
>It's actually exposed by fre3, both in the original test and in the 
>reduced testcase.  In the reduced testcase we have this just prior to
>FRE3:
>
>;;   basic block 3, loop depth 0, count 0, freq 7326, maybe hot
>;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>;;    pred:       2 [73.3%]  (TRUE_VALUE,EXECUTABLE)
>   _3 = MEM[(const struct vec *)_4].m_num;
>   if (_3 != 0)
>     goto <bb 4>; [36.64%]
>   else
>     goto <bb 5>; [63.36%]
>;;    succ:       4 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>;;                5 [63.4%]  (FALSE_VALUE,EXECUTABLE)
>
>;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>   _6 = vec_av_set.m_vec;
>   _7 = _6->m_num;
>   _8 = _7 - _3;
>   _6->m_num = _8;
>   _9 = (long unsigned int) _8;
>   _10 = _9 * 4;
>   slot.2_11 = slot;
>   dest.3_12 = dest;
>   memmove (dest.3_12, slot.2_11, _10);
>;;    succ:       5 [100.0%]  (FALLTHRU,EXECUTABLE)
>
>
>_3 has the value _6->m_num.  Thus _8 will have the value 0, which in 
>turn makes _10 have the value zero as seen in the .fre3 dump:
>
>;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>   _4->m_num = 0;
>   slot.2_11 = slot;
>   dest.3_12 = dest;
>   memmove (dest.3_12, slot.2_11, 0);
>
>In the full test its similar.
>
>I don't know if you want to try and catch this in FRE though.

Ah, I think I have patches for this since a long time in my tree...  We're folding calls in a restricted way for some historical reason.

>If we detect in DCE (where it makes reasonable sense) rather than DSE, 
>then we detect the dead mem* about 17 passes earlier and the dead 
>argument setup about 20 passes earlier.  In the testcase I looked at, I
>
>didn't see additional secondary optimizations enabled, but I could 
>imagine cases where it might.  Seems like a gcc-8 thing though.

I'll give it a quick look tomorrow.

Richard.

>Jeff
Richard Biener Jan. 17, 2017, 9:15 a.m. UTC | #4
On Mon, Jan 16, 2017 at 11:36 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On January 16, 2017 7:27:53 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>>On 01/16/2017 01:51 AM, Richard Biener wrote:
>>> On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>>>>
>>>> At one time I know I had the max_size == size test in
>>valid_ao_ref_for_dse.
>>>> But it got lost at some point.  This is what caused the Ada failure.
>>>>
>>>> Technically it'd be OK for the potentially dead store to have a
>>variable
>>>> size as long as the later stores covered the entire range of the
>>potentially
>>>> dead store.  I doubt this happens enough to be worth checking.
>>>>
>>>> The ppc64 big endian failures were more interesting.  We had this in
>>the IL:
>>>>
>>>> memmove (dst, src, 0)
>>>>
>>>> The trimming code assumes that there's at least one live byte in the
>>store,
>>>> which obviously isn't the case here.  The net result is we compute
>>an
>>>> incorrect trim and the copy goes wild with incorrect addresses and
>>lengths.
>>>> This is trivial to fix by validating that the store has a nonzero
>>length.
>>>>
>>>> I was a bit curious how often this happened in practice because such
>>a call
>>>> is trivially dead.  ~80 during a bootstrap and a few dozen in the
>>testsuite.
>>>> Given how trivial it is to detect and optimize, this patch includes
>>removal
>>>> of such calls.  This hunk makes the check for zero size in
>>>> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if
>>we add
>>>> more builtin support without filtering zero size we'd regress again.
>>>
>>> Interesting - we do fold memset (..., 0) away so this means we either
>>> have an unfolded memset stmt in the IL before DSE.
>>It's actually exposed by fre3, both in the original test and in the
>>reduced testcase.  In the reduced testcase we have this just prior to
>>FRE3:
>>
>>;;   basic block 3, loop depth 0, count 0, freq 7326, maybe hot
>>;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>;;    pred:       2 [73.3%]  (TRUE_VALUE,EXECUTABLE)
>>   _3 = MEM[(const struct vec *)_4].m_num;
>>   if (_3 != 0)
>>     goto <bb 4>; [36.64%]
>>   else
>>     goto <bb 5>; [63.36%]
>>;;    succ:       4 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>;;                5 [63.4%]  (FALSE_VALUE,EXECUTABLE)
>>
>>;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>   _6 = vec_av_set.m_vec;
>>   _7 = _6->m_num;
>>   _8 = _7 - _3;
>>   _6->m_num = _8;
>>   _9 = (long unsigned int) _8;
>>   _10 = _9 * 4;
>>   slot.2_11 = slot;
>>   dest.3_12 = dest;
>>   memmove (dest.3_12, slot.2_11, _10);
>>;;    succ:       5 [100.0%]  (FALLTHRU,EXECUTABLE)
>>
>>
>>_3 has the value _6->m_num.  Thus _8 will have the value 0, which in
>>turn makes _10 have the value zero as seen in the .fre3 dump:
>>
>>;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>   _4->m_num = 0;
>>   slot.2_11 = slot;
>>   dest.3_12 = dest;
>>   memmove (dest.3_12, slot.2_11, 0);
>>
>>In the full test its similar.
>>
>>I don't know if you want to try and catch this in FRE though.
>
> Ah, I think I have patches for this since a long time in my tree...  We're folding calls in a restricted way for some historical reason.
>
>>If we detect in DCE (where it makes reasonable sense) rather than DSE,
>>then we detect the dead mem* about 17 passes earlier and the dead
>>argument setup about 20 passes earlier.  In the testcase I looked at, I
>>
>>didn't see additional secondary optimizations enabled, but I could
>>imagine cases where it might.  Seems like a gcc-8 thing though.
>
> I'll give it a quick look tomorrow.

The comment before fold_stmt_inplace no longer applies (but it seems I
simply forgot to push
this change...).  It's better to not keep unfolded stmts around, so
I'll commit this as last bit
of stage3 if testing is fine.

Bootstrap / regtest on x86_64-unknown-linux-gnu in progress.

Richard.

2017-01-17  Richard Biener  <rguenther@suse.de>

        * tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
        Fold calls regularly.

        * gcc.dg/tree-ssa/ssa-fre-57.c: New testcase.
Jeff Law Jan. 17, 2017, 2:42 p.m. UTC | #5
On 01/17/2017 02:15 AM, Richard Biener wrote:
> On Mon, Jan 16, 2017 at 11:36 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On January 16, 2017 7:27:53 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>>> On 01/16/2017 01:51 AM, Richard Biener wrote:
>>>> On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> At one time I know I had the max_size == size test in
>>> valid_ao_ref_for_dse.
>>>>> But it got lost at some point.  This is what caused the Ada failure.
>>>>>
>>>>> Technically it'd be OK for the potentially dead store to have a
>>> variable
>>>>> size as long as the later stores covered the entire range of the
>>> potentially
>>>>> dead store.  I doubt this happens enough to be worth checking.
>>>>>
>>>>> The ppc64 big endian failures were more interesting.  We had this in
>>> the IL:
>>>>>
>>>>> memmove (dst, src, 0)
>>>>>
>>>>> The trimming code assumes that there's at least one live byte in the
>>> store,
>>>>> which obviously isn't the case here.  The net result is we compute
>>> an
>>>>> incorrect trim and the copy goes wild with incorrect addresses and
>>> lengths.
>>>>> This is trivial to fix by validating that the store has a nonzero
>>> length.
>>>>>
>>>>> I was a bit curious how often this happened in practice because such
>>> a call
>>>>> is trivially dead.  ~80 during a bootstrap and a few dozen in the
>>> testsuite.
>>>>> Given how trivial it is to detect and optimize, this patch includes
>>> removal
>>>>> of such calls.  This hunk makes the check for zero size in
>>>>> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if
>>> we add
>>>>> more builtin support without filtering zero size we'd regress again.
>>>>
>>>> Interesting - we do fold memset (..., 0) away so this means we either
>>>> have an unfolded memset stmt in the IL before DSE.
>>> It's actually exposed by fre3, both in the original test and in the
>>> reduced testcase.  In the reduced testcase we have this just prior to
>>> FRE3:
>>>
>>> ;;   basic block 3, loop depth 0, count 0, freq 7326, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 [73.3%]  (TRUE_VALUE,EXECUTABLE)
>>>   _3 = MEM[(const struct vec *)_4].m_num;
>>>   if (_3 != 0)
>>>     goto <bb 4>; [36.64%]
>>>   else
>>>     goto <bb 5>; [63.36%]
>>> ;;    succ:       4 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>> ;;                5 [63.4%]  (FALSE_VALUE,EXECUTABLE)
>>>
>>> ;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>   _6 = vec_av_set.m_vec;
>>>   _7 = _6->m_num;
>>>   _8 = _7 - _3;
>>>   _6->m_num = _8;
>>>   _9 = (long unsigned int) _8;
>>>   _10 = _9 * 4;
>>>   slot.2_11 = slot;
>>>   dest.3_12 = dest;
>>>   memmove (dest.3_12, slot.2_11, _10);
>>> ;;    succ:       5 [100.0%]  (FALLTHRU,EXECUTABLE)
>>>
>>>
>>> _3 has the value _6->m_num.  Thus _8 will have the value 0, which in
>>> turn makes _10 have the value zero as seen in the .fre3 dump:
>>>
>>> ;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>   _4->m_num = 0;
>>>   slot.2_11 = slot;
>>>   dest.3_12 = dest;
>>>   memmove (dest.3_12, slot.2_11, 0);
>>>
>>> In the full test its similar.
>>>
>>> I don't know if you want to try and catch this in FRE though.
>>
>> Ah, I think I have patches for this since a long time in my tree...  We're folding calls in a restricted way for some historical reason.
>>
>>> If we detect in DCE (where it makes reasonable sense) rather than DSE,
>>> then we detect the dead mem* about 17 passes earlier and the dead
>>> argument setup about 20 passes earlier.  In the testcase I looked at, I
>>>
>>> didn't see additional secondary optimizations enabled, but I could
>>> imagine cases where it might.  Seems like a gcc-8 thing though.
>>
>> I'll give it a quick look tomorrow.
>
> The comment before fold_stmt_inplace no longer applies (but it seems I
> simply forgot to push
> this change...).  It's better to not keep unfolded stmts around, so
> I'll commit this as last bit
> of stage3 if testing is fine.
>
> Bootstrap / regtest on x86_64-unknown-linux-gnu in progress.
>
> Richard.
>
> 2017-01-17  Richard Biener  <rguenther@suse.de>
>
>         * tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
>         Fold calls regularly.
>
>         * gcc.dg/tree-ssa/ssa-fre-57.c: New testcase.
>
Note you'll need the trivial update to the new ssa-dse testcase as it 
verifies removal of the dead memmove.

jeff
Richard Biener Jan. 18, 2017, 12:41 p.m. UTC | #6
On Tue, Jan 17, 2017 at 3:42 PM, Jeff Law <law@redhat.com> wrote:
> On 01/17/2017 02:15 AM, Richard Biener wrote:
>>
>> On Mon, Jan 16, 2017 at 11:36 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On January 16, 2017 7:27:53 PM GMT+01:00, Jeff Law <law@redhat.com>
>>> wrote:
>>>>
>>>> On 01/16/2017 01:51 AM, Richard Biener wrote:
>>>>>
>>>>> On Sun, Jan 15, 2017 at 10:34 AM, Jeff Law <law@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>> At one time I know I had the max_size == size test in
>>>>
>>>> valid_ao_ref_for_dse.
>>>>>>
>>>>>> But it got lost at some point.  This is what caused the Ada failure.
>>>>>>
>>>>>> Technically it'd be OK for the potentially dead store to have a
>>>>
>>>> variable
>>>>>>
>>>>>> size as long as the later stores covered the entire range of the
>>>>
>>>> potentially
>>>>>>
>>>>>> dead store.  I doubt this happens enough to be worth checking.
>>>>>>
>>>>>> The ppc64 big endian failures were more interesting.  We had this in
>>>>
>>>> the IL:
>>>>>>
>>>>>>
>>>>>> memmove (dst, src, 0)
>>>>>>
>>>>>> The trimming code assumes that there's at least one live byte in the
>>>>
>>>> store,
>>>>>>
>>>>>> which obviously isn't the case here.  The net result is we compute
>>>>
>>>> an
>>>>>>
>>>>>> incorrect trim and the copy goes wild with incorrect addresses and
>>>>
>>>> lengths.
>>>>>>
>>>>>> This is trivial to fix by validating that the store has a nonzero
>>>>
>>>> length.
>>>>>>
>>>>>>
>>>>>> I was a bit curious how often this happened in practice because such
>>>>
>>>> a call
>>>>>>
>>>>>> is trivially dead.  ~80 during a bootstrap and a few dozen in the
>>>>
>>>> testsuite.
>>>>>>
>>>>>> Given how trivial it is to detect and optimize, this patch includes
>>>>
>>>> removal
>>>>>>
>>>>>> of such calls.  This hunk makes the check for zero size in
>>>>>> valid_ao_ref_for_dse redundant, but I'd like to keep the check -- if
>>>>
>>>> we add
>>>>>>
>>>>>> more builtin support without filtering zero size we'd regress again.
>>>>>
>>>>>
>>>>> Interesting - we do fold memset (..., 0) away so this means we either
>>>>> have an unfolded memset stmt in the IL before DSE.
>>>>
>>>> It's actually exposed by fre3, both in the original test and in the
>>>> reduced testcase.  In the reduced testcase we have this just prior to
>>>> FRE3:
>>>>
>>>> ;;   basic block 3, loop depth 0, count 0, freq 7326, maybe hot
>>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>>> ;;    pred:       2 [73.3%]  (TRUE_VALUE,EXECUTABLE)
>>>>   _3 = MEM[(const struct vec *)_4].m_num;
>>>>   if (_3 != 0)
>>>>     goto <bb 4>; [36.64%]
>>>>   else
>>>>     goto <bb 5>; [63.36%]
>>>> ;;    succ:       4 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>> ;;                5 [63.4%]  (FALSE_VALUE,EXECUTABLE)
>>>>
>>>> ;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>>> ;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>>   _6 = vec_av_set.m_vec;
>>>>   _7 = _6->m_num;
>>>>   _8 = _7 - _3;
>>>>   _6->m_num = _8;
>>>>   _9 = (long unsigned int) _8;
>>>>   _10 = _9 * 4;
>>>>   slot.2_11 = slot;
>>>>   dest.3_12 = dest;
>>>>   memmove (dest.3_12, slot.2_11, _10);
>>>> ;;    succ:       5 [100.0%]  (FALLTHRU,EXECUTABLE)
>>>>
>>>>
>>>> _3 has the value _6->m_num.  Thus _8 will have the value 0, which in
>>>> turn makes _10 have the value zero as seen in the .fre3 dump:
>>>>
>>>> ;;   basic block 4, loop depth 0, count 0, freq 2684, maybe hot
>>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>>> ;;    pred:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>>   _4->m_num = 0;
>>>>   slot.2_11 = slot;
>>>>   dest.3_12 = dest;
>>>>   memmove (dest.3_12, slot.2_11, 0);
>>>>
>>>> In the full test its similar.
>>>>
>>>> I don't know if you want to try and catch this in FRE though.
>>>
>>>
>>> Ah, I think I have patches for this since a long time in my tree...
>>> We're folding calls in a restricted way for some historical reason.
>>>
>>>> If we detect in DCE (where it makes reasonable sense) rather than DSE,
>>>> then we detect the dead mem* about 17 passes earlier and the dead
>>>> argument setup about 20 passes earlier.  In the testcase I looked at, I
>>>>
>>>> didn't see additional secondary optimizations enabled, but I could
>>>> imagine cases where it might.  Seems like a gcc-8 thing though.
>>>
>>>
>>> I'll give it a quick look tomorrow.
>>
>>
>> The comment before fold_stmt_inplace no longer applies (but it seems I
>> simply forgot to push
>> this change...).  It's better to not keep unfolded stmts around, so
>> I'll commit this as last bit
>> of stage3 if testing is fine.
>>
>> Bootstrap / regtest on x86_64-unknown-linux-gnu in progress.
>>
>> Richard.
>>
>> 2017-01-17  Richard Biener  <rguenther@suse.de>
>>
>>         * tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
>>         Fold calls regularly.
>>
>>         * gcc.dg/tree-ssa/ssa-fre-57.c: New testcase.
>>
> Note you'll need the trivial update to the new ssa-dse testcase as it
> verifies removal of the dead memmove.

Yep, the patch had other fallout so I'm postponing it for GCC 8.

Richard.

> jeff
>
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 746f6af..36982c6 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@ 
+2017-01-14  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/79090
+	* tree-ssa-dse.c (valid_ao_ref_for_dse): Reject zero length and
+	variable length stores.
+	(compute_trims): Delete dead assignment to *trim_tail.
+	(dse_dom_walker::dse_optimize_stmt): Optimize mem* calls with
+	zero length.
+
 2017-01-14  Bernd Schmidt  <bschmidt@redhat.com>
 
 	PR rtl-optimization/78626
diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 2e6f8ff..65dcb12 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -129,6 +129,8 @@  valid_ao_ref_for_dse (ao_ref *ref)
 {
   return (ao_ref_base (ref)
 	  && ref->max_size != -1
+	  && ref->size != 0
+	  && ref->max_size == ref->size
 	  && (ref->offset % BITS_PER_UNIT) == 0
 	  && (ref->size % BITS_PER_UNIT) == 0
 	  && (ref->size != -1));
@@ -221,7 +223,6 @@  compute_trims (ao_ref *ref, sbitmap live, int *trim_head, int *trim_tail)
      the REF to compute the trims.  */
 
   /* Now identify how much, if any of the tail we can chop off.  */
-  *trim_tail = 0;
   int last_orig = (ref->size / BITS_PER_UNIT) - 1;
   int last_live = bitmap_last_set_bit (live);
   *trim_tail = (last_orig - last_live) & ~0x1;
@@ -700,6 +701,16 @@  dse_dom_walker::dse_optimize_stmt (gimple_stmt_iterator *gsi)
 	  case BUILT_IN_MEMMOVE:
 	  case BUILT_IN_MEMSET:
 	    {
+	      /* Occasionally calls with an explicit length of zero
+		 show up in the IL.  It's pointless to do analysis
+		 on them, they're trivially dead.  */
+	      tree size = gimple_call_arg (stmt, 2);
+	      if (integer_zerop (size))
+		{
+		  delete_dead_call (gsi);
+		  return;
+		}
+
 	      gimple *use_stmt;
 	      enum dse_store_status store_status;
 	      m_byte_tracking_enabled