diff mbox

Run DCE after if conversion

Message ID 20150310075708.GA7213@maggie
State New
Headers show

Commit Message

Andreas Krebbel March 10, 2015, 7:57 a.m. UTC
Hi,

the combine pass sometimes gets confused by already dead compares
which are remains of the if conversion pass.  This e.g. happens in
gcc.dg/builtin-bswap-7.c.  Compiling with -march=z196 the if blocks
are modified to make use of load on condition. This duplicates the
compare insn but unfortunately ifcvt fails to clean it up and the
additional compare survives until the second ifcvt pass. Combine fails
to do the simplification of the bswap since the bswapped value appears
to be used in the second compare as well.

ifcvt runs df_analyze in a loop until nothing changes anymore. So this
loop is always left with all df solutions being clean. However, dce is
only run once, before the first iteration.

The attached patch fixes the builtin-bswap-7.c testcase for s390x
(-march=z196) but is probably helpful in other situations as well.

Ok?

Bye,

-Andreas-


2015-03-10  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>

	* gcc/ifcvt.c (if_convert):

Comments

Steven Bosscher March 10, 2015, 9:12 a.m. UTC | #1
On Tue, Mar 10, 2015 at 8:57 AM, Andreas Krebbel wrote:
>
>         * gcc/ifcvt.c (if_convert):
>

...yes...?


> diff --git a/gcc/ifcvt.c b/gcc/ifcvt.c
> index a3e3e5c..d2040af 100644
> --- a/gcc/ifcvt.c
> +++ b/gcc/ifcvt.c
> @@ -4626,6 +4626,13 @@ if_convert (bool after_combine)
>                num_true_changes);
>      }
>
> +  if (num_true_changes > 0)
> +    {
> +      df_set_flags (DF_LR_RUN_DCE);
> +      df_mark_solutions_dirty ();
> +      df_analyze ();
> +    }
> +
>   if (optimize == 1)
>     df_remove_problem (df_live);

Tiny nail, huge hammer. This triggers a full re-scan of all insns and
a re-calculation of all dataflow problems.

The transformations in ifcvt are all simple enough that it should be
possible to just clean up that redundant insn at the site where the
code transformation is performed.

Ciao!
Steven
Andreas Krebbel March 10, 2015, 9:19 a.m. UTC | #2
On 03/10/2015 10:12 AM, Steven Bosscher wrote:
> On Tue, Mar 10, 2015 at 8:57 AM, Andreas Krebbel wrote:
>>
>>         * gcc/ifcvt.c (if_convert):
>>
> 
> ...yes...?

Damn. mklog is still not able to do the complete job for me ;)

> Tiny nail, huge hammer. This triggers a full re-scan of all insns and
> a re-calculation of all dataflow problems.
> 
> The transformations in ifcvt are all simple enough that it should be
> possible to just clean up that redundant insn at the site where the
> code transformation is performed.

Agreed. I was hoping to get this low risk version into 5.0.

Bye,

-Andreas-
Richard Biener March 10, 2015, 10:27 a.m. UTC | #3
On Tue, Mar 10, 2015 at 10:19 AM, Andreas Krebbel
<krebbel@linux.vnet.ibm.com> wrote:
> On 03/10/2015 10:12 AM, Steven Bosscher wrote:
>> On Tue, Mar 10, 2015 at 8:57 AM, Andreas Krebbel wrote:
>>>
>>>         * gcc/ifcvt.c (if_convert):
>>>
>>
>> ...yes...?
>
> Damn. mklog is still not able to do the complete job for me ;)
>
>> Tiny nail, huge hammer. This triggers a full re-scan of all insns and
>> a re-calculation of all dataflow problems.
>>
>> The transformations in ifcvt are all simple enough that it should be
>> possible to just clean up that redundant insn at the site where the
>> code transformation is performed.
>
> Agreed. I was hoping to get this low risk version into 5.0.

Low risk?  High risk of heavy compile-time regression I'd say
(make a testcase that triggers a lot of iteration).

Is this fixing a regression in some way?

Richard.

> Bye,
>
> -Andreas-
>
>
Andreas Krebbel March 10, 2015, 12:22 p.m. UTC | #4
On 03/10/2015 11:27 AM, Richard Biener wrote:
....
> Is this fixing a regression in some way?

Not really. The optimization supposed to fold the bswap in that case is not that old:
https://gcc.gnu.org/ml/gcc-patches/2013-05/msg01378.html
The underlying problem however is probably visible in one way or the other since we have "load on
condition" instructions introduced with z196 (GCC 4.6).

-Andreas-
Andreas Krebbel March 17, 2015, 8:17 a.m. UTC | #5
On 03/10/2015 11:27 AM, Richard Biener wrote:
> On Tue, Mar 10, 2015 at 10:19 AM, Andreas Krebbel
> <krebbel@linux.vnet.ibm.com> wrote:
>> On 03/10/2015 10:12 AM, Steven Bosscher wrote:
>>> On Tue, Mar 10, 2015 at 8:57 AM, Andreas Krebbel wrote:
>>>>
>>>>         * gcc/ifcvt.c (if_convert):
>>>>
>>>
>>> ...yes...?
>>
>> Damn. mklog is still not able to do the complete job for me ;)
>>
>>> Tiny nail, huge hammer. This triggers a full re-scan of all insns and
>>> a re-calculation of all dataflow problems.
>>>
>>> The transformations in ifcvt are all simple enough that it should be
>>> possible to just clean up that redundant insn at the site where the
>>> code transformation is performed.
>>
>> Agreed. I was hoping to get this low risk version into 5.0.
> 
> Low risk?  High risk of heavy compile-time regression I'd say
> (make a testcase that triggers a lot of iteration).

Just to have some numbers I did run a -j1 GCC bootstrap twice with and without the patch on x86_64.
Best results for both are:

clean:   21459s
patched: 21314s

There rather appears to be a trend towards reduced compile time perhaps due to the reduced number of
INSNs to be processed in the RTL passes between the two ifcvt runs (loop optimization, combine,
fwprop, dse,...)?!

I also tried to measure the testsuite runs but the results show a big variance. So what I have right
now does not qualify as a benchmark.

Bye,

-Andreas-

> 
> Is this fixing a regression in some way?
> 
> Richard.
> 
>> Bye,
>>
>> -Andreas-
>>
>>
>
Jeff Law March 17, 2015, 6:29 p.m. UTC | #6
On 03/17/2015 02:17 AM, Andreas Krebbel wrote:
>
> Just to have some numbers I did run a -j1 GCC bootstrap twice with and without the patch on x86_64.
> Best results for both are:
>
> clean:   21459s
> patched: 21314s
>
> There rather appears to be a trend towards reduced compile time perhaps due to the reduced number of
> INSNs to be processed in the RTL passes between the two ifcvt runs (loop optimization, combine,
> fwprop, dse,...)?!
>
> I also tried to measure the testsuite runs but the results show a big variance. So what I have right
> now does not qualify as a benchmark.
And reality is it's getting harder and harder to benchmark this kind of 
thing with turbo modes and such.    A single run isn't sufficient unless 
you've locked the box into a particular cpu frequency.

jeff
>
Richard Biener March 18, 2015, 11:04 a.m. UTC | #7
On Tue, Mar 17, 2015 at 7:29 PM, Jeff Law <law@redhat.com> wrote:
> On 03/17/2015 02:17 AM, Andreas Krebbel wrote:
>>
>>
>> Just to have some numbers I did run a -j1 GCC bootstrap twice with and
>> without the patch on x86_64.
>> Best results for both are:
>>
>> clean:   21459s
>> patched: 21314s
>>
>> There rather appears to be a trend towards reduced compile time perhaps
>> due to the reduced number of
>> INSNs to be processed in the RTL passes between the two ifcvt runs (loop
>> optimization, combine,
>> fwprop, dse,...)?!
>>
>> I also tried to measure the testsuite runs but the results show a big
>> variance. So what I have right
>> now does not qualify as a benchmark.
>
> And reality is it's getting harder and harder to benchmark this kind of
> thing with turbo modes and such.    A single run isn't sufficient unless
> you've locked the box into a particular cpu frequency.

For the particular patch I wonder if you really need to change all
three if-conversion pass instances or if changing the one before
combine (pass_rtl_ifcvt, thus rest_of_handle_if_conversion) is enough.

That already runs an unconditonal (huh...) cleanup_cfg (0) at the end
which could be changed so that DCE is performed (CLEANUP_EXPENSIVE,
runs delete_trivially_dead_insns).

At least that makes the patch smaller and its impact restricted to
one of the three ifcvt passes.

OTOH ifcvt performs a DCE at its start (to be not confused by dead
instructions I guess), so why doesn't combine do that as well
(oh, it does!?)?  And maybe _that_ DCE can be removed as if_convert ()
already performs a DF_LR_RUN_DCE on the first pass.

Richard.

> jeff
>>
>>
>
Andreas Krebbel March 20, 2015, 8:50 a.m. UTC | #8
On 03/18/2015 12:04 PM, Richard Biener wrote:
> On Tue, Mar 17, 2015 at 7:29 PM, Jeff Law <law@redhat.com> wrote:
>> On 03/17/2015 02:17 AM, Andreas Krebbel wrote:
>>>
>>>
>>> Just to have some numbers I did run a -j1 GCC bootstrap twice with and
>>> without the patch on x86_64.
>>> Best results for both are:
>>>
>>> clean:   21459s
>>> patched: 21314s
>>>
>>> There rather appears to be a trend towards reduced compile time perhaps
>>> due to the reduced number of
>>> INSNs to be processed in the RTL passes between the two ifcvt runs (loop
>>> optimization, combine,
>>> fwprop, dse,...)?!
>>>
>>> I also tried to measure the testsuite runs but the results show a big
>>> variance. So what I have right
>>> now does not qualify as a benchmark.
>>
>> And reality is it's getting harder and harder to benchmark this kind of
>> thing with turbo modes and such.    A single run isn't sufficient unless
>> you've locked the box into a particular cpu frequency.
> 
> For the particular patch I wonder if you really need to change all
> three if-conversion pass instances or if changing the one before
> combine (pass_rtl_ifcvt, thus rest_of_handle_if_conversion) is enough.
Right. For this particular case it would be good enough to do it only in the first ifcvt run. But
perhaps there are cases where later passes get confused by the leftovers from ifcvt?

> That already runs an unconditonal (huh...) cleanup_cfg (0) at the end
> which could be changed so that DCE is performed (CLEANUP_EXPENSIVE,
> runs delete_trivially_dead_insns).
> 
> At least that makes the patch smaller and its impact restricted to
> one of the three ifcvt passes.
This does not seem to work. The DCE run by cleanup_cfg only deals with dead pseudos. In my case it
is a set to the CC hard reg which becomes dead.

> OTOH ifcvt performs a DCE at its start (to be not confused by dead
> instructions I guess), so why doesn't combine do that as well
> (oh, it does!?)?  
When reaching combine the LR solution is clean and therefore also DCE isn't performed. This is
because ifcvt disables running DCE with LR. So LR is always clean after ifcvt although there are
still dead insns left.

> And maybe _that_ DCE can be removed as if_convert ()
> already performs a DF_LR_RUN_DCE on the first pass.
You mean removing the DCE run in combine? That one probably can go away then given that the passes
running between ifcvt and combine (loop and fwprop) get rid of dead insns properly.

-Andreas-

> 
> Richard.
> 
>> jeff
>>>
>>>
>>
>
Richard Biener March 20, 2015, 9:48 a.m. UTC | #9
On Fri, Mar 20, 2015 at 9:50 AM, Andreas Krebbel
<krebbel@linux.vnet.ibm.com> wrote:
> On 03/18/2015 12:04 PM, Richard Biener wrote:
>> On Tue, Mar 17, 2015 at 7:29 PM, Jeff Law <law@redhat.com> wrote:
>>> On 03/17/2015 02:17 AM, Andreas Krebbel wrote:
>>>>
>>>>
>>>> Just to have some numbers I did run a -j1 GCC bootstrap twice with and
>>>> without the patch on x86_64.
>>>> Best results for both are:
>>>>
>>>> clean:   21459s
>>>> patched: 21314s
>>>>
>>>> There rather appears to be a trend towards reduced compile time perhaps
>>>> due to the reduced number of
>>>> INSNs to be processed in the RTL passes between the two ifcvt runs (loop
>>>> optimization, combine,
>>>> fwprop, dse,...)?!
>>>>
>>>> I also tried to measure the testsuite runs but the results show a big
>>>> variance. So what I have right
>>>> now does not qualify as a benchmark.
>>>
>>> And reality is it's getting harder and harder to benchmark this kind of
>>> thing with turbo modes and such.    A single run isn't sufficient unless
>>> you've locked the box into a particular cpu frequency.
>>
>> For the particular patch I wonder if you really need to change all
>> three if-conversion pass instances or if changing the one before
>> combine (pass_rtl_ifcvt, thus rest_of_handle_if_conversion) is enough.
> Right. For this particular case it would be good enough to do it only in the first ifcvt run. But
> perhaps there are cases where later passes get confused by the leftovers from ifcvt?

Well, "perhaps" ...

>> That already runs an unconditonal (huh...) cleanup_cfg (0) at the end
>> which could be changed so that DCE is performed (CLEANUP_EXPENSIVE,
>> runs delete_trivially_dead_insns).
>>
>> At least that makes the patch smaller and its impact restricted to
>> one of the three ifcvt passes.
> This does not seem to work. The DCE run by cleanup_cfg only deals with dead pseudos. In my case it
> is a set to the CC hard reg which becomes dead.

I see.

>> OTOH ifcvt performs a DCE at its start (to be not confused by dead
>> instructions I guess), so why doesn't combine do that as well
>> (oh, it does!?)?
> When reaching combine the LR solution is clean and therefore also DCE isn't performed. This is
> because ifcvt disables running DCE with LR. So LR is always clean after ifcvt although there are
> still dead insns left.

Ok...

>> And maybe _that_ DCE can be removed as if_convert ()
>> already performs a DF_LR_RUN_DCE on the first pass.
> You mean removing the DCE run in combine? That one probably can go away then given that the passes
> running between ifcvt and combine (loop and fwprop) get rid of dead insns properly.

So I think the best solution is still removing the dead code manually
in ifcvt.  Then there is also
pass_ud_rtl_dce run before the combine pass - if that isn't enough
because of clean solutions or
whatever then that should be strenghthened - it was most definitely
designed to fix the issue you see.

Richard.

> -Andreas-
>
>>
>> Richard.
>>
>>> jeff
>>>>
>>>>
>>>
>>
>
diff mbox

Patch

diff --git a/gcc/ifcvt.c b/gcc/ifcvt.c
index a3e3e5c..d2040af 100644
--- a/gcc/ifcvt.c
+++ b/gcc/ifcvt.c
@@ -4626,6 +4626,13 @@  if_convert (bool after_combine)
 	       num_true_changes);
     }
 
+  if (num_true_changes > 0)
+    {
+      df_set_flags (DF_LR_RUN_DCE);
+      df_mark_solutions_dirty ();
+      df_analyze ();
+    }
+
   if (optimize == 1)
     df_remove_problem (df_live);