diff mbox

combine/dce patch for PR36003, PR42575

Message ID 4C20938E.2060606@codesourcery.com
State New
Headers show

Commit Message

Bernd Schmidt June 22, 2010, 10:42 a.m. UTC
PR36003 is a problem where enabling pass_fast_rtl_byte_dce causes
crashes in a cris compiler when the host is 64 bit.  combine produces an
insn of the form

(set (cc0)
        (zero_extract:DI (subreg:SI (reg/v:DI 38 [ high ]) 0)
            (const_int 1 [0x1])
            (const_int 0 [0x0]))) 16 {*btst} (nil))

The cris backend accepts this since it does not specify a mode for the
zero_extract, but IMO this is invalid, as the documentation states

     If LOC is in a register, the mode to use is specified by the
     operand of the `insv' or `extv' pattern (*note Standard Names::)
     and is usually a full-word integer mode, which is the default if
     none is specified.
[...]
     The mode M is the same as the mode that would be used for LOC if
     it were a register.

which seems to imply that the zero_extract and its first operand should
both have word_mode.  The DImode zero_extract of a SImode value seems to
confuse pass_fsat_rtl_byte_dce.

In make_extraction, we have

  if (mode != VOIDmode
      && GET_MODE_SIZE (extraction_mode) < GET_MODE_SIZE (mode))
    extraction_mode = mode;

where extraction_mode will become the mode of the zero_extract, and mode
is the mode of the object we're performing the extraction on.  In this
case, mode is DImode; we have (reg:DI 38) at this point - later in
make_extraction we will create a subreg around it due to
wanted_inner_reg_mode, which is SImode == word_mode.

It seems to me that we shouldn't widen extraction_mode beyond
wanted_inner_mode, so I've deleted the if statement quoted above.  I've
built various compilers (pa, m68k, arm, i686, cris) and compiled a large
set of input files with and without this change, and I've not observed
any differences in code generation.

With this problem fixed, we can reenable byte DCE, which eliminates one
insn in the PR42575 testcase.

Bootstrap succeeded on i686-linux, regression tests in progress.  Ok?


Bernd
PR middle-end/36003
	PR rtl-optimization/42575
	* combine.c (make_extraction): Don't widen extraction_mode to mode.
	* passes.c (init_optimization_passes): Reenable pass_fast_rtl_byte_dce.

Comments

Eric Botcazou June 23, 2010, 10:41 a.m. UTC | #1
> In make_extraction, we have
>
>   if (mode != VOIDmode
>       && GET_MODE_SIZE (extraction_mode) < GET_MODE_SIZE (mode))
>     extraction_mode = mode;
>
> where extraction_mode will become the mode of the zero_extract, and mode
> is the mode of the object we're performing the extraction on.  In this
> case, mode is DImode; we have (reg:DI 38) at this point - later in
> make_extraction we will create a subreg around it due to
> wanted_inner_reg_mode, which is SImode == word_mode.

Revision history shows that wanted_inner_reg_mode was added later, i.e. the 
original wanted mode would have been extraction_mode otherwise.  In light of 
this, I think the safest change is to keep the widening for MEM_P (inner), 
i.e. move the block to within the 'else' branch a few lines below.

OK with this change for the bugfix (PR middle-end/36003 should be renamed PR 
rtl-optimization/36003) if it passes bootstrap/regtesting.

Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the 
speed/performance trade-off, DF is known to be costly.
Paolo Bonzini June 23, 2010, 12:15 p.m. UTC | #2
On 06/23/2010 12:41 PM, Eric Botcazou wrote:
> Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the
> speed/performance trade-off, DF is known to be costly.

It's not these parts that are _extremely_ costly (it's only what is 
inherently quadratic, and cannot be helped without rewriting the 
algorithms to avoid that complexity).  But I agree that it should only 
be enabled at -O3.

Paolo
Bernd Schmidt June 23, 2010, 11:48 p.m. UTC | #3
On 06/23/2010 12:41 PM, Eric Botcazou wrote:

> Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the 
> speed/performance trade-off, DF is known to be costly.

It also seems to create some regressions on arm-linux.

Would you be opposed to adding a simpler, basic-block-local pass that
only tries to eliminate unnecessary operations on DImode subregs?


Bernd
Paolo Bonzini June 24, 2010, 12:04 a.m. UTC | #4
On 06/24/2010 01:48 AM, Bernd Schmidt wrote:
> On 06/23/2010 12:41 PM, Eric Botcazou wrote:
>
>> Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the
>> speed/performance trade-off, DF is known to be costly.
>
> It also seems to create some regressions on arm-linux.
>
> Would you be opposed to adding a simpler, basic-block-local pass that
> only tries to eliminate unnecessary operations on DImode subregs?

That would be nice.  I never liked byte DCE actually, it always sounded 
a bit overengineered.

Paolo
Eric Botcazou June 24, 2010, 12:53 p.m. UTC | #5
> Would you be opposed to adding a simpler, basic-block-local pass that
> only tries to eliminate unnecessary operations on DImode subregs?

The 3 words "adding a pass" always make me cringe. :-)  Couldn't this be 
integrated into one of the existing RTL passes?
Bernd Schmidt June 24, 2010, 1 p.m. UTC | #6
On 06/24/2010 02:53 PM, Eric Botcazou wrote:
>> Would you be opposed to adding a simpler, basic-block-local pass that
>> only tries to eliminate unnecessary operations on DImode subregs?
> 
> The 3 words "adding a pass" always make me cringe. :-)  Couldn't this be 
> integrated into one of the existing RTL passes?

I can tack it onto something else, but I don't see how this would reduce
the amount of work we need to do?


Bernd
Eric Botcazou June 24, 2010, 1:16 p.m. UTC | #7
> I can tack it onto something else, but I don't see how this would reduce
> the amount of work we need to do?

The overhead of traversing the RTL or invoking DF isn't negligible so if we 
can avoid doing it one more time...  Steven's measurements showed that the RTL 
optimizers still consume 40-45% of the compilation time at -O2.
Mike Stump June 24, 2010, 1:54 p.m. UTC | #8
On Jun 24, 2010, at 6:00 AM, Bernd Schmidt wrote:
> On 06/24/2010 02:53 PM, Eric Botcazou wrote:
>> 
>> The 3 words "adding a pass" always make me cringe. :-)  Couldn't this be 
>> integrated into one of the existing RTL passes?
> 
> I can tack it onto something else, but I don't see how this would reduce
> the amount of work we need to do?

Work needed to do is measured in cache misses.  A pass adds nothing but cache misses.  Inside a loop, if you get lucky enough, extra work adds 0.  Anyway, thats a very rough idea behind his statement.  Worse, the tradeoff generally just gets worse over time.  If you collect sample data at the instruction level, you can see the instructions that just sit and wait around for memory.  Very striking, if you expected the old, this instruction take 3 cycles, this one takes 11 mentality.  In the new world, all the instructions take 0, and the 1 trivial instruction, that should be free, that waits for memory take 60% of the processor time....
Paolo Bonzini June 24, 2010, 2:03 p.m. UTC | #9
On 06/24/2010 03:00 PM, Bernd Schmidt wrote:
> On 06/24/2010 02:53 PM, Eric Botcazou wrote:
>>> Would you be opposed to adding a simpler, basic-block-local pass that
>>> only tries to eliminate unnecessary operations on DImode subregs?
>>
>> The 3 words "adding a pass" always make me cringe. :-)  Couldn't this be
>> integrated into one of the existing RTL passes?
>
> I can tack it onto something else, but I don't see how this would reduce
> the amount of work we need to do?

It may be possible to piggy-back it onto an existing DCE pass, or onto 
lower-subreg, but I'm not sure without understanding exactly what 
transform you need to do (especially if you want to run this before 
register allocation).

But you're talking about replacing a buggy and expensive pass 
(byte-level DCE) with a smaller one, which isn't a bad idea.

Paolo
Steven Bosscher June 24, 2010, 2:27 p.m. UTC | #10
On Thu, Jun 24, 2010 at 2:53 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Would you be opposed to adding a simpler, basic-block-local pass that
>> only tries to eliminate unnecessary operations on DImode subregs?
>
> The 3 words "adding a pass" always make me cringe. :-)  Couldn't this be
> integrated into one of the existing RTL passes?

Byte DCE is also a separate pass, isn't it? The new pass would
(should) just replace it. Byte DCE never really worked anyway, and
*replacing* it with something simpler, instead of adding something
new, would be a good thing IMHO.

Ciao!
Steven
Hans-Peter Nilsson June 24, 2010, 5:58 p.m. UTC | #11
On Tue, 22 Jun 2010, Bernd Schmidt wrote:
> PR36003 is a problem where enabling pass_fast_rtl_byte_dce causes
> crashes in a cris compiler when the host is 64 bit.  combine produces an
> insn of the form
>
> (set (cc0)
>         (zero_extract:DI (subreg:SI (reg/v:DI 38 [ high ]) 0)
>             (const_int 1 [0x1])
>             (const_int 0 [0x0]))) 16 {*btst} (nil))
>
> The cris backend accepts this

Look again!

> since it does not specify a mode for the
> zero_extract, but IMO this is invalid, as the documentation states
>
>      If LOC is in a register, the mode to use is specified by the
>      operand of the `insv' or `extv' pattern (*note Standard Names::)
>      and is usually a full-word integer mode, which is the default if
>      none is specified.

The RTL may be invalid, but none was specified in the port, just
like the doc says.  So, I guess rtl.texi should change too?

Anyway.  That changed about a year ago: Bonzini's condition-code
rewrite (r147425).  It is now has a mode (and is wrapped in a
modeless compare).

> [...]
>      The mode M is the same as the mode that would be used for LOC if
>      it were a register.
>
> which seems to imply that the zero_extract and its first operand should
> both have word_mode.  The DImode zero_extract of a SImode value seems to
> confuse pass_fsat_rtl_byte_dce.

I find that reference in rtl.texi so IIUC is regarding valid
RTX, not what a port has to assert or refuse.  In this respect,
the port was consistent with the compare insns; there's no mode
on the compare, just the operands of the compare.  I can't find
whatever documentation specified this, maybe I found nothing at
the time was just using the m68k-port as effective
documentation, which had the same modeless zero_extract
assigning to cc0 *at the time*.  Whatever, all this *has*
changed.  Water under the bridges...

brgds, H-P
Bernd Schmidt June 24, 2010, 6:12 p.m. UTC | #12
On 06/24/2010 07:58 PM, Hans-Peter Nilsson wrote:
> On Tue, 22 Jun 2010, Bernd Schmidt wrote:
>> PR36003 is a problem where enabling pass_fast_rtl_byte_dce causes
>> crashes in a cris compiler when the host is 64 bit.  combine produces an
>> insn of the form
>>
>> (set (cc0)
>>         (zero_extract:DI (subreg:SI (reg/v:DI 38 [ high ]) 0)
>>             (const_int 1 [0x1])
>>             (const_int 0 [0x0]))) 16 {*btst} (nil))
>>
>> The cris backend accepts this
> 
> Look again!

Well, I was working with the revision mentioned in the PR so that I
could reproduce the problem.

> I find that reference in rtl.texi so IIUC is regarding valid
> RTX, not what a port has to assert or refuse.  In this respect,
> the port was consistent with the compare insns; there's no mode
> on the compare, just the operands of the compare.  I can't find
> whatever documentation specified this, maybe I found nothing at
> the time was just using the m68k-port as effective
> documentation, which had the same modeless zero_extract
> assigning to cc0 *at the time*.  Whatever, all this *has*
> changed.  Water under the bridges...

I'm not actually sure what you're trying to say here.  I wasn't
criticizing the cris backend, although I think it's good that it now has
a mode on the zero_extract since that'll avoid such problems in the future.


Bernd
Eric Botcazou June 24, 2010, 8:40 p.m. UTC | #13
> Byte DCE is also a separate pass, isn't it? The new pass would
> (should) just replace it. Byte DCE never really worked anyway, and
> *replacing* it with something simpler, instead of adding something
> new, would be a good thing IMHO.

Yes, but byte DCE isn't enabled so a new pass will slow down the compiler out 
of its mere existence, no matter how faster than byte DCE it is.

I'm not saying that adding a new pass is out of question, but rather that it 
should be the last resort solution.
Jan Hubicka June 27, 2010, 9:21 p.m. UTC | #14
> On 06/23/2010 12:41 PM, Eric Botcazou wrote:
> 
> > Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the 
> > speed/performance trade-off, DF is known to be costly.
> 
> It also seems to create some regressions on arm-linux.
> 
> Would you be opposed to adding a simpler, basic-block-local pass that
> only tries to eliminate unnecessary operations on DImode subregs?

I can do some testing on my WHOPR build (where I was testing various DF speedups
for a while).

fast_dce and DF's note adding code are indeed one of most expensive RTL passes.
Both are very similar in their nature - i.e. they compute liveness and then walk
BB updating life.

I did some work on speeding up notes that I hope to get into mainline soonish
and considered testing replacing fast_dce by the du/ud implementation.

I think the slowness, from large part, comes from two reasons.
1) DCE iterates and calls dataflow at each step.  This is naturaly slow but
   this probably can be limited.  Pass don't get dead induction vars removed
   and other loop carried liveness is not too important.
2) It is slow on bitmap operations.  At each BB it copies a liveness bitmap
   and then it does a lot of sets/clears that are rather complicated at least
   for more than 128 pseudos.  Implementation is generally O(nregs*ninsns)
   and since the functions tends to get more noticeable on big functions I think
   it counts.

   I think we can avoid this by using two sbitmaps.  First sbitmap carry local
   liveness and second sbitmap carry info if reg was seen (and thus if liveness
   should be tested by looking into local sbitmaps or DF's live at exit bitmap).
   TO get things O(ninsns+nregs) one would need to track non-zero words of
   the first sbitmap and use the list to clear both bitmaps at BB boundary.

Also perhaps byte-dce can replace one of existing DCE passes?
fast_dce is currently done post-reload and when crossjumping happens.

I can get you some numbers and profiles for the patch on my testing setup.
In a way I would like to see the byte DF code either used or removed, so
I think we should give at least a try to the first alternative.

Honza
> 
> 
> Bernd
Steven Bosscher June 27, 2010, 9:45 p.m. UTC | #15
On Sun, Jun 27, 2010 at 11:21 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> I did some work on speeding up notes that I hope to get into mainline soonish
> and considered testing replacing fast_dce by the du/ud implementation.

The du/ud implementation has its drawbacks too.  Such as not working
very well for very large functions.


> I think the slowness, from large part, comes from two reasons.
> 1) DCE iterates and calls dataflow at each step.  This is naturaly slow but
>   this probably can be limited.  Pass don't get dead induction vars removed
>   and other loop carried liveness is not too important.

Originally DCE did not iterate, but this caused bugs, e.g. PR35805.


> Also perhaps byte-dce can replace one of existing DCE passes?

It doesn't remove the same insns.

Ciao!
Steven
Jan Hubicka June 27, 2010, 9:49 p.m. UTC | #16
> On Sun, Jun 27, 2010 at 11:21 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > I did some work on speeding up notes that I hope to get into mainline soonish
> > and considered testing replacing fast_dce by the du/ud implementation.
> 
> The du/ud implementation has its drawbacks too.  Such as not working
> very well for very large functions.

Yep, we need FUD to get around DU/UD memory use ;))
> 
> 
> > I think the slowness, from large part, comes from two reasons.
> > 1) DCE iterates and calls dataflow at each step.  This is naturaly slow but
> >   this probably can be limited.  Pass don't get dead induction vars removed
> >   and other loop carried liveness is not too important.
> 
> Originally DCE did not iterate, but this caused bugs, e.g. PR35805.

But the iteration is very ugly.  I wonder if it can't be done as usual liveness
based DCE (i.e. computing dataflow just once but having transfer function smart
enough to not mark values used to computed dead values live)
Just like old DCE did.
> 
> 
> > Also perhaps byte-dce can replace one of existing DCE passes?
> 
> It doesn't remove the same insns.

Hmm, I was under impression that byte DCE is same as fast_dce just with dataflow
done on byte granuality (so getting subregs and multiword values right)?
What kind of dead code byte-dce miss?

Honza
> 
> Ciao!
> Steven
Hans-Peter Nilsson July 1, 2010, 10:48 p.m. UTC | #17
On Thu, 24 Jun 2010, Bernd Schmidt wrote:
> On 06/24/2010 07:58 PM, Hans-Peter Nilsson wrote:
> > On Tue, 22 Jun 2010, Bernd Schmidt wrote:
> > I find that reference in rtl.texi so IIUC is regarding valid
> > RTX, not what a port has to assert or refuse.  In this respect,
> > the port was consistent with the compare insns; there's no mode
> > on the compare, just the operands of the compare.  I can't find
> > whatever documentation specified this, maybe I found nothing at
> > the time was just using the m68k-port as effective
> > documentation, which had the same modeless zero_extract
> > assigning to cc0 *at the time*.  Whatever, all this *has*
> > changed.  Water under the bridges...
>
> I'm not actually sure what you're trying to say here.

Lots of defense and excuses for not pursuing the path you took
here at the time (you missed the IRC debate, all fingerpointing
and fat-calling), but in the paragraph above the above one was
an actual point:  "So, I guess rtl.texi should change too?" (to
require a mode at the zero_extract rather than mentioning it to
default elsewhere).

brgds, H-P
Bernd Schmidt July 1, 2010, 11:03 p.m. UTC | #18
On 07/02/2010 12:48 AM, Hans-Peter Nilsson wrote:

> "So, I guess rtl.texi should change too?" (to
> require a mode at the zero_extract rather than mentioning it to
> default elsewhere).

I'm not sure.  In theory it sounds like it would be better, but I think
given past practice the compiler is prepared to deal with it, and I
don't know whether all ports now define a mode for their zero_extracts.


Bernd
Jeff Law July 1, 2010, 11:43 p.m. UTC | #19
On 07/01/10 17:03, Bernd Schmidt wrote:
> On 07/02/2010 12:48 AM, Hans-Peter Nilsson wrote:
>
>    
>> "So, I guess rtl.texi should change too?" (to
>> require a mode at the zero_extract rather than mentioning it to
>> default elsewhere).
>>      
> I'm not sure.  In theory it sounds like it would be better, but I think
> given past practice the compiler is prepared to deal with it, and I
> don't know whether all ports now define a mode for their zero_extracts.
>    
I don't think there has ever been a directive that they must; however, 
it's certainly advisable for ports to specify a mode on their extracts 
-- this is particularly true for ports where WORD_SIZE varies based on 
target switches.

jeff
diff mbox

Patch

Index: combine.c
===================================================================
--- combine.c	(revision 161116)
+++ combine.c	(working copy)
@@ -6835,12 +6835,6 @@  make_extraction (enum machine_mode mode,
       extraction_mode = mode_for_extraction (EP_extv, 0);
     }
 
-  /* Never narrow an object, since that might not be safe.  */
-
-  if (mode != VOIDmode
-      && GET_MODE_SIZE (extraction_mode) < GET_MODE_SIZE (mode))
-    extraction_mode = mode;
-
   if (pos_rtx && GET_MODE (pos_rtx) != VOIDmode
       && GET_MODE_SIZE (pos_mode) < GET_MODE_SIZE (GET_MODE (pos_rtx)))
     pos_mode = GET_MODE (pos_rtx);
Index: passes.c
===================================================================
--- passes.c	(revision 161116)
+++ passes.c	(working copy)
@@ -1014,6 +1014,7 @@  init_optimization_passes (void)
       NEXT_PASS (pass_regmove);
       NEXT_PASS (pass_outof_cfg_layout_mode);
       NEXT_PASS (pass_split_all_insns);
+      NEXT_PASS (pass_fast_rtl_byte_dce);
       NEXT_PASS (pass_lower_subreg2);
       NEXT_PASS (pass_df_initialize_no_opt);
       NEXT_PASS (pass_stack_ptr_mod);