Message ID | 4C20938E.2060606@codesourcery.com |
---|---|
State | New |
Headers | show |
> In make_extraction, we have > > if (mode != VOIDmode > && GET_MODE_SIZE (extraction_mode) < GET_MODE_SIZE (mode)) > extraction_mode = mode; > > where extraction_mode will become the mode of the zero_extract, and mode > is the mode of the object we're performing the extraction on. In this > case, mode is DImode; we have (reg:DI 38) at this point - later in > make_extraction we will create a subreg around it due to > wanted_inner_reg_mode, which is SImode == word_mode. Revision history shows that wanted_inner_reg_mode was added later, i.e. the original wanted mode would have been extraction_mode otherwise. In light of this, I think the safest change is to keep the widening for MEM_P (inner), i.e. move the block to within the 'else' branch a few lines below. OK with this change for the bugfix (PR middle-end/36003 should be renamed PR rtl-optimization/36003) if it passes bootstrap/regtesting. Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the speed/performance trade-off, DF is known to be costly.
On 06/23/2010 12:41 PM, Eric Botcazou wrote: > Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the > speed/performance trade-off, DF is known to be costly. It's not these parts that are _extremely_ costly (it's only what is inherently quadratic, and cannot be helped without rewriting the algorithms to avoid that complexity). But I agree that it should only be enabled at -O3. Paolo
On 06/23/2010 12:41 PM, Eric Botcazou wrote: > Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the > speed/performance trade-off, DF is known to be costly. It also seems to create some regressions on arm-linux. Would you be opposed to adding a simpler, basic-block-local pass that only tries to eliminate unnecessary operations on DImode subregs? Bernd
On 06/24/2010 01:48 AM, Bernd Schmidt wrote: > On 06/23/2010 12:41 PM, Eric Botcazou wrote: > >> Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the >> speed/performance trade-off, DF is known to be costly. > > It also seems to create some regressions on arm-linux. > > Would you be opposed to adding a simpler, basic-block-local pass that > only tries to eliminate unnecessary operations on DImode subregs? That would be nice. I never liked byte DCE actually, it always sounded a bit overengineered. Paolo
> Would you be opposed to adding a simpler, basic-block-local pass that > only tries to eliminate unnecessary operations on DImode subregs? The 3 words "adding a pass" always make me cringe. :-) Couldn't this be integrated into one of the existing RTL passes?
On 06/24/2010 02:53 PM, Eric Botcazou wrote: >> Would you be opposed to adding a simpler, basic-block-local pass that >> only tries to eliminate unnecessary operations on DImode subregs? > > The 3 words "adding a pass" always make me cringe. :-) Couldn't this be > integrated into one of the existing RTL passes? I can tack it onto something else, but I don't see how this would reduce the amount of work we need to do? Bernd
> I can tack it onto something else, but I don't see how this would reduce > the amount of work we need to do? The overhead of traversing the RTL or invoking DF isn't negligible so if we can avoid doing it one more time... Steven's measurements showed that the RTL optimizers still consume 40-45% of the compilation time at -O2.
On Jun 24, 2010, at 6:00 AM, Bernd Schmidt wrote: > On 06/24/2010 02:53 PM, Eric Botcazou wrote: >> >> The 3 words "adding a pass" always make me cringe. :-) Couldn't this be >> integrated into one of the existing RTL passes? > > I can tack it onto something else, but I don't see how this would reduce > the amount of work we need to do? Work needed to do is measured in cache misses. A pass adds nothing but cache misses. Inside a loop, if you get lucky enough, extra work adds 0. Anyway, thats a very rough idea behind his statement. Worse, the tradeoff generally just gets worse over time. If you collect sample data at the instruction level, you can see the instructions that just sit and wait around for memory. Very striking, if you expected the old, this instruction take 3 cycles, this one takes 11 mentality. In the new world, all the instructions take 0, and the 1 trivial instruction, that should be free, that waits for memory take 60% of the processor time....
On 06/24/2010 03:00 PM, Bernd Schmidt wrote: > On 06/24/2010 02:53 PM, Eric Botcazou wrote: >>> Would you be opposed to adding a simpler, basic-block-local pass that >>> only tries to eliminate unnecessary operations on DImode subregs? >> >> The 3 words "adding a pass" always make me cringe. :-) Couldn't this be >> integrated into one of the existing RTL passes? > > I can tack it onto something else, but I don't see how this would reduce > the amount of work we need to do? It may be possible to piggy-back it onto an existing DCE pass, or onto lower-subreg, but I'm not sure without understanding exactly what transform you need to do (especially if you want to run this before register allocation). But you're talking about replacing a buggy and expensive pass (byte-level DCE) with a smaller one, which isn't a bad idea. Paolo
On Thu, Jun 24, 2010 at 2:53 PM, Eric Botcazou <ebotcazou@adacore.com> wrote: >> Would you be opposed to adding a simpler, basic-block-local pass that >> only tries to eliminate unnecessary operations on DImode subregs? > > The 3 words "adding a pass" always make me cringe. :-) Couldn't this be > integrated into one of the existing RTL passes? Byte DCE is also a separate pass, isn't it? The new pass would (should) just replace it. Byte DCE never really worked anyway, and *replacing* it with something simpler, instead of adding something new, would be a good thing IMHO. Ciao! Steven
On Tue, 22 Jun 2010, Bernd Schmidt wrote: > PR36003 is a problem where enabling pass_fast_rtl_byte_dce causes > crashes in a cris compiler when the host is 64 bit. combine produces an > insn of the form > > (set (cc0) > (zero_extract:DI (subreg:SI (reg/v:DI 38 [ high ]) 0) > (const_int 1 [0x1]) > (const_int 0 [0x0]))) 16 {*btst} (nil)) > > The cris backend accepts this Look again! > since it does not specify a mode for the > zero_extract, but IMO this is invalid, as the documentation states > > If LOC is in a register, the mode to use is specified by the > operand of the `insv' or `extv' pattern (*note Standard Names::) > and is usually a full-word integer mode, which is the default if > none is specified. The RTL may be invalid, but none was specified in the port, just like the doc says. So, I guess rtl.texi should change too? Anyway. That changed about a year ago: Bonzini's condition-code rewrite (r147425). It is now has a mode (and is wrapped in a modeless compare). > [...] > The mode M is the same as the mode that would be used for LOC if > it were a register. > > which seems to imply that the zero_extract and its first operand should > both have word_mode. The DImode zero_extract of a SImode value seems to > confuse pass_fsat_rtl_byte_dce. I find that reference in rtl.texi so IIUC is regarding valid RTX, not what a port has to assert or refuse. In this respect, the port was consistent with the compare insns; there's no mode on the compare, just the operands of the compare. I can't find whatever documentation specified this, maybe I found nothing at the time was just using the m68k-port as effective documentation, which had the same modeless zero_extract assigning to cc0 *at the time*. Whatever, all this *has* changed. Water under the bridges... brgds, H-P
On 06/24/2010 07:58 PM, Hans-Peter Nilsson wrote: > On Tue, 22 Jun 2010, Bernd Schmidt wrote: >> PR36003 is a problem where enabling pass_fast_rtl_byte_dce causes >> crashes in a cris compiler when the host is 64 bit. combine produces an >> insn of the form >> >> (set (cc0) >> (zero_extract:DI (subreg:SI (reg/v:DI 38 [ high ]) 0) >> (const_int 1 [0x1]) >> (const_int 0 [0x0]))) 16 {*btst} (nil)) >> >> The cris backend accepts this > > Look again! Well, I was working with the revision mentioned in the PR so that I could reproduce the problem. > I find that reference in rtl.texi so IIUC is regarding valid > RTX, not what a port has to assert or refuse. In this respect, > the port was consistent with the compare insns; there's no mode > on the compare, just the operands of the compare. I can't find > whatever documentation specified this, maybe I found nothing at > the time was just using the m68k-port as effective > documentation, which had the same modeless zero_extract > assigning to cc0 *at the time*. Whatever, all this *has* > changed. Water under the bridges... I'm not actually sure what you're trying to say here. I wasn't criticizing the cris backend, although I think it's good that it now has a mode on the zero_extract since that'll avoid such problems in the future. Bernd
> Byte DCE is also a separate pass, isn't it? The new pass would > (should) just replace it. Byte DCE never really worked anyway, and > *replacing* it with something simpler, instead of adding something > new, would be a good thing IMHO. Yes, but byte DCE isn't enabled so a new pass will slow down the compiler out of its mere existence, no matter how faster than byte DCE it is. I'm not saying that adding a new pass is out of question, but rather that it should be the last resort solution.
> On 06/23/2010 12:41 PM, Eric Botcazou wrote: > > > Reenabling pass_fast_rtl_byte_dce cannot be done without first evaluating the > > speed/performance trade-off, DF is known to be costly. > > It also seems to create some regressions on arm-linux. > > Would you be opposed to adding a simpler, basic-block-local pass that > only tries to eliminate unnecessary operations on DImode subregs? I can do some testing on my WHOPR build (where I was testing various DF speedups for a while). fast_dce and DF's note adding code are indeed one of most expensive RTL passes. Both are very similar in their nature - i.e. they compute liveness and then walk BB updating life. I did some work on speeding up notes that I hope to get into mainline soonish and considered testing replacing fast_dce by the du/ud implementation. I think the slowness, from large part, comes from two reasons. 1) DCE iterates and calls dataflow at each step. This is naturaly slow but this probably can be limited. Pass don't get dead induction vars removed and other loop carried liveness is not too important. 2) It is slow on bitmap operations. At each BB it copies a liveness bitmap and then it does a lot of sets/clears that are rather complicated at least for more than 128 pseudos. Implementation is generally O(nregs*ninsns) and since the functions tends to get more noticeable on big functions I think it counts. I think we can avoid this by using two sbitmaps. First sbitmap carry local liveness and second sbitmap carry info if reg was seen (and thus if liveness should be tested by looking into local sbitmaps or DF's live at exit bitmap). TO get things O(ninsns+nregs) one would need to track non-zero words of the first sbitmap and use the list to clear both bitmaps at BB boundary. Also perhaps byte-dce can replace one of existing DCE passes? fast_dce is currently done post-reload and when crossjumping happens. I can get you some numbers and profiles for the patch on my testing setup. In a way I would like to see the byte DF code either used or removed, so I think we should give at least a try to the first alternative. Honza > > > Bernd
On Sun, Jun 27, 2010 at 11:21 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > I did some work on speeding up notes that I hope to get into mainline soonish > and considered testing replacing fast_dce by the du/ud implementation. The du/ud implementation has its drawbacks too. Such as not working very well for very large functions. > I think the slowness, from large part, comes from two reasons. > 1) DCE iterates and calls dataflow at each step. This is naturaly slow but > this probably can be limited. Pass don't get dead induction vars removed > and other loop carried liveness is not too important. Originally DCE did not iterate, but this caused bugs, e.g. PR35805. > Also perhaps byte-dce can replace one of existing DCE passes? It doesn't remove the same insns. Ciao! Steven
> On Sun, Jun 27, 2010 at 11:21 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > > I did some work on speeding up notes that I hope to get into mainline soonish > > and considered testing replacing fast_dce by the du/ud implementation. > > The du/ud implementation has its drawbacks too. Such as not working > very well for very large functions. Yep, we need FUD to get around DU/UD memory use ;)) > > > > I think the slowness, from large part, comes from two reasons. > > 1) DCE iterates and calls dataflow at each step. This is naturaly slow but > > this probably can be limited. Pass don't get dead induction vars removed > > and other loop carried liveness is not too important. > > Originally DCE did not iterate, but this caused bugs, e.g. PR35805. But the iteration is very ugly. I wonder if it can't be done as usual liveness based DCE (i.e. computing dataflow just once but having transfer function smart enough to not mark values used to computed dead values live) Just like old DCE did. > > > > Also perhaps byte-dce can replace one of existing DCE passes? > > It doesn't remove the same insns. Hmm, I was under impression that byte DCE is same as fast_dce just with dataflow done on byte granuality (so getting subregs and multiword values right)? What kind of dead code byte-dce miss? Honza > > Ciao! > Steven
On Thu, 24 Jun 2010, Bernd Schmidt wrote: > On 06/24/2010 07:58 PM, Hans-Peter Nilsson wrote: > > On Tue, 22 Jun 2010, Bernd Schmidt wrote: > > I find that reference in rtl.texi so IIUC is regarding valid > > RTX, not what a port has to assert or refuse. In this respect, > > the port was consistent with the compare insns; there's no mode > > on the compare, just the operands of the compare. I can't find > > whatever documentation specified this, maybe I found nothing at > > the time was just using the m68k-port as effective > > documentation, which had the same modeless zero_extract > > assigning to cc0 *at the time*. Whatever, all this *has* > > changed. Water under the bridges... > > I'm not actually sure what you're trying to say here. Lots of defense and excuses for not pursuing the path you took here at the time (you missed the IRC debate, all fingerpointing and fat-calling), but in the paragraph above the above one was an actual point: "So, I guess rtl.texi should change too?" (to require a mode at the zero_extract rather than mentioning it to default elsewhere). brgds, H-P
On 07/02/2010 12:48 AM, Hans-Peter Nilsson wrote: > "So, I guess rtl.texi should change too?" (to > require a mode at the zero_extract rather than mentioning it to > default elsewhere). I'm not sure. In theory it sounds like it would be better, but I think given past practice the compiler is prepared to deal with it, and I don't know whether all ports now define a mode for their zero_extracts. Bernd
On 07/01/10 17:03, Bernd Schmidt wrote: > On 07/02/2010 12:48 AM, Hans-Peter Nilsson wrote: > > >> "So, I guess rtl.texi should change too?" (to >> require a mode at the zero_extract rather than mentioning it to >> default elsewhere). >> > I'm not sure. In theory it sounds like it would be better, but I think > given past practice the compiler is prepared to deal with it, and I > don't know whether all ports now define a mode for their zero_extracts. > I don't think there has ever been a directive that they must; however, it's certainly advisable for ports to specify a mode on their extracts -- this is particularly true for ports where WORD_SIZE varies based on target switches. jeff
Index: combine.c =================================================================== --- combine.c (revision 161116) +++ combine.c (working copy) @@ -6835,12 +6835,6 @@ make_extraction (enum machine_mode mode, extraction_mode = mode_for_extraction (EP_extv, 0); } - /* Never narrow an object, since that might not be safe. */ - - if (mode != VOIDmode - && GET_MODE_SIZE (extraction_mode) < GET_MODE_SIZE (mode)) - extraction_mode = mode; - if (pos_rtx && GET_MODE (pos_rtx) != VOIDmode && GET_MODE_SIZE (pos_mode) < GET_MODE_SIZE (GET_MODE (pos_rtx))) pos_mode = GET_MODE (pos_rtx); Index: passes.c =================================================================== --- passes.c (revision 161116) +++ passes.c (working copy) @@ -1014,6 +1014,7 @@ init_optimization_passes (void) NEXT_PASS (pass_regmove); NEXT_PASS (pass_outof_cfg_layout_mode); NEXT_PASS (pass_split_all_insns); + NEXT_PASS (pass_fast_rtl_byte_dce); NEXT_PASS (pass_lower_subreg2); NEXT_PASS (pass_df_initialize_no_opt); NEXT_PASS (pass_stack_ptr_mod);