Message ID | CAFULd4aeN69+bzzKxKyHR93N3O7K47B3=QmQFhSR7ghJpqjK=g@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 12/12/2012 10:32 AM, Uros Bizjak wrote: > Please check the attached patch, it implements this limitation in a correct way: > - keeps memory operands for -Os or cold parts of the executable > - doesn't increase register pressure > - handles all situations where memory operand can propagate into RTX I agree this is the right way to attack this problem. r~
On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener > <richard.guenther@gmail.com> wrote: > >>> I assume that this is not right way for fixing such simple performance >>> anomaly since we need to do redundant work - combine load to >>> conditional and then split it back in peephole2? Does it look >>> reasonable? Why we should produce non-efficient instrucction that must >>> be splitted later? >> >> Well, either don't allow this instruction variant from the start, or allow >> the extra freedom for register allocation this creates. It doesn't make >> sense to just reject it being generated by combine - that doesn't address >> when it materializes in another way. > > Please check the attached patch, it implements this limitation in a correct way: > - keeps memory operands for -Os or cold parts of the executable > - doesn't increase register pressure > - handles all situations where memory operand can propagate into RTX > > Yuri, can you please check if this patch fixes the performance problem for you? > > BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP > target macro and conditionalize new peephole2 patterns on it. Looks good to me. I believe optimize_insn_for_speed_p () only works reliable during RTL expansion as it relies on crtl->maybe_hot_insn_p to be better than function granular. I'm quite sure split does not adjust this (it probably should, as those predicates are definitely the correct ones to use), via rtl_profile_for_bb (). I think passes that do not adjust this get what is left over by previous passes (instead of the default). Honza, I think the pass manager should call default_rtl_profile () before each RTL pass to avoid this, no? Thanks, Richard. > Uros.
On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener <richard.guenther@gmail.com> wrote: >>>> I assume that this is not right way for fixing such simple performance >>>> anomaly since we need to do redundant work - combine load to >>>> conditional and then split it back in peephole2? Does it look >>>> reasonable? Why we should produce non-efficient instrucction that must >>>> be splitted later? >>> >>> Well, either don't allow this instruction variant from the start, or allow >>> the extra freedom for register allocation this creates. It doesn't make >>> sense to just reject it being generated by combine - that doesn't address >>> when it materializes in another way. >> >> Please check the attached patch, it implements this limitation in a correct way: >> - keeps memory operands for -Os or cold parts of the executable >> - doesn't increase register pressure >> - handles all situations where memory operand can propagate into RTX >> >> Yuri, can you please check if this patch fixes the performance problem for you? >> >> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP >> target macro and conditionalize new peephole2 patterns on it. > > Looks good to me. I believe optimize_insn_for_speed_p () > only works reliable during RTL expansion as it relies on > crtl->maybe_hot_insn_p to be better than function granular. I'm quite sure > split does not adjust this (it probably should, as those predicates are > definitely the correct ones to use), via rtl_profile_for_bb (). I > think passes that > do not adjust this get what is left over by previous passes (instead of the > default). > > Honza, I think the pass manager should call default_rtl_profile () before each > RTL pass to avoid this, no? Please note that we have plenty of existing peephole2s that use optimize_insn_for_speed_p predicate. It is assumed to work ... Uros.
On Thu, Dec 13, 2012 at 11:20 AM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener > <richard.guenther@gmail.com> wrote: > >>>>> I assume that this is not right way for fixing such simple performance >>>>> anomaly since we need to do redundant work - combine load to >>>>> conditional and then split it back in peephole2? Does it look >>>>> reasonable? Why we should produce non-efficient instrucction that must >>>>> be splitted later? >>>> >>>> Well, either don't allow this instruction variant from the start, or allow >>>> the extra freedom for register allocation this creates. It doesn't make >>>> sense to just reject it being generated by combine - that doesn't address >>>> when it materializes in another way. >>> >>> Please check the attached patch, it implements this limitation in a correct way: >>> - keeps memory operands for -Os or cold parts of the executable >>> - doesn't increase register pressure >>> - handles all situations where memory operand can propagate into RTX >>> >>> Yuri, can you please check if this patch fixes the performance problem for you? >>> >>> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP >>> target macro and conditionalize new peephole2 patterns on it. >> >> Looks good to me. I believe optimize_insn_for_speed_p () >> only works reliable during RTL expansion as it relies on >> crtl->maybe_hot_insn_p to be better than function granular. I'm quite sure >> split does not adjust this (it probably should, as those predicates are >> definitely the correct ones to use), via rtl_profile_for_bb (). I >> think passes that >> do not adjust this get what is left over by previous passes (instead of the >> default). >> >> Honza, I think the pass manager should call default_rtl_profile () before each >> RTL pass to avoid this, no? > > Please note that we have plenty of existing peephole2s that use > optimize_insn_for_speed_p predicate. It is assumed to work ... Indeed - it calls rtl_profile_for_bb already. (it's in recog.c ... just looked for sth like "split.c" in my grep). Richard. > Uros.
Hi Guys, The patch proposed by Uros is useless since we don't have free scratch register to do splitting of memory operand: ;; regs ever live 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags] ... (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]) (if_then_else:SI (ne (reg:CCZ 17 flags) (const_int 0 [0])) (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32]) (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]))) routelookup/route_lookup.c:639 940 {*movsicc_noc} (expr_list:REG_DEAD (reg:CCZ 17 flags) (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) (nil)))) How we can cope with this? I still assume that we can apply my patch with additional check on optimization for speed. Best regards. Yuri. 2012/12/12 Richard Henderson <rth@redhat.com>: > On 12/12/2012 10:32 AM, Uros Bizjak wrote: >> Please check the attached patch, it implements this limitation in a correct way: >> - keeps memory operands for -Os or cold parts of the executable >> - doesn't increase register pressure >> - handles all situations where memory operand can propagate into RTX > > I agree this is the right way to attack this problem. > > > r~
On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote: > The patch proposed by Uros is useless since we don't have free scratch > register to do splitting of memory operand: > > ;; regs ever live 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags] > > ... > > (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]) > (if_then_else:SI (ne (reg:CCZ 17 flags) > (const_int 0 [0])) > (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) > (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32]) > (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]))) > routelookup/route_lookup.c:639 940 {*movsicc_noc} > (expr_list:REG_DEAD (reg:CCZ 17 flags) > (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) > (nil)))) > > How we can cope with this? I still assume that we can apply my patch > with additional check on optimization for speed. Can you please post the testcase? Uros.
On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote: > Hi Guys, > > The patch proposed by Uros is useless since we don't have free scratch > register to do splitting of memory operand: > > ;; regs ever live 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags] > > ... > > (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]) > (if_then_else:SI (ne (reg:CCZ 17 flags) > (const_int 0 [0])) > (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) > (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32]) > (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]))) > routelookup/route_lookup.c:639 940 {*movsicc_noc} > (expr_list:REG_DEAD (reg:CCZ 17 flags) > (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) > (nil)))) > > How we can cope with this? I still assume that we can apply my patch > with additional check on optimization for speed. What's the important benchmark you are register-starved for? Richard. > Best regards. > Yuri. > > 2012/12/12 Richard Henderson <rth@redhat.com>: >> On 12/12/2012 10:32 AM, Uros Bizjak wrote: >>> Please check the attached patch, it implements this limitation in a correct way: >>> - keeps memory operands for -Os or cold parts of the executable >>> - doesn't increase register pressure >>> - handles all situations where memory operand can propagate into RTX >> >> I agree this is the right way to attack this problem. >> >> >> r~
On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote: >> The patch proposed by Uros is useless since we don't have free scratch >> register to do splitting of memory operand: >> >> ;; regs ever live 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags] >> >> ... >> >> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]) >> (if_then_else:SI (ne (reg:CCZ 17 flags) >> (const_int 0 [0])) >> (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) >> (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32]) >> (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]))) >> routelookup/route_lookup.c:639 940 {*movsicc_noc} >> (expr_list:REG_DEAD (reg:CCZ 17 flags) >> (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) >> (nil)))) >> >> How we can cope with this? I still assume that we can apply my patch >> with additional check on optimization for speed. > > Can you please post the testcase? Also, peephole2 pass recalculates life information by itself before every insn, so "regs ever live" array is not something to look at. The question was, if the patch solves the runtime regression for you, not if it removes all memory operands from the conditional moves. The split depends on optimize_insn_for_speed_p predicate. Uros.
Uros, We did not see any performance improvement on Atom in 32-bit mode at routelookup from eembc_2_0 (eembc_1_1). Best regards. Yuri. 2012/12/13 Uros Bizjak <ubizjak@gmail.com>: > On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > >>> The patch proposed by Uros is useless since we don't have free scratch >>> register to do splitting of memory operand: >>> >>> ;; regs ever live 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags] >>> >>> ... >>> >>> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]) >>> (if_then_else:SI (ne (reg:CCZ 17 flags) >>> (const_int 0 [0])) >>> (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) >>> (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32]) >>> (reg/v/f:SI 6 bp [orig:70 trie_root ] [70]))) >>> routelookup/route_lookup.c:639 940 {*movsicc_noc} >>> (expr_list:REG_DEAD (reg:CCZ 17 flags) >>> (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70]) >>> (nil)))) >>> >>> How we can cope with this? I still assume that we can apply my patch >>> with additional check on optimization for speed. >> >> Can you please post the testcase? > > Also, peephole2 pass recalculates life information by itself before > every insn, so "regs ever live" array is not something to look at. The > question was, if the patch solves the runtime regression for you, not > if it removes all memory operands from the conditional moves. The > split depends on optimize_insn_for_speed_p predicate. > > Uros.
> On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener > > <richard.guenther@gmail.com> wrote: > > > >>> I assume that this is not right way for fixing such simple performance > >>> anomaly since we need to do redundant work - combine load to > >>> conditional and then split it back in peephole2? Does it look > >>> reasonable? Why we should produce non-efficient instrucction that must > >>> be splitted later? > >> > >> Well, either don't allow this instruction variant from the start, or allow > >> the extra freedom for register allocation this creates. It doesn't make > >> sense to just reject it being generated by combine - that doesn't address > >> when it materializes in another way. > > > > Please check the attached patch, it implements this limitation in a correct way: > > - keeps memory operands for -Os or cold parts of the executable > > - doesn't increase register pressure > > - handles all situations where memory operand can propagate into RTX > > > > Yuri, can you please check if this patch fixes the performance problem for you? > > > > BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP > > target macro and conditionalize new peephole2 patterns on it. > > Looks good to me. I believe optimize_insn_for_speed_p () > only works reliable during RTL expansion as it relies on > crtl->maybe_hot_insn_p to be better than function granular. I'm quite sure > split does not adjust this (it probably should, as those predicates are > definitely the correct ones to use), via rtl_profile_for_bb (). I > think passes that > do not adjust this get what is left over by previous passes (instead of the > default). > > Honza, I think the pass manager should call default_rtl_profile () before each > RTL pass to avoid this, no? Yep, we should get to default_rtl_profile after each pass. We should also fix passes that do splitting/expansion without preprly setting current basic block. Honza > > Thanks, > Richard. > > > Uros.
> > Honza, I think the pass manager should call default_rtl_profile () before each > > RTL pass to avoid this, no? > > Please note that we have plenty of existing peephole2s that use > optimize_insn_for_speed_p predicate. It is assumed to work ... It is set by peep2 pass static void peephole2_optimize (void) { rtx insn; bitmap live; int i; basic_block bb; peep2_do_cleanup_cfg = false; peep2_do_rebuild_jump_labels = false; df_set_flags (DF_LR_RUN_DCE); df_note_add_problem (); df_analyze (); /* Initialize the regsets we're going to use. */ for (i = 0; i < MAX_INSNS_PER_PEEP2 + 1; ++i) peep2_insn_data[i].live_before = BITMAP_ALLOC (®_obstack); live = BITMAP_ALLOC (®_obstack); FOR_EACH_BB_REVERSE (bb) { bool past_end = false; int pos; rtl_profile_for_bb (bb); ^^^ here. spliters/peepholes should be safe as splitting/peephole and combine passes care about the profile. What is unreliable somewhat is expansion is done late, like from loop optimizer. I am not sure how much of those we have left. Honza > > Uros.
Index: i386.md =================================================================== --- i386.md (revision 194451) +++ i386.md (working copy) @@ -16122,6 +16122,31 @@ operands[3] = gen_lowpart (SImode, operands[3]); }) +;; Don't do conditional moves with memory inputs +(define_peephole2 + [(match_scratch:SWI248 2 "r") + (set (match_operand:SWI248 0 "register_operand") + (if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator" + [(reg FLAGS_REG) (const_int 0)]) + (match_dup 0) + (match_operand:SWI248 3 "memory_operand")))] + "TARGET_CMOVE && optimize_insn_for_speed_p ()" + [(set (match_dup 2) (match_dup 3)) + (set (match_dup 0) + (if_then_else:SWI248 (match_dup 1) (match_dup 0) (match_dup 2)))]) + +(define_peephole2 + [(match_scratch:SWI248 2 "r") + (set (match_operand:SWI248 0 "register_operand") + (if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator" + [(reg FLAGS_REG) (const_int 0)]) + (match_operand:SWI248 3 "memory_operand") + (match_dup 0)))] + "TARGET_CMOVE && optimize_insn_for_speed_p ()" + [(set (match_dup 2) (match_dup 3)) + (set (match_dup 0) + (if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 0)))]) + (define_expand "mov<mode>cc" [(set (match_operand:X87MODEF 0 "register_operand") (if_then_else:X87MODEF @@ -16209,6 +16234,35 @@ [(set_attr "type" "fcmov,fcmov,icmov,icmov") (set_attr "mode" "SF,SF,SI,SI")]) +;; Don't do conditional moves with memory inputs +(define_peephole2 + [(match_scratch:MODEF 2 "r") + (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand") + (if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator" + [(reg FLAGS_REG) (const_int 0)]) + (match_dup 0) + (match_operand:MODEF 3 "memory_operand")))] + "(<MODE>mode != DFmode || TARGET_64BIT) + && TARGET_80387 && TARGET_CMOVE + && optimize_insn_for_speed_p ()" + [(set (match_dup 2) (match_dup 3)) + (set (match_dup 0) + (if_then_else:MODEF (match_dup 1) (match_dup 0) (match_dup 2)))]) + +(define_peephole2 + [(match_scratch:MODEF 2 "r") + (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand") + (if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator" + [(reg FLAGS_REG) (const_int 0)]) + (match_operand:MODEF 3 "memory_operand") + (match_dup 0)))] + "(<MODE>mode != DFmode || TARGET_64BIT) + && TARGET_80387 && TARGET_CMOVE + && optimize_insn_for_speed_p ()" + [(set (match_dup 2) (match_dup 3)) + (set (match_dup 0) + (if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 0)))]) + ;; All moves in XOP pcmov instructions are 128 bits and hence we restrict ;; the scalar versions to have only XMM registers as operands.