diff mbox

[x86] Fix combine for condditional instructions.

Message ID CAFULd4aeN69+bzzKxKyHR93N3O7K47B3=QmQFhSR7ghJpqjK=g@mail.gmail.com
State New
Headers show

Commit Message

Uros Bizjak Dec. 12, 2012, 6:32 p.m. UTC
On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
<richard.guenther@gmail.com> wrote:

>> I assume that this is not right way for fixing such simple performance
>> anomaly since we need to do redundant work - combine load to
>> conditional and then split it back in peephole2? Does it look
>> reasonable? Why we should produce non-efficient instrucction that must
>> be splitted later?
>
> Well, either don't allow this instruction variant from the start, or allow
> the extra freedom for register allocation this creates.  It doesn't make
> sense to just reject it being generated by combine - that doesn't address
> when it materializes in another way.

Please check the attached patch, it implements this limitation in a correct way:
- keeps memory operands for -Os or cold parts of the executable
- doesn't increase register pressure
- handles all situations where memory operand can propagate into RTX

Yuri, can you please check if this patch fixes the performance problem for you?

BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
target macro and conditionalize new peephole2 patterns on it.

Uros.

Comments

Richard Henderson Dec. 12, 2012, 7:32 p.m. UTC | #1
On 12/12/2012 10:32 AM, Uros Bizjak wrote:
> Please check the attached patch, it implements this limitation in a correct way:
> - keeps memory operands for -Os or cold parts of the executable
> - doesn't increase register pressure
> - handles all situations where memory operand can propagate into RTX

I agree this is the right way to attack this problem.


r~
Richard Biener Dec. 13, 2012, 9:51 a.m. UTC | #2
On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>
>>> I assume that this is not right way for fixing such simple performance
>>> anomaly since we need to do redundant work - combine load to
>>> conditional and then split it back in peephole2? Does it look
>>> reasonable? Why we should produce non-efficient instrucction that must
>>> be splitted later?
>>
>> Well, either don't allow this instruction variant from the start, or allow
>> the extra freedom for register allocation this creates.  It doesn't make
>> sense to just reject it being generated by combine - that doesn't address
>> when it materializes in another way.
>
> Please check the attached patch, it implements this limitation in a correct way:
> - keeps memory operands for -Os or cold parts of the executable
> - doesn't increase register pressure
> - handles all situations where memory operand can propagate into RTX
>
> Yuri, can you please check if this patch fixes the performance problem for you?
>
> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
> target macro and conditionalize new peephole2 patterns on it.

Looks good to me.  I believe optimize_insn_for_speed_p ()
only works reliable during RTL expansion as it relies on
crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
split does not adjust this (it probably should, as those predicates are
definitely the correct ones to use), via rtl_profile_for_bb ().  I
think passes that
do not adjust this get what is left over by previous passes (instead of the
default).

Honza, I think the pass manager should call default_rtl_profile () before each
RTL pass to avoid this, no?

Thanks,
Richard.

> Uros.
Uros Bizjak Dec. 13, 2012, 10:20 a.m. UTC | #3
On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener
<richard.guenther@gmail.com> wrote:

>>>> I assume that this is not right way for fixing such simple performance
>>>> anomaly since we need to do redundant work - combine load to
>>>> conditional and then split it back in peephole2? Does it look
>>>> reasonable? Why we should produce non-efficient instrucction that must
>>>> be splitted later?
>>>
>>> Well, either don't allow this instruction variant from the start, or allow
>>> the extra freedom for register allocation this creates.  It doesn't make
>>> sense to just reject it being generated by combine - that doesn't address
>>> when it materializes in another way.
>>
>> Please check the attached patch, it implements this limitation in a correct way:
>> - keeps memory operands for -Os or cold parts of the executable
>> - doesn't increase register pressure
>> - handles all situations where memory operand can propagate into RTX
>>
>> Yuri, can you please check if this patch fixes the performance problem for you?
>>
>> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
>> target macro and conditionalize new peephole2 patterns on it.
>
> Looks good to me.  I believe optimize_insn_for_speed_p ()
> only works reliable during RTL expansion as it relies on
> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
> split does not adjust this (it probably should, as those predicates are
> definitely the correct ones to use), via rtl_profile_for_bb ().  I
> think passes that
> do not adjust this get what is left over by previous passes (instead of the
> default).
>
> Honza, I think the pass manager should call default_rtl_profile () before each
> RTL pass to avoid this, no?

Please note that we have plenty of existing peephole2s that use
optimize_insn_for_speed_p predicate. It is assumed to work ...

Uros.
Richard Biener Dec. 13, 2012, 10:33 a.m. UTC | #4
On Thu, Dec 13, 2012 at 11:20 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>
>>>>> I assume that this is not right way for fixing such simple performance
>>>>> anomaly since we need to do redundant work - combine load to
>>>>> conditional and then split it back in peephole2? Does it look
>>>>> reasonable? Why we should produce non-efficient instrucction that must
>>>>> be splitted later?
>>>>
>>>> Well, either don't allow this instruction variant from the start, or allow
>>>> the extra freedom for register allocation this creates.  It doesn't make
>>>> sense to just reject it being generated by combine - that doesn't address
>>>> when it materializes in another way.
>>>
>>> Please check the attached patch, it implements this limitation in a correct way:
>>> - keeps memory operands for -Os or cold parts of the executable
>>> - doesn't increase register pressure
>>> - handles all situations where memory operand can propagate into RTX
>>>
>>> Yuri, can you please check if this patch fixes the performance problem for you?
>>>
>>> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
>>> target macro and conditionalize new peephole2 patterns on it.
>>
>> Looks good to me.  I believe optimize_insn_for_speed_p ()
>> only works reliable during RTL expansion as it relies on
>> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
>> split does not adjust this (it probably should, as those predicates are
>> definitely the correct ones to use), via rtl_profile_for_bb ().  I
>> think passes that
>> do not adjust this get what is left over by previous passes (instead of the
>> default).
>>
>> Honza, I think the pass manager should call default_rtl_profile () before each
>> RTL pass to avoid this, no?
>
> Please note that we have plenty of existing peephole2s that use
> optimize_insn_for_speed_p predicate. It is assumed to work ...

Indeed - it calls rtl_profile_for_bb already.  (it's in recog.c ...
just looked for
sth like "split.c" in my grep).

Richard.

> Uros.
Yuri Rumyantsev Dec. 13, 2012, 2:23 p.m. UTC | #5
Hi Guys,

The patch proposed by Uros is useless since we don't have free scratch
register to do splitting of memory operand:

;;  regs ever live 	 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]

...

(insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
        (if_then_else:SI (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
                    (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
            (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
routelookup/route_lookup.c:639 940 {*movsicc_noc}
     (expr_list:REG_DEAD (reg:CCZ 17 flags)
        (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
            (nil))))

How we can cope with this? I still assume that we can apply my patch
with additional check on optimization for speed.

Best regards.
Yuri.

2012/12/12 Richard Henderson <rth@redhat.com>:
> On 12/12/2012 10:32 AM, Uros Bizjak wrote:
>> Please check the attached patch, it implements this limitation in a correct way:
>> - keeps memory operands for -Os or cold parts of the executable
>> - doesn't increase register pressure
>> - handles all situations where memory operand can propagate into RTX
>
> I agree this is the right way to attack this problem.
>
>
> r~
Uros Bizjak Dec. 13, 2012, 2:27 p.m. UTC | #6
On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:

> The patch proposed by Uros is useless since we don't have free scratch
> register to do splitting of memory operand:
>
> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>
> ...
>
> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>             (nil))))
>
> How we can cope with this? I still assume that we can apply my patch
> with additional check on optimization for speed.

Can you please post the testcase?

Uros.
Richard Biener Dec. 13, 2012, 2:36 p.m. UTC | #7
On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Guys,
>
> The patch proposed by Uros is useless since we don't have free scratch
> register to do splitting of memory operand:
>
> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>
> ...
>
> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>             (nil))))
>
> How we can cope with this? I still assume that we can apply my patch
> with additional check on optimization for speed.

What's the important benchmark you are register-starved for?

Richard.

> Best regards.
> Yuri.
>
> 2012/12/12 Richard Henderson <rth@redhat.com>:
>> On 12/12/2012 10:32 AM, Uros Bizjak wrote:
>>> Please check the attached patch, it implements this limitation in a correct way:
>>> - keeps memory operands for -Os or cold parts of the executable
>>> - doesn't increase register pressure
>>> - handles all situations where memory operand can propagate into RTX
>>
>> I agree this is the right way to attack this problem.
>>
>>
>> r~
Uros Bizjak Dec. 13, 2012, 2:40 p.m. UTC | #8
On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

>> The patch proposed by Uros is useless since we don't have free scratch
>> register to do splitting of memory operand:
>>
>> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>>
>> ...
>>
>> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
>> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>             (nil))))
>>
>> How we can cope with this? I still assume that we can apply my patch
>> with additional check on optimization for speed.
>
> Can you please post the testcase?

Also, peephole2 pass recalculates life information by itself before
every insn, so "regs ever live" array is not something to look at. The
question was, if the patch solves the runtime regression for you, not
if it removes all memory operands from the conditional moves. The
split depends on optimize_insn_for_speed_p predicate.

Uros.
Yuri Rumyantsev Dec. 13, 2012, 3:02 p.m. UTC | #9
Uros,

We did not see any performance improvement on Atom in 32-bit mode at
routelookup from eembc_2_0 (eembc_1_1).

Best regards.
Yuri.

2012/12/13 Uros Bizjak <ubizjak@gmail.com>:
> On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>
>>> The patch proposed by Uros is useless since we don't have free scratch
>>> register to do splitting of memory operand:
>>>
>>> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>>>
>>> ...
>>>
>>> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>>>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>>>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
>>> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>>>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>>>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>>             (nil))))
>>>
>>> How we can cope with this? I still assume that we can apply my patch
>>> with additional check on optimization for speed.
>>
>> Can you please post the testcase?
>
> Also, peephole2 pass recalculates life information by itself before
> every insn, so "regs ever live" array is not something to look at. The
> question was, if the patch solves the runtime regression for you, not
> if it removes all memory operands from the conditional moves. The
> split depends on optimize_insn_for_speed_p predicate.
>
> Uros.
Jan Hubicka Dec. 13, 2012, 7:30 p.m. UTC | #10
> On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> > On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >
> >>> I assume that this is not right way for fixing such simple performance
> >>> anomaly since we need to do redundant work - combine load to
> >>> conditional and then split it back in peephole2? Does it look
> >>> reasonable? Why we should produce non-efficient instrucction that must
> >>> be splitted later?
> >>
> >> Well, either don't allow this instruction variant from the start, or allow
> >> the extra freedom for register allocation this creates.  It doesn't make
> >> sense to just reject it being generated by combine - that doesn't address
> >> when it materializes in another way.
> >
> > Please check the attached patch, it implements this limitation in a correct way:
> > - keeps memory operands for -Os or cold parts of the executable
> > - doesn't increase register pressure
> > - handles all situations where memory operand can propagate into RTX
> >
> > Yuri, can you please check if this patch fixes the performance problem for you?
> >
> > BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
> > target macro and conditionalize new peephole2 patterns on it.
> 
> Looks good to me.  I believe optimize_insn_for_speed_p ()
> only works reliable during RTL expansion as it relies on
> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
> split does not adjust this (it probably should, as those predicates are
> definitely the correct ones to use), via rtl_profile_for_bb ().  I
> think passes that
> do not adjust this get what is left over by previous passes (instead of the
> default).
> 
> Honza, I think the pass manager should call default_rtl_profile () before each
> RTL pass to avoid this, no?

Yep, we should get to default_rtl_profile after each pass.  We should also fix passes
that do splitting/expansion without preprly setting current basic block.

Honza
> 
> Thanks,
> Richard.
> 
> > Uros.
Jan Hubicka Dec. 13, 2012, 7:33 p.m. UTC | #11
> > Honza, I think the pass manager should call default_rtl_profile () before each
> > RTL pass to avoid this, no?
> 
> Please note that we have plenty of existing peephole2s that use
> optimize_insn_for_speed_p predicate. It is assumed to work ...

It is set by peep2 pass
static void
peephole2_optimize (void)
{
  rtx insn;
  bitmap live;
  int i;
  basic_block bb;

  peep2_do_cleanup_cfg = false;
  peep2_do_rebuild_jump_labels = false;

  df_set_flags (DF_LR_RUN_DCE);
  df_note_add_problem ();
  df_analyze ();

  /* Initialize the regsets we're going to use.  */
  for (i = 0; i < MAX_INSNS_PER_PEEP2 + 1; ++i)
    peep2_insn_data[i].live_before = BITMAP_ALLOC (&reg_obstack);
  live = BITMAP_ALLOC (&reg_obstack);

  FOR_EACH_BB_REVERSE (bb)
    {
      bool past_end = false;
      int pos;

      rtl_profile_for_bb (bb);
      ^^^ here.

spliters/peepholes should be safe as splitting/peephole and combine passes care
about the profile.  What is unreliable somewhat is expansion is done late, like
from loop optimizer.  I am not sure how much of those we have left.

Honza
> 
> Uros.
diff mbox

Patch

Index: i386.md
===================================================================
--- i386.md	(revision 194451)
+++ i386.md	(working copy)
@@ -16122,6 +16122,31 @@ 
   operands[3] = gen_lowpart (SImode, operands[3]);
 })
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:SWI248 3 "memory_operand")))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 3 "memory_operand")
+	  (match_dup 0)))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
 	(if_then_else:X87MODEF
@@ -16209,6 +16234,35 @@ 
   [(set_attr "type" "fcmov,fcmov,icmov,icmov")
    (set_attr "mode" "SF,SF,SI,SI")])
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:MODEF 3 "memory_operand")))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 3 "memory_operand")
+	  (match_dup 0)))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 ;; All moves in XOP pcmov instructions are 128 bits and hence we restrict
 ;; the scalar versions to have only XMM registers as operands.