diff mbox series

[RFA] Improve strcmp expansion when one input is a constant string.

Message ID a4d5b600-2e21-3fea-17f8-4c2764880409@gmail.com
State New
Headers show
Series [RFA] Improve strcmp expansion when one input is a constant string. | expand

Commit Message

Jeff Law June 4, 2023, 9:41 p.m. UTC
While investigating a RISC-V backend patch from Jivan I noticed a 
regression in terms of dynamic instruction counts for the omnetpp 
benchmark in spec2017.

https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html

The code we we with Jivan's patch at expansion time looks like this for 
each character in the input string:



(insn 6 5 7 (set (reg:SI 137)
         (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM 
<char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
      (nil))

(insn 7 6 8 (set (reg:DI 138)
         (sign_extend:DI (plus:SI (reg:SI 137)
                 (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 -1
      (nil))

(insn 8 7 9 (set (reg:SI 136)
         (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1
      (expr_list:REG_EQUAL (plus:SI (reg:SI 137)
             (const_int -108 [0xffffffffffffff94]))
         (nil)))

(insn 9 8 10 (set (reg:DI 139)
         (sign_extend:DI (reg:SI 136))) "j.c":5:11 -1
      (nil))

(jump_insn 10 9 11 (set (pc)
         (if_then_else (ne (reg:DI 139)
                 (const_int 0 [0]))
             (label_ref 64)
             (pc))) "j.c":5:11 -1
      (nil))


Ignore insn 9.  fwprop will turn it into a trivial copy from r138->r139 
which will ultimately propagate away.


All the paths eventually transfer to control to the label in question, 
either by jumping or falling thru on the last character.  After a bit of 
cleanup by fwprop & friends we have:



> (insn 6 3 7 2 (set (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
>         (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2}
>      (nil)) 
> (insn 7 6 8 2 (set (reg:DI 138)
>         (sign_extend:DI (plus:SI (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
>                 (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 6 {addsi3_extended}
>      (expr_list:REG_DEAD (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
>         (nil)))
> (insn 8 7 10 2 (set (reg:SI 136 [ MEM <char[1:12]> [(void *)x_2(D)]+11 ])
>         (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal}
>      (nil))
> (jump_insn 10 8 73 2 (set (pc)
>         (if_then_else (ne (reg:DI 138)
>                 (const_int 0 [0]))
>             (label_ref 64)
>             (pc))) "j.c":5:11 243 {*branchdi}
>      (expr_list:REG_DEAD (reg:DI 138)
>         (int_list:REG_BR_PROB 536870916 (nil)))
>  -> 64)


insn 8 is the result of wanting the ultimate result of the strcmp to be 
an "int" type (SImode).    Note that (reg 136) is the result of the 
strcmp.  It gets set in each fragment of code that compares one element 
in the string.  It's also live after the strcmp sequence.   As a result 
combine isn't going to be able to clean this up.

Note how (reg 136) births while (reg 138) is live and even though (reg 
136) is a copy of (reg 138), IRA doesn't have the necessary code to 
determine that the regs do not conflict.  As a result (reg 136) and (reg 
138) must be allocated different hard registers and we get code like this:

>         lbu     a5,0(a0)        # 6     [c=28 l=4]  zero_extendqisi2/1
>         addiw   a5,a5,-108      # 7     [c=8 l=4]  addsi3_extended/1
>         mv      a4,a5   # 8     [c=4 l=4]  *movsi_internal/0
>         bne     a5,zero,.L2     # 10    [c=4 l=4]  *branchdi

Note the annoying "mv".


Rather than do a conversion for each character, we could do each step in 
word_mode and do the conversion once at the end of the whole sequence.

So for each character we expand to:

> (insn 6 5 7 (set (reg:DI 138)
>         (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
>      (nil))
> 
> (insn 7 6 8 (set (reg:DI 137)
>         (plus:DI (reg:DI 138)
>             (const_int -108 [0xffffffffffffff94]))) "j.c":5:11 -1
>      (nil))
> 
> (jump_insn 8 7 9 (set (pc)
>         (if_then_else (ne (reg:DI 137)
>                 (const_int 0 [0]))
>             (label_ref 41)
>             (pc))) "j.c":5:11 -1
>      (nil))

Good.  Then at the end of the sequence we have:
> (code_label 41 40 42 2 (nil) [0 uses])
> 
> (insn 42 41 43 (set (reg:SI 136)
>         (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1
>      (nil))

Which seems like exactly what we want.  At the assembly level we get:
         lbu     a5,0(a0)        # 6     [c=28 l=4]  zero_extendqidi2/1
         addi    a0,a5,-108      # 7     [c=4 l=4]  adddi3/1
         bne     a0,zero,.L2     # 8     [c=4 l=4]  *branchdi
[ ... ]

At the end of the sequence we realize the narrowing subreg followed by 
an extnesion isn't necessary and just remove them.

The ultimate result is omnetpp goes from a small regression to a small 
overall improvement with Jivan's patch.

Bootstrapped and regression tested on x86.  Also built and run spec2017 
on riscv64.

OK for the trunk?

Jeff

Comments

Richard Biener June 5, 2023, 6:29 a.m. UTC | #1
On Sun, Jun 4, 2023 at 11:41 PM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> While investigating a RISC-V backend patch from Jivan I noticed a
> regression in terms of dynamic instruction counts for the omnetpp
> benchmark in spec2017.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html
>
> The code we we with Jivan's patch at expansion time looks like this for
> each character in the input string:
>
>
>
> (insn 6 5 7 (set (reg:SI 137)
>          (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM
> <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
>       (nil))
>
> (insn 7 6 8 (set (reg:DI 138)
>          (sign_extend:DI (plus:SI (reg:SI 137)
>                  (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 -1
>       (nil))
>
> (insn 8 7 9 (set (reg:SI 136)
>          (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1
>       (expr_list:REG_EQUAL (plus:SI (reg:SI 137)
>              (const_int -108 [0xffffffffffffff94]))
>          (nil)))
>
> (insn 9 8 10 (set (reg:DI 139)
>          (sign_extend:DI (reg:SI 136))) "j.c":5:11 -1
>       (nil))
>
> (jump_insn 10 9 11 (set (pc)
>          (if_then_else (ne (reg:DI 139)
>                  (const_int 0 [0]))
>              (label_ref 64)
>              (pc))) "j.c":5:11 -1
>       (nil))
>
>
> Ignore insn 9.  fwprop will turn it into a trivial copy from r138->r139
> which will ultimately propagate away.
>
>
> All the paths eventually transfer to control to the label in question,
> either by jumping or falling thru on the last character.  After a bit of
> cleanup by fwprop & friends we have:
>
>
>
> > (insn 6 3 7 2 (set (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> >         (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2}
> >      (nil))
> > (insn 7 6 8 2 (set (reg:DI 138)
> >         (sign_extend:DI (plus:SI (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> >                 (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 6 {addsi3_extended}
> >      (expr_list:REG_DEAD (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> >         (nil)))
> > (insn 8 7 10 2 (set (reg:SI 136 [ MEM <char[1:12]> [(void *)x_2(D)]+11 ])
> >         (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal}
> >      (nil))
> > (jump_insn 10 8 73 2 (set (pc)
> >         (if_then_else (ne (reg:DI 138)
> >                 (const_int 0 [0]))
> >             (label_ref 64)
> >             (pc))) "j.c":5:11 243 {*branchdi}
> >      (expr_list:REG_DEAD (reg:DI 138)
> >         (int_list:REG_BR_PROB 536870916 (nil)))
> >  -> 64)
>
>
> insn 8 is the result of wanting the ultimate result of the strcmp to be
> an "int" type (SImode).    Note that (reg 136) is the result of the
> strcmp.  It gets set in each fragment of code that compares one element
> in the string.  It's also live after the strcmp sequence.   As a result
> combine isn't going to be able to clean this up.
>
> Note how (reg 136) births while (reg 138) is live and even though (reg
> 136) is a copy of (reg 138), IRA doesn't have the necessary code to
> determine that the regs do not conflict.  As a result (reg 136) and (reg
> 138) must be allocated different hard registers and we get code like this:
>
> >         lbu     a5,0(a0)        # 6     [c=28 l=4]  zero_extendqisi2/1
> >         addiw   a5,a5,-108      # 7     [c=8 l=4]  addsi3_extended/1
> >         mv      a4,a5   # 8     [c=4 l=4]  *movsi_internal/0
> >         bne     a5,zero,.L2     # 10    [c=4 l=4]  *branchdi
>
> Note the annoying "mv".
>
>
> Rather than do a conversion for each character, we could do each step in
> word_mode and do the conversion once at the end of the whole sequence.
>
> So for each character we expand to:
>
> > (insn 6 5 7 (set (reg:DI 138)
> >         (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
> >      (nil))
> >
> > (insn 7 6 8 (set (reg:DI 137)
> >         (plus:DI (reg:DI 138)
> >             (const_int -108 [0xffffffffffffff94]))) "j.c":5:11 -1
> >      (nil))
> >
> > (jump_insn 8 7 9 (set (pc)
> >         (if_then_else (ne (reg:DI 137)
> >                 (const_int 0 [0]))
> >             (label_ref 41)
> >             (pc))) "j.c":5:11 -1
> >      (nil))
>
> Good.  Then at the end of the sequence we have:
> > (code_label 41 40 42 2 (nil) [0 uses])
> >
> > (insn 42 41 43 (set (reg:SI 136)
> >         (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1
> >      (nil))
>
> Which seems like exactly what we want.  At the assembly level we get:
>          lbu     a5,0(a0)        # 6     [c=28 l=4]  zero_extendqidi2/1
>          addi    a0,a5,-108      # 7     [c=4 l=4]  adddi3/1
>          bne     a0,zero,.L2     # 8     [c=4 l=4]  *branchdi
> [ ... ]
>
> At the end of the sequence we realize the narrowing subreg followed by
> an extnesion isn't necessary and just remove them.
>
> The ultimate result is omnetpp goes from a small regression to a small
> overall improvement with Jivan's patch.
>
> Bootstrapped and regression tested on x86.  Also built and run spec2017
> on riscv64.
>
> OK for the trunk?

But then for example x86 has smaller encoding for byte ops and while
widening is easily done later, truncation is not.

Btw, you failed to update the overall function comment which lists
the conversions applied.

Note I would have expected to use the mode of the load so we truly
elide some extensions, using word_mode looks like just another
mode here?  The key to note is probably

      op0 = convert_modes (mode, unit_mode, op0, 1);
      op1 = convert_modes (mode, unit_mode, op1, 1);
      rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
                                      result, 1, OPTAB_WIDEN);

which uses OPTAB_WIDEN - wouldn't it be better to pass in the
unconverted modes and leave the decision which mode to use
to OPTAB_WIDEN?  Should we somehow query the target for
the smallest mode from unit_mode it can do both the MINUS
and the compare?

Thanks,
Richard.

>
> Jeff
Jeff Law June 5, 2023, 1:37 p.m. UTC | #2
On 6/5/23 00:29, Richard Biener wrote:

> 
> But then for example x86 has smaller encoding for byte ops and while
> widening is easily done later, truncation is not.
Which probably argues we need to be checking costs.

> 
> Btw, you failed to update the overall function comment which lists
> the conversions applied.
ACK.  It occurred to me when I woke up today that I also failed to 
handle the case where word_mode is actually smaller than an int.

> 
> Note I would have expected to use the mode of the load so we truly
> elide some extensions, using word_mode looks like just another
> mode here?  The key to note is probably
> 
>        op0 = convert_modes (mode, unit_mode, op0, 1);
>        op1 = convert_modes (mode, unit_mode, op1, 1);
>        rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
>                                        result, 1, OPTAB_WIDEN);
On many (most?) targets the loads can cheaply extend to word_mode.

> 
> which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> unconverted modes and leave the decision which mode to use
> to OPTAB_WIDEN?  Should we somehow query the target for
> the smallest mode from unit_mode it can do both the MINUS
> and the compare?
I'll play with it.  My worry is that the optabs are going to leave 
things in SImode.  RV64 has 32 bit add/subtract which implicitly sign 
extends the result to 64 bits and Jivan's patch models that behavior 
with generally very good results.  But again, I'll play with it.

I do agree that we need to be looking at cost modeling more in here.  So 
I'll poke at that too.


jeff
Jeff Law June 5, 2023, 6:41 p.m. UTC | #3
On 6/5/23 00:29, Richard Biener wrote:

> 
> But then for example x86 has smaller encoding for byte ops and while
> widening is easily done later, truncation is not.
Sadly, the x86 costing looks totally bogus here.  We actually emit the 
exact same code for a QI mode loads vs a zero-extending load from QI to 
SI.  But the costing is different and would tend to prefer QImode.  That 
in turn is going to force an extension at the end of the sequence which 
would be a regression relative to the current code.  Additionally we may 
get partial register stalls for the byte ops to implement the comparison 
steps.

The net result is that querying the backend's costs would do the exact 
opposite of what I think we want on x86.  One could argue the x86 
maintainers should improve this situation...

> 
> Note I would have expected to use the mode of the load so we truly
> elide some extensions, using word_mode looks like just another
> mode here?  The key to note is probably
> 
>        op0 = convert_modes (mode, unit_mode, op0, 1);
>        op1 = convert_modes (mode, unit_mode, op1, 1);
>        rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
>                                        result, 1, OPTAB_WIDEN);
> 
> which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> unconverted modes and leave the decision which mode to use
> to OPTAB_WIDEN?  Should we somehow query the target for
> the smallest mode from unit_mode it can do both the MINUS
> and the compare?
And avoiding OPTAB_WIDEN isn't going to help rv64 at all.  The core 
issue being that we do define 32bit ops.  With Jivan's patch those 32bit 
ops expose the sign extending nature.  So a 32bit add would look 
something like

(set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI))))
(set (res:SI) (subreg:SI (temp:DI) 0)

Where we mark the subreg with SUBREG_PROMOTED_VAR_P.


I'm not sure the best way to proceed now.  I could just put this on the 
back-burner as it's RISC-V specific and the gains elsewhere dwarf this 
issue.


jeff
Richard Biener June 6, 2023, 6:47 a.m. UTC | #4
On Mon, Jun 5, 2023 at 8:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 6/5/23 00:29, Richard Biener wrote:
>
> >
> > But then for example x86 has smaller encoding for byte ops and while
> > widening is easily done later, truncation is not.
> Sadly, the x86 costing looks totally bogus here.  We actually emit the
> exact same code for a QI mode loads vs a zero-extending load from QI to
> SI.  But the costing is different and would tend to prefer QImode.  That
> in turn is going to force an extension at the end of the sequence which
> would be a regression relative to the current code.  Additionally we may
> get partial register stalls for the byte ops to implement the comparison
> steps.
>
> The net result is that querying the backend's costs would do the exact
> opposite of what I think we want on x86.  One could argue the x86
> maintainers should improve this situation...
>
> >
> > Note I would have expected to use the mode of the load so we truly
> > elide some extensions, using word_mode looks like just another
> > mode here?  The key to note is probably
> >
> >        op0 = convert_modes (mode, unit_mode, op0, 1);
> >        op1 = convert_modes (mode, unit_mode, op1, 1);
> >        rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
> >                                        result, 1, OPTAB_WIDEN);
> >
> > which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> > unconverted modes and leave the decision which mode to use
> > to OPTAB_WIDEN?  Should we somehow query the target for
> > the smallest mode from unit_mode it can do both the MINUS
> > and the compare?
> And avoiding OPTAB_WIDEN isn't going to help rv64 at all.  The core
> issue being that we do define 32bit ops.  With Jivan's patch those 32bit
> ops expose the sign extending nature.  So a 32bit add would look
> something like
>
> (set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI))))
> (set (res:SI) (subreg:SI (temp:DI) 0)
>
> Where we mark the subreg with SUBREG_PROMOTED_VAR_P.
>
>
> I'm not sure the best way to proceed now.  I could just put this on the
> back-burner as it's RISC-V specific and the gains elsewhere dwarf this
> issue.

I wonder if there's some more generic target macro we can key the
behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS
is maybe closer but it's exact implications are unknown to me.  Maybe
there's something else as well ...

The point about OPTAB_WIDEN above was that I wonder why we
extend 'op0' and 'op1' before emitting the binop when we allow WIDEN
anyway.  Yes, we want the result in 'mode' (but why?  As you say we
can extend at the end) and there's likely no way to tell expand_simple_binop
to "expand as needed and not narrow the result".  So I wonder if we should
emulate that somehow (also taking into consideration the compare).

Richard.

>
> jeff
Jeff Law June 7, 2023, 3:02 a.m. UTC | #5
On 6/6/23 00:47, Richard Biener wrote:

> 
> I wonder if there's some more generic target macro we can key the
> behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS
> is maybe closer but it's exact implications are unknown to me.  Maybe
> there's something else as well ...
LOAD_EXTEND_OP might help here, at least on some targets.  Though not on 
x86.

> 
> The point about OPTAB_WIDEN above was that I wonder why we
> extend 'op0' and 'op1' before emitting the binop when we allow WIDEN
> anyway. 
Ahh.  I misunderstood.  However, I think dropping the pre-widening will 
result in byte ops on x86 which may not be wise given the partial 
register stall problem that exists on some variants.



  Yes, we want the result in 'mode' (but why?  As you say we
> can extend at the end) and there's likely no way to tell expand_simple_binop
> to "expand as needed and not narrow the result".  So I wonder if we should
> emulate that somehow (also taking into consideration the compare).
That's what I felt I was starting to build.  Essentially looking at 
costing (and probably other stuff eventually, like the ability to 
compare/branch on narrower modes) to make a determination about whether 
or not to do the operations in narrow or wider modes.  With the costing 
so mucked up on x86 though, I'm hesitant to pursue this path further at 
this time.

Jeff
diff mbox series

Patch

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 8400adaf5b4..f2e0d3b7d7f 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -7135,6 +7135,9 @@  inline_string_cmp (rtx target, tree var_str, const char *const_str,
   scalar_int_mode unit_mode
     = as_a <scalar_int_mode> TYPE_MODE (unit_type_node);
 
+  /* We do the intermediate steps in WORD_MODE, then convert
+     to mode at the end of the sequence.  */
+  rtx intermediate_result = gen_reg_rtx (word_mode);
   start_sequence ();
 
   for (unsigned HOST_WIDE_INT i = 0; i < length; i++)
@@ -7145,25 +7148,27 @@  inline_string_cmp (rtx target, tree var_str, const char *const_str,
       rtx op0 = (const_str_n == 1) ? const_rtx : var_rtx;
       rtx op1 = (const_str_n == 1) ? var_rtx : const_rtx;
 
-      op0 = convert_modes (mode, unit_mode, op0, 1);
-      op1 = convert_modes (mode, unit_mode, op1, 1);
-      rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
-				      result, 1, OPTAB_WIDEN);
+      op0 = convert_modes (word_mode, unit_mode, op0, 1);
+      op1 = convert_modes (word_mode, unit_mode, op1, 1);
+      rtx diff = expand_simple_binop (word_mode, MINUS, op0, op1,
+				      intermediate_result, 1, OPTAB_WIDEN);
 
-      /* Force the difference into result register.  We cannot reassign
-	 result here ("result = diff") or we may end up returning
-	 uninitialized result when expand_simple_binop allocates a new
-	 pseudo-register for returning.  */
-      if (diff != result)
-	emit_move_insn (result, diff);
+      /* Force the difference into intermediate result register.  We cannot
+	 reassign the intermediate result here ("intermediate_result = diff")
+	 or we may end up returning uninitialized result when
+	 expand_simple_binop allocates a new pseudo-register for returning.  */
+      if (diff != intermediate_result)
+	emit_move_insn (intermediate_result, diff);
 
       if (i < length - 1)
-	emit_cmp_and_jump_insns (result, CONST0_RTX (mode), NE, NULL_RTX,
-	    			 mode, true, ne_label);
+	emit_cmp_and_jump_insns (intermediate_result, CONST0_RTX (word_mode),
+				 NE, NULL_RTX, word_mode, true, ne_label);
       offset += GET_MODE_SIZE (unit_mode);
     }
 
   emit_label (ne_label);
+  /* Now convert the intermediate result to the final result.  */
+  emit_move_insn (result, gen_lowpart (mode, intermediate_result));
   rtx_insn *insns = get_insns ();
   end_sequence ();
   emit_insn (insns);