diff mbox series

[x86] PR target/105930: Split *xordi3_doubleword after reload.

Message ID 00e701d8862c$bb2d2290$318767b0$@nextmovesoftware.com
State New
Headers show
Series [x86] PR target/105930: Split *xordi3_doubleword after reload. | expand

Commit Message

Roger Sayle June 22, 2022, 11:39 a.m. UTC
This patch addresses PR target/105930 which is an ia32 stack frame size
regression in high-register pressure XOR-rich cryptography functions
reported by Linus Torvalds.  The underlying problem is once the limited
number of registers on the x86 are exhausted, the register allocator
has to decide which to spill, where some eviction choices lead to much
poorer code, but these consequences are difficult to predict in advance.

The patch below, which splits xordi3_doubleword and iordi3_doubleword
after reload (instead of before), significantly reduces the amount of
spill code and stack frame size, in what might appear to be an arbitrary
choice.

My explanation of this behaviour is that the mixing of pre-reload split
SImode instructions and post-reload split DImode instructions is
confusing some of the heuristics used by reload.  One might think
that splitting early gives the register allocator more freedom to
use available registers, but in practice the constraint that double
word values occupy consecutive registers (when ultimately used as a
DImode value) is the greater constraint.  Instead, I believe in this
case, the pseudo registers used in memory addressing, appear to be
double counted for split SImode instructions compared to unsplit
DImode instructions.  For the reduced test case in comment #13, this
leads to %eax being used to hold the long-lived argument pointer "v",
blocking the use of the ax:dx pair for processing double word values.
The important lines are at the very top of the assembly output:

GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
        sub     esp, 24
        mov     ecx, DWORD PTR [esp+40]

GCC 12 [use %eax to address memory, require a 44-byte stack frame]
        sub     esp, 44
        mov     eax, DWORD PTR [esp+64]

Jakub's alternative proposed patch in comment #17 is to improve
consistency by splitting more instructions (rotates) before reload,
which shows a small improvement, but not to GCC v11 levels.

I have some follow-up patches (based on other approaches I've tried),
including splitting rotations by 32 after reload, and supporting
TImode operations via <dwi>, notice this same pattern is mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html
but this patch below is the core minimal fix that's hopefully
suitable for benchmarking and possibly backporting to the 12 branch.
I believe that changes to the register allocator itself, to tweak how
stack slots are assigned and which values can be cheaply materialized
are out-of-scope for the release branches.


I'm curious what folks (especially Uros and Jakub) think?
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  CSiBE also shows a (minor) code size reduction.
It would be great if someone could benchmark this patch, or alternatively
it can be baked on mainline to let the automatic benchmarking evaluate it,
then revert the patch if there are any observed performance issues.

Thoughts?


2022-06-22  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR target/105930
	* config/i386/i386.md (*<any_or>di3_doubleword): Split after
	reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
	and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).
	(define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1
	to provide gen_<any_or>si_1 functions.

Roger
--

Comments

Uros Bizjak June 23, 2022, 7:03 a.m. UTC | #1
On Wed, Jun 22, 2022 at 1:39 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch addresses PR target/105930 which is an ia32 stack frame size
> regression in high-register pressure XOR-rich cryptography functions
> reported by Linus Torvalds.  The underlying problem is once the limited
> number of registers on the x86 are exhausted, the register allocator
> has to decide which to spill, where some eviction choices lead to much
> poorer code, but these consequences are difficult to predict in advance.
>
> The patch below, which splits xordi3_doubleword and iordi3_doubleword
> after reload (instead of before), significantly reduces the amount of
> spill code and stack frame size, in what might appear to be an arbitrary
> choice.
>
> My explanation of this behaviour is that the mixing of pre-reload split
> SImode instructions and post-reload split DImode instructions is
> confusing some of the heuristics used by reload.  One might think
> that splitting early gives the register allocator more freedom to
> use available registers, but in practice the constraint that double
> word values occupy consecutive registers (when ultimately used as a
> DImode value) is the greater constraint.  Instead, I believe in this
> case, the pseudo registers used in memory addressing, appear to be
> double counted for split SImode instructions compared to unsplit
> DImode instructions.  For the reduced test case in comment #13, this
> leads to %eax being used to hold the long-lived argument pointer "v",
> blocking the use of the ax:dx pair for processing double word values.
> The important lines are at the very top of the assembly output:
>
> GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
>         sub     esp, 24
>         mov     ecx, DWORD PTR [esp+40]
>
> GCC 12 [use %eax to address memory, require a 44-byte stack frame]
>         sub     esp, 44
>         mov     eax, DWORD PTR [esp+64]

The pre-reload vs. post-reload approach is a constant source of
various regressions, since one or the other has its benefits and
drawbacks. Please note that arithmetic and shift instructions remain
on a post-reload approach, since there were various regressions w.r.t.
register allocator observed.

I think that the best approach for now would be to move *all*
double-word splitters to the post-reload approach. As you exposed
above, mixing pre-reload and post-reload approaches confuses reload
too much due to the blocking of the part of a register pair.

> Jakub's alternative proposed patch in comment #17 is to improve
> consistency by splitting more instructions (rotates) before reload,
> which shows a small improvement, but not to GCC v11 levels.

So, I guess the experiment shows that we should revert to post-reload
splitters, while keeping the improvements splitters gain from
post-reload implementation.

> I have some follow-up patches (based on other approaches I've tried),
> including splitting rotations by 32 after reload, and supporting
> TImode operations via <dwi>, notice this same pattern is mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html
> but this patch below is the core minimal fix that's hopefully
> suitable for benchmarking and possibly backporting to the 12 branch.
> I believe that changes to the register allocator itself, to tweak how
> stack slots are assigned and which values can be cheaply materialized
> are out-of-scope for the release branches.

I assume that and/andnot should also be moved to a post-reload split approach.
>
>
> I'm curious what folks (especially Uros and Jakub) think?
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  CSiBE also shows a (minor) code size reduction.
> It would be great if someone could benchmark this patch, or alternatively
> it can be baked on mainline to let the automatic benchmarking evaluate it,
> then revert the patch if there are any observed performance issues.
>
> Thoughts?
>
>
> 2022-06-22  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/105930
>         * config/i386/i386.md (*<any_or>di3_doubleword): Split after
>         reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
>         and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).
>         (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1
>         to provide gen_<any_or>si_1 functions.

Is there a reason to not use ix86_expand_{binary|unary}_operator? It
fits perfectly here.

Otherwise, the patch looks OK to me.

Uros.
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3093cb5..537fba21 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10539,48 +10539,61 @@ 
   "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;")
 
 (define_insn_and_split "*<code>di3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r")
 	(any_or:DI
-	 (match_operand:DI 1 "nonimmediate_operand")
-	 (match_operand:DI 2 "x86_64_szext_general_operand")))
+	 (match_operand:DI 1 "nonimmediate_operand" "0,0")
+	 (match_operand:DI 2 "x86_64_szext_general_operand" "re,o")))
    (clobber (reg:CC FLAGS_REG))]
   "!TARGET_64BIT
-   && ix86_binary_operator_ok (<CODE>, DImode, operands)
-   && ix86_pre_reload_split ()"
+   && ix86_binary_operator_ok (<CODE>, DImode, operands)"
   "#"
-  "&& 1"
+  "&& reload_completed"
   [(const_int 0)]
 {
+  /* This insn may disappear completely when operands[2] == const0_rtx
+     and operands[0] == operands[1], which requires a NOTE_INSN_DELETED.  */
+  bool emit_insn_deleted_note_p = false;
+
   split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
 
   if (operands[2] == const0_rtx)
-    emit_move_insn (operands[0], operands[1]);
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+	emit_move_insn (operands[0], operands[1]);
+      else
+	emit_insn_deleted_note_p = true;
+    }
   else if (operands[2] == constm1_rtx)
     {
       if (<CODE> == IOR)
 	emit_move_insn (operands[0], constm1_rtx);
       else
-	ix86_expand_unary_operator (NOT, SImode, &operands[0]);
+	emit_insn (gen_one_cmplsi2 (operands[0], operands[1]));
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[0]);
+    emit_insn (gen_<code>si_1 (operands[0], operands[1], operands[2]));
 
   if (operands[5] == const0_rtx)
-    emit_move_insn (operands[3], operands[4]);
+    {
+      if (!rtx_equal_p (operands[3], operands[4]))
+	emit_move_insn (operands[3], operands[4]);
+      else if (emit_insn_deleted_note_p)
+	emit_note (NOTE_INSN_DELETED);
+    }
   else if (operands[5] == constm1_rtx)
     {
       if (<CODE> == IOR)
 	emit_move_insn (operands[3], constm1_rtx);
       else
-	ix86_expand_unary_operator (NOT, SImode, &operands[3]);
+	emit_insn (gen_one_cmplsi2 (operands[3], operands[4]));
     }
   else
-    ix86_expand_binary_operator (<CODE>, SImode, &operands[3]);
+    emit_insn (gen_<code>si_1 (operands[3], operands[4], operands[5]));
 
   DONE;
 })
 
-(define_insn "*<code><mode>_1"
+(define_insn "<code><mode>_1"
   [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k")
 	(any_or:SWI248
 	 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")