diff mbox series

[i386] PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32.

Message ID 004501d82fb2$d9e77b20$8db67160$@nextmovesoftware.com
State New
Headers show
Series [i386] PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32. | expand

Commit Message

Roger Sayle March 4, 2022, 10:30 a.m. UTC
This clean-up patch resolves PR testsuite/104732, the failure of the recent
test gcc.target/i386/pr100711-1.c on 32-bit Solaris/x86.  Rather than just
tweak the testcase, the proposed approach is to fix the underlying problem
by removing the "TARGET_STV && TARGET_SSE2" conditionals from the DI mode
logical operation expanders and pre-reload splitters in i386.md, which as
I'll show generate inferior code (even a GCC 12 regression) on !TARGET_64BIT
whenever -mno-stv (such as Solaris) or -msse (but not -msse2).

First a little bit of history.  In the beginning, DImode operations on
i386 weren't defined by the machine description, and lowered during RTL
expansion to SI mode operations.  The with PR 65105 in 2015, -mstv was
added, together with a SWIM1248x mode iterator (later renamed to SWIM1248x)
together with several *<code>di3_doubleword post-reload splitters that
made use of register allocation to perform some double word operations
in 64-but XMM registers.  A short while later in 2016, PR 70322 added
similar support for one_cmpldi2.  All of this logic was dependent upon
"!TARGET_64BIT && TARGET_STV && TARGET_SSE2".  With the passing of time,
these conditions became irrelevant when in 2019, it was decided to split
these double-word patterns before reload.
https://gcc.gnu.org/pipermail/gcc-patches/2019-June/523877.html
https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532236.html
Hence the current situation, where on most modern CPU architectures
(where "TARGET_STV && TARGET_SSE2" is true), RTL is expanded with DI
mode operations, that are then split into two SI mode instructions
before reload, except on Solaris and other odd cases, where the splitting
is to two SI mode instructions is done during RTL expansion.  By the
time compilation reaches register allocation both paths in theory
produce identical or similar code, so the vestigial legacy/logic would
appear to be harmless.

Unfortunately, there is one place where this arbitrary choice of how
to lower DI mode doubleword operations is visible to the middle-end,
it controls whether the backend appears to have a suitable optab, and
the presence (or not) of DImode optabs can influence vectorization
cost models and veclower decisions.

The issue (and code quality regression) can be seen in this test case:

typedef long long v2di __attribute__((vector_size (16)));
v2di x;
void foo (long long a)
{
    v2di t = {a, a};
    x = ~t;
}

which when compiled with "-O2 -m32 -msse -march=pentiumpro" produces:

foo:    subl    $28, %esp
        movl    %ebx, 16(%esp)
        movl    32(%esp), %eax
        movl    %esi, 20(%esp)
        movl    36(%esp), %edx
        movl    %edi, 24(%esp)
        movl    %eax, %esi
        movl    %eax, %edi
        movl    %edx, %ebx
        movl    %edx, %ecx
        notl    %esi
        notl    %ebx
        movl    %esi, (%esp)
        notl    %edi
        notl    %ecx
        movl    %ebx, 4(%esp)
        movl    20(%esp), %esi
        movl    %edi, 8(%esp)
        movl    16(%esp), %ebx
        movl    %ecx, 12(%esp)
        movl    24(%esp), %edi
        movss   8(%esp), %xmm1
        movss   12(%esp), %xmm2
        movss   (%esp), %xmm0
        movss   4(%esp), %xmm3
        unpcklps        %xmm2, %xmm1
        unpcklps        %xmm3, %xmm0
        movlhps %xmm1, %xmm0
        movaps  %xmm0, x
        addl    $28, %esp
        ret


Importantly notice the four "notl" instructions.  With this patch:

foo:    subl    $28, %esp
        movl    32(%esp), %edx
        movl    36(%esp), %eax
        notl    %edx
        movl    %edx, (%esp)
        notl    %eax
        movl    %eax, 4(%esp)
        movl    %edx, 8(%esp)
        movl    %eax, 12(%esp)
        movaps  (%esp), %xmm1
        movaps  %xmm1, x
        addl    $28, %esp
        ret

Notice only two "notl" instructions.  Checking with Godbolt.org, GCC
generated 4 NOTs in GCC 4.x and 5.x, 2 NOTs between GCC 6.x and 9.x,
and regressed to 4 NOTs since GCC 10.x [which hopefully qualifies
this clean-up as suitable for stage 4].

Most significantly, this patch allows pr100711-1.c to pass with
-mno-stv, allowing pandn to be used with V2DImode on Solaris/x86.
Fingers-crossed this should reduce the number of discrepancies
that Rainer Orth encounters supporting Solaris/x86.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, with/without --target_board='unix{-m32\ -march=
cascadelake}' with no new failures.  Ok for mainline?


2022-03-04  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR testsuite/104732
	* config/i386/i386.md (SWIM1248s): Include DI mode unconditionally.
	(*anddi3_doubleword): Remove && TARGET_STV && TARGET_SSE2 condition,
	i.e. always split on !TARGET_64BIT.
	(*<any_or>di3_doubleword): Likewise.
	(*one_cmpldi2_doubleword): Likewise.

gcc/testsuite/ChangeLog
	PR testsuite/104732
	* gcc.target/i386/pr104732.c: New test case.


Thanks in advance,
Roger
--

Comments

Uros Bizjak March 4, 2022, 11:08 a.m. UTC | #1
On Fri, Mar 4, 2022 at 11:30 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This clean-up patch resolves PR testsuite/104732, the failure of the recent
> test gcc.target/i386/pr100711-1.c on 32-bit Solaris/x86.  Rather than just
> tweak the testcase, the proposed approach is to fix the underlying problem
> by removing the "TARGET_STV && TARGET_SSE2" conditionals from the DI mode
> logical operation expanders and pre-reload splitters in i386.md, which as
> I'll show generate inferior code (even a GCC 12 regression) on !TARGET_64BIT
> whenever -mno-stv (such as Solaris) or -msse (but not -msse2).
>
> First a little bit of history.  In the beginning, DImode operations on
> i386 weren't defined by the machine description, and lowered during RTL
> expansion to SI mode operations.  The with PR 65105 in 2015, -mstv was
> added, together with a SWIM1248x mode iterator (later renamed to SWIM1248x)
> together with several *<code>di3_doubleword post-reload splitters that
> made use of register allocation to perform some double word operations
> in 64-but XMM registers.  A short while later in 2016, PR 70322 added
> similar support for one_cmpldi2.  All of this logic was dependent upon
> "!TARGET_64BIT && TARGET_STV && TARGET_SSE2".  With the passing of time,
> these conditions became irrelevant when in 2019, it was decided to split
> these double-word patterns before reload.
> https://gcc.gnu.org/pipermail/gcc-patches/2019-June/523877.html
> https://gcc.gnu.org/pipermail/gcc-patches/2019-October/532236.html
> Hence the current situation, where on most modern CPU architectures
> (where "TARGET_STV && TARGET_SSE2" is true), RTL is expanded with DI
> mode operations, that are then split into two SI mode instructions
> before reload, except on Solaris and other odd cases, where the splitting
> is to two SI mode instructions is done during RTL expansion.  By the
> time compilation reaches register allocation both paths in theory
> produce identical or similar code, so the vestigial legacy/logic would
> appear to be harmless.
>
> Unfortunately, there is one place where this arbitrary choice of how
> to lower DI mode doubleword operations is visible to the middle-end,
> it controls whether the backend appears to have a suitable optab, and
> the presence (or not) of DImode optabs can influence vectorization
> cost models and veclower decisions.
>
> The issue (and code quality regression) can be seen in this test case:
>
> typedef long long v2di __attribute__((vector_size (16)));
> v2di x;
> void foo (long long a)
> {
>     v2di t = {a, a};
>     x = ~t;
> }
>
> which when compiled with "-O2 -m32 -msse -march=pentiumpro" produces:
>
> foo:    subl    $28, %esp
>         movl    %ebx, 16(%esp)
>         movl    32(%esp), %eax
>         movl    %esi, 20(%esp)
>         movl    36(%esp), %edx
>         movl    %edi, 24(%esp)
>         movl    %eax, %esi
>         movl    %eax, %edi
>         movl    %edx, %ebx
>         movl    %edx, %ecx
>         notl    %esi
>         notl    %ebx
>         movl    %esi, (%esp)
>         notl    %edi
>         notl    %ecx
>         movl    %ebx, 4(%esp)
>         movl    20(%esp), %esi
>         movl    %edi, 8(%esp)
>         movl    16(%esp), %ebx
>         movl    %ecx, 12(%esp)
>         movl    24(%esp), %edi
>         movss   8(%esp), %xmm1
>         movss   12(%esp), %xmm2
>         movss   (%esp), %xmm0
>         movss   4(%esp), %xmm3
>         unpcklps        %xmm2, %xmm1
>         unpcklps        %xmm3, %xmm0
>         movlhps %xmm1, %xmm0
>         movaps  %xmm0, x
>         addl    $28, %esp
>         ret
>
>
> Importantly notice the four "notl" instructions.  With this patch:
>
> foo:    subl    $28, %esp
>         movl    32(%esp), %edx
>         movl    36(%esp), %eax
>         notl    %edx
>         movl    %edx, (%esp)
>         notl    %eax
>         movl    %eax, 4(%esp)
>         movl    %edx, 8(%esp)
>         movl    %eax, 12(%esp)
>         movaps  (%esp), %xmm1
>         movaps  %xmm1, x
>         addl    $28, %esp
>         ret
>
> Notice only two "notl" instructions.  Checking with Godbolt.org, GCC
> generated 4 NOTs in GCC 4.x and 5.x, 2 NOTs between GCC 6.x and 9.x,
> and regressed to 4 NOTs since GCC 10.x [which hopefully qualifies
> this clean-up as suitable for stage 4].
>
> Most significantly, this patch allows pr100711-1.c to pass with
> -mno-stv, allowing pandn to be used with V2DImode on Solaris/x86.
> Fingers-crossed this should reduce the number of discrepancies
> that Rainer Orth encounters supporting Solaris/x86.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, with/without --target_board='unix{-m32\ -march=
> cascadelake}' with no new failures.  Ok for mainline?

The idea was to leave decomposition of double-word operations to the
generic middle-end, where simplification and propagation of constants
will be handled in a generic way. However, several releases later,
these simplifications were also introduced to STV-enabeld patterns.
So, if the latter approach is demonstrably better, then I see no
problem to enable it for all targets, not only STV-enabled ones.

>
> 2022-03-04  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR testsuite/104732
>         * config/i386/i386.md (SWIM1248s): Include DI mode unconditionally.
>         (*anddi3_doubleword): Remove && TARGET_STV && TARGET_SSE2 condition,
>         i.e. always split on !TARGET_64BIT.
>         (*<any_or>di3_doubleword): Likewise.
>         (*one_cmpldi2_doubleword): Likewise.
>
> gcc/testsuite/ChangeLog
>         PR testsuite/104732
>         * gcc.target/i386/pr104732.c: New test case.

OK with a nit below.

Thanks,
Uros.

-;; Math-dependant integer modes with DImode (enabled for 32bit with STV).
+;; Math-dependant integer modes with DImode.
 (define_mode_iterator SWIM1248s
  [(QI "TARGET_QIMODE_MATH")
  (HI "TARGET_HIMODE_MATH")
- SI (DI "TARGET_64BIT || (TARGET_STV && TARGET_SSE2)")])
+ SI DI])

Please rename this iterator to SWIM1248x, as is the case with other
iterators where DImode is used unconditionally.
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5e0a980..8eae4b0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1079,11 +1079,11 @@ 
 			       (HI "TARGET_HIMODE_MATH")
 			       SI])
 
-;; Math-dependant integer modes with DImode (enabled for 32bit with STV).
+;; Math-dependant integer modes with DImode.
 (define_mode_iterator SWIM1248s
 	[(QI "TARGET_QIMODE_MATH")
 	 (HI "TARGET_HIMODE_MATH")
-	 SI (DI "TARGET_64BIT || (TARGET_STV && TARGET_SSE2)")])
+	 SI DI])
 
 ;; Math-dependant single word integer modes without QImode.
 (define_mode_iterator SWIM248 [(HI "TARGET_HIMODE_MATH")
@@ -9733,7 +9733,7 @@ 
 	 (match_operand:DI 1 "nonimmediate_operand")
 	 (match_operand:DI 2 "x86_64_szext_general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2
+  "!TARGET_64BIT
    && ix86_binary_operator_ok (AND, DImode, operands)
    && ix86_pre_reload_split ()"
   "#"
@@ -10349,7 +10349,7 @@ 
 	 (match_operand:DI 1 "nonimmediate_operand")
 	 (match_operand:DI 2 "x86_64_szext_general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2
+  "!TARGET_64BIT
    && ix86_binary_operator_ok (<CODE>, DImode, operands)
    && ix86_pre_reload_split ()"
   "#"
@@ -11435,7 +11435,7 @@ 
 (define_insn_and_split "*one_cmpldi2_doubleword"
   [(set (match_operand:DI 0 "nonimmediate_operand")
 	(not:DI (match_operand:DI 1 "nonimmediate_operand")))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2
+  "!TARGET_64BIT
    && ix86_unary_operator_ok (NOT, DImode, operands)
    && ix86_pre_reload_split ()"
   "#"
diff --git a/gcc/testsuite/gcc.target/i386/pr104732.c b/gcc/testsuite/gcc.target/i386/pr104732.c
new file mode 100644
index 0000000..c8954366c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr104732.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2 -m32 -msse -march=pentiumpro" } */
+
+typedef long long v2di __attribute__((vector_size (16)));
+
+v2di x;
+
+void foo (long long a)
+{
+    v2di t = {a, a};
+    x = ~t;
+}
+
+/* { dg-final { scan-assembler-times "notl" 2 } } */