diff mbox series

[i386] Fix PR 88998, bad codegen with mmx instructions

Message ID CAFULd4ZN1jGnbRdjhKTjNVDqEAMYdU3nEKxuAfB=AhyWRuPkOg@mail.gmail.com
State New
Headers show
Series [i386] Fix PR 88998, bad codegen with mmx instructions | expand

Commit Message

Uros Bizjak Jan. 23, 2019, 7:22 p.m. UTC
Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi
and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd
intrinsics is used. Without the patch, the testcase compiles to (-O2
-mavx):

_Z7prepareii:
        vmovd   %edi, %xmm1
        vpinsrd $1, %esi, %xmm1, %xmm0
        movdq2q %xmm0, %mm0
        cvtpi2pd        %mm0, %xmm0
        vhaddpd %xmm0, %xmm0, %xmm0
        ret

while patched gcc generates:

        vmovd   %edi, %xmm1
        vpinsrd $1, %esi, %xmm1, %xmm0
        vcvtdq2pd       %xmm0, %xmm0
        vhaddpd %xmm0, %xmm0, %xmm0
        ret

The later avoids transition of FPU to MMX mode.

2019-01-23  Uroš Bizjak  <ubizjak@gmail.com>

    PR target/88998
    * config/i386/sse.md (sse2_cvtpi2pd): Add SSE alternatives.
    Disparage MMX alternative.
    (sse2_cvtpd2pi): Ditto.
    (sse2_cvttpd2pi): Ditto.

testsuite/ChangeLog:

2019-01-23  Uroš Bizjak  <ubizjak@gmail.com>

    PR target/88998
    * g++.target/i386/pr88998.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Committed to mainline, will be backported to release branches.

Uros.

Comments

H.J. Lu Jan. 23, 2019, 7:52 p.m. UTC | #1
On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi
> and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd
> intrinsics is used. Without the patch, the testcase compiles to (-O2
> -mavx):
>
> _Z7prepareii:
>         vmovd   %edi, %xmm1
>         vpinsrd $1, %esi, %xmm1, %xmm0
>         movdq2q %xmm0, %mm0
>         cvtpi2pd        %mm0, %xmm0
>         vhaddpd %xmm0, %xmm0, %xmm0
>         ret
>
> while patched gcc generates:
>
>         vmovd   %edi, %xmm1
>         vpinsrd $1, %esi, %xmm1, %xmm0
>         vcvtdq2pd       %xmm0, %xmm0
>         vhaddpd %xmm0, %xmm0, %xmm0
>         ret
>
> The later avoids transition of FPU to MMX mode.
>

Is that possible to support 64-bit vectors, like V2SI, with SSE
instead of MMX for x86-64 under a command-line switch?
Uros Bizjak Jan. 23, 2019, 8:26 p.m. UTC | #2
On Wed, Jan 23, 2019 at 8:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi
> > and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd
> > intrinsics is used. Without the patch, the testcase compiles to (-O2
> > -mavx):
> >
> > _Z7prepareii:
> >         vmovd   %edi, %xmm1
> >         vpinsrd $1, %esi, %xmm1, %xmm0
> >         movdq2q %xmm0, %mm0
> >         cvtpi2pd        %mm0, %xmm0
> >         vhaddpd %xmm0, %xmm0, %xmm0
> >         ret
> >
> > while patched gcc generates:
> >
> >         vmovd   %edi, %xmm1
> >         vpinsrd $1, %esi, %xmm1, %xmm0
> >         vcvtdq2pd       %xmm0, %xmm0
> >         vhaddpd %xmm0, %xmm0, %xmm0
> >         ret
> >
> > The later avoids transition of FPU to MMX mode.
> >
>
> Is that possible to support 64-bit vectors, like V2SI, with SSE
> instead of MMX for x86-64 under a command-line switch?

SSE registers are preferred for 64bit vectors (see number of
exclamation marks in *mov<mode>_internal in mmx.md), so the value will
be passed in SSE regs unless there is pure MMX instruction, where due
to missing SSE alternatives, RA will need to allocate MMX register.

Uros.
H.J. Lu Jan. 23, 2019, 8:56 p.m. UTC | #3
On Wed, Jan 23, 2019 at 12:27 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jan 23, 2019 at 8:52 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi
> > > and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd
> > > intrinsics is used. Without the patch, the testcase compiles to (-O2
> > > -mavx):
> > >
> > > _Z7prepareii:
> > >         vmovd   %edi, %xmm1
> > >         vpinsrd $1, %esi, %xmm1, %xmm0
> > >         movdq2q %xmm0, %mm0
> > >         cvtpi2pd        %mm0, %xmm0
> > >         vhaddpd %xmm0, %xmm0, %xmm0
> > >         ret
> > >
> > > while patched gcc generates:
> > >
> > >         vmovd   %edi, %xmm1
> > >         vpinsrd $1, %esi, %xmm1, %xmm0
> > >         vcvtdq2pd       %xmm0, %xmm0
> > >         vhaddpd %xmm0, %xmm0, %xmm0
> > >         ret
> > >
> > > The later avoids transition of FPU to MMX mode.
> > >
> >
> > Is that possible to support 64-bit vectors, like V2SI, with SSE
> > instead of MMX for x86-64 under a command-line switch?
>
> SSE registers are preferred for 64bit vectors (see number of
> exclamation marks in *mov<mode>_internal in mmx.md), so the value will
> be passed in SSE regs unless there is pure MMX instruction, where due
> to missing SSE alternatives, RA will need to allocate MMX register.
>

[hjl@gnu-cfl-1 v64-1]$ cat x.i
typedef float __v2sf __attribute__ ((__vector_size__ (8)));

extern __v2sf z;

void
add (__v2sf x, __v2sf y)
{
  z = x + y;
}
[hjl@gnu-cfl-1 v64-1]$ make x.s
/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/  -O2
-S x.i
[hjl@gnu-cfl-1 v64-1]$ cat x.s
.file "x.i"
.text
.p2align 4
.globl add
.type add, @function
add:
.LFB0:
.cfi_startproc
movlps %xmm0, -32(%rsp)
movss -32(%rsp), %xmm0
movlps %xmm1, -40(%rsp)
addss -40(%rsp), %xmm0
movss %xmm0, -56(%rsp)
movss -28(%rsp), %xmm0
addss -36(%rsp), %xmm0
movss %xmm0, -52(%rsp)
movq -56(%rsp), %rax
movq %rax, z(%rip)
ret
.cfi_endproc
.LFE0:
.size add, .-add
.ident "GCC: (GNU) 9.0.0 20190122 (experimental)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 v64-1]$

I am expecting:

addps %xmm1, %xmm0
movlps %xmm0, z(%rip)
retq
diff mbox series

Patch

Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 268188)
+++ config/i386/sse.md	(working copy)
@@ -4997,37 +4997,49 @@ 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
 (define_insn "sse2_cvtpi2pd"
-  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
-	(float:V2DF (match_operand:V2SI 1 "nonimmediate_operand" "y,m")))]
+  [(set (match_operand:V2DF 0 "register_operand" "=v,x")
+	(float:V2DF (match_operand:V2SI 1 "nonimmediate_operand" "vBm,?!y")))]
   "TARGET_SSE2"
-  "cvtpi2pd\t{%1, %0|%0, %1}"
+  "@
+   %vcvtdq2pd\t{%1, %0|%0, %1}
+   cvtpi2pd\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssecvt")
-   (set_attr "unit" "mmx,*")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "unit" "*,mmx")
+   (set_attr "prefix_data16" "*,1")
+   (set_attr "prefix" "maybe_vex,*")
    (set_attr "mode" "V2DF")])
 
 (define_insn "sse2_cvtpd2pi"
-  [(set (match_operand:V2SI 0 "register_operand" "=y")
-	(unspec:V2SI [(match_operand:V2DF 1 "nonimmediate_operand" "xm")]
+  [(set (match_operand:V2SI 0 "register_operand" "=v,?!y")
+	(unspec:V2SI [(match_operand:V2DF 1 "nonimmediate_operand" "vBm,xm")]
 		     UNSPEC_FIX_NOTRUNC))]
   "TARGET_SSE2"
-  "cvtpd2pi\t{%1, %0|%0, %1}"
+  "@
+   * return TARGET_AVX ? \"vcvtpd2dq{x}\t{%1, %0|%0, %1}\" : \"cvtpd2dq\t{%1, %0|%0, %1}\";
+   cvtpd2pi\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssecvt")
-   (set_attr "unit" "mmx")
+   (set_attr "unit" "*,mmx")
+   (set_attr "amdfam10_decode" "double")
+   (set_attr "athlon_decode" "vector")
    (set_attr "bdver1_decode" "double")
-   (set_attr "btver2_decode" "direct")
-   (set_attr "prefix_data16" "1")
-   (set_attr "mode" "DI")])
+   (set_attr "prefix_data16" "*,1")
+   (set_attr "prefix" "maybe_vex,*")
+   (set_attr "mode" "TI")])
 
 (define_insn "sse2_cvttpd2pi"
-  [(set (match_operand:V2SI 0 "register_operand" "=y")
-	(fix:V2SI (match_operand:V2DF 1 "nonimmediate_operand" "xm")))]
+  [(set (match_operand:V2SI 0 "register_operand" "=v,?!y")
+	(fix:V2SI (match_operand:V2DF 1 "nonimmediate_operand" "vBm,xm")))]
   "TARGET_SSE2"
-  "cvttpd2pi\t{%1, %0|%0, %1}"
+  "@
+   * return TARGET_AVX ? \"vcvttpd2dq{x}\t{%1, %0|%0, %1}\" : \"cvttpd2dq\t{%1, %0|%0, %1}\";
+   cvttpd2pi\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssecvt")
-   (set_attr "unit" "mmx")
+   (set_attr "unit" "*,mmx")
+   (set_attr "amdfam10_decode" "double")
+   (set_attr "athlon_decode" "vector")
    (set_attr "bdver1_decode" "double")
-   (set_attr "prefix_data16" "1")
+   (set_attr "prefix_data16" "*,1")
+   (set_attr "prefix" "maybe_vex,*")
    (set_attr "mode" "TI")])
 
 (define_insn "sse2_cvtsi2sd"
Index: testsuite/g++.target/i386/pr88998.C
===================================================================
--- testsuite/g++.target/i386/pr88998.C	(nonexistent)
+++ testsuite/g++.target/i386/pr88998.C	(working copy)
@@ -0,0 +1,31 @@ 
+// PR target/88998
+// { dg-do run { target sse2_runtime } }
+// { dg-options "-O2 -msse2 -mfpmath=387" }
+// { dg-require-effective-target c++11 }
+
+#include <cassert>
+#include <unordered_map>
+#include <x86intrin.h>
+
+double
+__attribute__((noinline))
+prepare (int a, int b)
+{
+  __m128i is = _mm_setr_epi32 (a, b, 0, 0);
+  __m128d ds = _mm_cvtepi32_pd (is);
+  return ds[0] + ds[1];
+}
+
+int
+main (int, char **)
+{
+  double d = prepare (1, 2);
+
+  std::unordered_map < int, int >m;
+  m.insert ({0, 0});
+  m.insert ({1, 1});
+  assert (m.load_factor () <= m.max_load_factor ());
+
+  assert (d == 3);
+  return 0;
+}