diff mbox series

i386: Optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw[PR96906]

Message ID CAMZc-bzDDRgssko3D72J79aNVw3YwLN1gVedx5X_zWMrPL7=gA@mail.gmail.com
State New
Headers show
Series i386: Optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw[PR96906] | expand

Commit Message

Hongtao Liu Nov. 30, 2020, 1:11 p.m. UTC
Hi:
  This patch is quite similar like what jakub did in
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/560151.html
but for target avx512bw.

.i.e. for -mavx512bw -mavx512vl transform code from

        vpsubusw        %xmm1, %xmm0, %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpw  $0, %xmm1, %xmm0, %k0
to
        vpcmpleuw       %xmm1, %xmm0, %k0

   Bootstrapped/regtested on x86_64-linux is ok.

gcc/ChangeLog
        PR target/96906
         * config/i386/sse.md
        (<avx512>_ucmp<mode>3<mask_scalar_merge_name>): Add a new
        define_split after this insn.

gcc/testsuite/ChangeLog

        * gcc.target/i386/avx512bw-pr96906-1.c: New test.
        * gcc.target/i386/pr96906-1.c: Add -mno-avx512f.

Comments

Jakub Jelinek Nov. 30, 2020, 1:46 p.m. UTC | #1
On Mon, Nov 30, 2020 at 09:11:10PM +0800, Hongtao Liu wrote:
> +;; PR96906 - optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw.
> +(define_split
> +  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
> +        (unspec:<avx512fmaskmode>
> +          [(us_minus:VI12_AVX512VL
> +             (match_operand:VI12_AVX512VL 1 "vector_operand")
> +             (match_operand:VI12_AVX512VL 2 "vector_operand"))
> +           (match_operand:VI12_AVX512VL 3 "const0_operand")
> +           (match_operand:SI 4 "const0_operand")]
> +          UNSPEC_PCMP))]
> +  "TARGET_AVX512BW && ix86_binary_operator_ok (US_MINUS, <MODE>mode, operands)"

Too long line, please wrap it.
Also, INTVAL (operands[4]) == 0 is EQ comparison, can't we handle also
NE (i.e. INTVAL (operands[4]) == 4?
I.e. replace the "const0_operand" in there with "const_0_to_7_operand"
and check in conditions that (INTVAL (operands[4]) & 3) == 0.

> +  [(const_int 0)]
> +  {
> +    /* LE: 2, NLT: 5.  */
> +    rtx cmp_predicate = GEN_INT (2);
> +    if (MEM_P (operands[1]))
> +      {
> +        std::swap (operands[1], operands[2]);
> +        cmp_predicate = GEN_INT (5);

For INTVAL (operands[4]) == 4 it would then be cmp_predictate NLE: 4 resp.
LT: 3 I think.

Also, this handles only UNSPEC_PCMP, can't we handle UNSPEC_UNSIGNED_PCMP
too?  I mean, for equality comparisons it doesn't really matter if we have
signed or unsigned == or !=.  And for unsigned
x == 0U is equivalent to x <= 0U, and x != 0U equivalent to x > 0U.

	Jakub
Hongtao Liu Dec. 1, 2020, 4:49 a.m. UTC | #2
On Mon, Nov 30, 2020 at 9:46 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Mon, Nov 30, 2020 at 09:11:10PM +0800, Hongtao Liu wrote:
> > +;; PR96906 - optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw.
> > +(define_split
> > +  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
> > +        (unspec:<avx512fmaskmode>
> > +          [(us_minus:VI12_AVX512VL
> > +             (match_operand:VI12_AVX512VL 1 "vector_operand")
> > +             (match_operand:VI12_AVX512VL 2 "vector_operand"))
> > +           (match_operand:VI12_AVX512VL 3 "const0_operand")
> > +           (match_operand:SI 4 "const0_operand")]
> > +          UNSPEC_PCMP))]
> > +  "TARGET_AVX512BW && ix86_binary_operator_ok (US_MINUS, <MODE>mode, operands)"
>
> Too long line, please wrap it.
> Also, INTVAL (operands[4]) == 0 is EQ comparison, can't we handle also
> NE (i.e. INTVAL (operands[4]) == 4?
> I.e. replace the "const0_operand" in there with "const_0_to_7_operand"
> and check in conditions that (INTVAL (operands[4]) & 3) == 0.
>
> > +  [(const_int 0)]
> > +  {
> > +    /* LE: 2, NLT: 5.  */
> > +    rtx cmp_predicate = GEN_INT (2);
> > +    if (MEM_P (operands[1]))
> > +      {
> > +        std::swap (operands[1], operands[2]);
> > +        cmp_predicate = GEN_INT (5);
>
> For INTVAL (operands[4]) == 4 it would then be cmp_predictate NLE: 4 resp.
> LT: 3 I think.
>
> Also, this handles only UNSPEC_PCMP, can't we handle UNSPEC_UNSIGNED_PCMP
> too?  I mean, for equality comparisons it doesn't really matter if we have
> signed or unsigned == or !=.  And for unsigned
> x == 0U is equivalent to x <= 0U, and x != 0U equivalent to x > 0U.
>
>         Jakub
>

Yes, Update patch.

+(define_int_iterator UNSPEC_PCMP_ITER
+  [UNSPEC_PCMP UNSPEC_UNSIGNED_PCMP])
+
+(define_int_attr pcmp_signed_mask
+  [(UNSPEC_PCMP "3") (UNSPEC_UNSIGNED_PCMP "1")])
+
+;; PR96906 - optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw.
+;; For signed comparison, handle EQ 0: NEQ 4,
+;; for unsigned comparison extra handle LE:2, NLE:6, equivalent to EQ and NEQ.
+
+(define_split
+  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
+       (unspec:<avx512fmaskmode>
+         [(us_minus:VI12_AVX512VL
+            (match_operand:VI12_AVX512VL 1 "vector_operand")
+            (match_operand:VI12_AVX512VL 2 "vector_operand"))
+          (match_operand:VI12_AVX512VL 3 "const0_operand")
+          (match_operand:SI 4 "const_0_to_7_operand")]
+         UNSPEC_PCMP_ITER))]
+  "TARGET_AVX512BW
+  && ix86_binary_operator_ok (US_MINUS, <MODE>mode, operands)
+  && (INTVAL (operands[4]) & <pcmp_signed_mask>) == 0"
+  [(const_int 0)]
+  {
+    bool neq_p = INTVAL (operands[4]) >> 2;
+    /* LE: 2, NLT: 5, NLE: 6, LT: 1  */
+    rtx cmp_predicate = neq_p ? GEN_INT (6) : GEN_INT (2);
+    if (MEM_P (operands[1]))
+      {
+       std::swap (operands[1], operands[2]);
+       cmp_predicate = neq_p ? GEN_INT (1) : GEN_INT (5);
+      }
+    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1],
+                                       operands[2], cmp_predicate));
+    DONE;
+  })
+
Jakub Jelinek Dec. 2, 2020, 6:22 p.m. UTC | #3
On Tue, Dec 01, 2020 at 12:49:03PM +0800, Hongtao Liu via Gcc-patches wrote:
> +    bool neq_p = INTVAL (operands[4]) >> 2;
> +    /* LE: 2, NLT: 5, NLE: 6, LT: 1  */
> +    rtx cmp_predicate = neq_p ? GEN_INT (6) : GEN_INT (2);
> +    if (MEM_P (operands[1]))
> +      {
> +	std::swap (operands[1], operands[2]);
> +	cmp_predicate = neq_p ? GEN_INT (1) : GEN_INT (5);
> +      }
> +    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1],
> +					operands[2], cmp_predicate));

I'd suggest instead:
+    /* LE: 2, NLT: 5, NLE: 6, LT: 1  */
+    int cmp_predicate = 2; /* LE  */
+    if (MEM_P (operands[1]))
+      {
+	std::swap (operands[1], operands[2]);
+	cmp_predicate = 5; /* NLT (GE)  */
+      }
+    if ((INTVAL (operands[4]) & 4) != 0)
+      cmp_predictate ^= 4; /* Invert the comparison to NLE (GT) or LT.  */
+    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1], operands[2],
+					 GEN_INT (cmp_predicate)));
so that you don't create the rtx CONST_INTs in 4 places and don't do that
unnecessarily when you will need another constant.

Otherwise LGTM, thanks.

	Jakub
Hongtao Liu Dec. 3, 2020, 5:50 a.m. UTC | #4
On Thu, Dec 3, 2020 at 2:22 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Dec 01, 2020 at 12:49:03PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +    bool neq_p = INTVAL (operands[4]) >> 2;
> > +    /* LE: 2, NLT: 5, NLE: 6, LT: 1  */
> > +    rtx cmp_predicate = neq_p ? GEN_INT (6) : GEN_INT (2);
> > +    if (MEM_P (operands[1]))
> > +      {
> > +     std::swap (operands[1], operands[2]);
> > +     cmp_predicate = neq_p ? GEN_INT (1) : GEN_INT (5);
> > +      }
> > +    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1],
> > +                                     operands[2], cmp_predicate));
>
> I'd suggest instead:
> +    /* LE: 2, NLT: 5, NLE: 6, LT: 1  */
> +    int cmp_predicate = 2; /* LE  */
> +    if (MEM_P (operands[1]))
> +      {
> +       std::swap (operands[1], operands[2]);
> +       cmp_predicate = 5; /* NLT (GE)  */
> +      }
> +    if ((INTVAL (operands[4]) & 4) != 0)
> +      cmp_predictate ^= 4; /* Invert the comparison to NLE (GT) or LT.  */
> +    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1], operands[2],
> +                                        GEN_INT (cmp_predicate)));
> so that you don't create the rtx CONST_INTs in 4 places and don't do that
> unnecessarily when you will need another constant.
Thanks for the review,committed.
>
> Otherwise LGTM, thanks.
>
>         Jakub
>
diff mbox series

Patch

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 4aad462f882..eebc3750584 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -3006,6 +3006,30 @@  (define_insn
"<avx512>_ucmp<mode>3<mask_scalar_merge_name>"
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])

+;; PR96906 - optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnltuw.
+(define_split
+  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
+        (unspec:<avx512fmaskmode>
+          [(us_minus:VI12_AVX512VL
+             (match_operand:VI12_AVX512VL 1 "vector_operand")
+             (match_operand:VI12_AVX512VL 2 "vector_operand"))
+           (match_operand:VI12_AVX512VL 3 "const0_operand")
+           (match_operand:SI 4 "const0_operand")]
+          UNSPEC_PCMP))]
+  "TARGET_AVX512BW && ix86_binary_operator_ok (US_MINUS, <MODE>mode, operands)"
+  [(const_int 0)]
+  {
+    /* LE: 2, NLT: 5.  */
+    rtx cmp_predicate = GEN_INT (2);
+    if (MEM_P (operands[1]))
+      {
+        std::swap (operands[1], operands[2]);
+        cmp_predicate = GEN_INT (5);
+      }
+    emit_insn (gen_<avx512>_ucmp<mode>3 (operands[0], operands[1],
+                                        operands[2], cmp_predicate));
+    DONE;
+  })
+
 (define_insn "avx512f_vmcmp<mode>3<round_saeonly_name>"
   [(set (match_operand:<avx512fmaskmode> 0 "register_operand" "=k")
         (and:<avx512fmaskmode>
diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-pr96906-1.c
b/gcc/testsuite/gcc.target/i386/avx512bw-pr96906-1.c
new file mode 100644
index 00000000000..ae7ec7abed1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512bw-pr96906-1.c
@@ -0,0 +1,80 @@ 
+/* PR target/96906 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512bw -mavx512vl -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)vpcmpub[ \t]*\$2} 6 } } */
+/* { dg-final { scan-assembler-times {(?n)vpcmpuw[ \t]*\$2} 6 } } */
+
+
+#include<immintrin.h>
+
+__mmask8
+ff1 (__m128i x, __m128i y)
+{
+  return _mm_cmp_epi16_mask (_mm_subs_epu16 (x, y), _mm_setzero_si128 (), 0);
+}
+
+__mmask8
+ff2 (__m128i x, __m128i y)
+{
+  return _mm_cmple_epu16_mask (x, y);
+}
+
+__mmask16
+ff3 (__m128i x, __m128i y)
+{
+  return _mm_cmp_epi8_mask (_mm_subs_epu8 (x, y), _mm_setzero_si128 (), 0);
+}
+
+__mmask16
+ff4 (__m128i x, __m128i y)
+{
+  return _mm_cmple_epu8_mask (x, y);
+}
+
+__mmask16
+ff5 (__m256i x, __m256i y)
+{
+  return _mm256_cmp_epi16_mask (_mm256_subs_epu16 (x, y),
_mm256_setzero_si256 (), 0);
+}
+
+__mmask16
+ff6 (__m256i x, __m256i y)
+{
+  return _mm256_cmple_epu16_mask (x, y);
+}
+
+__mmask32
+ff7 (__m256i x, __m256i y)
+{
+  return _mm256_cmp_epi8_mask (_mm256_subs_epu8 (x, y),
_mm256_setzero_si256 (), 0);
+}
+
+__mmask32
+ff8 (__m256i x, __m256i y)
+{
+  return _mm256_cmple_epu8_mask (x, y);
+}
+
+__mmask32
+ff9 (__m512i x, __m512i y)
+{
+  return _mm512_cmp_epi16_mask (_mm512_subs_epu16 (x, y),
_mm512_setzero_si512 (), 0);
+}
+
+__mmask32
+ff10 (__m512i x, __m512i y)
+{
+  return _mm512_cmple_epu16_mask (x, y);
+}
+
+__mmask64
+ff11 (__m512i x, __m512i y)
+{
+  return _mm512_cmp_epi8_mask (_mm512_subs_epu8 (x, y),
_mm512_setzero_si512 (), 0);
+}
+
+__mmask64
+ff12 (__m512i x, __m512i y)
+{
+  return _mm512_cmple_epu8_mask (x, y);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr96906-1.c
b/gcc/testsuite/gcc.target/i386/pr96906-1.c
index 9d836eb2bdd..b1b41bf522d 100644
--- a/gcc/testsuite/gcc.target/i386/pr96906-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr96906-1.c
@@ -1,6 +1,6 @@ 
 /* PR target/96906 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mavx2" } */
+/* { dg-options "-O2 -mavx2 -mno-avx512f" } */
 /* { dg-final { scan-assembler-times "\tvpminub\[^\n\r]*xmm" 2 } } */
 /* { dg-final { scan-assembler-times "\tvpminuw\[^\n\r]*xmm" 2 } } */
 /* { dg-final { scan-assembler-times "\tvpminub\[^\n\r]*ymm" 2 } } */