From patchwork Mon Jun 7 11:20:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christophe Lyon X-Patchwork-Id: 1488572 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=FwmGPJAQ; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4Fz9tz397dz9s1l for ; Mon, 7 Jun 2021 21:22:03 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1F0E23835823 for ; Mon, 7 Jun 2021 11:22:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1F0E23835823 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1623064921; bh=/O0BafRn2K67HvIfQyDLDnmLnmWkiQ8ld4fZKkea164=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=FwmGPJAQkL8tNpxx/zgZqw3IJ/F4LrFGS0CD5nNxWwfpGRG886jLO5czLvnqeGkg2 PNSdNGYgkBFIcR5SbhKP+SFHgUF2IEqvsC2SrW703+sTl9dwLmx+RVwy0qIZQ4REHO whTXgmNAf2oJhDhsiWsvbuztylomAKmOhdqQojHs= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by sourceware.org (Postfix) with ESMTPS id 60B6B384400A for ; Mon, 7 Jun 2021 11:20:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 60B6B384400A Received: by mail-wr1-x42a.google.com with SMTP id h8so17145525wrz.8 for ; Mon, 07 Jun 2021 04:20:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/O0BafRn2K67HvIfQyDLDnmLnmWkiQ8ld4fZKkea164=; b=jm2MlFkwjMNph/3+t0YQV39NNE983FARdjNSR5P7jyPV146VWbrTZpCQeYp3cTo791 udUqf/CHkD+Wt4L332SH1lqLfMRVz5VGZYvzY2pgG7pPmbh0ZV0X2dy8DKeSVSOHUyWW kO5YAwLvliYgyKUmM27xmS/EStDXCeA6esafrUTgsX4ZrweEQuCWoJdYuV7BmgSkKmve 6MZQ+SjNhRCVa34sL5gCVKlCSLdHysyF+tanckWfP1M5T5vdiztK9FBpfcon0dHGTP7N zVKKJoJPMHmvmD8tMOiUea94n5W5uRE2zcmmEU7/l7wsu13n79ff5pEj73TrHRoZd7/B Sm0w== X-Gm-Message-State: AOAM531QiZb2mlPc+fONQqnYucp+dQqonUN/CSBHhzoRs+re97/v4fQb Y9Oo5GqiOtldTKFfF/NYjgbihKAGf4rZCg== X-Google-Smtp-Source: ABdhPJwR47Kux2rhrpUn8rfFq8XGNcxc8nF0Fj3nBBef0EE5DTpu6IaOTR5KmQrBP2G8b1xdZ5lZsw== X-Received: by 2002:a5d:6082:: with SMTP id w2mr16748935wrt.209.1623064809932; Mon, 07 Jun 2021 04:20:09 -0700 (PDT) Received: from localhost.localdomain (static.42.136.251.148.clients.your-server.de. [148.251.136.42]) by smtp.gmail.com with ESMTPSA id w23sm13944778wmi.0.2021.06.07.04.20.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Jun 2021 04:20:09 -0700 (PDT) To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [PATCH 2/2] arm: Auto-vectorization for MVE: add pack/unpack patterns Date: Mon, 7 Jun 2021 11:20:07 +0000 Message-Id: <20210607112007.8659-2-christophe.lyon@linaro.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210607112007.8659-1-christophe.lyon@linaro.org> References: <20210607112007.8659-1-christophe.lyon@linaro.org> MIME-Version: 1.0 X-Spam-Status: No, score=-14.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Christophe Lyon via Gcc-patches From: Christophe Lyon Reply-To: Christophe Lyon Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" This patch adds vec_unpack_hi_, vec_unpack_lo_, vec_pack_trunc_ patterns for MVE. It does so by moving the unpack patterns from neon.md to vec-common.md, while adding them support for MVE. The pack expander is derived from the Neon one (which in turn is renamed into neon_quad_vec_pack_trunc_). The patch introduces mve_vec_pack_trunc_ to avoid the need for a zero-initialized temporary, which is needed if the vec_pack_trunc_ expander calls @mve_vmovn[bt]q_ instead. With this patch, we can now vectorize the 16 and 8-bit versions of vclz and vshl, although the generated code could still be improved. For test_clz_s16, we now generate vldrh.16 q3, [r1] vmovlb.s16 q2, q3 vmovlt.s16 q3, q3 vclz.i32 q2, q2 vclz.i32 q3, q3 vmovnb.i32 q1, q2 vmovnt.i32 q1, q3 vstrh.16 q1, [r0] which could be improved to vldrh.16 q3, [r1] vclz.i16 q1, q3 vstrh.16 q1, [r0] if we could avoid the need for unpack/pack steps. For reference, clang-12 generates: vldrh.s32 q0, [r1] vldrh.s32 q1, [r1, #8] vclz.i32 q0, q0 vstrh.32 q0, [r0] vclz.i32 q0, q1 vstrh.32 q0, [r0, #8] 2021-06-03 Christophe Lyon gcc/ * config/arm/mve.md (mve_vmovltq_): Prefix with '@'. (mve_vmovlbq_): Likewise. (mve_vmovnbq_): Likewise. (mve_vmovntq_): Likewise. (@mve_vec_pack_trunc_): New pattern. * config/arm/neon.md (vec_unpack_hi_): Move to vec-common.md. (vec_unpack_lo_): Likewise. (vec_pack_trunc_): Rename to neon_quad_vec_pack_trunc_. * config/arm/vec-common.md (vec_unpack_hi_): New pattern. (vec_unpack_lo_): New. (vec_pack_trunc_): New. gcc/testsuite/ * gcc.target/arm/simd/mve-vclz.c: Update expected results. * gcc.target/arm/simd/mve-vshl.c: Likewise. --- gcc/config/arm/mve.md | 20 ++++- gcc/config/arm/neon.md | 39 +-------- gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- 5 files changed, 114 insertions(+), 46 deletions(-) diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md index 99e46d0bc69..b18292c07d3 100644 --- a/gcc/config/arm/mve.md +++ b/gcc/config/arm/mve.md @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_" ;; ;; [vmovltq_u, vmovltq_s]) ;; -(define_insn "mve_vmovltq_" +(define_insn "@mve_vmovltq_" [ (set (match_operand: 0 "s_register_operand" "=w") (unspec: [(match_operand:MVE_3 1 "s_register_operand" "w")] @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_" ;; ;; [vmovlbq_s, vmovlbq_u]) ;; -(define_insn "mve_vmovlbq_" +(define_insn "@mve_vmovlbq_" [ (set (match_operand: 0 "s_register_operand" "=w") (unspec: [(match_operand:MVE_3 1 "s_register_operand" "w")] @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s" ;; ;; [vmovnbq_u, vmovnbq_s]) ;; -(define_insn "mve_vmovnbq_" +(define_insn "@mve_vmovnbq_" [ (set (match_operand: 0 "s_register_operand" "=w") (unspec: [(match_operand: 1 "s_register_operand" "0") @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_" ;; ;; [vmovntq_s, vmovntq_u]) ;; -(define_insn "mve_vmovntq_" +(define_insn "@mve_vmovntq_" [ (set (match_operand: 0 "s_register_operand" "=w") (unspec: [(match_operand: 1 "s_register_operand" "0") @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_" [(set_attr "type" "mve_move") ]) +(define_insn "@mve_vec_pack_trunc_" + [(set (match_operand: 0 "register_operand" "=&w") + (vec_concat: + (truncate: + (match_operand:MVE_5 1 "register_operand" "w")) + (truncate: + (match_operand:MVE_5 2 "register_operand" "w"))))] + "TARGET_HAVE_MVE" + "vmovnb.i %q0, %q1\;vmovnt.i %q0, %q2" + [(set_attr "type" "mve_move")] +) + ;; ;; [vmulq_f]) ;; diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 0fdffaf4ec4..392d9607919 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack_hi_" [(set_attr "type" "neon_shift_imm_long")] ) -(define_expand "vec_unpack_hi_" - [(match_operand: 0 "register_operand") - (SE: (match_operand:VU 1 "register_operand"))] - "TARGET_NEON && !BYTES_BIG_ENDIAN" - { - rtvec v = rtvec_alloc (/2) ; - rtx t1; - int i; - for (i = 0; i < (/2); i++) - RTVEC_ELT (v, i) = GEN_INT ((/2) + i); - - t1 = gen_rtx_PARALLEL (mode, v); - emit_insn (gen_neon_vec_unpack_hi_ (operands[0], - operands[1], - t1)); - DONE; - } -) - -(define_expand "vec_unpack_lo_" - [(match_operand: 0 "register_operand") - (SE: (match_operand:VU 1 "register_operand"))] - "TARGET_NEON && !BYTES_BIG_ENDIAN" - { - rtvec v = rtvec_alloc (/2) ; - rtx t1; - int i; - for (i = 0; i < (/2) ; i++) - RTVEC_ELT (v, i) = GEN_INT (i); - t1 = gen_rtx_PARALLEL (mode, v); - emit_insn (gen_neon_vec_unpack_lo_ (operands[0], - operands[1], - t1)); - DONE; - } -) - (define_insn "neon_vec_mult_lo_" [(set (match_operand: 0 "register_operand" "=w") (mult: (SE: (vec_select: @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_shiftl_lo_" ; because the ordering of vector elements in Q registers is different from what ; the semantics of the instructions require. -(define_insn "vec_pack_trunc_" +(define_insn "neon_quad_vec_pack_trunc_" [(set (match_operand: 0 "register_operand" "=&w") (vec_concat: (truncate: diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md index 1ba1e5eb008..0ffc7a9322c 100644 --- a/gcc/config/arm/vec-common.md +++ b/gcc/config/arm/vec-common.md @@ -638,3 +638,92 @@ (define_expand "clz2" emit_insn (gen_mve_vclzq_s (mode, operands[0], operands[1])); DONE; }) + +;; vmovl[tb] are not available for V4SI on MVE +(define_expand "vec_unpack_hi_" + [(match_operand: 0 "register_operand") + (SE: (match_operand:VU 1 "register_operand"))] + "ARM_HAVE__ARITH + && !TARGET_REALLY_IWMMXT + && ! (mode == V4SImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + rtvec v = rtvec_alloc (/2); + rtx t1; + int i; + for (i = 0; i < (/2); i++) + RTVEC_ELT (v, i) = GEN_INT ((/2) + i); + + t1 = gen_rtx_PARALLEL (mode, v); + emit_insn (gen_neon_vec_unpack_hi_ (operands[0], + operands[1], + t1)); + } + else + { + emit_insn (gen_mve_vmovltq (VMOVLTQ_S, mode, operands[0], + operands[1])); + } + DONE; + } +) + +;; vmovl[tb] are not available for V4SI on MVE +(define_expand "vec_unpack_lo_" + [(match_operand: 0 "register_operand") + (SE: (match_operand:VU 1 "register_operand"))] + "ARM_HAVE__ARITH + && !TARGET_REALLY_IWMMXT + && ! (mode == V4SImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + rtvec v = rtvec_alloc (/2); + rtx t1; + int i; + for (i = 0; i < (/2) ; i++) + RTVEC_ELT (v, i) = GEN_INT (i); + + t1 = gen_rtx_PARALLEL (mode, v); + emit_insn (gen_neon_vec_unpack_lo_ (operands[0], + operands[1], + t1)); + } + else + { + emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, mode, operands[0], + operands[1])); + } + DONE; + } +) + +;; vmovn[tb] are not available for V2DI on MVE +(define_expand "vec_pack_trunc_" + [(set (match_operand: 0 "register_operand" "=&w") + (vec_concat: + (truncate: + (match_operand:VN 1 "register_operand" "w")) + (truncate: + (match_operand:VN 2 "register_operand" "w"))))] + "ARM_HAVE__ARITH + && !TARGET_REALLY_IWMMXT + && ! (mode == V2DImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + emit_insn (gen_neon_quad_vec_pack_trunc_ (operands[0], operands[1], + operands[2])); + } + else + { + emit_insn (gen_mve_vec_pack_trunc (mode, operands[0], operands[1], + operands[2])); + } + DONE; + } +) diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c index 7068736bc28..5d6e991cfc6 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz) FUNC(s, int, 8, clz) FUNC(u, uint, 8, clz) -/* 16 and 8-bit versions are not vectorized because they need pack/unpack - patterns since __builtin_clz uses 32-bit parameter and return value. */ -/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 2 } } */ +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for + instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of + 14 vclz.i32 expected in this testcase. */ +/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 14 } } */ /* { dg-final { scan-assembler-times {vclz\.i16 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {vclz\.i8 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c index 7a0644997c8..91dd942d818 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm) /* MVE has only 128-bit vectors, so we can vectorize only half of the functions above. */ /* We only emit vshl.u, which is equivalent to vshl.s anyway. */ -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */ +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for + instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of + 14 vshl.i32 expected in this testcase. */ +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */ /* We emit vshl.i when the shift amount is an immediate. */ /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */