From patchwork Mon Jun 12 13:29:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "juzhe.zhong@rivai.ai" X-Patchwork-Id: 1793934 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Qfswl1vRsz20Wl for ; Mon, 12 Jun 2023 23:29:27 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E5CDF3858002 for ; Mon, 12 Jun 2023 13:29:24 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtpbgsg1.qq.com (smtpbgsg1.qq.com [54.254.200.92]) by sourceware.org (Postfix) with ESMTPS id 062303858D20 for ; Mon, 12 Jun 2023 13:29:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 062303858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai X-QQ-mid: bizesmtp80t1686576543tqsqb2kr Received: from rios-cad5.localdomain ( [58.60.1.11]) by bizesmtp.qq.com (ESMTP) with id ; Mon, 12 Jun 2023 21:29:02 +0800 (CST) X-QQ-SSF: 01400000000000F0S000000A0000000 X-QQ-FEAT: LrCnY+iDm+OSEqAOqohGv8pHNK/RitqWM9jowZFMr32jQFpVM5Ny9Ams8vR/5 7ta6HEDMRywNGC32khV/ANy0iqxb6b625UJTdDRbKC2csnT67sdTxXE5LtXY9/X3dg82Cp6 YP2SI2alLLaPvIvUpCaKhbXAzCraxwwryuaQnSCZqdErw97g8/RNEKfa0Twn8Ub3ccWOLJb uGXmYv/YIPks7rpRqRq8WrNY84s7lqn8fLpU2OxDGaK4I1XjXhd4PMzl18Yf3aNppRRAHGJ YSOpSp8zONjFrPMySMC2fAHjjpnyKZVQYGyc3BPC6ZOXhs1b7JbOAS0y34WqWzFu4jjtD5i 1teFAb6WJzt5mZNNdkpQI4E3hAYDwjSYJlzPsMimw/PMyenZ0HTNNCffw/1LMx2qEZj65h/ InqFx7DWw8s= X-QQ-GoodBg: 2 X-BIZMAIL-ID: 2021505073988444507 From: juzhe.zhong@rivai.ai To: gcc-patches@gcc.gnu.org Cc: kito.cheng@gmail.com, kito.cheng@sifive.com, palmer@dabbelt.com, palmer@rivosinc.com, jeffreyalaw@gmail.com, rdapp.gcc@gmail.com, Juzhe-Zhong Subject: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation Date: Mon, 12 Jun 2023 21:29:01 +0800 Message-Id: <20230612132901.1727002-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.3 MIME-Version: 1.0 X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" From: Juzhe-Zhong According to RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing vdecompress) Decompress operation. Case 1 (nunits = POLY_INT_CST [16, 16]): _48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17, 16], 2, POLY_INT_CST [18, 16], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... }; Case 2 (nunits = POLY_INT_CST [16, 16]): _23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3], POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 3], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask = { 0, 1, 0, 1, ... }; For example: void __attribute__ ((noinline, noclone)) vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n) { for (int i = 0; i < n; ++i) { a[i * 2] += b; a[i * 2 + 1] += c; } } ASM: ... vid.v v0 vand.vi v0,v0,1 vmseq.vi v0,v0,1 ===> mask = { 0, 1, 0, 1, ... } vdecompress: viota.m v3,v0 vrgather.vv v2,v1,v3,v0.t ... gcc/ChangeLog: * config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function. (expand_const_vector): Enhance repeating sequence mask. (shuffle_decompress_patterns): New function. (expand_vec_perm_const_1): Add decompress optimization. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test. --- gcc/config/riscv/riscv-v.cc | 146 +++++++++++++++++- .../riscv/rvv/autovec/partial/slp-8.c | 30 ++++ .../riscv/rvv/autovec/partial/slp-9.c | 31 ++++ .../riscv/rvv/autovec/partial/slp_run-8.c | 30 ++++ .../riscv/rvv/autovec/partial/slp_run-9.c | 30 ++++ 5 files changed, 260 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index e1b85a5af91..3cea6b25261 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -836,6 +836,46 @@ emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx sel, rtx mask) emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops); } +/* According to RVV ISA spec (16.5.1. Synthesizing vdecompress): + https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc + + There is no inverse vdecompress provided, as this operation can be readily + synthesized using iota and a masked vrgather: + + Desired functionality of 'vdecompress' + 7 6 5 4 3 2 1 0 # vid + + e d c b a # packed vector of 5 elements + 1 0 0 1 1 1 0 1 # mask vector of 8 elements + p q r s t u v w # destination register before vdecompress + + e q r d c b v a # result of vdecompress + # v0 holds mask + # v1 holds packed data + # v11 holds input expanded vector and result + viota.m v10, v0 # Calc iota from mask in v0 + vrgather.vv v11, v1, v10, v0.t # Expand into destination + p q r s t u v w # v11 destination register + e d c b a # v1 source vector + 1 0 0 1 1 1 0 1 # v0 mask vector + + 4 4 4 3 2 1 1 0 # v10 result of viota.m + e q r d c b v a # v11 destination after vrgather using viota.m under mask +*/ +static void +emit_vlmax_decompress_insn (rtx target, rtx op, rtx mask) +{ + machine_mode data_mode = GET_MODE (target); + machine_mode sel_mode = related_int_vector_mode (data_mode).require (); + if (GET_MODE_INNER (data_mode) == QImode) + sel_mode = get_vector_mode (HImode, GET_MODE_NUNITS (data_mode)).require (); + + rtx sel = gen_reg_rtx (sel_mode); + rtx iota_ops[] = {sel, mask}; + emit_vlmax_insn (code_for_pred_iota (sel_mode), RVV_UNOP, iota_ops); + emit_vlmax_masked_gather_mu_insn (target, op, sel, mask); +} + /* Emit merge instruction. */ static machine_mode @@ -934,14 +974,41 @@ expand_const_vector (rtx target, rtx src) { machine_mode mode = GET_MODE (target); scalar_mode elt_mode = GET_MODE_INNER (mode); + poly_uint64 nunits = GET_MODE_NUNITS (mode); + unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src); + unsigned int npatterns = CONST_VECTOR_NPATTERNS (src); if (GET_MODE_CLASS (mode) == MODE_VECTOR_BOOL) { rtx elt; - gcc_assert ( - const_vec_duplicate_p (src, &elt) - && (rtx_equal_p (elt, const0_rtx) || rtx_equal_p (elt, const1_rtx))); - rtx ops[] = {target, src}; - emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops); + if (const_vec_duplicate_p (src, &elt)) + { + rtx ops[] = {target, src}; + emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops); + } + else + { + gcc_assert (CONST_VECTOR_DUPLICATE_P (src)); + if (npatterns == 2) + { + /* Generate mask with repeating sequence: + 1. { 0, 1, 0, 1, ... }. + 2. { 1, 0, 1, 0, ... }. */ + rtx ele1 = CONST_VECTOR_ELT (src, 1); + machine_mode vid_mode + = get_vector_mode (QImode, nunits).require (); + rtx vid = gen_reg_rtx (vid_mode); + rtx vid_repeat = gen_reg_rtx (vid_mode); + emit_insn ( + gen_vec_series (vid_mode, vid, const0_rtx, const1_rtx)); + rtx and_ops[] = {vid_repeat, vid, const1_rtx}; + emit_vlmax_insn (code_for_pred_scalar (AND, vid_mode), RVV_BINOP, + and_ops); + rtx const_vec = gen_const_vector_dup (vid_mode, INTVAL (ele1)); + expand_vec_cmp (target, EQ, vid_repeat, const_vec); + } + else + gcc_unreachable (); + } return; } @@ -977,8 +1044,6 @@ expand_const_vector (rtx target, rtx src) } /* Handle variable-length vector. */ - unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src); - unsigned int npatterns = CONST_VECTOR_NPATTERNS (src); rvv_builder builder (mode, npatterns, nelts_per_pattern); for (unsigned int i = 0; i < nelts_per_pattern; i++) { @@ -2337,6 +2402,71 @@ struct expand_vec_perm_d bool testing_p; }; +/* Recognize decompress patterns: + + 1. VEC_PERM_EXPR op0 and op1 + with isel = { 0, nunits, 1, nunits + 1, ... }. + Decompress op0 and op1 vector with the mask = { 0, 1, 0, 1, ... }. + + 2. VEC_PERM_EXPR op0 and op1 + with isel = { 1/2 nunits, 3/2 nunits, 1/2 nunits+1, 3/2 nunits+1,... }. + Slide down op0 and op1 with OFFSET = 1/2 nunits. + Decompress op0 and op1 vector with the mask = { 0, 1, 0, 1, ... }. +*/ +static bool +shuffle_decompress_patterns (struct expand_vec_perm_d *d) +{ + poly_uint64 nelt = d->perm.length (); + machine_mode mask_mode = get_mask_mode (d->vmode).require (); + + /* For constant size indices, we dont't need to handle it here. + Just leave it to vec_perm. */ + if (d->perm.length ().is_constant ()) + return false; + + poly_uint64 first = d->perm[0]; + if ((maybe_ne (first, 0U) && maybe_ne (first * 2, nelt)) + || !d->perm.series_p (0, 2, first, 1) + || !d->perm.series_p (1, 2, first + nelt, 1)) + return false; + + /* Permuting two SEW8 variable-length vectors need vrgatherei16.vv. + Otherwise, it could overflow the index range. */ + machine_mode sel_mode; + if (GET_MODE_INNER (d->vmode) == QImode + && !get_vector_mode (HImode, nelt).exists (&sel_mode)) + return false; + + /* Success! */ + if (d->testing_p) + return true; + + rtx op0, op1; + if (known_eq (first, 0U)) + { + op0 = d->op0; + op1 = d->op1; + } + else + { + op0 = gen_reg_rtx (d->vmode); + op1 = gen_reg_rtx (d->vmode); + insn_code icode = code_for_pred_slide (UNSPEC_VSLIDEDOWN, d->vmode); + rtx ops0[] = {op0, d->op0, gen_int_mode (first, Pmode)}; + rtx ops1[] = {op1, d->op1, gen_int_mode (first, Pmode)}; + emit_vlmax_insn (icode, RVV_BINOP, ops0); + emit_vlmax_insn (icode, RVV_BINOP, ops1); + } + /* Generate { 0, 1, .... } mask. */ + rvv_builder builder (mask_mode, 2, 1); + builder.quick_push (CONST0_RTX (BImode)); + builder.quick_push (CONST1_RTX (BImode)); + emit_move_insn (d->target, op0); + emit_vlmax_decompress_insn (d->target, op1, + force_reg (mask_mode, builder.build ())); + return true; +} + /* Recognize the pattern that can be shuffled by generic approach. */ static bool @@ -2388,6 +2518,8 @@ expand_vec_perm_const_1 (struct expand_vec_perm_d *d) { if (d->vmode == d->op_mode) { + if (shuffle_decompress_patterns (d)) + return true; if (shuffle_generic_patterns (d)) return true; return false; diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c new file mode 100644 index 00000000000..2568d6947a2 --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-optimized-details" } */ + +#include + +#define VEC_PERM(TYPE) \ + TYPE __attribute__ ((noinline, noclone)) \ + vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n) \ + { \ + for (int i = 0; i < n; ++i) \ + { \ + a[i * 2] += b; \ + a[i * 2 + 1] += c; \ + } \ + } + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) + +TEST_ALL (VEC_PERM) + +/* { dg-final { scan-tree-dump-times "\.VEC_PERM" 2 "optimized" } } */ +/* { dg-final { scan-assembler-times {viota.m} 2 } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c new file mode 100644 index 00000000000..d410e57adbd --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-optimized-details" } */ + +#include + +#define VEC_PERM(TYPE) \ + TYPE __attribute__ ((noinline, noclone)) \ + vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n) \ + { \ + for (int i = 0; i < n; ++i) \ + { \ + a[i * 4] += b; \ + a[i * 4 + 1] += c; \ + a[i * 4 + 2] += b; \ + a[i * 4 + 3] += c; \ + } \ + } + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) + +TEST_ALL (VEC_PERM) + +/* { dg-final { scan-assembler-times {viota.m} 2 } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c new file mode 100644 index 00000000000..39ae513812b --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c @@ -0,0 +1,30 @@ +/* { dg-do run { target { riscv_vector } } } */ +/* { dg-additional-options "--param riscv-autovec-preference=scalable -fno-vect-cost-model" } */ + +#include "slp-8.c" + +#define N (103 * 2) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[2] = { 3, 11 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b[0], b[1], N / 2); \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + TYPE orig = i * 2 + i % 5; \ + TYPE expected = orig + b[i % 2]; \ + if (a[i] != expected) \ + __builtin_abort (); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c new file mode 100644 index 00000000000..791cfbc2b47 --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c @@ -0,0 +1,30 @@ +/* { dg-do run { target { riscv_vector } } } */ +/* { dg-additional-options "--param riscv-autovec-preference=scalable -fno-vect-cost-model" } */ + +#include "slp-9.c" + +#define N (103 * 4) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[2] = { 3, 11 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b[0], b[1], N / 4); \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + TYPE orig = i * 2 + i % 5; \ + TYPE expected = orig + b[i % 2]; \ + if (a[i] != expected) \ + __builtin_abort (); \ + } \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +}