From patchwork Tue May 15 08:20:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kyrill Tkachov X-Patchwork-Id: 913472 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-477688-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=foss.arm.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="SEdU1MQ4"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40lVrS0fc4z9s1w for ; Tue, 15 May 2018 18:20:55 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=YDdbNHW9r7sUtcaUGpEe77Ur/6o3TJwVrBCWSgMY3dK vRo7sWPve/xPWE8cTrriE7/nP8IuEYAMiyGpuwn+QtSF8iuTRNnCYAGGxzdotqO3 YtP9kYqLitjogunBZtDeL4wuYkw0i0fedBnOb9fvsNZSw4TnI6o1JUX9NQnLVD5Q = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; s=default; bh=AtCIf+Pd/YVo/oh+PV/hOgSdJTg=; b=SEdU1MQ4JN1Bpr6TY 8MAN+CB9vd2B1aXMuZVVcjWRB6zEwcnaS412ESQvAXeVVaaZrObe5y8W3KwPP6+Y NeEJi3HIhrbtAnXXT/xaknA3sFSS9AH9uOV2jcnV0hI3kBAAiM+f6fRxtVFjJ4/p 9PifT7A2WUU2YPzreFi8cytVy8= Received: (qmail 52601 invoked by alias); 15 May 2018 08:20:48 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 52291 invoked by uid 89); 15 May 2018 08:20:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.6 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH, KAM_SHORT autolearn=ham version=3.3.2 spammy=cheap, tweaked, clunky X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 15 May 2018 08:20:15 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8CB741435; Tue, 15 May 2018 01:20:14 -0700 (PDT) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ABF9A3F25D; Tue, 15 May 2018 01:20:13 -0700 (PDT) Message-ID: <5AFA983C.5070609@foss.arm.com> Date: Tue, 15 May 2018 09:20:12 +0100 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" CC: "Richard Earnshaw (lists)" , James Greenhalgh , Marcus Shawcroft Subject: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful Hi all, This is a respin of James's patch from: https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html The original patch was approved and committed but was later reverted because of failures on big-endian. This tweaked version fixes the big-endian failures in aarch64_expand_vector_init by picking the right element of VALS to move into the low part of the vector register depending on endianness. The rest of the patch stays the same. I'm looking for approval on the aarch64 parts, as they are the ones that have changed since the last approved version of the patch. ----------------------------------------------------------------------- In the testcase in this patch we create an SLP vector with only two elements. Our current vector initialisation code will first duplicate the first element to both lanes, then overwrite the top lane with a new value. This duplication can be clunky and wasteful. Better would be to simply use the fact that we will always be overwriting the remaining bits, and simply move the first element to the corrcet place (implicitly zeroing all other bits). This reduces the code generation for this case, and can allow more efficient addressing modes, and other second order benefits for AArch64 code which has been vectorized to V2DI mode. Note that the change is generic enough to catch the case for any vector mode, but is expected to be most useful for 2x64-bit vectorization. Unfortunately, on its own, this would cause failures in gcc.target/aarch64/load_v2vec_lanes_1.c and gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more vec_merge and vec_duplicate for their simplifications to apply. To fix this, add a special case to the AArch64 code if we are loading from two memory addresses, and use the load_pair_lanes patterns directly. We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to catch: (vec_merge:OUTER (vec_duplicate:OUTER x:INNER) (subreg:OUTER y:INNER 0) (const_int N)) And simplify it to: (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) This is similar to the existing patterns which are tested in this function, without requiring the second operand to also be a vec_duplicate. Bootstrapped and tested on aarch64-none-linux-gnu and tested on aarch64-none-elf. Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html if we don't want to ICE creating broken vector zero extends. Are the non-AArch64 parts OK? Thanks, James --- 2018-05-15 James Greenhalgh Kyrylo Tkachov * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code generation for cases where splatting a value is not useful. * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge across a vec_duplicate and a paradoxical subreg forming a vector mode to a vec_concat. 2018-05-15 James Greenhalgh * gcc.target/aarch64/vect-slp-dup.c: New. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 4b5183b602b8786307deb8e3d8056323028b50a2..562eb315f881a1e0d8aa3ba946c99b4c6f25949b 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -13901,9 +13901,54 @@ aarch64_expand_vector_init (rtx target, rtx vals) maxv = matches[i][1]; } - /* Create a duplicate of the most common element. */ - rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); - aarch64_emit_move (target, gen_vec_duplicate (mode, x)); + /* Create a duplicate of the most common element, unless all elements + are equally useless to us, in which case just immediately set the + vector register using the first element. */ + + if (maxv == 1) + { + /* For vectors of two 64-bit elements, we can do even better. */ + if (n_elts == 2 + && (inner_mode == E_DImode + || inner_mode == E_DFmode)) + + { + rtx x0 = XVECEXP (vals, 0, 0); + rtx x1 = XVECEXP (vals, 0, 1); + /* Combine can pick up this case, but handling it directly + here leaves clearer RTL. + + This is load_pair_lanes, and also gives us a clean-up + for store_pair_lanes. */ + if (memory_operand (x0, inner_mode) + && memory_operand (x1, inner_mode) + && !STRICT_ALIGNMENT + && rtx_equal_p (XEXP (x1, 0), + plus_constant (Pmode, + XEXP (x0, 0), + GET_MODE_SIZE (inner_mode)))) + { + rtx t; + if (inner_mode == DFmode) + t = gen_load_pair_lanesdf (target, x0, x1); + else + t = gen_load_pair_lanesdi (target, x0, x1); + emit_insn (t); + return; + } + } + /* The subreg-move sequence below will move into lane zero of the + vector register. For big-endian we want that position to hold + the last element of VALS. */ + maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0; + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); + aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode)); + } + else + { + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); + aarch64_emit_move (target, gen_vec_duplicate (mode, x)); + } /* Insert the rest. */ for (int i = 0; i < n_elts; i++) diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 23244a12545ba2f9db21f66a63a6d36ff8fd29fc..d32cdd19ecd9b8ca6a2957d115bec9c6613d3836 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -5891,6 +5891,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode, return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); } + /* Replace: + + (vec_merge:outer (vec_duplicate:outer x:inner) + (subreg:outer y:inner 0) + (const_int N)) + + with (vec_concat:outer x:inner y:inner) if N == 1, + or (vec_concat:outer y:inner x:inner) if N == 2. + + Implicitly, this means we have a paradoxical subreg, but such + a check is cheap, so make it anyway. + + Only applies for vectors of two elements. */ + if (GET_CODE (op0) == VEC_DUPLICATE + && GET_CODE (op1) == SUBREG + && GET_MODE (op1) == GET_MODE (op0) + && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0)) + && paradoxical_subreg_p (op1) + && subreg_lowpart_p (op1) + && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2) + && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2) + && IN_RANGE (sel, 1, 2)) + { + rtx newop0 = XEXP (op0, 0); + rtx newop1 = SUBREG_REG (op1); + if (sel == 2) + std::swap (newop0, newop1); + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); + } + /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y) (const_int n)) with (vec_concat x y) or (vec_concat y x) depending on value diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c new file mode 100644 index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ + +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */ + +void bar (double); + +void +foo (double *restrict in, double *restrict in2, + double *restrict out1, double *restrict out2) +{ + for (int i = 0; i < 1024; i++) + { + out1[i] = in[i] + 2.0 * in[i+128]; + out1[i+1] = in[i+1] + 2.0 * in2[i]; + bar (in[i]); + } +} + +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */ +