From patchwork Thu Sep 14 09:12:35 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jackson Woodruff X-Patchwork-Id: 813751 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-462111-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="Tac9AXH0"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3xtCVd6wrlz9sBZ for ; Thu, 14 Sep 2017 19:12:56 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=oqLkJBHvJPWUB8tgKxhpeB2YxjsukkLSjnYK26USJu44MN5Tjl JjrBkCAEdRHqSTg16mlKi7WBKqGegCdhntAGie7DCaRYX7qSURHOJme0P5Y4nq09 H3+pM9gZQnxd4UZ1pufWvg3zmp2vGrKN2gHl61tS383SRfYwooNQtGrDQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; s= default; bh=ZGc9oZd/f57/eHkObRUPs6Qwq64=; b=Tac9AXH0j97eOo7gdjEW V0hP6MeQiBzRpWTJyQ/MGVt0B+zkVlEm4ikkTkee2vgHtqmnheQ60l1v/Zv79H43 4vij3ECiALVOE+L9u4TsTiQ0Bz5fh+HuHCnTlkZ3tiHHuXGR9+8o9Q/vTenFBkTu v67eQnZR4sgbzDaCwcDJVXE= Received: (qmail 65960 invoked by alias); 14 Sep 2017 09:12:48 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 64806 invoked by uid 89); 14 Sep 2017 09:12:46 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=formation X-HELO: foss.arm.com Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 14 Sep 2017 09:12:40 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EC3F31529; Thu, 14 Sep 2017 02:12:37 -0700 (PDT) Received: from [10.2.206.195] (e112997-lin.cambridge.arm.com [10.2.206.195]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2A2C63F3E1; Thu, 14 Sep 2017 02:12:37 -0700 (PDT) To: Richard Earnshaw , James Greenhalgh , GCC Patches From: Jackson Woodruff Subject: [AArch64] Improve LDP/STP generation that requires a base register Message-ID: <8c82cf45-fbb2-3bbe-8c9e-990c140650d2@foss.arm.com> Date: Thu, 14 Sep 2017 10:12:35 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 X-IsSubscribed: yes Hi all, This patch generalizes the formation of LDP/STP that require a base register. Previously, we would only accept address pairs that were ordered in ascending or descending order, and only strictly sequential loads/stores. This patch improves that by allowing us to accept all orders of loads/stores that are valid, and extending the range that the LDP/STP addresses can reach. This patch is based on https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00741.html OK for trunk? Jackson ChangeLog: gcc/ 2017-08-09 Jackson Woodruff * aarch64.c (aarch64_host_wide_int_compare): New. (aarch64_ldrstr_offset_compare): New. (aarch64_operands_adjust_ok_for_ldpstp): Change to consider all load/store orderings. (aarch64_gen_adjusted_ldpstp): Likewise. gcc/testsuite 2017-08-09 Jackson Woodruff * gcc.target/aarch64/simd/ldp_stp_9: New. * gcc.target/aarch64/simd/ldp_stp_10: New. * gcc.target/aarch64/simd/ldp_stp_11: New. * gcc.target/aarch64/simd/ldp_stp_12: New. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 4c5ed9610cb8bbb337bbfcb9260d7fd227c68ce8..e015bc440e0c5e4cd85b6b92a9058bb69ada6fa1 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -14799,6 +14799,49 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, return true; } +int +aarch64_host_wide_int_compare (const void *x, const void *y) +{ + return wi::cmps (* ((const HOST_WIDE_INT *) x), + * ((const HOST_WIDE_INT *) y)); +} + +/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the + other pointing to a REG rtx containing an offset, compare the offsets + of the two pairs. + + Return: + + 1 iff offset (X) > offset (Y) + 0 iff offset (X) == offset (Y) + -1 iff offset (X) < offset (Y) + */ +int +aarch64_ldrstr_offset_compare (const void *x, const void *y) +{ + const rtx * operands_1 = (const rtx *) x; + const rtx * operands_2 = (const rtx *) y; + rtx mem_1, mem_2, base, offset_1, offset_2; + + if (GET_CODE (operands_1[0]) == MEM) + mem_1 = operands_1[0]; + else + mem_1 = operands_1[1]; + + if (GET_CODE (operands_2[0]) == MEM) + mem_2 = operands_2[0]; + else + mem_2 = operands_2[1]; + + /* Extract the offsets. */ + extract_base_offset_in_addr (mem_1, &base, &offset_1); + extract_base_offset_in_addr (mem_2, &base, &offset_2); + + gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX); + + return wi::cmps (INTVAL (offset_1), INTVAL (offset_2)); +} + /* Given OPERANDS of consecutive load/store that can be merged, swap them if they are not in ascending order. */ void @@ -14859,7 +14902,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize; + HOST_WIDE_INT offvals[4], msize; rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; @@ -14875,8 +14918,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, mem_4 = operands[7]; gcc_assert (REG_P (reg_1) && REG_P (reg_2) && REG_P (reg_3) && REG_P (reg_4)); - if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4)) - return false; + + /* Do not attempt to merge the loads if the loads clobber each other. */ + for (int i = 0; i < 8; i += 2) + for (int j = i + 2; j < 8; j += 2) + if (REGNO (operands[i]) == REGNO (operands[j])) + return false; } else { @@ -14918,34 +14965,36 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, || !rtx_equal_p (base_3, base_4)) return false; - offval_1 = INTVAL (offset_1); - offval_2 = INTVAL (offset_2); - offval_3 = INTVAL (offset_3); - offval_4 = INTVAL (offset_4); + offvals[0] = INTVAL (offset_1); + offvals[1] = INTVAL (offset_2); + offvals[2] = INTVAL (offset_3); + offvals[3] = INTVAL (offset_4); msize = GET_MODE_SIZE (mode); - /* Check if the offsets are consecutive. */ - if ((offval_1 != (offval_2 + msize) - || offval_1 != (offval_3 + msize * 2) - || offval_1 != (offval_4 + msize * 3)) - && (offval_4 != (offval_3 + msize) - || offval_4 != (offval_2 + msize * 2) - || offval_4 != (offval_1 + msize * 3))) + + /* Check if the offsets can be put in the right order to do a ldp/stp. */ + qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare); + + if (!(offvals[1] == offvals[0] + msize + && offvals[3] == offvals[2] + msize)) return false; - /* Check if the addresses are clobbered by load. */ - if (load) - { - if (reg_mentioned_p (reg_1, mem_1) - || reg_mentioned_p (reg_2, mem_2) - || reg_mentioned_p (reg_3, mem_3)) - return false; + /* Check that offsets close enough togther. The ldp/stp instructions have + 7 bit immediate offsets, so use 0x80. */ + if (offvals[2] - offvals[0] >= msize * 0x80) + return false; - /* In increasing order, the last load can clobber the address. */ - if (offval_1 > offval_2 && reg_mentioned_p (reg_4, mem_4)) - return false; - } + /* The offsets must be aligned with respect to each other. */ + if (offvals[0] % msize != offvals[2] % msize) + return false; - /* If we have SImode and slow unaligned ldp, + /* Check if the addresses are clobbered by load. */ + if (load && (reg_mentioned_p (reg_1, mem_1) + || reg_mentioned_p (reg_2, mem_2) + || reg_mentioned_p (reg_3, mem_3) + || reg_mentioned_p (reg_4, mem_4))) + return false; + + /* If we have SImode and slow unigned ldp, check the alignment to be at least 8 byte. */ if (mode == SImode && (aarch64_tune_params.extra_tuning_flags @@ -14983,7 +15032,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, /* Given OPERANDS of consecutive load/store, this function pairs them into ldp/stp after adjusting the offset. It depends on the fact - that addresses of load/store instructions are in increasing order. + that the operands can be sorted so the offsets are correct for stp. MODE is the mode of memory operands. CODE is the rtl operator which should be applied to all memory operands, it's SIGN_EXTEND, ZERO_EXTEND or UNKNOWN. */ @@ -14992,87 +15041,111 @@ bool aarch64_gen_adjusted_ldpstp (rtx *operands, bool load, scalar_mode mode, RTX_CODE code) { - rtx base, offset_1, offset_2, t1, t2; + rtx base, offset_1, offset_3, t1, t2; rtx mem_1, mem_2, mem_3, mem_4; - HOST_WIDE_INT off_val, abs_off, adj_off, new_off, stp_off_limit, msize; + rtx temp_operands[8]; + HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3, + stp_off_upper_limit, stp_off_lower_limit, msize; + + /* We make changes on a copy as we may still abort. */ + for (int i = 0; i < 8; i ++) + temp_operands[i] = operands[i]; + + /* Sort the operands. */ + qsort (temp_operands, 4, 2 * sizeof (rtx *), aarch64_ldrstr_offset_compare); if (load) { - mem_1 = operands[1]; - mem_2 = operands[3]; - mem_3 = operands[5]; - mem_4 = operands[7]; + mem_1 = temp_operands[1]; + mem_2 = temp_operands[3]; + mem_3 = temp_operands[5]; + mem_4 = temp_operands[7]; } else { - mem_1 = operands[0]; - mem_2 = operands[2]; - mem_3 = operands[4]; - mem_4 = operands[6]; + mem_1 = temp_operands[0]; + mem_2 = temp_operands[2]; + mem_3 = temp_operands[4]; + mem_4 = temp_operands[6]; gcc_assert (code == UNKNOWN); } extract_base_offset_in_addr (mem_1, &base, &offset_1); - extract_base_offset_in_addr (mem_2, &base, &offset_2); + extract_base_offset_in_addr (mem_3, &base, &offset_3); gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX - && offset_2 != NULL_RTX); - - if (INTVAL (offset_1) > INTVAL (offset_2)) - { - std::swap (operands[0], operands[6]); - std::swap (operands[1], operands[7]); - std::swap (operands[2], operands[4]); - std::swap (operands[3], operands[5]); - } - + && offset_3 != NULL_RTX); - /* Adjust offset thus it can fit in ldp/stp instruction. */ + /* Adjust offset so it can fit in ldp/stp instruction. */ msize = GET_MODE_SIZE (mode); - stp_off_limit = msize * 0x40; - off_val = INTVAL (offset_1); - abs_off = (off_val < 0) ? -off_val : off_val; - new_off = abs_off % stp_off_limit; - adj_off = abs_off - new_off; + stp_off_upper_limit = msize * (0x40 - 1); + stp_off_lower_limit = - msize * 0x40; - /* Further adjust to make sure all offsets are OK. */ - if ((new_off + msize * 2) >= stp_off_limit) - { - adj_off += stp_off_limit; - new_off -= stp_off_limit; - } + off_val_1 = INTVAL (offset_1); + off_val_3 = INTVAL (offset_3); - /* Make sure the adjustment can be done with ADD/SUB instructions. */ - if (adj_off >= 0x1000) + /* The base offset is optimally half way between the two stp/ldp offsets. */ + if (msize <= 4) + base_off = (off_val_1 + off_val_3) / 2; + else + /* However, due to issues with negative LDP/STP offset generation for + larger modes, for DF and DI modes (and any vector modes that have + since been added) we must not use negative addresses smaller than + 9 signed unadjusted bits can store. This provides the most range + in this case. */ + base_off = off_val_1; + + /* Adjust the base so that it is aligned with the addresses but still + optimal. */ + if (base_off % msize != off_val_1 % msize) + /* Fix the offset, bearing in mind we want to make it bigger not + smaller. */ + base_off += (((base_off % msize) - (off_val_1 % msize)) + msize) % msize; + else if (msize <= 4) + /* The negative range of LDP/STP is one larger than the positive range. */ + base_off += msize; + + /* Check if base offset is too big or too small. We can attempt to resolve + this issue by setting it to the maximum value and seeing if the offsets + still fit. */ + if (base_off >= 0x1000) + { + base_off = 0x1000 - 1; + /* We must still make sure that the base offset is aligned with respect + to the address. But it may may not be made any bigger. */ + base_off -= (((base_off % msize) - (off_val_1 % msize)) + msize) % msize; + } + + /* Likewise for the case where the base is too small. */ + if (base_off <= -0x1000) + { + base_off = -0x1000 + 1; + base_off += (((base_off % msize) - (off_val_1 % msize)) + msize) % msize; + } + + /* Offset of the first stp/ldp. */ + new_off_1 = off_val_1 - base_off; + + /* Offset of the second stp/ldp. */ + new_off_3 = off_val_3 - base_off; + + /* The offsets must be within the range of the LDP/STP instructions. */ + if (new_off_1 > stp_off_upper_limit || new_off_1 < stp_off_lower_limit + || new_off_3 > stp_off_upper_limit || new_off_3 < stp_off_lower_limit) return false; - if (off_val < 0) - { - adj_off = -adj_off; - new_off = -new_off; - } - - /* Create new memory references. */ - mem_1 = change_address (mem_1, VOIDmode, - plus_constant (DImode, operands[8], new_off)); - - /* Check if the adjusted address is OK for ldp/stp. */ - if (!aarch64_mem_pair_operand (mem_1, mode)) + mem_1 = gen_rtx_MEM (mode, plus_constant (Pmode, operands[8], + new_off_1)); + mem_2 = gen_rtx_MEM (mode, plus_constant (Pmode, operands[8], + new_off_1 + msize)); + mem_3 = gen_rtx_MEM (mode, plus_constant (Pmode, operands[8], + new_off_3)); + mem_4 = gen_rtx_MEM (mode, plus_constant (Pmode, operands[8], + new_off_3 + msize)); + + if (!aarch64_mem_pair_operand (mem_1, mode) + || !aarch64_mem_pair_operand (mem_3, mode)) return false; - msize = GET_MODE_SIZE (mode); - mem_2 = change_address (mem_2, VOIDmode, - plus_constant (DImode, - operands[8], - new_off + msize)); - mem_3 = change_address (mem_3, VOIDmode, - plus_constant (DImode, - operands[8], - new_off + msize * 2)); - mem_4 = change_address (mem_4, VOIDmode, - plus_constant (DImode, - operands[8], - new_off + msize * 3)); - if (code == ZERO_EXTEND) { mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1); @@ -15090,21 +15163,29 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load, if (load) { + operands[0] = temp_operands[0]; operands[1] = mem_1; + operands[2] = temp_operands[2]; operands[3] = mem_2; + operands[4] = temp_operands[4]; operands[5] = mem_3; + operands[6] = temp_operands[6]; operands[7] = mem_4; } else { operands[0] = mem_1; + operands[1] = temp_operands[1]; operands[2] = mem_2; + operands[3] = temp_operands[3]; operands[4] = mem_3; + operands[5] = temp_operands[5]; operands[6] = mem_4; + operands[7] = temp_operands[7]; } /* Emit adjusting instruction. */ - emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, adj_off))); + emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, base_off))); /* Emit ldp/stp instructions. */ t1 = gen_rtx_SET (operands[0], operands[1]); t2 = gen_rtx_SET (operands[2], operands[3]); diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c new file mode 100644 index 0000000000000000000000000000000000000000..8fb9c44f41a2d03eb30c99f4bbd48e106536f045 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_10.c @@ -0,0 +1,33 @@ +/* { dg-options "-O2" } */ + +int +load (int *arr) +{ + return arr[527] + arr[400] + arr[401] + arr[528]; +} + +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */ + +float +load_float (float *arr) +{ + return arr[404] + arr[403] + arr[400] + arr[401]; +} + +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]+, " 2 } } */ + +long +load_long (long int *arr) +{ + return arr[400] + arr[401] + arr[403] + arr[404]; +} + +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */ + +double +load_double (double *arr) +{ + return arr[200] + arr[201] + arr[263] + arr[264]; +} + +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]+, " 2 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c new file mode 100644 index 0000000000000000000000000000000000000000..3a7a2ca3ad19ae4c90ea2ee6f63ad5e913df72da --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_11.c @@ -0,0 +1,15 @@ +/* { dg-options "-O2" } */ + +double +load_one (double *in) +{ + return in[400] + in[401] + in[527] + in[528]; +} + +double +load_two (double *in) +{ + return in[400] + in[401] + in[464] + in[465]; +} + +/* { dg-final { scan-assembler-times "stp\td\[0-9\]+, d\[0-9\]+," 4 { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c new file mode 100644 index 0000000000000000000000000000000000000000..718e82b53f0ccfd09a19afa26ebdb88654359e33 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_12.c @@ -0,0 +1,13 @@ +/* { dg-options "-O2" } */ + +void +store_offset (int *array, int x, int y) +{ + array[1085] = x; + array[1084] = y; + + array[1086] = y; + array[1087] = 5; +} + +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c new file mode 100644 index 0000000000000000000000000000000000000000..ca438695a5f2fd6892c7ac92a3f85623ed375aa2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_9.c @@ -0,0 +1,49 @@ +/* { dg-options "-O2" } */ + +void +store (int *arr, int x, int y, int z) +{ + arr[400] = x; + arr[401] = y; + + arr[500] = z; + arr[501] = x; +} + +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */ + +void +store_float (float *arr, float x, float y) +{ + arr[404] = x; + arr[403] = y; + + arr[400] = x; + arr[401] = y; +} + +/* { dg-final { scan-assembler-times "stp\ts\[0-9\]+, s\[0-9\]+, " 2 } } */ + +void +store_long (long int *arr, long int x, long int y) +{ + arr[400] = x; + arr[401] = y; + + arr[403] = y; + arr[404] = x; +} + +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */ + +void +store_double (double *arr, double x, double y) +{ + arr[200] = x; + arr[201] = y; + + arr[263] = y; + arr[264] = x; +} + +/* { dg-final { scan-assembler-times "stp\td\[0-9\]+, d\[0-9\]+, " 2 } } */