[AArch64] Use Q-reg loads/stores in movmem expansion

Message ID	5C1CDCF9.7030001@foss.arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-492987-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=JDRIUriAPgleaO3HgyCp+/3buGOOU8ZD39Rsj5cgdZs 3VItNgG1FnCLgXTBFUYZFKXwctMmKAuRYoeBVoNAXBfMz+k5ii0djB/BIMLXeHxd suoK5ebO0elNUeMkXFcy3aVAiekpTzbKOS++ZVcR188gjxZcH0Ip8qVzX6ZADsNE = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <5C1CDCF9.7030001@foss.arm.com> Date: Fri, 21 Dec 2018 12:30:49 +0000 From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> CC: "Richard Earnshaw (lists)" <richard.earnshaw@arm.com>, James Greenhalgh <james.greenhalgh@arm.com>, Marcus Shawcroft <marcus.shawcroft@arm.com> Subject: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion Content-Type: multipart/mixed; boundary="------------040707020004030709040803"
Series	[AArch64] Use Q-reg loads/stores in movmem expansion \| expand [AArch64] Use Q-reg loads/stores in movmem expansion

Message ID

5C1CDCF9.7030001@foss.arm.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=JDRIUriAPgleaO3HgyCp+/3buGOOU8ZD39Rsj5cgdZs
	3VItNgG1FnCLgXTBFUYZFKXwctMmKAuRYoeBVoNAXBfMz+k5ii0djB/BIMLXeHxd
	suoK5ebO0elNUeMkXFcy3aVAiekpTzbKOS++ZVcR188gjxZcH0Ip8qVzX6ZADsNE
	=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
Message-ID: <5C1CDCF9.7030001@foss.arm.com>
Date: Fri, 21 Dec 2018 12:30:49 +0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: "Richard Earnshaw (lists)" <richard.earnshaw@arm.com>,
	James Greenhalgh <james.greenhalgh@arm.com>,
	Marcus Shawcroft <marcus.shawcroft@arm.com>
Subject: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion
Content-Type: multipart/mixed;
	boundary="------------040707020004030709040803"

Series

[AArch64] Use Q-reg loads/stores in movmem expansion | expand

Commit Message

Kyrill Tkachov Dec. 21, 2018, 12:30 p.m. UTC

Hi all,

Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.

For the purpose of copying memory, however, we want to prefer Q-registers.
This uses one fewer register, so helping with register pressure.
It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.

The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
rather than a TImode.

With this patch the testcase:
#define N 8
int src[N], dst[N];

void
foo (void)
{
   __builtin_memcpy (dst, src, N * sizeof (int));
}

generates:
foo:
         adrp    x1, src
         add     x1, x1, :lo12:src
         adrp    x0, dst
         add     x0, x0, :lo12:dst
         ldp     q1, q0, [x1]
         stp     q1, q0, [x0]
         ret

instead of:
foo:
         adrp    x1, src
         add     x1, x1, :lo12:src
         adrp    x0, dst
         add     x0, x0, :lo12:dst
         ldp     x2, x3, [x1]
         stp     x2, x3, [x0]
         ldp     x2, x3, [x1, 16]
         stp     x2, x3, [x0, 16]
         ret

Bootstrapped and tested on aarch64-none-linux-gnu.
I hope this is a small enough change for GCC 9.
One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs

I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)

Ok for trunk?

Thanks,
Kyrill

2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
     128-bit moves.

2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/movmem-q-reg_1.c: New test.

Comments

Kyrill Tkachov Jan. 4, 2019, 11:28 a.m. UTC | #1

Ping.
https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01560.html

Thanks,
Kyrill

On 21/12/18 12:30, Kyrill Tkachov wrote:
> Hi all,
>
> Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
> This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.
>
> For the purpose of copying memory, however, we want to prefer Q-registers.
> This uses one fewer register, so helping with register pressure.
> It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.
>
> The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
> rather than a TImode.
>
> With this patch the testcase:
> #define N 8
> int src[N], dst[N];
>
> void
> foo (void)
> {
>   __builtin_memcpy (dst, src, N * sizeof (int));
> }
>
> generates:
> foo:
>         adrp    x1, src
>         add     x1, x1, :lo12:src
>         adrp    x0, dst
>         add     x0, x0, :lo12:dst
>         ldp     q1, q0, [x1]
>         stp     q1, q0, [x0]
>         ret
>
> instead of:
> foo:
>         adrp    x1, src
>         add     x1, x1, :lo12:src
>         adrp    x0, dst
>         add     x0, x0, :lo12:dst
>         ldp     x2, x3, [x1]
>         stp     x2, x3, [x0]
>         ldp     x2, x3, [x1, 16]
>         stp     x2, x3, [x0, 16]
>         ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I hope this is a small enough change for GCC 9.
> One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs
>
> I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
> but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
>     128-bit moves.
>
> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.target/aarch64/movmem-q-reg_1.c: New test.

James Greenhalgh Jan. 9, 2019, 5:50 p.m. UTC | #2

On Fri, Dec 21, 2018 at 06:30:49AM -0600, Kyrill Tkachov wrote:
> Hi all,
> 
> Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
> This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.
> 
> For the purpose of copying memory, however, we want to prefer Q-registers.
> This uses one fewer register, so helping with register pressure.
> It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.
> 
> The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
> rather than a TImode.
> 
> With this patch the testcase:
> #define N 8
> int src[N], dst[N];
> 
> void
> foo (void)
> {
>    __builtin_memcpy (dst, src, N * sizeof (int));
> }
> 
> generates:
> foo:
>          adrp    x1, src
>          add     x1, x1, :lo12:src
>          adrp    x0, dst
>          add     x0, x0, :lo12:dst
>          ldp     q1, q0, [x1]
>          stp     q1, q0, [x0]
>          ret
> 
> instead of:
> foo:
>          adrp    x1, src
>          add     x1, x1, :lo12:src
>          adrp    x0, dst
>          add     x0, x0, :lo12:dst
>          ldp     x2, x3, [x1]
>          stp     x2, x3, [x0]
>          ldp     x2, x3, [x1, 16]
>          stp     x2, x3, [x0, 16]
>          ret
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I hope this is a small enough change for GCC 9.
> One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs
> 
> I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
> but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)
> 
> Ok for trunk?

I'm surprised by the logic. If we want to use 256-bit copies, shouldn't we
be explicit about that in the movmem code, rather than using 128-bit copies
that get merged. Why do TImode loads require two X registers? Shouldn't we
just fix TImode loads to use Q registers if that is preferable?

I'm not opposed to the principle of using LDP-Q in our movmem, but is this
the best way to make that happen?

Thanks,
James

> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
>      128-bit moves.
> 
> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * gcc.target/aarch64/movmem-q-reg_1.c: New test.

Kyrill Tkachov Jan. 10, 2019, 10:15 a.m. UTC | #3

Hi James,

On 09/01/19 17:50, James Greenhalgh wrote:
> On Fri, Dec 21, 2018 at 06:30:49AM -0600, Kyrill Tkachov wrote:
>> Hi all,
>>
>> Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
>> This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.
>>
>> For the purpose of copying memory, however, we want to prefer Q-registers.
>> This uses one fewer register, so helping with register pressure.
>> It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.
>>
>> The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
>> rather than a TImode.
>>
>> With this patch the testcase:
>> #define N 8
>> int src[N], dst[N];
>>
>> void
>> foo (void)
>> {
>>     __builtin_memcpy (dst, src, N * sizeof (int));
>> }
>>
>> generates:
>> foo:
>>           adrp    x1, src
>>           add     x1, x1, :lo12:src
>>           adrp    x0, dst
>>           add     x0, x0, :lo12:dst
>>           ldp     q1, q0, [x1]
>>           stp     q1, q0, [x0]
>>           ret
>>
>> instead of:
>> foo:
>>           adrp    x1, src
>>           add     x1, x1, :lo12:src
>>           adrp    x0, dst
>>           add     x0, x0, :lo12:dst
>>           ldp     x2, x3, [x1]
>>           stp     x2, x3, [x0]
>>           ldp     x2, x3, [x1, 16]
>>           stp     x2, x3, [x0, 16]
>>           ret
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> I hope this is a small enough change for GCC 9.
>> One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs
>>
>> I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
>> but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)
>>
>> Ok for trunk?
> I'm surprised by the logic. If we want to use 256-bit copies, shouldn't we
> be explicit about that in the movmem code, rather than using 128-bit copies
> that get merged.

To emit the Q-reg pairs here we'd need to:
1) Adjust the copy_limit to 256 bits after checking AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning.
2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-bit moves. There's a couple of ways to do it:
   i) Emit OImode moves. For that we'd need add patterns/alternatives that use LDP/STP-Q for OImode moves (currently we only use OImode as a container mode for LD1/ST1 operations).

   ii) Emit explicit load/store pairs of TImode values. For that we'd need to generate two MEMs and two registers, which would complicate aarch64_copy_one_block_and_progress_pointers a bit more. Furthermore we'd need to add {load,store}_pairtiti patterns in in aarch64-simd.md that actually
   handle TImode pairs. These patterns wouldn't be very useful in other contexts as for the compiler to form the register allocation should have chosen to use Q regs for the individual TImode operations (for the peepholing to match them), which is unlikely. So these patterns would largely exist only for this bit of code

   iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs ldp/stps. This wouldn't need any adjustments to MD patterns, but would make aarch64_copy_one_block_and_progress_pointers more complex as it would now have two paths, where one handles two adjacent memory addresses in one calls.
>   Why do TImode loads require two X registers? Shouldn't we
> just fix TImode loads to use Q registers if that is preferable?
We do have both alternatives in our movti pattern. The Q-reg alternatives comes later in the list, so I guess it's slightly less preferable.

I believe the most common use of TImode is for wide integer arithmetic (our add-with-overflow patterns)
which are usually done on X registers. So when preferring to use X registers makes sense in that scenario.
We only care about using Q-regs for TImode when moving memory.

>
> I'm not opposed to the principle of using LDP-Q in our movmem, but is this
> the best way to make that happen?

It looked like the minimal patch to achieve this. If you'd like to see explicit pair creating straight away during expand,
I think approach iii) above is the least bad option.

Thanks,
Kyrill

> Thanks,
> James
>
>> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
>>       128-bit moves.
>>
>> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * gcc.target/aarch64/movmem-q-reg_1.c: New test.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 88b14179a4cbc5357dfabe21227ff9c8a111804c..a8dcdd4c9e22a7583a197372e500c787c91fe459 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16448,6 +16448,16 @@  aarch64_expand_movmem (rtx *operands)
 	if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
 	  cur_mode = mode_iter.require ();
 
+      /* If we want to use 128-bit chunks use a vector mode to prefer the use
+	 of Q registers.  This is preferable to using load/store-pairs of X
+	 registers as we need 1 Q-register vs 2 X-registers.
+	 Also, for targets that prefer it, further passes can create
+	 LDP/STP of Q-regs to further reduce the code size.  */
+      if (TARGET_SIMD
+	  && known_eq (GET_MODE_SIZE (cur_mode), GET_MODE_SIZE (TImode)))
+	cur_mode = V4SImode;
+
+
       gcc_assert (cur_mode != BLKmode);
 
       mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
diff --git a/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..09afad59712b939e25519f02153b5156ddacbf5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define N 8
+int src[N], dst[N];
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, N * sizeof (int));
+}
+
+/* { dg-final { scan-assembler {ld[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {ld[rp]\tx[0-9]*} } } */
+/* { dg-final { scan-assembler {st[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {st[rp]\tx[0-9]*} } } */
\ No newline at end of file

[AArch64] Use Q-reg loads/stores in movmem expansion

Commit Message

Comments

Patch