diff mbox series

Allow single-element interleaving for non-power-of-2 strides

Message ID 87o9o0yl6y.fsf@linaro.org
State New
Headers show
Series Allow single-element interleaving for non-power-of-2 strides | expand

Commit Message

Richard Sandiford Nov. 17, 2017, 3:33 p.m. UTC
This allows LD3 to be used for isolated a[i * 3] accesses, in a similar
way to the current a[i * 2] and a[i * 4] for LD2 and LD4 respectively.
Given the problems with the cost model underestimating the cost of
elementwise accesses, the patch continues to reject the VMAT_ELEMENTWISE
cases that are currently rejected.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* tree-vect-data-refs.c (vect_analyze_group_access_1): Allow
	single-element interleaving even if the size is not a power of 2.
	* tree-vect-stmts.c (get_load_store_type): Disallow elementwise
	accesses for single-element interleaving if the group size is
	not a power of 2.

gcc/testsuite/
	* gcc.target/aarch64/sve_struct_vect_18.c: New test.
	* gcc.target/aarch64/sve_struct_vect_18_run.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_19.c: Likewise.
	* gcc.target/aarch64/sve_struct_vect_19_run.c: Likewise.

Comments

Jeff Law Nov. 17, 2017, 6:40 p.m. UTC | #1
On 11/17/2017 08:33 AM, Richard Sandiford wrote:
> This allows LD3 to be used for isolated a[i * 3] accesses, in a similar
> way to the current a[i * 2] and a[i * 4] for LD2 and LD4 respectively.
> Given the problems with the cost model underestimating the cost of
> elementwise accesses, the patch continues to reject the VMAT_ELEMENTWISE
> cases that are currently rejected.
> 
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu.  OK to install?
> 
> Richard
> 
> 
> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
> 	    Alan Hayward  <alan.hayward@arm.com>
> 	    David Sherwood  <david.sherwood@arm.com>
> 
> gcc/
> 	* tree-vect-data-refs.c (vect_analyze_group_access_1): Allow
> 	single-element interleaving even if the size is not a power of 2.
> 	* tree-vect-stmts.c (get_load_store_type): Disallow elementwise
> 	accesses for single-element interleaving if the group size is
> 	not a power of 2.
> 
> gcc/testsuite/
> 	* gcc.target/aarch64/sve_struct_vect_18.c: New test.
> 	* gcc.target/aarch64/sve_struct_vect_18_run.c: Likewise.
> 	* gcc.target/aarch64/sve_struct_vect_19.c: Likewise.
> 	* gcc.target/aarch64/sve_struct_vect_19_run.c: Likewise.
OK.
jeff
James Greenhalgh Jan. 7, 2018, 8:55 p.m. UTC | #2
On Fri, Nov 17, 2017 at 06:40:13PM +0000, Jeff Law wrote:
> On 11/17/2017 08:33 AM, Richard Sandiford wrote:
> > This allows LD3 to be used for isolated a[i * 3] accesses, in a similar
> > way to the current a[i * 2] and a[i * 4] for LD2 and LD4 respectively.
> > Given the problems with the cost model underestimating the cost of
> > elementwise accesses, the patch continues to reject the VMAT_ELEMENTWISE
> > cases that are currently rejected.
> > 
> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> > and powerpc64le-linux-gnu.  OK to install?
> > 
> > Richard
> > 
> > 
> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
> > 	    Alan Hayward  <alan.hayward@arm.com>
> > 	    David Sherwood  <david.sherwood@arm.com>
> > 
> > gcc/
> > 	* tree-vect-data-refs.c (vect_analyze_group_access_1): Allow
> > 	single-element interleaving even if the size is not a power of 2.
> > 	* tree-vect-stmts.c (get_load_store_type): Disallow elementwise
> > 	accesses for single-element interleaving if the group size is
> > 	not a power of 2.
> > 
> > gcc/testsuite/
> > 	* gcc.target/aarch64/sve_struct_vect_18.c: New test.
> > 	* gcc.target/aarch64/sve_struct_vect_18_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_struct_vect_19.c: Likewise.
> > 	* gcc.target/aarch64/sve_struct_vect_19_run.c: Likewise.
> OK.
> jeff

The AArch64 tests are OK.

Thanks,
James
Christophe Lyon Jan. 15, 2018, 10:20 a.m. UTC | #3
On 7 January 2018 at 21:55, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Fri, Nov 17, 2017 at 06:40:13PM +0000, Jeff Law wrote:
>> On 11/17/2017 08:33 AM, Richard Sandiford wrote:
>> > This allows LD3 to be used for isolated a[i * 3] accesses, in a similar
>> > way to the current a[i * 2] and a[i * 4] for LD2 and LD4 respectively.
>> > Given the problems with the cost model underestimating the cost of
>> > elementwise accesses, the patch continues to reject the VMAT_ELEMENTWISE
>> > cases that are currently rejected.
>> >
>> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>> > and powerpc64le-linux-gnu.  OK to install?
>> >
>> > Richard
>> >
>> >
>> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
>> >         Alan Hayward  <alan.hayward@arm.com>
>> >         David Sherwood  <david.sherwood@arm.com>
>> >
>> > gcc/
>> >     * tree-vect-data-refs.c (vect_analyze_group_access_1): Allow
>> >     single-element interleaving even if the size is not a power of 2.
>> >     * tree-vect-stmts.c (get_load_store_type): Disallow elementwise
>> >     accesses for single-element interleaving if the group size is
>> >     not a power of 2.
>> >
>> > gcc/testsuite/
>> >     * gcc.target/aarch64/sve_struct_vect_18.c: New test.
>> >     * gcc.target/aarch64/sve_struct_vect_18_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_struct_vect_19.c: Likewise.
>> >     * gcc.target/aarch64/sve_struct_vect_19_run.c: Likewise.
>> OK.
>> jeff
>
> The AArch64 tests are OK.
>

Hi,

After this commit (r256634), I have reported regressions on armeb in:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83851

Christophe

> Thanks,
> James
>
diff mbox series

Patch

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2017-11-17 15:32:12.513242384 +0000
+++ gcc/tree-vect-data-refs.c	2017-11-17 15:32:12.696843097 +0000
@@ -2440,11 +2440,10 @@  vect_analyze_group_access_1 (struct data
 	 element of the group that is accessed in the loop.  */
 
       /* Gaps are supported only for loads. STEP must be a multiple of the type
-	 size.  The size of the group must be a power of 2.  */
+	 size.  */
       if (DR_IS_READ (dr)
 	  && (dr_step % type_size) == 0
-	  && groupsize > 0
-	  && pow2p_hwi (groupsize))
+	  && groupsize > 0)
 	{
 	  GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = stmt;
 	  GROUP_SIZE (vinfo_for_stmt (stmt)) = groupsize;
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2017-11-17 15:32:12.513242384 +0000
+++ gcc/tree-vect-stmts.c	2017-11-17 15:32:12.697756534 +0000
@@ -2208,7 +2208,10 @@  get_load_store_type (gimple *stmt, tree
      cost of using elementwise accesses.  This check preserves the
      traditional behavior until that can be fixed.  */
   if (*memory_access_type == VMAT_ELEMENTWISE
-      && !STMT_VINFO_STRIDED_P (stmt_info))
+      && !STMT_VINFO_STRIDED_P (stmt_info)
+      && !(stmt == GROUP_FIRST_ELEMENT (stmt_info)
+	   && !GROUP_NEXT_ELEMENT (stmt_info)
+	   && !pow2p_hwi (GROUP_SIZE (stmt_info))))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18.c	2017-11-17 15:32:12.695929661 +0000
@@ -0,0 +1,44 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define N 2000
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src)		\
+  {								\
+    for (int i = 0; i < N; ++i)					\
+      dest[i] += src[i * 3];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_18_run.c	2017-11-17 15:32:12.695929661 +0000
@@ -0,0 +1,36 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_18.c"
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)				\
+  {							\
+    TYPE out[N];					\
+    TYPE in[N * 3];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 3; ++i)			\
+      {							\
+	in[i] = i * 9 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    NAME (out, in);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	TYPE expected = i * 7 / 2 + in[i * 3];		\
+	if (out[i] != expected)				\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19.c	2017-11-17 15:32:12.695929661 +0000
@@ -0,0 +1,42 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src, int n)		\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      dest[i] += src[i * 3];					\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, signed char) \
+  TEST_LOOP (NAME##_i16, unsigned short) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld3d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 1 } } */
+
+/* Check the scalar tail.  */
+/* { dg-final { scan-assembler-times {\tldrb\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrb\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldrh\tw} 2 } } */
+/* { dg-final { scan-assembler-times {\tstrh\tw} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\td} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_struct_vect_19_run.c	2017-11-17 15:32:12.695929661 +0000
@@ -0,0 +1,45 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_struct_vect_19.c"
+
+#define N 1000
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)			\
+  {						\
+    TYPE out[N];				\
+    TYPE in[N * 3];				\
+    int counts[] = { 0, 1, N - 1 };		\
+    for (int j = 0; j < 3; ++j)			\
+      {						\
+	int count = counts[j];			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    out[i] = i * 7 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	for (int i = 0; i < N * 3; ++i)		\
+	  {					\
+	    in[i] = i * 9 / 2;			\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+	NAME (out, in, count);			\
+	for (int i = 0; i < N; ++i)		\
+	  {					\
+	    TYPE expected = i * 7 / 2;		\
+	    if (i < count)			\
+	      expected += in[i * 3];		\
+	    if (out[i] != expected)		\
+	      __builtin_abort ();		\
+	    asm volatile ("" ::: "memory");	\
+	  }					\
+      }						\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}