diff mbox

PR/67682, break SLP groups up if only some elements match

Message ID 1447431522-4695-1-git-send-email-alan.lawrence@arm.com
State New
Headers show

Commit Message

Alan Lawrence Nov. 13, 2015, 4:18 p.m. UTC
On 10/11/15 12:51, Richard Biener wrote:
>>
>> Just noticing this... if we have a vectorization factor of 4 and matches
>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>
>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>> if not, split "twice", dropping the intermediate surely non-matching
>> group of vectorization_factor size?  After all if we split like with the
>> patch then the upper half will _not_ be splitted again with the
>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>
>> So I expect that the statistics will be the same if we restrict splitting
>> to the i % vectorization_factor == 0 case, or rather split where we do
>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>
>> Ok with that change.  Improvements on that incrementally.
>
> Btw, it just occurs to me that the whole thing is equivalent to splitting
> the store-group into vector-size pieces up-front?  That way we do
> the maximum splitting up-frond and avoid any redundant work?
>
> The patch is still ok as said, just the above may be a simple thing
> to explore.

I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
was that we only did that once, whereas we would retry the
vect_analyze_slp_instance path each time we decreased the
vectorization_factor...however, I did try putting code at the beginning of
vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
split them as they require 'unrolling'...so not addressing that here.

However your suggestion of splitting twice when we know the boundary is in the
middle of a vector is a nice compromise; it nets us a good number more
successes in SPEC2000 and SPEC2006, about 7% more than without the patch.

Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
up as we don't do that one anymore.)

Cheers, Alan

gcc/ChangeLog:

	PR tree-optimization/67682
	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	PR tree-optimization/67682
	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
 gcc/tree-vect-slp.c                            | 85 +++++++++++++++++++++++++-
 5 files changed, 215 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c

Comments

Christophe Lyon Nov. 16, 2015, 2:42 p.m. UTC | #1
On 13 November 2015 at 17:18, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 10/11/15 12:51, Richard Biener wrote:
>>>
>>> Just noticing this... if we have a vectorization factor of 4 and matches
>>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>>
>>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>>> if not, split "twice", dropping the intermediate surely non-matching
>>> group of vectorization_factor size?  After all if we split like with the
>>> patch then the upper half will _not_ be splitted again with the
>>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>>
>>> So I expect that the statistics will be the same if we restrict splitting
>>> to the i % vectorization_factor == 0 case, or rather split where we do
>>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>>
>>> Ok with that change.  Improvements on that incrementally.
>>
>> Btw, it just occurs to me that the whole thing is equivalent to splitting
>> the store-group into vector-size pieces up-front?  That way we do
>> the maximum splitting up-frond and avoid any redundant work?
>>
>> The patch is still ok as said, just the above may be a simple thing
>> to explore.
>
> I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
> was that we only did that once, whereas we would retry the
> vect_analyze_slp_instance path each time we decreased the
> vectorization_factor...however, I did try putting code at the beginning of
> vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
> us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
> split them as they require 'unrolling'...so not addressing that here.
>
> However your suggestion of splitting twice when we know the boundary is in the
> middle of a vector is a nice compromise; it nets us a good number more
> successes in SPEC2000 and SPEC2006, about 7% more than without the patch.
>
> Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
> and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
> up as we don't do that one anymore.)
>
> Cheers, Alan
>
> gcc/ChangeLog:
>
>         PR tree-optimization/67682
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>         subgroups if vect_build_slp_tree fails after 1st vector.
>
> gcc/testsuite/ChangeLog:
>
>         PR tree-optimization/67682
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.

Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.



> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
>  gcc/tree-vect-slp.c                            | 85 +++++++++++++++++++++++++-
>  5 files changed, 215 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index cfdfc29..65a183f 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
> +                    group1_size, group2_size);
> +
> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block SLP, try to break the group up into multiples of the
> +     vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo)
> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i >= vectorization_factor && i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> +         unsigned group1_size = i & ~(vectorization_factor - 1);
> +
> +         gimple *rest = vect_split_slp_store_group (stmt, group1_size);
> +         bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
> +         /* If the first non-match was in the middle of a vector,
> +            skip the rest of that vector.  */
> +         if (group1_size < i)
> +           {
> +             i = group1_size + vectorization_factor;
> +             if (i < group_size)
> +               rest = vect_split_slp_store_group (rest, vectorization_factor);
> +           }
> +         if (i < group_size)
> +           res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
> +         return res;
> +       }
> +      /* Even though the first vector did not all match, we might be able to SLP
> +        (some) of the remainder.  FORNOW ignore this possibility.  */
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>
Alan Lawrence Nov. 17, 2015, 4:46 p.m. UTC | #2
On 16/11/15 14:42, Christophe Lyon wrote:
>
> Hi Alan,
>
> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
> fails for armeb targets.
> I haven't had time to look at more details yet, but I guess you can
> reproduce it quickly enough.
>

Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
   vector(4) int vect__14.21;
   vector(4) int vect__2.20;
   vector(4) int vect__2.19;
   vector(4) int vect__3.13;
   vector(4) int vect__2.12;

   <bb 2>:
   vect__2.12_24 = MEM[(int *)&b];
   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
   MEM[(int *)&a] = vect__3.13_27;
   vect__2.19_31 = MEM[(int *)&b + 16B];
   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
   MEM[(int *)&a + 16B] = vect__14.21_35;
   return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being 
multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 
pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as 
to what's in each register:

test:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	movw	r2, #:lower16:b
	movt	r2, #:upper16:b
	vldr	d22, .L11
	vldr	d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
	movw	r3, #:lower16:a
	movt	r3, #:upper16:a
	vld1.64	{d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
	vmov	q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
	vldr	d20, [r2, #16]
	vldr	d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
	vuzp.32	q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
	vldr	d20, .L11+16
	vldr	d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
	vmul.i32	q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
	vadd.i32	q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
	vst1.64	{d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
	vstr	d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
	vstr	d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
	bx	lr
.L12:
	.align	3
.L11:
	.word	3
	.word	4
	.word	5
	.word	7
	.word	1
	.word	2
	.word	3
	.word	4

Which is to say - the bit order in the q-registers, is neither big- nor 
little-endian, but the elements get stored back to memory in a consistent order 
with how they were loaded, so we're OK as long as there are no permutes. 
Unfortunately for UZP this lane ordering mixup is not idempotent and messes 
everything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing 
armeb's whole register-numbering/lane-flipping scheme, and might be quite a 
large task. OTOH it might also fix the significant number of failing vectorizer 
tests. A simpler solution might be to disable...some part of vector 
support....on armeb, but I'm not sure which part would be best, yet.

Thoughts (CC maintainers)?

--Alan
Richard Biener Nov. 18, 2015, 9:26 a.m. UTC | #3
On Tue, Nov 17, 2015 at 5:46 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 16/11/15 14:42, Christophe Lyon wrote:
>>
>>
>> Hi Alan,
>>
>> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
>> fails for armeb targets.
>> I haven't had time to look at more details yet, but I guess you can
>> reproduce it quickly enough.
>>
>
> Thanks - yes I see it now.
>
> -fdump-tree-optimized looks sensible:
>
> __attribute__((noinline))
> test ()
> {
>   vector(4) int vect__14.21;
>   vector(4) int vect__2.20;
>   vector(4) int vect__2.19;
>   vector(4) int vect__3.13;
>   vector(4) int vect__2.12;
>
>   <bb 2>:
>   vect__2.12_24 = MEM[(int *)&b];
>   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
>   MEM[(int *)&a] = vect__3.13_27;
>   vect__2.19_31 = MEM[(int *)&b + 16B];
>   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6
> }>;
>   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
>   MEM[(int *)&a + 16B] = vect__14.21_35;
>   return;
> }
>
> but while a[0...3] end up containing 5 7 9 11 as expected,
> a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
> That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10),
> being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a
> UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my
> workings as to what's in each register:
>
> test:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         movw    r2, #:lower16:b
>         movt    r2, #:upper16:b
>         vldr    d22, .L11
>         vldr    d23, .L11+8
> ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
>         movw    r3, #:lower16:a
>         movt    r3, #:upper16:a
>         vld1.64 {d16-d17}, [r2:64]
> ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
>         vmov    q9, q8  @ v4si
> ;; q9 = (b[2] b[3] b[0] b[1])
>         vldr    d20, [r2, #16]
>         vldr    d21, [r2, #24]
> ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
>         vuzp.32 q10, q9
> ;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3]
> b[1])
> ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2]
> b[0])
>         vldr    d20, .L11+16
>         vldr    d21, .L11+24
> ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
>         vmul.i32        q9, q9, q11
> ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
> ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
>         vadd.i32        q8, q8, q10
> ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
> ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
>         vst1.64 {d16-d17}, [r3:64]
> ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
>         vstr    d18, [r3, #16]
> ;; a[4] = b[6]*3, a[5] = b[4]*4
>         vstr    d19, [r3, #24]
> ;; a[6] = b[2]*5, a[7] = b[0]*7
>         bx      lr
> .L12:
>         .align  3
> .L11:
>         .word   3
>         .word   4
>         .word   5
>         .word   7
>         .word   1
>         .word   2
>         .word   3
>         .word   4
>
> Which is to say - the bit order in the q-registers, is neither big- nor
> little-endian, but the elements get stored back to memory in a consistent
> order with how they were loaded, so we're OK as long as there are no
> permutes. Unfortunately for UZP this lane ordering mixup is not idempotent
> and messes everything up...
>
> Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing
> armeb's whole register-numbering/lane-flipping scheme, and might be quite a
> large task. OTOH it might also fix the significant number of failing
> vectorizer tests. A simpler solution might be to disable...some part of
> vector support....on armeb, but I'm not sure which part would be best, yet.
>
> Thoughts (CC maintainers)?

As most of the permutation support in the arm backend is disabled
for armeb I suppose disabling some more of it is appropriate ...

Or finally sitting down and fixing the mess.  For constant permutes
fixing it on the constant side would be an option I guess.

Richard.

> --Alan
>
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@  main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@  main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..65a183f 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@  vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@  vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,41 @@  vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  unsigned group1_size = i & ~(vectorization_factor - 1);
+
+	  gimple *rest = vect_split_slp_store_group (stmt, group1_size);
+	  bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
+	  /* If the first non-match was in the middle of a vector,
+	     skip the rest of that vector.  */
+	  if (group1_size < i)
+	    {
+	      i = group1_size + vectorization_factor;
+	      if (i < group_size)
+		rest = vect_split_slp_store_group (rest, vectorization_factor);
+	    }
+	  if (i < group_size)
+	    res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
+	  return res;
+	}
+      /* Even though the first vector did not all match, we might be able to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }