diff mbox

PR/67682, break SLP groups up if only some elements match

Message ID 1447081181-803-1-git-send-email-alan.lawrence@arm.com
State New
Headers show

Commit Message

Alan Lawrence Nov. 9, 2015, 2:59 p.m. UTC
On 06/11/15 12:55, Richard Biener wrote:
>
>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>> +  GROUP_GAP (first_vinfo) += group2_size;
>
> Please add a MSG_NOTE debug printf stating that we split the group and
> at which element.

Done.

> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
> otherwise this could be SLP reductions where there is no way the split
> group would enable vectorization.

Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
sufficed for that, as per e.g. the section above
  /* Create a node (a root of the SLP tree) for the packed grouped stores. */
But done.

> Note that BB vectorization now also very aggressively falls back to considering
> non-matches being "external".
>
> Not sure why that doesn't trigger for your testcases ;)

I tested against trunk r229944, on which all of my scan-tree-dump's were failing....

> I'm comfortable with the i < group_size half of the patch.  For the other piece
> I'd like to see more compile-time / success data from, say, building
> SPEC CPU 2006.

Well, as a couple of quick data points, a compile of SPEC2000 on
aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:

3080 successes without patch;
+79 successes from the "i < vectorization_factor" part of the patch (on its own)
+90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
+(79 from first) +(90 from second) + (an additional 62) from both parts together.

And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
11979 successes without patch;
+ 499 from the "i < vectorization_factor" part
+ 264 from the (i >= vectorization factor) && (i < group_size)" part
+ extra 336 if both parts combined.

I haven't done any significant measurements of compile-time yet.

(snipping this bit out-of-order)
> Hmm. This is of course pretty bad for compile-time for the non-matching
> case as that effectively will always split into N pieces if we feed it
> garbage (that is, without being sure that at least one pice _will_ vectorize).
>
> OTOH with the current way of computing "matches" we'd restrict ourselves
> to cases where the first stmt we look at (and match to) happens to be
> the operation that in the end will form a matching group.
...
> Eventually we'd want to improve the "matches" return
> to include the affected stmts (ok, that'll be not very easy) so we can do
> a cheap "if we split here will it fix that particular mismatch" check.

Yes, I think there are a bunch of things we can do here, that would be more
powerful than the simple approach I used here. The biggest limiting factor will
probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
addresses.

> So, please split the patch and I suggest to leave the controversical part
> for next stage1 together with some improvement on the SLP build process
> itself?

Here's a reduced version with just the second case,
bootstrapped+check-gcc/g++ on x86_64.

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
	* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
 gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
 6 files changed, 246 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c

Comments

Richard Biener Nov. 10, 2015, 12:45 p.m. UTC | #1
On Mon, Nov 9, 2015 at 3:59 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 06/11/15 12:55, Richard Biener wrote:
>>
>>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>>> +  GROUP_GAP (first_vinfo) += group2_size;
>>
>> Please add a MSG_NOTE debug printf stating that we split the group and
>> at which element.
>
> Done.
>
>> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
>> otherwise this could be SLP reductions where there is no way the split
>> group would enable vectorization.
>
> Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
> sufficed for that, as per e.g. the section above
>   /* Create a node (a root of the SLP tree) for the packed grouped stores. */
> But done.
>
>> Note that BB vectorization now also very aggressively falls back to considering
>> non-matches being "external".
>>
>> Not sure why that doesn't trigger for your testcases ;)
>
> I tested against trunk r229944, on which all of my scan-tree-dump's were failing....
>
>> I'm comfortable with the i < group_size half of the patch.  For the other piece
>> I'd like to see more compile-time / success data from, say, building
>> SPEC CPU 2006.
>
> Well, as a couple of quick data points, a compile of SPEC2000 on
> aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:
>
> 3080 successes without patch;
> +79 successes from the "i < vectorization_factor" part of the patch (on its own)
> +90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
> +(79 from first) +(90 from second) + (an additional 62) from both parts together.
>
> And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
> 11979 successes without patch;
> + 499 from the "i < vectorization_factor" part
> + 264 from the (i >= vectorization factor) && (i < group_size)" part
> + extra 336 if both parts combined.

Thanks

> I haven't done any significant measurements of compile-time yet.
>
> (snipping this bit out-of-order)
>> Hmm. This is of course pretty bad for compile-time for the non-matching
>> case as that effectively will always split into N pieces if we feed it
>> garbage (that is, without being sure that at least one pice _will_ vectorize).
>>
>> OTOH with the current way of computing "matches" we'd restrict ourselves
>> to cases where the first stmt we look at (and match to) happens to be
>> the operation that in the end will form a matching group.
> ...
>> Eventually we'd want to improve the "matches" return
>> to include the affected stmts (ok, that'll be not very easy) so we can do
>> a cheap "if we split here will it fix that particular mismatch" check.
>
> Yes, I think there are a bunch of things we can do here, that would be more
> powerful than the simple approach I used here. The biggest limiting factor will
> probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
> addresses.

Yeah, doing anything else requires us to use masked or strided stores though.

>> So, please split the patch and I suggest to leave the controversical part
>> for next stage1 together with some improvement on the SLP build process
>> itself?
>
> Here's a reduced version with just the second case,
> bootstrapped+check-gcc/g++ on x86_64.
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>         subgroups if vect_build_slp_tree fails after 1st vector.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
>  gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
>  6 files changed, 246 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..06099fd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,42 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1] + 1;
> +    a[2] = b[2] * 2;
> +    a[3] = b[3] + 3;
> +
> +    a[4] = b[4] + 4;
> +    a[5] = b[5] + 4;
> +    a[6] = b[6] + 4;
> +    a[7] = b[7] + 4;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 1;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
> +    abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != i + 5)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index cfdfc29..05d4b35 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
> +                    group1_size, group2_size);
> +
> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block SLP, try to break the group up into multiples of the
> +     vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo)
> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i >= vectorization_factor && i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);

Just noticing this... if we have a vectorization factor of 4 and matches
is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?

Thus shouldn't we split either only if i % vectorization_factor is 0 or
if not, split "twice", dropping the intermediate surely non-matching
group of vectorization_factor size?  After all if we split like with the
patch then the upper half will _not_ be splitted again with the
simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).

So I expect that the statistics will be the same if we restrict splitting
to the i % vectorization_factor == 0 case, or rather split where we do
now but only re-analyze group2 if i % vectorization_factor == 0 holds?

Ok with that change.  Improvements on that incrementally.

Thanks,
Richard.

> +         i &= ~(vectorization_factor - 1);
> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
> +        }
> +      /* Even though the first vector did not all match, we might be able to SLP
> +        (some) of the remainder.  FORNOW ignore this possibility.  */
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>
Richard Biener Nov. 10, 2015, 12:51 p.m. UTC | #2
On Tue, Nov 10, 2015 at 1:45 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 3:59 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> On 06/11/15 12:55, Richard Biener wrote:
>>>
>>>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>>>> +  GROUP_GAP (first_vinfo) += group2_size;
>>>
>>> Please add a MSG_NOTE debug printf stating that we split the group and
>>> at which element.
>>
>> Done.
>>
>>> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
>>> otherwise this could be SLP reductions where there is no way the split
>>> group would enable vectorization.
>>
>> Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
>> sufficed for that, as per e.g. the section above
>>   /* Create a node (a root of the SLP tree) for the packed grouped stores. */
>> But done.
>>
>>> Note that BB vectorization now also very aggressively falls back to considering
>>> non-matches being "external".
>>>
>>> Not sure why that doesn't trigger for your testcases ;)
>>
>> I tested against trunk r229944, on which all of my scan-tree-dump's were failing....
>>
>>> I'm comfortable with the i < group_size half of the patch.  For the other piece
>>> I'd like to see more compile-time / success data from, say, building
>>> SPEC CPU 2006.
>>
>> Well, as a couple of quick data points, a compile of SPEC2000 on
>> aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:
>>
>> 3080 successes without patch;
>> +79 successes from the "i < vectorization_factor" part of the patch (on its own)
>> +90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
>> +(79 from first) +(90 from second) + (an additional 62) from both parts together.
>>
>> And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
>> 11979 successes without patch;
>> + 499 from the "i < vectorization_factor" part
>> + 264 from the (i >= vectorization factor) && (i < group_size)" part
>> + extra 336 if both parts combined.
>
> Thanks
>
>> I haven't done any significant measurements of compile-time yet.
>>
>> (snipping this bit out-of-order)
>>> Hmm. This is of course pretty bad for compile-time for the non-matching
>>> case as that effectively will always split into N pieces if we feed it
>>> garbage (that is, without being sure that at least one pice _will_ vectorize).
>>>
>>> OTOH with the current way of computing "matches" we'd restrict ourselves
>>> to cases where the first stmt we look at (and match to) happens to be
>>> the operation that in the end will form a matching group.
>> ...
>>> Eventually we'd want to improve the "matches" return
>>> to include the affected stmts (ok, that'll be not very easy) so we can do
>>> a cheap "if we split here will it fix that particular mismatch" check.
>>
>> Yes, I think there are a bunch of things we can do here, that would be more
>> powerful than the simple approach I used here. The biggest limiting factor will
>> probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
>> addresses.
>
> Yeah, doing anything else requires us to use masked or strided stores though.
>
>>> So, please split the patch and I suggest to leave the controversical part
>>> for next stage1 together with some improvement on the SLP build process
>>> itself?
>>
>> Here's a reduced version with just the second case,
>> bootstrapped+check-gcc/g++ on x86_64.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>>         subgroups if vect_build_slp_tree fails after 1st vector.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
>> ---
>>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
>>  gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
>>  6 files changed, 246 insertions(+), 6 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> index ab54a48..b8bef8c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>>    unsigned int *pout = &out[0];
>>    unsigned int a0, a1, a2, a3;
>>
>> -  /* Non isomorphic.  */
>> +  /* Non isomorphic, even 64-bit subgroups.  */
>>    a0 = *pin++ + 23;
>> -  a1 = *pin++ + 142;
>> +  a1 = *pin++ * 142;
>>    a2 = *pin++ + 2;
>>    a3 = *pin++ * 31;
>> -
>> +
>>    *pout++ = a0 * x;
>>    *pout++ = a1 * y;
>>    *pout++ = a2 * x;
>> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>>
>>    /* Check results.  */
>>    if (out[0] != (in[0] + 23) * x
>> -      || out[1] != (in[1] + 142) * y
>> +      || out[1] != (in[1] * 142) * y
>>        || out[2] != (in[2] + 2) * x
>>        || out[3] != (in[3] * 31) * y)
>>      abort();
>> @@ -47,4 +47,4 @@ int main (void)
>>  }
>>
>>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
>> -
>> +
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>> new file mode 100644
>> index 0000000..39c23c3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>> @@ -0,0 +1,44 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[4];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0];
>> +    a[1] = b[1];
>> +    a[2] = b[2];
>> +    a[3] = b[3];
>> +    a[4] = 0;
>> +    a[5] = 0;
>> +    a[6] = 0;
>> +    a[7] = 0;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 4; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  for (int i = 0; i < 4; i++)
>> +    if (a[i] != i+4)
>> +      abort ();
>> +  for (int i = 4; i < 8; i++)
>> +    if (a[i] != 0)
>> +      abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>> new file mode 100644
>> index 0000000..06099fd
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>> @@ -0,0 +1,42 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[8];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0];
>> +    a[1] = b[1] + 1;
>> +    a[2] = b[2] * 2;
>> +    a[3] = b[3] + 3;
>> +
>> +    a[4] = b[4] + 4;
>> +    a[5] = b[5] + 4;
>> +    a[6] = b[6] + 4;
>> +    a[7] = b[7] + 4;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    b[i] = i + 1;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
>> +    abort ();
>> +  for (int i = 4; i < 8; i++)
>> +    if (a[i] != i + 5)
>> +      abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>> new file mode 100644
>> index 0000000..13c51f3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[4];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[2] + 1;
>> +    a[1] = b[0] + 2;
>> +    a[2] = b[1] + 3;
>> +    a[3] = b[1] + 4;
>> +    a[4] = b[3] * 3;
>> +    a[5] = b[0] * 4;
>> +    a[6] = b[2] * 5;
>> +    a[7] = b[1] * 7;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 4; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
>> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
>> +    abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>> new file mode 100644
>> index 0000000..6ae9a89
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[8];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0] + 1;
>> +    a[1] = b[1] + 2;
>> +    a[2] = b[2] + 3;
>> +    a[3] = b[3] + 4;
>> +    a[4] = b[0] * 3;
>> +    a[5] = b[2] * 4;
>> +    a[6] = b[4] * 5;
>> +    a[7] = b[6] * 7;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 8; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
>> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
>> +    abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
>> index cfdfc29..05d4b35 100644
>> --- a/gcc/tree-vect-slp.c
>> +++ b/gcc/tree-vect-slp.c
>> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>>    body_cost_vec.release ();
>>  }
>>
>> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
>> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
>> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
>> +   containing the remainder.
>> +   Return the first stmt in the second group.  */
>> +
>> +static gimple *
>> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
>> +{
>> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
>> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
>> +  gcc_assert (group1_size > 0);
>> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
>> +  gcc_assert (group2_size > 0);
>> +  GROUP_SIZE (first_vinfo) = group1_size;
>> +
>> +  gimple *stmt = first_stmt;
>> +  for (unsigned i = group1_size; i > 1; i--)
>> +    {
>> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
>> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
>> +    }
>> +  /* STMT is now the last element of the first group.  */
>> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
>> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
>> +
>> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
>> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
>> +    {
>> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
>> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
>> +    }
>> +
>> +  /* For the second group, the GROUP_GAP is that before the original group,
>> +     plus skipping over the first vector.  */
>> +  GROUP_GAP (vinfo_for_stmt (group2)) =
>> +    GROUP_GAP (first_vinfo) + group1_size;
>> +
>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>> +  GROUP_GAP (first_vinfo) += group2_size;
>> +
>> +  if (dump_enabled_p ())
>> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
>> +                    group1_size, group2_size);
>> +
>> +  return group2;
>> +}
>> +
>>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>>     vect_build_slp_tree to build a tree of packed stmts if possible.
>>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
>> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>>    tree vectype, scalar_type = NULL_TREE;
>>    gimple *next;
>>    unsigned int vectorization_factor = 0;
>> -  int i;
>> +  unsigned int i;
>>    unsigned int max_nunits = 0;
>>    vec<slp_tree> loads;
>>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
>> @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
>>    vect_free_slp_tree (node);
>>    loads.release ();
>>
>> +  /* For basic block SLP, try to break the group up into multiples of the
>> +     vectorization factor.  */
>> +  if (is_a <bb_vec_info> (vinfo)
>> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
>> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
>> +    {
>> +      /* We consider breaking the group only on VF boundaries from the existing
>> +        start.  */
>> +      for (i = 0; i < group_size; i++)
>> +       if (!matches[i]) break;
>> +
>> +      if (i >= vectorization_factor && i < group_size)
>> +       {
>> +         /* Split into two groups at the first vector boundary before i.  */
>> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
>
> Just noticing this... if we have a vectorization factor of 4 and matches
> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>
> Thus shouldn't we split either only if i % vectorization_factor is 0 or
> if not, split "twice", dropping the intermediate surely non-matching
> group of vectorization_factor size?  After all if we split like with the
> patch then the upper half will _not_ be splitted again with the
> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>
> So I expect that the statistics will be the same if we restrict splitting
> to the i % vectorization_factor == 0 case, or rather split where we do
> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>
> Ok with that change.  Improvements on that incrementally.

Btw, it just occurs to me that the whole thing is equivalent to splitting
the store-group into vector-size pieces up-front?  That way we do
the maximum splitting up-frond and avoid any redundant work?

The patch is still ok as said, just the above may be a simple thing
to explore.

Thanks,
Richard.

> Thanks,
> Richard.
>
>> +         i &= ~(vectorization_factor - 1);
>> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
>> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
>> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
>> +        }
>> +      /* Even though the first vector did not all match, we might be able to SLP
>> +        (some) of the remainder.  FORNOW ignore this possibility.  */
>> +    }
>> +
>>    return false;
>>  }
>>
>> --
>> 1.9.1
>>
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@  main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@  main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..05d4b35 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@  vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@  vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,30 @@  vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *grp2start = vect_split_slp_store_group (stmt, i);
+	  return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+		 | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+	 }
+      /* Even though the first vector did not all match, we might be able to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }