diff mbox

PR/67682, break SLP groups up if only some elements match

Message ID 1446730427-13428-1-git-send-email-alan.lawrence@arm.com
State New
Headers show

Commit Message

Alan Lawrence Nov. 5, 2015, 1:33 p.m. UTC
On 03/11/15 13:39, Richard Biener wrote:
> On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>
>> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
>> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
>
> As said on IRC it should be 4 1 1 1 and 4 1 1 1.

Right. And so, if we have a twelve-element group (0 1 1 1 1 1 1 1 1 1 1 1), by
the time it became three subgroups, these should each be (8 1 1 1), via an
intermediate stage of (4 1 1 1 1 1 1 1) (8 1 1 1). This leads to the code in
the attached patch.

> No, I don't think we can split load groups that way.  So I think if
> splitting store
> groups works well (with having larger load groups) then that's the way to go
> (even for loop vect).

Well, slp-11a.c still fails if I enable the splitting for non-BB SLP; I was
thinking this was because I needed to split the load groups too, but maybe not
- maybe this is a separate bug/issue with hybrid SLP. Whatever the reason, I
still think splitting groups in hybrid SLP is another patch. (Do we really want
to put off handling the basic-block case until it works for hybrid SLP as well?
IMHO I would think not.) It sounds as if the approach of restricting splitting
to store groups with appropriate asserts GROUP_GAP == 1 is thus the right thing
to do in the longer term too (hence, renamed vect_split_slp_store_group to
emphasize that) - at least until we remove that restriction on SLP generally.

Bootstrapped + check-{gcc,g++,gfortran} on x86_64, AArch64, ARM.

Re. the extra skipping loop, I think it would theoretically be possible for the
recursive call to vect_slp_analyze to succeed on an element where the original
failed, because it may have more num_permutes remaining (after skipping over
the first vector). So there's a second argument (besides code complexity) for
dropping that part....?

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): Recurse on subgroup(s) if
	vect_build_slp_tree fails during basic block SLP.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
	* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++
 gcc/tree-vect-slp.c                            | 87 +++++++++++++++++++++++++-
 6 files changed, 259 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c

Comments

Richard Biener Nov. 6, 2015, 12:55 p.m. UTC | #1
On Thu, Nov 5, 2015 at 2:33 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 03/11/15 13:39, Richard Biener wrote:
>> On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>>
>>> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
>>> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
>>
>> As said on IRC it should be 4 1 1 1 and 4 1 1 1.
>
> Right. And so, if we have a twelve-element group (0 1 1 1 1 1 1 1 1 1 1 1), by
> the time it became three subgroups, these should each be (8 1 1 1), via an
> intermediate stage of (4 1 1 1 1 1 1 1) (8 1 1 1). This leads to the code in
> the attached patch.
>
>> No, I don't think we can split load groups that way.  So I think if
>> splitting store
>> groups works well (with having larger load groups) then that's the way to go
>> (even for loop vect).
>
> Well, slp-11a.c still fails if I enable the splitting for non-BB SLP; I was
> thinking this was because I needed to split the load groups too, but maybe not
> - maybe this is a separate bug/issue with hybrid SLP. Whatever the reason, I
> still think splitting groups in hybrid SLP is another patch. (Do we really want
> to put off handling the basic-block case until it works for hybrid SLP as well?
> IMHO I would think not.)

No.

> It sounds as if the approach of restricting splitting
> to store groups with appropriate asserts GROUP_GAP == 1 is thus the right thing
> to do in the longer term too (hence, renamed vect_split_slp_store_group to
> emphasize that) - at least until we remove that restriction on SLP generally.
>
> Bootstrapped + check-{gcc,g++,gfortran} on x86_64, AArch64, ARM.
>
> Re. the extra skipping loop, I think it would theoretically be possible for the
> recursive call to vect_slp_analyze to succeed on an element where the original
> failed, because it may have more num_permutes remaining (after skipping over
> the first vector). So there's a second argument (besides code complexity) for
> dropping that part....?
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): Recurse on subgroup(s) if
>         vect_build_slp_tree fails during basic block SLP.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++
>  gcc/tree-vect-slp.c                            | 87 +++++++++++++++++++++++++-
>  6 files changed, 259 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..06099fd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,42 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1] + 1;
> +    a[2] = b[2] * 2;
> +    a[3] = b[3] + 3;
> +
> +    a[4] = b[4] + 4;
> +    a[5] = b[5] + 4;
> +    a[6] = b[6] + 4;
> +    a[7] = b[7] + 4;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 1;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
> +    abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != i + 5)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 1424123..c01632a 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1641,6 +1641,50 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;

Please add a MSG_NOTE debug printf stating that we split the group and
at which element.

> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1656,7 +1700,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1846,6 +1890,47 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block vectorization, try to break the group up into multiples
> +     of the vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo) && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))

I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
otherwise this could be SLP reductions where there is no way the split
group would
enable vectorization.

> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i < vectorization_factor)
> +       {
> +         /* First vector is a mix of (non-/)matches, or first element was
> +            impossible for another reason.  Skip the first vector, and look
> +            for a vector's worth of consecutive (non-/)matches.  */
> +         i = vectorization_factor;
> +         while (i < group_size)
> +           {
> +             for (unsigned j = i + 1; matches[j] == matches[i]; j++)
> +               if (j == (i + vectorization_factor - 1))
> +                 {
> +                   /* Either all in that vector match, or none do.
> +                      Retry SLP from that vector onwards.  */

Hmm.  This is of course pretty bad for compile-time for the non-matching
case as that effectively will always split into N pieces if we feed it
garbage (that is, without being sure that at least one pice _will_ vectorize).

OTOH with the current way of computing "matches" we'd restrict ourselves
to cases where the first stmt we look at (and match to) happens to be
the operation that in the end will form a matching group.

> +                   stmt = vect_split_slp_store_group (stmt, i);
> +                   return vect_analyze_slp_instance (vinfo,
> +                                                     stmt, max_tree_size);
> +                 }
> +             /* Vector contains both matches and non-matches.  Skip over.  */
> +             i += vectorization_factor;
> +           }
> +       }
> +      else if (i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> +         i &= ~(vectorization_factor - 1);
> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);

I think this part is more obvious, at least one vectorization_factor size
group at the beginning "matches" and we're iteratively splitting the 2nd "half".

Note that BB vectorization now also very aggressively falls back to considering
non-matches being "external".

Not sure why that doesn't trigger for your testcases ;)

I'm comfortable with the i < group_size half of the patch.  For the other piece
I'd like to see more compile-time / success data from, say, building
SPEC CPU 2006.  Eventually we'd want to improve the "matches" return
to include the affected stmts (ok, that'll be not very easy) so we can do
a cheap "if we split here will it fix that particular mismatch" check.

Eventually a the SLP tree build can maintain a "hierarchy" of 'matches'
itself, that is, make matches[] an array of ints refering to the element it
matches to (or itself if none).  That will make the non-matching case a bit
more expensive during the SLP tree build of course.

So, please split the patch and I suggest to leave the controversical part
for next stage1 together with some improvement on the SLP build process
itself?

Richard.

> +        }
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@  main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@  main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@ 
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1424123..c01632a 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1641,6 +1641,50 @@  vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1656,7 +1700,7 @@  vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1846,6 +1890,47 @@  vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block vectorization, try to break the group up into multiples
+     of the vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo) && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i < vectorization_factor)
+	{
+	  /* First vector is a mix of (non-/)matches, or first element was
+	     impossible for another reason.  Skip the first vector, and look
+	     for a vector's worth of consecutive (non-/)matches.  */
+	  i = vectorization_factor;
+	  while (i < group_size)
+	    {
+	      for (unsigned j = i + 1; matches[j] == matches[i]; j++)
+		if (j == (i + vectorization_factor - 1))
+		  {
+		    /* Either all in that vector match, or none do.
+		       Retry SLP from that vector onwards.  */
+		    stmt = vect_split_slp_store_group (stmt, i);
+		    return vect_analyze_slp_instance (vinfo,
+						      stmt, max_tree_size);
+		  }
+	      /* Vector contains both matches and non-matches.  Skip over.  */
+	      i += vectorization_factor;
+	    }
+	}
+      else if (i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *grp2start = vect_split_slp_store_group (stmt, i);
+	  return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+		 | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+	 }
+    }
+
   return false;
 }