From patchwork Fri Nov 13 16:18:42 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alan Lawrence <alan.lawrence@arm.com>
X-Patchwork-Id: 544362
Return-Path: 
 <gcc-patches-return-414019-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 7015814141E
	for <incoming@patchwork.ozlabs.org>;
	Sat, 14 Nov 2015 03:19:11 +1100 (AEDT)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b=xa+MkBgO; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:in-reply-to:references
	:content-type:content-transfer-encoding; q=dns; s=default; b=xa1
	gpA0Sl87iouKHsRN5XsuKat9nu/5lTC/XV6zcb/pVEHaaCo4mdKl1T3ofUN1qvVp
	jukqKrmFnMLEDGtN3YjicbXR2pfFgveRqMd1onV/dBmLpOqlylb8SsYH2PBwzMrZ
	pLyHdvh9GyG4DeCkpEu918VtebGlrcczxiSlXTJc=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:in-reply-to:references
	:content-type:content-transfer-encoding; s=default; bh=fYgG6EnwS
	PWhnJLTzmrsQ9oTuYU=; b=xa+MkBgOYFbmBN5xKyUyOd2/OSvBKZbP9uIU6t90/
	FhbOhGhX5X3kSCiF2DqqFJaHGZyWFhmaEzFBybeQrl+ZW+/Oe+W5wyceqDm0dVdt
	oz/HFGxeBvMTN0ngfkYRZw2fqDVq+aG9i38uKFhj1ncxw8IFe4ntb5LNzv3lUqv0
	VM=
Received: (qmail 79661 invoked by alias); 13 Nov 2015 16:19:00 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 79645 invoked by uid 89); 13 Nov 2015 16:19:00 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL, BAYES_00,
	SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO
	eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Fri, 13 Nov 2015 16:18:57 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com
	[217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id
	uk-mta-16-HNfL1Me3QWWfjfSgu8sjNQ-1; Fri, 13 Nov 2015 16:18:51 +0000
Received: from arm.com ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft
	SMTPSVC(6.0.3790.3959); Fri, 13 Nov 2015 16:18:51 +0000
From: Alan Lawrence <alan.lawrence@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: richard.guenther@gmail.com
Subject: Re: [PATCH] PR/67682,
	break SLP groups up if only some elements match
Date: Fri, 13 Nov 2015 16:18:42 +0000
Message-Id: <1447431522-4695-1-git-send-email-alan.lawrence@arm.com>
In-Reply-To: 
 <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg@mail.gmail.com>
References: 
 <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg@mail.gmail.com>
X-MC-Unique: HNfL1Me3QWWfjfSgu8sjNQ-1
X-IsSubscribed: yes

On 10/11/15 12:51, Richard Biener wrote:
>>
>> Just noticing this... if we have a vectorization factor of 4 and matches
>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>
>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>> if not, split "twice", dropping the intermediate surely non-matching
>> group of vectorization_factor size?  After all if we split like with the
>> patch then the upper half will _not_ be splitted again with the
>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>
>> So I expect that the statistics will be the same if we restrict splitting
>> to the i % vectorization_factor == 0 case, or rather split where we do
>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>
>> Ok with that change.  Improvements on that incrementally.
>
> Btw, it just occurs to me that the whole thing is equivalent to splitting
> the store-group into vector-size pieces up-front?  That way we do
> the maximum splitting up-frond and avoid any redundant work?
>
> The patch is still ok as said, just the above may be a simple thing
> to explore.

I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
was that we only did that once, whereas we would retry the
vect_analyze_slp_instance path each time we decreased the
vectorization_factor...however, I did try putting code at the beginning of
vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
split them as they require 'unrolling'...so not addressing that here.

However your suggestion of splitting twice when we know the boundary is in the
middle of a vector is a nice compromise; it nets us a good number more
successes in SPEC2000 and SPEC2006, about 7% more than without the patch.

Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
up as we don't do that one anymore.)

Cheers, Alan

gcc/ChangeLog:

	PR tree-optimization/67682
	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	PR tree-optimization/67682
	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
 gcc/tree-vect-slp.c                            | 85 +++++++++++++++++++++++++-
 5 files changed, 215 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..65a183f 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  unsigned group1_size = i & ~(vectorization_factor - 1);
+
+	  gimple *rest = vect_split_slp_store_group (stmt, group1_size);
+	  bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
+	  /* If the first non-match was in the middle of a vector,
+	     skip the rest of that vector.  */
+	  if (group1_size < i)
+	    {
+	      i = group1_size + vectorization_factor;
+	      if (i < group_size)
+		rest = vect_split_slp_store_group (rest, vectorization_factor);
+	    }
+	  if (i < group_size)
+	    res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
+	  return res;
+	}
+      /* Even though the first vector did not all match, we might be able to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }