From patchwork Fri Nov 13 16:18:42 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Lawrence X-Patchwork-Id: 544362 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 7015814141E for ; Sat, 14 Nov 2015 03:19:11 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=xa+MkBgO; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; q=dns; s=default; b=xa1 gpA0Sl87iouKHsRN5XsuKat9nu/5lTC/XV6zcb/pVEHaaCo4mdKl1T3ofUN1qvVp jukqKrmFnMLEDGtN3YjicbXR2pfFgveRqMd1onV/dBmLpOqlylb8SsYH2PBwzMrZ pLyHdvh9GyG4DeCkpEu918VtebGlrcczxiSlXTJc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; s=default; bh=fYgG6EnwS PWhnJLTzmrsQ9oTuYU=; b=xa+MkBgOYFbmBN5xKyUyOd2/OSvBKZbP9uIU6t90/ FhbOhGhX5X3kSCiF2DqqFJaHGZyWFhmaEzFBybeQrl+ZW+/Oe+W5wyceqDm0dVdt oz/HFGxeBvMTN0ngfkYRZw2fqDVq+aG9i38uKFhj1ncxw8IFe4ntb5LNzv3lUqv0 VM= Received: (qmail 79661 invoked by alias); 13 Nov 2015 16:19:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 79645 invoked by uid 89); 13 Nov 2015 16:19:00 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Nov 2015 16:18:57 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-16-HNfL1Me3QWWfjfSgu8sjNQ-1; Fri, 13 Nov 2015 16:18:51 +0000 Received: from arm.com ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 13 Nov 2015 16:18:51 +0000 From: Alan Lawrence To: gcc-patches@gcc.gnu.org Cc: richard.guenther@gmail.com Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match Date: Fri, 13 Nov 2015 16:18:42 +0000 Message-Id: <1447431522-4695-1-git-send-email-alan.lawrence@arm.com> In-Reply-To: References: X-MC-Unique: HNfL1Me3QWWfjfSgu8sjNQ-1 X-IsSubscribed: yes On 10/11/15 12:51, Richard Biener wrote: >> >> Just noticing this... if we have a vectorization factor of 4 and matches >> is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and >> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail? >> >> Thus shouldn't we split either only if i % vectorization_factor is 0 or >> if not, split "twice", dropping the intermediate surely non-matching >> group of vectorization_factor size? After all if we split like with the >> patch then the upper half will _not_ be splitted again with the >> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again). >> >> So I expect that the statistics will be the same if we restrict splitting >> to the i % vectorization_factor == 0 case, or rather split where we do >> now but only re-analyze group2 if i % vectorization_factor == 0 holds? >> >> Ok with that change. Improvements on that incrementally. > > Btw, it just occurs to me that the whole thing is equivalent to splitting > the store-group into vector-size pieces up-front? That way we do > the maximum splitting up-frond and avoid any redundant work? > > The patch is still ok as said, just the above may be a simple thing > to explore. I'd refrained from splitting in vect_analyze_group_access_1 as my understanding was that we only did that once, whereas we would retry the vect_analyze_slp_instance path each time we decreased the vectorization_factor...however, I did try putting code at the beginning of vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses us some previously-successful SLPs, as some bigger groups cannot be SLPed if we split them as they require 'unrolling'...so not addressing that here. However your suggestion of splitting twice when we know the boundary is in the middle of a vector is a nice compromise; it nets us a good number more successes in SPEC2000 and SPEC2006, about 7% more than without the patch. Hence, here's the patch I've committed, as r230330, after regstrap on x86_64 and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others up as we don't do that one anymore.) Cheers, Alan gcc/ChangeLog: PR tree-optimization/67682 * tree-vect-slp.c (vect_split_slp_store_group): New. (vect_analyze_slp_instance): During basic block SLP, recurse on subgroups if vect_build_slp_tree fails after 1st vector. gcc/testsuite/ChangeLog: PR tree-optimization/67682 * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. * gcc.dg/vect/bb-slp-subgroups-1.c: New. * gcc.dg/vect/bb-slp-subgroups-2.c: New. * gcc.dg/vect/bb-slp-subgroups-3.c: New. --- gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 +-- gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++ gcc/tree-vect-slp.c | 85 +++++++++++++++++++++++++- 5 files changed, 215 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c index ab54a48..b8bef8c 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) unsigned int *pout = &out[0]; unsigned int a0, a1, a2, a3; - /* Non isomorphic. */ + /* Non isomorphic, even 64-bit subgroups. */ a0 = *pin++ + 23; - a1 = *pin++ + 142; + a1 = *pin++ * 142; a2 = *pin++ + 2; a3 = *pin++ * 31; - + *pout++ = a0 * x; *pout++ = a1 * y; *pout++ = a2 * x; @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) /* Check results. */ if (out[0] != (in[0] + 23) * x - || out[1] != (in[1] + 142) * y + || out[1] != (in[1] * 142) * y || out[2] != (in[2] + 2) * x || out[3] != (in[3] * 31) * y) abort(); @@ -47,4 +47,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */ - + diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c new file mode 100644 index 0000000..39c23c3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c @@ -0,0 +1,44 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = 0; + a[5] = 0; + a[6] = 0; + a[7] = 0; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + for (int i = 0; i < 4; i++) + if (a[i] != i+4) + abort (); + for (int i = 4; i < 8; i++) + if (a[i] != 0) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c new file mode 100644 index 0000000..13c51f3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[2] + 1; + a[1] = b[0] + 2; + a[2] = b[1] + 3; + a[3] = b[1] + 4; + a[4] = b[3] * 3; + a[5] = b[0] * 4; + a[6] = b[2] * 5; + a[7] = b[1] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9) + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c new file mode 100644 index 0000000..6ae9a89 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[8]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0] + 1; + a[1] = b[1] + 2; + a[2] = b[2] + 3; + a[3] = b[3] + 4; + a[4] = b[0] * 3; + a[5] = b[2] * 4; + a[6] = b[4] * 5; + a[7] = b[6] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 8; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11) + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index cfdfc29..65a183f 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data) body_cost_vec.release (); } +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups: + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing + the first GROUP1_SIZE stmts, since stores are consecutive), the second + containing the remainder. + Return the first stmt in the second group. */ + +static gimple * +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) +{ + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt); + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt); + gcc_assert (group1_size > 0); + int group2_size = GROUP_SIZE (first_vinfo) - group1_size; + gcc_assert (group2_size > 0); + GROUP_SIZE (first_vinfo) = group1_size; + + gimple *stmt = first_stmt; + for (unsigned i = group1_size; i > 1; i--) + { + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + /* STMT is now the last element of the first group. */ + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0; + + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size; + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt))) + { + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2; + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + + /* For the second group, the GROUP_GAP is that before the original group, + plus skipping over the first vector. */ + GROUP_GAP (vinfo_for_stmt (group2)) = + GROUP_GAP (first_vinfo) + group1_size; + + /* GROUP_GAP of the first group now has to skip over the second group too. */ + GROUP_GAP (first_vinfo) += group2_size; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n", + group1_size, group2_size); + + return group2; +} + /* Analyze an SLP instance starting from a group of grouped stores. Call vect_build_slp_tree to build a tree of packed stmts if possible. Return FALSE if it's impossible to SLP any stmt in the loop. */ @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, tree vectype, scalar_type = NULL_TREE; gimple *next; unsigned int vectorization_factor = 0; - int i; + unsigned int i; unsigned int max_nunits = 0; vec loads; struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)); @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo, vect_free_slp_tree (node); loads.release (); + /* For basic block SLP, try to break the group up into multiples of the + vectorization factor. */ + if (is_a (vinfo) + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) + { + /* We consider breaking the group only on VF boundaries from the existing + start. */ + for (i = 0; i < group_size; i++) + if (!matches[i]) break; + + if (i >= vectorization_factor && i < group_size) + { + /* Split into two groups at the first vector boundary before i. */ + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0); + unsigned group1_size = i & ~(vectorization_factor - 1); + + gimple *rest = vect_split_slp_store_group (stmt, group1_size); + bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size); + /* If the first non-match was in the middle of a vector, + skip the rest of that vector. */ + if (group1_size < i) + { + i = group1_size + vectorization_factor; + if (i < group_size) + rest = vect_split_slp_store_group (rest, vectorization_factor); + } + if (i < group_size) + res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size); + return res; + } + /* Even though the first vector did not all match, we might be able to SLP + (some) of the remainder. FORNOW ignore this possibility. */ + } + return false; }