From patchwork Mon Nov 9 14:59:41 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Lawrence X-Patchwork-Id: 541788 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 9523C140D69 for ; Tue, 10 Nov 2015 02:00:08 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=rVzruSJA; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; q=dns; s=default; b=Hft oFDnVKv3cmfGXvSdmCAuUecoIpdyjnUvN8d3AIsfFWxrQ0xeFd6EJ4P+cBbprwxl GFSzLx9m6PXDW9VeVmiT+B+qRp8YA7MA8Xbp0zfaR+PxXEwrJsU4wOykIvZsPm/E +jpnxfXiuM/FdeMJfUiopStKYqrZjeICrBFdnapM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; s=default; bh=sPLZAYy4S CwknGoZlfLMSw0hmPQ=; b=rVzruSJADAGORs71TSCq4vHfNpA9My4zfF7SUi/oQ 6LL9tA9Ja+v2WO4rccZF0r+hXHoZCuQbF0nD2rb0/62Zeb9kVxzxc/zPoIo06HUD vkIDBEoq8zVgTegsGrrd8e1YCQ5BnkoBKJE0IG7DDb0KmFhBD21FXJpmJnztWj2f MM= Received: (qmail 104037 invoked by alias); 9 Nov 2015 14:59:59 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 104026 invoked by uid 89); 9 Nov 2015 14:59:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 09 Nov 2015 14:59:56 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-1-Kocj2bS6QWyJoA5kWk1bFQ-1; Mon, 09 Nov 2015 14:59:52 +0000 Received: from arm.com ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 9 Nov 2015 14:59:51 +0000 From: Alan Lawrence To: gcc-patches@gcc.gnu.org Cc: richard.guenther@gmail.com Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match Date: Mon, 9 Nov 2015 14:59:41 +0000 Message-Id: <1447081181-803-1-git-send-email-alan.lawrence@arm.com> In-Reply-To: References: X-MC-Unique: Kocj2bS6QWyJoA5kWk1bFQ-1 X-IsSubscribed: yes On 06/11/15 12:55, Richard Biener wrote: > >> + /* GROUP_GAP of the first group now has to skip over the second group too. */ >> + GROUP_GAP (first_vinfo) += group2_size; > > Please add a MSG_NOTE debug printf stating that we split the group and > at which element. Done. > I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)) > otherwise this could be SLP reductions where there is no way the split > group would enable vectorization. Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check sufficed for that, as per e.g. the section above /* Create a node (a root of the SLP tree) for the packed grouped stores. */ But done. > Note that BB vectorization now also very aggressively falls back to considering > non-matches being "external". > > Not sure why that doesn't trigger for your testcases ;) I tested against trunk r229944, on which all of my scan-tree-dump's were failing.... > I'm comfortable with the i < group_size half of the patch. For the other piece > I'd like to see more compile-time / success data from, say, building > SPEC CPU 2006. Well, as a couple of quick data points, a compile of SPEC2000 on aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have: 3080 successes without patch; +79 successes from the "i < vectorization_factor" part of the patch (on its own) +90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own) +(79 from first) +(90 from second) + (an additional 62) from both parts together. And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57): 11979 successes without patch; + 499 from the "i < vectorization_factor" part + 264 from the (i >= vectorization factor) && (i < group_size)" part + extra 336 if both parts combined. I haven't done any significant measurements of compile-time yet. (snipping this bit out-of-order) > Hmm. This is of course pretty bad for compile-time for the non-matching > case as that effectively will always split into N pieces if we feed it > garbage (that is, without being sure that at least one pice _will_ vectorize). > > OTOH with the current way of computing "matches" we'd restrict ourselves > to cases where the first stmt we look at (and match to) happens to be > the operation that in the end will form a matching group. ... > Eventually we'd want to improve the "matches" return > to include the affected stmts (ok, that'll be not very easy) so we can do > a cheap "if we split here will it fix that particular mismatch" check. Yes, I think there are a bunch of things we can do here, that would be more powerful than the simple approach I used here. The biggest limiting factor will probably be (lack of) permutation, i.e. if we only SLP stores to consecutive addresses. > So, please split the patch and I suggest to leave the controversical part > for next stage1 together with some improvement on the SLP build process > itself? Here's a reduced version with just the second case, bootstrapped+check-gcc/g++ on x86_64. gcc/ChangeLog: * tree-vect-slp.c (vect_split_slp_store_group): New. (vect_analyze_slp_instance): During basic block SLP, recurse on subgroups if vect_build_slp_tree fails after 1st vector. gcc/testsuite/ChangeLog: * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. * gcc.dg/vect/bb-slp-subgroups-1.c: New. * gcc.dg/vect/bb-slp-subgroups-2.c: New. * gcc.dg/vect/bb-slp-subgroups-3.c: New. * gcc.dg/vect/bb-slp-subgroups-4.c: New. --- gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 ++-- gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++ gcc/tree-vect-slp.c | 74 +++++++++++++++++++++++++- 6 files changed, 246 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c index ab54a48..b8bef8c 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) unsigned int *pout = &out[0]; unsigned int a0, a1, a2, a3; - /* Non isomorphic. */ + /* Non isomorphic, even 64-bit subgroups. */ a0 = *pin++ + 23; - a1 = *pin++ + 142; + a1 = *pin++ * 142; a2 = *pin++ + 2; a3 = *pin++ * 31; - + *pout++ = a0 * x; *pout++ = a1 * y; *pout++ = a2 * x; @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) /* Check results. */ if (out[0] != (in[0] + 23) * x - || out[1] != (in[1] + 142) * y + || out[1] != (in[1] * 142) * y || out[2] != (in[2] + 2) * x || out[3] != (in[3] * 31) * y) abort(); @@ -47,4 +47,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */ - + diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c new file mode 100644 index 0000000..39c23c3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c @@ -0,0 +1,44 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = 0; + a[5] = 0; + a[6] = 0; + a[7] = 0; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + for (int i = 0; i < 4; i++) + if (a[i] != i+4) + abort (); + for (int i = 4; i < 8; i++) + if (a[i] != 0) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c new file mode 100644 index 0000000..06099fd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c @@ -0,0 +1,42 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[8]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0]; + a[1] = b[1] + 1; + a[2] = b[2] * 2; + a[3] = b[3] + 3; + + a[4] = b[4] + 4; + a[5] = b[5] + 4; + a[6] = b[6] + 4; + a[7] = b[7] + 4; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + b[i] = i + 1; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7)) + abort (); + for (int i = 4; i < 8; i++) + if (a[i] != i + 5) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c new file mode 100644 index 0000000..13c51f3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[2] + 1; + a[1] = b[0] + 2; + a[2] = b[1] + 3; + a[3] = b[1] + 4; + a[4] = b[3] * 3; + a[5] = b[0] * 4; + a[6] = b[2] * 5; + a[7] = b[1] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9) + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c new file mode 100644 index 0000000..6ae9a89 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[8]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0] + 1; + a[1] = b[1] + 2; + a[2] = b[2] + 3; + a[3] = b[3] + 4; + a[4] = b[0] * 3; + a[5] = b[2] * 4; + a[6] = b[4] * 5; + a[7] = b[6] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 8; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11) + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index cfdfc29..05d4b35 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data) body_cost_vec.release (); } +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups: + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing + the first GROUP1_SIZE stmts, since stores are consecutive), the second + containing the remainder. + Return the first stmt in the second group. */ + +static gimple * +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) +{ + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt); + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt); + gcc_assert (group1_size > 0); + int group2_size = GROUP_SIZE (first_vinfo) - group1_size; + gcc_assert (group2_size > 0); + GROUP_SIZE (first_vinfo) = group1_size; + + gimple *stmt = first_stmt; + for (unsigned i = group1_size; i > 1; i--) + { + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + /* STMT is now the last element of the first group. */ + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0; + + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size; + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt))) + { + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2; + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + + /* For the second group, the GROUP_GAP is that before the original group, + plus skipping over the first vector. */ + GROUP_GAP (vinfo_for_stmt (group2)) = + GROUP_GAP (first_vinfo) + group1_size; + + /* GROUP_GAP of the first group now has to skip over the second group too. */ + GROUP_GAP (first_vinfo) += group2_size; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n", + group1_size, group2_size); + + return group2; +} + /* Analyze an SLP instance starting from a group of grouped stores. Call vect_build_slp_tree to build a tree of packed stmts if possible. Return FALSE if it's impossible to SLP any stmt in the loop. */ @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, tree vectype, scalar_type = NULL_TREE; gimple *next; unsigned int vectorization_factor = 0; - int i; + unsigned int i; unsigned int max_nunits = 0; vec loads; struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)); @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo, vect_free_slp_tree (node); loads.release (); + /* For basic block SLP, try to break the group up into multiples of the + vectorization factor. */ + if (is_a (vinfo) + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) + { + /* We consider breaking the group only on VF boundaries from the existing + start. */ + for (i = 0; i < group_size; i++) + if (!matches[i]) break; + + if (i >= vectorization_factor && i < group_size) + { + /* Split into two groups at the first vector boundary before i. */ + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0); + i &= ~(vectorization_factor - 1); + gimple *grp2start = vect_split_slp_store_group (stmt, i); + return vect_analyze_slp_instance (vinfo, stmt, max_tree_size) + | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size); + } + /* Even though the first vector did not all match, we might be able to SLP + (some) of the remainder. FORNOW ignore this possibility. */ + } + return false; }