From patchwork Tue Nov 10 22:33:58 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 542639 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 17AF4141412 for ; Wed, 11 Nov 2015 09:34:13 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=HxYn7GIp; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=UYdt55JgauNuGJ4fc4+oyCgvfauM5iK4fy30EWNKfPR1ykrDGD 1A1kGG55LDUKNi3BGMjHDwFtYaR+sFDViahri/DshyFDnj1vocS/p3axf3+36u+w bPaaGNW8SETLDrRgYcQD2BTUhcNzOZP1swhVFwItU4ctcN+C+1QabF86E= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; s= default; bh=fpWEw3bmrD6m/GoGnkZ5e9IlC6w=; b=HxYn7GIpQ708N+TqAzl/ VMrJNoJbEEn5+QbSf+dn+0kSnfsrlqjQrqIUPImTR7YiBlWFBoM9ks1siXZT7yQF jI9KMh2YqvL9WRmLbCOPmz6A1RfDyQvfmDVn0MjYodx5YkkWx2X/q60BVYkYQYGc Q058WiGG0hQeEPP7hKNr6UQ= Received: (qmail 16177 invoked by alias); 10 Nov 2015 22:34:03 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 16148 invoked by uid 89); 10 Nov 2015 22:34:03 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 X-HELO: mail-yk0-f173.google.com Received: from mail-yk0-f173.google.com (HELO mail-yk0-f173.google.com) (209.85.160.173) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 10 Nov 2015 22:34:01 +0000 Received: by ykba77 with SMTP id a77so20457084ykb.2 for ; Tue, 10 Nov 2015 14:33:59 -0800 (PST) X-Received: by 10.129.48.212 with SMTP id w203mr6166306yww.4.1447194839521; Tue, 10 Nov 2015 14:33:59 -0800 (PST) Received: from ?IPv6:2601:181:c000:c497:a2a8:cdff:fe3e:b48? ([2601:181:c000:c497:a2a8:cdff:fe3e:b48]) by smtp.googlemail.com with ESMTPSA id 3sm1322539ywd.38.2015.11.10.14.33.58 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 10 Nov 2015 14:33:58 -0800 (PST) To: GCC Patches From: Nathan Sidwell Subject: [ptx] partitioning optimization Message-ID: <564270D6.6090303@acm.org> Date: Tue, 10 Nov 2015 17:33:58 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 I've committed this patch to trunk. It implements a partitioning optimization for a loop partitioned over both vector and worker axes. We can elide the inner vector partitioning state propagation, if there are no intervening instructions in the worker-partitioned outer loop other than the forking and joining. We simply execute the worker propagation on all vectors. I've been unable to introduce a testcase for this. The difficulty is we want to check an rtl dump from the acceleration compiler, and there doesn't appear to be existing machinery for that in the testsuite. Perhaps something to be added later? nathan 2015-11-10 Nathan Sidwell * config/nvptx/nvptx.opt (moptimize): New flag. * config/nvptx/nvptx.c (nvptx_option_override): Set nvptx_optimize default. (nvptx_optimize_inner): New. (nvptx_process_pars): Call it when optimizing. * doc/invoke.texi (Nvidia PTX Options): Document -moptimize. Index: config/nvptx/nvptx.c =================================================================== --- config/nvptx/nvptx.c (revision 230112) +++ config/nvptx/nvptx.c (working copy) @@ -137,6 +137,9 @@ nvptx_option_override (void) write_symbols = NO_DEBUG; debug_info_level = DINFO_LEVEL_NONE; + if (nvptx_optimize < 0) + nvptx_optimize = optimize > 0; + declared_fndecls_htab = hash_table::create_ggc (17); needed_fndecls_htab = hash_table::create_ggc (17); declared_libfuncs_htab @@ -2942,6 +2945,69 @@ nvptx_skip_par (unsigned mask, parallel nvptx_single (mask, par->forked_block, pre_tail); } +/* If PAR has a single inner parallel and PAR itself only contains + empty entry and exit blocks, swallow the inner PAR. */ + +static void +nvptx_optimize_inner (parallel *par) +{ + parallel *inner = par->inner; + + /* We mustn't be the outer dummy par. */ + if (!par->mask) + return; + + /* We must have a single inner par. */ + if (!inner || inner->next) + return; + + /* We must only contain 2 blocks ourselves -- the head and tail of + the inner par. */ + if (par->blocks.length () != 2) + return; + + /* We must be disjoint partitioning. As we only have vector and + worker partitioning, this is sufficient to guarantee the pars + have adjacent partitioning. */ + if ((par->mask & inner->mask) & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1)) + /* This indicates malformed code generation. */ + return; + + /* The outer forked insn should be immediately followed by the inner + fork insn. */ + rtx_insn *forked = par->forked_insn; + rtx_insn *fork = BB_END (par->forked_block); + + if (NEXT_INSN (forked) != fork) + return; + gcc_checking_assert (recog_memoized (fork) == CODE_FOR_nvptx_fork); + + /* The outer joining insn must immediately follow the inner join + insn. */ + rtx_insn *joining = par->joining_insn; + rtx_insn *join = inner->join_insn; + if (NEXT_INSN (join) != joining) + return; + + /* Preconditions met. Swallow the inner par. */ + if (dump_file) + fprintf (dump_file, "Merging loop %x [%d,%d] into %x [%d,%d]\n", + inner->mask, inner->forked_block->index, + inner->join_block->index, + par->mask, par->forked_block->index, par->join_block->index); + + par->mask |= inner->mask & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1); + + par->blocks.reserve (inner->blocks.length ()); + while (inner->blocks.length ()) + par->blocks.quick_push (inner->blocks.pop ()); + + par->inner = inner->inner; + inner->inner = NULL; + + delete inner; +} + /* Process the parallel PAR and all its contained parallels. We do everything but the neutering. Return mask of partitioned modes used within this parallel. */ @@ -2949,6 +3015,9 @@ nvptx_skip_par (unsigned mask, parallel static unsigned nvptx_process_pars (parallel *par) { + if (nvptx_optimize) + nvptx_optimize_inner (par); + unsigned inner_mask = par->mask; /* Do the inner parallels first. */ Index: config/nvptx/nvptx.opt =================================================================== --- config/nvptx/nvptx.opt (revision 230112) +++ config/nvptx/nvptx.opt (working copy) @@ -28,3 +28,7 @@ Generate code for a 64-bit ABI. mmainkernel Target Report RejectNegative Link in code for a __main kernel. + +moptimize +Target Report Var(nvptx_optimize) Init(-1) +Optimize partition neutering Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 230112) +++ doc/invoke.texi (working copy) @@ -873,7 +873,7 @@ Objective-C and Objective-C++ Dialects}. -march=@var{arch} -mbmx -mno-bmx -mcdx -mno-cdx} @emph{Nvidia PTX Options} -@gccoptlist{-m32 -m64 -mmainkernel} +@gccoptlist{-m32 -m64 -mmainkernel -moptimize} @emph{PDP-11 Options} @gccoptlist{-mfpu -msoft-float -mac0 -mno-ac0 -m40 -m45 -m10 @gol @@ -18960,6 +18960,11 @@ Generate code for 32-bit or 64-bit ABI. Link in code for a __main kernel. This is for stand-alone instead of offloading execution. +@item -moptimize +@opindex moptimize +Apply partitioned execution optimizations. This is the default when any +level of optimization is selected. + @end table @node PDP-11 Options