From patchwork Sat Apr 21 09:59:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 902359 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-476696-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="hFNmuZN/"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40Sp9V2Rksz9s1s for ; Sat, 21 Apr 2018 19:59:40 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=LMpkMyhe6jZFCLWQ+uukOqNRRTO5yAEppa0oHUttR8BMgjOYj+ +AyIhCXxhWtMlzO4yMK+IQgmlpyAYYVdiPlfqJ04NWVbNsPHzpCn9MfMuQG9Bx1s D4vkl6dS1Dd/Uw4ZHoRsfZzgZxbpQ6YDta/AsINm6Ecz2BpXXZJozeOII= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; s= default; bh=Feekx3wypiDrUmQVGZ06V9UTydU=; b=hFNmuZN/9eXW5dYt4Yts Z/68DjlFrbr9x41RAj923owGLG6IYZk23JgLzO1zxn18n/PceiDILPCwOTEdeHEa ++CI6hsvUDr4yS03EWZhxy27CJylp7LuECCBtw/s8ooopHXyw5pZbkoNFjniCR6o INoBlZDgHAzssTMHd1bzK28= Received: (qmail 69708 invoked by alias); 21 Apr 2018 09:59:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 69283 invoked by uid 89); 21 Apr 2018 09:59:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS, URIBL_RED autolearn=ham version=3.3.2 spammy=Todo, barriers X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 21 Apr 2018 09:59:29 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=svr-ies-mbx-01.mgc.mentorg.com) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1f9pIk-0002A6-Ql from Tom_deVries@mentor.com for gcc-patches@gcc.gnu.org; Sat, 21 Apr 2018 02:59:27 -0700 Received: from [172.30.72.29] (137.202.0.87) by svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Sat, 21 Apr 2018 10:59:23 +0100 To: GCC Patches From: Tom de Vries Subject: [og7, nvptx, openacc, PR85381, committed] Don't emit barriers for empty loops Message-ID: Date: Sat, 21 Apr 2018 11:59:22 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 X-ClientProxiedBy: svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) To svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) Hi, when compiling this testcase with the og7 branch: ... int main (void) { long long v1; #pragma acc parallel num_gangs (640) num_workers(1) vector_length (128) #pragma acc loop for (v1 = 0; v1 < 20; v1 += 2) ; return 0; } ... this ptx is generated: ... { // fork 4; bar.sync 0; // forked 4; // joining 4; bar.sync 0; // join 4; ret; } ... This triggers some bug on my quadro m1200 (I'm assuming in the ptxas/JIT compiler) that hangs the testcase. I can work around this by adding a membar.cta before the bar.syc, or two membar.ctas inbetween, but I'm not really sure what a minimal workaround should look like (I reported the bug to nvidia, I'm hoping for them to answer that question). This patch works around the bug by doing an optimization: we detect that this is an empty loop (a forked immediately followed by a joining), and don't emit the barriers. Build x86_64 with nvptx accelerator and tested libgomp. Committed to og7 branch. Thanks, - Tom [nvptx, openacc] Don't emit barriers for empty loops 2018-04-21 Tom de Vries PR target/85381 * config/nvptx/nvptx.c (nvptx_process_pars): Don't emit barriers for empty loops. * testsuite/libgomp.oacc-c-c++-common/pr85381-2.c: New test. * testsuite/libgomp.oacc-c-c++-common/pr85381-3.c: New test. * testsuite/libgomp.oacc-c-c++-common/pr85381-4.c: New test. * testsuite/libgomp.oacc-c-c++-common/pr85381-5.c: New test. * testsuite/libgomp.oacc-c-c++-common/pr85381.c: New test. --- gcc/config/nvptx/nvptx.c | 15 +++++++--- .../libgomp.oacc-c-c++-common/pr85381-2.c | 35 ++++++++++++++++++++++ .../libgomp.oacc-c-c++-common/pr85381-3.c | 34 +++++++++++++++++++++ .../libgomp.oacc-c-c++-common/pr85381-4.c | 26 ++++++++++++++++ .../libgomp.oacc-c-c++-common/pr85381-5.c | 23 ++++++++++++++ .../testsuite/libgomp.oacc-c-c++-common/pr85381.c | 17 +++++++++++ 6 files changed, 146 insertions(+), 4 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 8c478c8..3aee9cc 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4467,9 +4467,12 @@ nvptx_process_pars (parallel *par) { nvptx_shared_propagate (false, is_call, par->forked_block, par->forked_insn, !worker); - bool empty = nvptx_shared_propagate (true, is_call, - par->forked_block, par->fork_insn, - !worker); + bool no_prop_p + = nvptx_shared_propagate (true, is_call, par->forked_block, + par->fork_insn, !worker); + bool empty_loop_p + = !is_call && (NEXT_INSN (par->forked_insn) + && NEXT_INSN (par->forked_insn) == par->joining_insn); rtx barrier = GEN_INT (0); int threads = 0; @@ -4479,7 +4482,11 @@ nvptx_process_pars (parallel *par) threads = nvptx_mach_vector_length (); } - if (!empty || !is_call) + if (no_prop_p && empty_loop_p) + ; + else if (no_prop_p && is_call) + ; + else { /* Insert begin and end synchronizations. */ emit_insn_before (nvptx_cta_sync (barrier, threads), diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-2.c new file mode 100644 index 0000000..e5d02cf --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-2.c @@ -0,0 +1,35 @@ +/* { dg-additional-options "-save-temps" } */ +/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1 -O2" } } */ + +int +main (void) +{ + int v1; + + #pragma acc parallel + #pragma acc loop worker + for (v1 = 0; v1 < 20; v1 += 2) + ; + + return 0; +} + +/* Todo: Boths bar.syncs can be removed. + Atm we generate this dead code inbetween forked and joining: + + mov.u32 %r28, %ntid.y; + mov.u32 %r29, %tid.y; + add.u32 %r30, %r29, %r29; + setp.gt.s32 %r31, %r30, 19; + @%r31 bra $L2; + add.u32 %r25, %r28, %r28; + mov.u32 %r24, %r30; + $L3: + add.u32 %r24, %r24, %r25; + setp.le.s32 %r33, %r24, 19; + @%r33 bra $L3; + $L2: + + so the loop is not recognized as empty loop (which we detect by seeing if + joining immediately follows forked). */ +/* { dg-final { scan-assembler-times "bar.sync" 2 } } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-3.c new file mode 100644 index 0000000..7d9ba1b --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-3.c @@ -0,0 +1,34 @@ +/* { dg-additional-options "-save-temps -w" } */ +/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1 -O2" } } */ + +int a; +#pragma acc declare create(a) + +#pragma acc routine vector +void __attribute__((noinline, noclone)) +foo_v (void) +{ + a = 1; +} + +#pragma acc routine worker +void __attribute__((noinline, noclone)) +foo_w (void) +{ + a = 2; +} + +int +main (void) +{ + + #pragma acc parallel + foo_v (); + + #pragma acc parallel + foo_w (); + + return 0; +} + +/* { dg-final { scan-assembler-not "bar.sync" } } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-4.c new file mode 100644 index 0000000..477297d --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-4.c @@ -0,0 +1,26 @@ +/* { dg-additional-options "-save-temps -w" } */ +/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1 -O2" } } */ + +#define n 1024 + +int +main (void) +{ + #pragma acc parallel + { + #pragma acc loop worker + for (int i = 0; i < n; i++) + ; + + #pragma acc loop worker + for (int i = 0; i < n; i++) + ; + } + + return 0; +} + +/* Atm, %ntid.y is broadcast from one loop to the next, so there are 2 bar.syncs + for that (the other two are there for the same reason as in pr85381-2.c). + Todo: Recompute %ntid.y instead of broadcasting it. */ +/* { dg-final { scan-assembler-times "bar.sync" 4 } } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c new file mode 100644 index 0000000..4653009 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c @@ -0,0 +1,23 @@ +/* { dg-additional-options "-save-temps" } */ +/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1 -O2" } } */ + +#define n 1024 + +int +main (void) +{ + #pragma acc parallel vector_length(128) + { + #pragma acc loop vector + for (int i = 0; i < n; i++) + ; + + #pragma acc loop vector + for (int i = 0; i < n; i++) + ; + } + + return 0; +} + +/* { dg-final { scan-assembler-not "bar.sync" } } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c new file mode 100644 index 0000000..f585ae5 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c @@ -0,0 +1,17 @@ +/* { dg-additional-options "-save-temps" } */ +/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1 -O2" } } */ + +int +main (void) +{ + int v1; + + #pragma acc parallel vector_length (128) + #pragma acc loop vector + for (v1 = 0; v1 < 20; v1 += 2) + ; + + return 0; +} + +/* { dg-final { scan-assembler-not "bar.sync" } } */