From patchwork Tue Mar 20 10:39:05 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 888150 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-474999-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="EHHB47hx"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4058ZF6TJZz9sX0 for ; Tue, 20 Mar 2018 21:39:33 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:from:to:cc:references:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=AaJ+d0ak234ZbdS8K w/pWFY6LRoXGklKA6Y2uoQ71slblyDZ800/MDT89tlWyaUeCjXIK1881F2SbGmIz e2gC+db3/RY2iN8m+e/XaFHb+E+Y+N+st0AJRkMFd042yLa/lDCnrOucFhTxYDR2 r4FD3buXu5SOBLe9vjKdI1Ec0g= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:from:to:cc:references:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=SCLUvNb3MSzRE6oj+bhkt4v A6ss=; b=EHHB47hxOcxKYeAzlUrIaLQ1oNaDE50gnoIMYTcSK/5LhVFINIHOgli 2o6I84I4z0vLbBSCoWre02foHqMmQBrFgrGKgttWM3s8gav8+gozzVrzadZaSsN4 AKB9Qdu5GOyiK2yfN52uTWi3eDlTZtO3yMOxGsrsZxwOP2Eh0xGo= Received: (qmail 107621 invoked by alias); 20 Mar 2018 10:39:26 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 107607 invoked by uid 89); 20 Mar 2018 10:39:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_PASS, URIBL_RED autolearn=ham version=3.3.2 spammy= X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 20 Mar 2018 10:39:23 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-MBX-04.mgc.mentorg.com) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1eyEfp-0003Hh-RG from Tom_deVries@mentor.com ; Tue, 20 Mar 2018 03:39:21 -0700 Received: from [172.30.73.228] (137.202.0.87) by SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 20 Mar 2018 10:39:09 +0000 Subject: [nvptx, PR84952, committed] Fix bar.sync position From: Tom de Vries To: "gcc-patches@gcc.gnu.org" CC: Cesar Philippidis , Thomas Schwinge , Richard Biener References: <600a90eb-fbc6-1b35-a3d3-f34915473951@codesourcery.com> <1cd131fc-32f3-d453-35ec-d23c20cfa3ee@codesourcery.com> Message-ID: <59a758ce-b4df-fd68-6855-1236582e454f@mentor.com> Date: Tue, 20 Mar 2018 11:39:05 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: X-ClientProxiedBy: svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) [ was: Re: [og7] Update nvptx_fork/join barrier placement ] On 03/19/2018 06:02 PM, Tom de Vries wrote: > I've got a tentative patch at > https://gcc.gnu.org/bugzilla/attachment.cgi?id=43707 ( PR84952 - > "[nvptx] bar.sync generated in divergent code" ). Tested on x86_64 with nvptx accelerator (in combination with a patch that verifies the positioning of bar.sync). Committed to stage4 trunk. [ Recap: Consider testcase workers.c: ... int main (void) { int a[10]; #pragma acc parallel loop worker for (int i = 0; i < 10; i++) a[i] = i; return 0; } ... At -O2, we generate (edited for readability): ... // BEGIN PREAMBLE .version 3.1 .target sm_30 .address_size 64 // END PREAMBLE // BEGIN FUNCTION DECL: main$_omp_fn$0 .entry main$_omp_fn$0 (.param .u64 %in_ar0); //:FUNC_MAP "main$_omp_fn$0", 0x1, 0x20, 0x20 // BEGIN VAR DEF: __worker_bcast .shared .align 8 .u8 __worker_bcast[8]; // BEGIN FUNCTION DEF: main$_omp_fn$0 .entry main$_omp_fn$0 (.param .u64 %in_ar0) { .reg .u64 %ar0; ld.param.u64 %ar0,[%in_ar0]; .reg .u32 %r24; .reg .u64 %r25; .reg .pred %r26; .reg .u64 %r27; .reg .u64 %r28; .reg .u64 %r29; .reg .u64 %r30; .reg .u64 %r31; .reg .u64 %r32; .reg .pred %r33; .reg .pred %r34; { .reg .u32 %y; mov.u32 %y,%tid.y; setp.ne.u32 %r34,%y,0; } { .reg .u32 %x; mov.u32 %x,%tid.x; setp.ne.u32 %r33,%x,0; } @ %r34 bra.uni $L6; @ %r33 bra $L7; mov.u64 %r25,%ar0; // fork 2; cvta.shared.u64 %r32,__worker_bcast; st.u64 [%r32],%r25; $L7: $L6: @ %r33 bra $L5; // forked 2; bar.sync 0; cvta.shared.u64 %r31,__worker_bcast; ld.u64 %r25,[%r31]; mov.u32 %r24,%tid.y; setp.le.s32 %r26,%r24,9; @ %r26 bra $L2; bra $L3; $L2: ld.u64 %r27,[%r25]; cvt.s64.s32 %r28,%r24; shl.b64 %r29,%r28,2; add.u64 %r30,%r27,%r29; st.u32 [%r30],%r24; $L3: bar.sync 1; // joining 2; $L5: @ %r34 bra.uni $L8; @ %r33 bra $L9; // join 2; $L9: $L8: ret; } ... The problem is the positioning of bar.sync, inside the vector-neutering branch "@ %r33 bra $L5". The documentation for bar.sync says: ... Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp). In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge). ... The documentation is somewhat contradictory, in that it first explains that that it is executed on a per-warp basis (implying that only one thread executing it should be fine), but then goes on to state that it should not be executed in divergent mode (implying that all threads should execute it). Either way, the safest form of usage is: don't execute in divergent mode. As is evident from the example above, we do generate bar.sync in divergent mode, and patch below fixes that. With the patch, the difference in positioning of bar.sync is in the example above is: ... @@ -42,18 +42,18 @@ st.u64 [%r32], %r25; $L7: $L6: + bar.sync 0; @%r33 bra $L5; // forked 2; - bar.sync 0; cvta.shared.u64 %r31, __worker_bcast; ld.u64 %r25, [%r31]; mov.u32 %r24, %tid.y; setp.le.s32 %r26, %r24, 9; @%r26 bra $L2; $L3: - bar.sync 1; // joining 2; $L5: + bar.sync 1; @%r34 bra.uni $L8; @%r33 bra $L9; // join 2; ... ] Thanks, - Tom [nvptx] Fix bar.sync position 2018-03-20 Tom de Vries PR target/84952 * config/nvptx/nvptx.c (nvptx_single): Don't neuter bar.sync. (nvptx_process_pars): Emit bar.sync asap and alap. --- gcc/config/nvptx/nvptx.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index a6f4443..a839988 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3969,7 +3969,9 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) while (true) { /* Find first insn of from block. */ - while (head != BB_END (from) && !INSN_P (head)) + while (head != BB_END (from) + && (!INSN_P (head) + || recog_memoized (head) == CODE_FOR_nvptx_barsync)) head = NEXT_INSN (head); if (from == to) @@ -4018,6 +4020,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) { default: break; + case CODE_FOR_nvptx_barsync: case CODE_FOR_nvptx_fork: case CODE_FOR_nvptx_forked: case CODE_FOR_nvptx_joining: @@ -4275,8 +4278,8 @@ nvptx_process_pars (parallel *par) nvptx_wpropagate (false, par->forked_block, par->forked_insn); nvptx_wpropagate (true, par->forked_block, par->fork_insn); /* Insert begin and end synchronizations. */ - emit_insn_after (nvptx_wsync (false), par->forked_insn); - emit_insn_before (nvptx_wsync (true), par->joining_insn); + emit_insn_before (nvptx_wsync (false), par->forked_insn); + emit_insn_before (nvptx_wsync (true), par->join_insn); } else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) nvptx_vpropagate (par->forked_block, par->forked_insn);