From patchwork Thu Mar 10 11:27:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 1603754 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=ViMg3Zcz; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KDn1C0B3sz9sGF for ; Thu, 10 Mar 2022 22:30:23 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EC3F93857C5A for ; Thu, 10 Mar 2022 11:30:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EC3F93857C5A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1646911821; bh=KOtvZTsiFKVsUm0vP8B6Mb9Eqof0bO/h4CeuETExsDU=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=ViMg3ZczA30xTDndBCUG9grlaZQT4Gdg1M1zm3WJzP9aXzGRHAA7F3vj8gxEld40Z s8Q4YUj69bMBvrcmAzNnwtDMsBioYs0NrsVYun9t4GNquKHiWzZQPiB02vrJuZEVdC el9GrHHiWnHf2ZH59yFGBj5Aq6W6l/1t0ywREhUc= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 290F43857C44 for ; Thu, 10 Mar 2022 11:27:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 290F43857C44 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 27C27210E1 for ; Thu, 10 Mar 2022 11:27:32 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 158D713FA3 for ; Thu, 10 Mar 2022 11:27:32 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id hLMpBKTgKWJFDQAAMHmgww (envelope-from ) for ; Thu, 10 Mar 2022 11:27:32 +0000 Date: Thu, 10 Mar 2022 12:27:30 +0100 To: gcc-patches@gcc.gnu.org Subject: [committed][nvptx] Disable warp sync in simt region Message-ID: <20220310112728.GA29486@delia.home> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Tom de Vries via Gcc-patches From: Tom de Vries Reply-To: Tom de Vries Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi, I ran into a hang for this code: ... #pragma omp target map(tofrom: counter_N0) #pragma omp simd for (int i = 0 ; i < 1 ; i++ ) { #pragma omp atomic update counter_N0 = counter_N0 + 1 ; } ... This has to do with the nature of -muniform-simt. It has two modes of operation: inside and outside an SIMT region. Outside an SIMT region, a warp pretends to execute a single thread, but actually executes in all threads, to keep the local registers in all threads consistent. This approach works unless the insn that is executed is a syscall or an atomic insn. In that case, the insn is predicated, such that it executes in only one thread. If the predicated insn writes a result to a register, then that register is propagated to the other threads, after which the local registers in all threads are consistent again. Inside an SIMT region, a warp executes in all threads. However, the predication and propagation for syscalls and atomic insns is also present here, because nvptx_reorg_uniform_simt works on all code. Care has been taken though to ensure that the predication and propagation is a nop. That is, inside an SIMT region: - the predicate evalutes to true for each thread, and - the propagation insn copies a register from each thread to the same thread. That works fine, until we use -mptx=6.0, and instead of using the deprecated warp propagation insn shfl, we start using shfl.sync: ... @%r33 atom.add.u32 _, [%r29], 1; shfl.sync.idx.b32 %r30, %r30, %r32, 31, 0xffffffff; ... The shfl.sync specifies a member mask indicating all threads, but given that the loop only has a single iteration, only thread 0 will execute the insn, where it will hang waiting for the other threads. Fix this by predicating the shfl.sync (and likewise, bar.warp.sync and the uniform warp check) such that it only executes outside the SIMT region. Tested on x86_64 with nvptx accelerator. Committed to trunk. Thanks, - Tom [nvptx] Disable warp sync in simt region gcc/ChangeLog: 2022-03-08 Tom de Vries PR target/104783 * config/nvptx/nvptx.cc (nvptx_init_unisimt_predicate) (nvptx_output_unisimt_switch): Handle unisimt_outside_simt_predicate. (nvptx_get_unisimt_outside_simt_predicate): New function. (predicate_insn): New function, factored out of ... (nvptx_reorg_uniform_simt): ... here. Predicate all emitted insns. * config/nvptx/nvptx.h (struct machine_function): Add unisimt_outside_simt_predicate field. * config/nvptx/nvptx.md (define_insn "nvptx_warpsync") (define_insn "nvptx_uniform_warp_check"): Make predicable. libgomp/ChangeLog: 2022-03-10 Tom de Vries * testsuite/libgomp.c/pr104783.c: New test. --- gcc/config/nvptx/nvptx.cc | 45 +++++++++++++++++++++++++++++++--- gcc/config/nvptx/nvptx.h | 1 + gcc/config/nvptx/nvptx.md | 29 ++++++++++++---------- libgomp/testsuite/libgomp.c/pr104783.c | 18 ++++++++++++++ 4 files changed, 76 insertions(+), 17 deletions(-) diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index c41e305a34f..3a7be63c290 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -1364,6 +1364,13 @@ nvptx_init_unisimt_predicate (FILE *file) int master = REGNO (cfun->machine->unisimt_master); int pred = REGNO (cfun->machine->unisimt_predicate); fprintf (file, "\t\tld.shared.u32 %%r%d, [%%r%d];\n", master, loc); + if (cfun->machine->unisimt_outside_simt_predicate) + { + int pred_outside_simt + = REGNO (cfun->machine->unisimt_outside_simt_predicate); + fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, 0;\n", + pred_outside_simt, master); + } fprintf (file, "\t\tmov.u32 %%ustmp0, %%laneid;\n"); /* Compute 'master lane index' as 'laneid & __nvptx_uni[tid.y]'. */ fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master); @@ -1589,6 +1596,13 @@ nvptx_output_unisimt_switch (FILE *file, bool entering) fprintf (file, "\t{\n"); fprintf (file, "\t\t.reg.u32 %%ustmp2;\n"); fprintf (file, "\t\tmov.u32 %%ustmp2, %d;\n", entering ? -1 : 0); + if (cfun->machine->unisimt_outside_simt_predicate) + { + int pred_outside_simt + = REGNO (cfun->machine->unisimt_outside_simt_predicate); + fprintf (file, "\t\tmov.pred %%r%d, %d;\n", pred_outside_simt, + entering ? 0 : 1); + } if (!crtl->is_leaf) { int loc = REGNO (cfun->machine->unisimt_location); @@ -3242,6 +3256,13 @@ nvptx_get_unisimt_predicate () return pred ? pred : pred = gen_reg_rtx (BImode); } +static rtx +nvptx_get_unisimt_outside_simt_predicate () +{ + rtx &pred = cfun->machine->unisimt_outside_simt_predicate; + return pred ? pred : pred = gen_reg_rtx (BImode); +} + /* Return true if given call insn references one of the functions provided by the CUDA runtime: malloc, free, vprintf. */ @@ -3286,6 +3307,16 @@ nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master) return false; } +static void +predicate_insn (rtx_insn *insn, rtx pred) +{ + rtx pat = PATTERN (insn); + pred = gen_rtx_NE (BImode, pred, const0_rtx); + pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat); + bool changed_p = validate_change (insn, &PATTERN (insn), pat, false); + gcc_assert (changed_p); +} + /* Adjust code for uniform-simt code generation variant by making atomics and "syscalls" conditionally executed, and inserting shuffle-based propagation for registers being set. */ @@ -3352,10 +3383,16 @@ nvptx_reorg_uniform_simt () } rtx pred = nvptx_get_unisimt_predicate (); - pred = gen_rtx_NE (BImode, pred, const0_rtx); - pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat); - bool changed_p = validate_change (insn, &PATTERN (insn), pat, false); - gcc_assert (changed_p); + predicate_insn (insn, pred); + + pred = NULL_RTX; + for (rtx_insn *post = NEXT_INSN (insn); post != next; + post = NEXT_INSN (post)) + { + if (pred == NULL_RTX) + pred = nvptx_get_unisimt_outside_simt_predicate (); + predicate_insn (post, pred); + } } } diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 3ca22a595d2..b55ade65cc5 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -226,6 +226,7 @@ struct GTY(()) machine_function rtx sync_bar; /* Synchronization barrier ID for vectors. */ rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ rtx unisimt_predicate; /* Predicate for -muniform-simt. */ + rtx unisimt_outside_simt_predicate; /* Predicate for -muniform-simt. */ rtx unisimt_location; /* Mask location for -muniform-simt. */ /* The following two fields hold the maximum size resp. alignment required for per-lane storage in OpenMP SIMD regions. */ diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 1cbf197065f..1ccb0f11e4c 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -2268,25 +2268,28 @@ (define_insn "nvptx_barsync" (define_insn "nvptx_warpsync" [(unspec_volatile [(const_int 0)] UNSPECV_WARPSYNC)] "TARGET_PTX_6_0" - "\\tbar.warp.sync\\t0xffffffff;" - [(set_attr "predicable" "false")]) + "%.\\tbar.warp.sync\\t0xffffffff;") (define_insn "nvptx_uniform_warp_check" [(unspec_volatile [(const_int 0)] UNSPECV_UNIFORM_WARP_CHECK)] "" { - output_asm_insn ("{", NULL); - output_asm_insn ("\\t" ".reg.b32" "\\t" "act;", NULL); - output_asm_insn ("\\t" "vote.ballot.b32" "\\t" "act,1;", NULL); - output_asm_insn ("\\t" ".reg.pred" "\\t" "uni;", NULL); - output_asm_insn ("\\t" "setp.eq.b32" "\\t" "uni,act,0xffffffff;", - NULL); - output_asm_insn ("@ !uni\\t" "trap;", NULL); - output_asm_insn ("@ !uni\\t" "exit;", NULL); - output_asm_insn ("}", NULL); + const char *insns[] = { + "{", + "\\t" ".reg.b32" "\\t" "act;", + "%.\\t" "vote.ballot.b32" "\\t" "act,1;", + "\\t" ".reg.pred" "\\t" "do_abort;", + "\\t" "mov.pred" "\\t" "do_abort,0;", + "%.\\t" "setp.ne.b32" "\\t" "do_abort,act,0xffffffff;", + "@ do_abort\\t" "trap;", + "@ do_abort\\t" "exit;", + "}", + NULL + }; + for (const char **p = &insns[0]; *p != NULL; p++) + output_asm_insn (*p, NULL); return ""; - } - [(set_attr "predicable" "false")]) + }) (define_expand "memory_barrier" [(set (match_dup 0) diff --git a/libgomp/testsuite/libgomp.c/pr104783.c b/libgomp/testsuite/libgomp.c/pr104783.c new file mode 100644 index 00000000000..05a93cd6bc1 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/pr104783.c @@ -0,0 +1,18 @@ +int +main (void) +{ + unsigned val = 0; + +#pragma omp target map(tofrom: val) +#pragma omp simd + for (int i = 0 ; i < 1 ; i++) + { +#pragma omp atomic update + val = val + 1; + } + + if (val != 1) + __builtin_abort (); + + return 0; +}