From patchwork Fri Nov 22 17:02:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Stubbs X-Patchwork-Id: 1199554 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-514422-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="SgcX5K7P"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47KN5m69Rlz9sPK for ; Sat, 23 Nov 2019 04:02:36 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; q=dns; s= default; b=LHC745gnlKjP0lnj6LivPOjDY3An/t3LwzoQEnZIWE49I0TVaEvks Y8vpyBbbE0QxJWMFBn2aC3bZWkZMzixhpVUH+ZY5AxW4xsyn+Ghv1LFYuiFvgbva IBHIa0J70IARXJsmXxeaF3QByal3aakHpPssSOPSCNPfpCIwgApnEc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; s= default; bh=Ovzzf45svoiRzmOF8pUe/uSRuH0=; b=SgcX5K7PI59IKVtV9XpO QmbeaMXBdFqT8Dtc17UmiqgbJfolPeECeUpWZq604XQ8MYOqPoJc65L4bjst5Zga t8Pgu1aPulOp3ulXYbCGCXMmQ2D9+hyVgXxt3mwYyqujaTZBjY0aw/Mi8ERy+r8/ 2gv+cvtOU/ZHvZjOFxcLlWM= Received: (qmail 106402 invoked by alias); 22 Nov 2019 17:02:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 106394 invoked by uid 89); 22 Nov 2019 17:02:29 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-18.5 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, SPF_PASS autolearn=ham version=3.3.1 spammy=permit, emit_move_insn, nonkernel, sk:gcn_exp X-HELO: esa2.mentor.iphmx.com Received: from esa2.mentor.iphmx.com (HELO esa2.mentor.iphmx.com) (68.232.141.98) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 22 Nov 2019 17:02:28 +0000 IronPort-SDR: 7wtBF3SUUVjzEuXOinl0v3LGcFsOt8UWZi3PS1Ie9bC3sRNIU0EQWKCpKcQmstOzCOvQfq7MLy umrn+K7a8W3Ina6k24wdwR3VTwCY8fklyMMklol+gYikLI93nZ6HmyV9mukBxrKAKKQZlds2ir qD3IC539Ta4v009BwFsMZ4J4km9ee5Q/Pc/5sgAcG/53kHiAhz4/Hrwr2St3qMhmkph4VijKCx yKPrMJz0q6m2pxm7ZqlrKyAJ8OP+8ajTroL4t90vz1/aE783Fh9xAWCeyQihM53aSdTVuVv+NV YuU= Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 22 Nov 2019 09:02:24 -0800 IronPort-SDR: moH4WavoBSyHRP5tmNdED99r6MJaqz8rYXERnxWDuQj3sEypLEfg8rTruWmqjyaAsyKexH0NN9 xcbv3jL9T1+BDwoTk08BTpmLqXqbBiz3mgOaAzHEpZ8q2p7qhli2DPGAl4nM2MaCRrUtY1DmNU 8iAOiH3Op/tOM7tNaBaUReUn4oLFoPNhTKPIBIH9CE72kvH5QgYGuRzR9qTJZxOfEoq4BosMbD dgSX0D0G7Mbccd0ZTsJVdrWMweQVpcDz+pHvUTZZAcGRKIXW0zz6cC9Po1exicMrtduwHsgnNm P0w= From: Andrew Stubbs Subject: [committed, amdgcn] Limit LDS usage To: "gcc-patches@gcc.gnu.org" Message-ID: <3873d05a-c08b-70e9-8112-5beb0f6dbddb@codesourcery.com> Date: Fri, 22 Nov 2019 17:02:20 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.2 MIME-Version: 1.0 This patch changes the amount of LDS (Local Data Store) memory requested for offload kernels. This allows more teams/gangs to run on the same compute unit, increasing potential data throughput. For OpenMP we can reduce the allocation to almost nothing. This means we can have up-to 40 single-thread teams per CU. For OpenACC we need enough LDS to broadcast data between workers, and the algorithm is not particularly memory efficient. This means we cannot yet achieve the maximum thread count, but we can at least double the current thread-count -- to 32 -- but halving the LDS usage and relying on having 16 workers. (Note that I'm assuming Julian's multi-worker support patches will be committed soon. Without those we can allocate no LDS and have 40 single-worker teams. With the patches the same can also be true, but that's still on the to-do list.) LDS allocation remains unchanged for non-offload compiles (this is only really used for running the testsuite). Limit LDS usage. 2019-11-22 Andrew Stubbs gcc/ * config/gcn/gcn.c (OMP_LDS_SIZE): Define. (ACC_LDS_SIZE): Define. (OTHER_LDS_SIZE): Define. (LDS_SIZE): Redefine using above. (gcn_expand_prologue): Initialize m0 with LDS_SIZE-1. diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c index 3a8c10ed8b4..f85d84bbe95 100644 --- a/gcc/config/gcn/gcn.c +++ b/gcc/config/gcn/gcn.c @@ -70,10 +70,15 @@ int gcn_isa = 3; /* Default to GCN3. */ worker-single mode to worker-partitioned mode), per workgroup. Global analysis could calculate an exact bound, but we don't do that yet. - We reserve the whole LDS, which also prevents any other workgroup - sharing the Compute Unit. */ + We want to permit full occupancy, so size accordingly. */ -#define LDS_SIZE 65536 +#define OMP_LDS_SIZE 0x600 /* 0x600 is 1/40 total, rounded down. */ +#define ACC_LDS_SIZE 32768 /* Half of the total should be fine. */ +#define OTHER_LDS_SIZE 65536 /* If in doubt, reserve all of it. */ + +#define LDS_SIZE (flag_openacc ? ACC_LDS_SIZE \ + : flag_openmp ? OMP_LDS_SIZE \ + : OTHER_LDS_SIZE) /* The number of registers usable by normal non-kernel functions. The SGPR count includes any special extra registers such as VCC. */ @@ -2876,8 +2881,11 @@ gcn_expand_prologue () /* Ensure that the scheduler doesn't do anything unexpected. */ emit_insn (gen_blockage ()); + /* m0 is initialized for the usual LDS DS and FLAT memory case. + The low-part is the address of the topmost addressable byte, which is + size-1. The high-part is an offset and should be zero. */ emit_move_insn (gen_rtx_REG (SImode, M0_REG), - gen_int_mode (LDS_SIZE, SImode)); + gen_int_mode (LDS_SIZE-1, SImode)); emit_insn (gen_prologue_use (gen_rtx_REG (SImode, M0_REG)));