From patchwork Tue Nov 12 13:29:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Stubbs X-Patchwork-Id: 1193542 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-513116-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="eigHubd6"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47C7v42rt0z9sNT for ; Wed, 13 Nov 2019 00:31:44 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; q=dns; s=default; b=fPI sEmhvl1dNzoL0MzBEhmPZa5e9FAuup4Tuo2c9u9SH0hz20yrm3MgBzm1CbpNscVo dIwORicoThjkmNEIlugoAQAYAp1zay5LilNRY2H+sVXldeGffLdi5CxTgickRGvj Vh3USD8jjqUZBT/4gzaqoWCM5cOtuNlgN6Cx8qLQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:subject:date:message-id:in-reply-to:references:mime-version :content-type:content-transfer-encoding; s=default; bh=qWuS/oFTN dqOVc0p7zHrWjRqgjw=; b=eigHubd6LON0IW0mop6nyRBaww7DyeARzcBYjpoNm GpRnP6vqY6wgspvCJRX/EkMGMsvyLNRUGCmqGsxoDbRC2Os8dUbw1mP3t2stVLW3 eOvtVqQqXdJFMNp3Wl5lQ4VRDwG9ClGe9WjxIz4iBPBV7Zl7iaCrJe3cJ4/Wngxs LA= Received: (qmail 124832 invoked by alias); 12 Nov 2019 13:30:39 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 123805 invoked by uid 89); 12 Nov 2019 13:30:29 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-18.4 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3 autolearn=ham version=3.3.1 spammy=beware, exhausted X-HELO: esa1.mentor.iphmx.com Received: from esa1.mentor.iphmx.com (HELO esa1.mentor.iphmx.com) (68.232.129.153) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 12 Nov 2019 13:30:17 +0000 IronPort-SDR: mgEcAukGO8XmrBnevhpArlHeVbHpkrGXblhqUeNfhKtM5c0G06X4sUb35DXFIjNIM3usnKkj8N 4MbLkeEIe8GQs2Aa0R1JsJdNT+MHqNXMQ0Z5ygLP2k+yeDCjcAlVcuAW6ixh9DTSitbM6dqdZJ am1YufW9NnJsXfyDXYT8cnKsmJEqFYmy2r691ejBiiJEPvfmd3BLaYCp/tjxgU5Nvi+VIszUC3 FYaFVnupewyYh/ayvfE2ZCOefQ+Kj19SkSSm9TmnlLXq90GLyvZqCEUbnJxlAygbYlj3EQzFnb Dtc= Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa1.mentor.iphmx.com with ESMTP; 12 Nov 2019 05:29:52 -0800 IronPort-SDR: wVD6+91pv0Xb+t25yWkeol8EeHwAy9HIj2yTlLfnricKR2M1dQrvW9/HgNKNqk2c0IJLos5B1P G0+D5pLOD30D2bgwgPyQ6q56BiiLPkeO5zPL32dA+iTkMAL8+y5oZvFWYgdAAnAR6KCOCxf5rC fLa6iBigYh9n868Y5xhOQ4Uq8O8gTz8sFvShdy7Jq5hR9ClSyd12pE6nCHjcJTmL6+SjrBwjvh mL0jLW4rxcOnX7ez/wfUxGC3nYqwwkP4nTO+q4mFxK5UVaBUf6E3XxcuoeYDjZY3QI1pA/UFTk 7ew= From: Andrew Stubbs To: Subject: [PATCH 5/7 libgomp,amdgcn] Optimize GCN OpenMP malloc performance Date: Tue, 12 Nov 2019 13:29:14 +0000 Message-ID: <31005bba27788173688c0cb8f332c58d27cacbbb.1573560401.git.ams@codesourcery.com> In-Reply-To: References: MIME-Version: 1.0 This patch implements a malloc optimization to improve the startup and shutdown overhead for each OpenMP team. New malloc functions are created, "team_malloc" and "team_free", that take memory from a per-team memory arena provided by the plugin, rather than the shared heap space, which is slow, and gets worse the more teams are trying to allocate at once. These new functions are used both in the gcn/team.c file and in selected places elsewhere in libgomp. Arena-space is limited (and larger sizes have greater overhead at launch time) so this should not be a global search and replace. Dummy pass-through definitions are provided for other targets. OK to commit? Thanks Andrew 2019-11-12 Andrew Stubbs libgomp/ * config/gcn/team.c (gomp_gcn_enter_kernel): Set up the team arena and use team_malloc variants. (gomp_gcn_exit_kernel): Use team_free. * libgomp.h (TEAM_ARENA_SIZE): Define. (TEAM_ARENA_FREE): Define. (TEAM_ARENA_END): Define. (team_malloc): New function. (team_malloc_cleared): New function. (team_free): New function. * team.c (gomp_new_team): Use team_malloc. (free_team): Use team_free. (gomp_free_thread): Use team_free. (gomp_pause_host): Use team_free. * work.c (gomp_init_work_share): Use team_malloc. (gomp_fini_work_share): Use team_free. --- libgomp/config/gcn/team.c | 18 ++++++++++--- libgomp/libgomp.h | 56 +++++++++++++++++++++++++++++++++++++++ libgomp/team.c | 12 ++++----- libgomp/work.c | 4 +-- 4 files changed, 78 insertions(+), 12 deletions(-) diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c index c566482bda2..063571fc751 100644 --- a/libgomp/config/gcn/team.c +++ b/libgomp/config/gcn/team.c @@ -57,16 +57,26 @@ gomp_gcn_enter_kernel (void) /* Starting additional threads is not supported. */ gomp_global_icv.dyn_var = true; + /* Initialize the team arena for optimized memory allocation. + The arena has been allocated on the host side, and the address + passed in via the kernargs. Each team takes a small slice of it. */ + register void **kernargs asm("s8"); + void *team_arena = (kernargs[4] + TEAM_ARENA_SIZE*teamid); + void * __lds *arena_free = (void * __lds *)TEAM_ARENA_FREE; + void * __lds *arena_end = (void * __lds *)TEAM_ARENA_END; + *arena_free = team_arena; + *arena_end = team_arena + TEAM_ARENA_SIZE; + /* Allocate and initialize the team-local-storage data. */ - struct gomp_thread *thrs = gomp_malloc_cleared (sizeof (*thrs) + struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs) * numthreads); set_gcn_thrs (thrs); /* Allocate and initailize a pool of threads in the team. The threads are already running, of course, we just need to manage the communication between them. */ - struct gomp_thread_pool *pool = gomp_malloc (sizeof (*pool)); - pool->threads = gomp_malloc (sizeof (void *) * numthreads); + struct gomp_thread_pool *pool = team_malloc (sizeof (*pool)); + pool->threads = team_malloc (sizeof (void *) * numthreads); for (int tid = 0; tid < numthreads; tid++) pool->threads[tid] = &thrs[tid]; pool->threads_size = numthreads; @@ -91,7 +101,7 @@ void gomp_gcn_exit_kernel (void) { gomp_free_thread (gcn_thrs ()); - free (gcn_thrs ()); + team_free (gcn_thrs ()); } /* This function contains the idle loop in which a thread waits diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 19e1241ee4c..659aeb95ffe 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -106,6 +106,62 @@ extern void gomp_aligned_free (void *); GCC's builtin alloca(). */ #define gomp_alloca(x) __builtin_alloca(x) +/* Optimized allocators for team-specific data that will die with the team. */ + +#ifdef __AMDGCN__ +/* The arena is initialized in config/gcn/team.c. */ +#define TEAM_ARENA_SIZE 64*1024 /* Must match the value in plugin-gcn.c. */ +#define TEAM_ARENA_FREE 16 /* LDS offset of free pointer. */ +#define TEAM_ARENA_END 24 /* LDS offset of end pointer. */ + +static inline void * __attribute__((malloc)) +team_malloc (size_t size) +{ + /* 4-byte align the size. */ + size = (size + 3) & ~3; + + /* Allocate directly from the arena. + The compiler does not support DS atomics, yet. */ + void *result; + asm ("ds_add_rtn_u64 %0, %1, %2\n\ts_waitcnt 0" + : "=v"(result) : "v"(TEAM_ARENA_FREE), "v"(size), "e"(1L) : "memory"); + + /* Handle OOM. */ + if (result + size > *(void * __lds *)TEAM_ARENA_END) + { + const char msg[] = "GCN team arena exhausted\n"; + write (2, msg, sizeof(msg)-1); + /* It's better to continue with reeduced performance than abort. + Beware that this won't get freed, which might cause more problems. */ + result = gomp_malloc (size); + } + return result; +} + +static inline void * __attribute__((malloc)) __attribute__((optimize("-O3"))) +team_malloc_cleared (size_t size) +{ + char *result = team_malloc (size); + + /* Clear the allocated memory. + This should vectorize. The allocation has been rounded up to the next + 4-byte boundary, so this is safe. */ + for (int i = 0; iordered_release[0]) + sizeof (team->implicit_task[0]); - team = gomp_malloc (sizeof (*team) + nthreads * extra); + team = team_malloc (sizeof (*team) + nthreads * extra); #ifndef HAVE_SYNC_BUILTINS gomp_mutex_init (&team->work_share_list_free_lock); @@ -221,7 +221,7 @@ free_team (struct gomp_team *team) gomp_barrier_destroy (&team->barrier); gomp_mutex_destroy (&team->task_lock); priority_queue_free (&team->task_queue); - free (team); + team_free (team); } static void @@ -285,8 +285,8 @@ gomp_free_thread (void *arg __attribute__((unused))) if (pool->last_team) free_team (pool->last_team); #ifndef __nvptx__ - free (pool->threads); - free (pool); + team_free (pool->threads); + team_free (pool); #endif thr->thread_pool = NULL; } @@ -1082,8 +1082,8 @@ gomp_pause_host (void) if (pool->last_team) free_team (pool->last_team); #ifndef __nvptx__ - free (pool->threads); - free (pool); + team_free (pool->threads); + team_free (pool); #endif thr->thread_pool = NULL; } diff --git a/libgomp/work.c b/libgomp/work.c index a589b8b5231..28bb0c11255 100644 --- a/libgomp/work.c +++ b/libgomp/work.c @@ -120,7 +120,7 @@ gomp_init_work_share (struct gomp_work_share *ws, size_t ordered, else ordered = nthreads * sizeof (*ws->ordered_team_ids); if (ordered > INLINE_ORDERED_TEAM_IDS_SIZE) - ws->ordered_team_ids = gomp_malloc (ordered); + ws->ordered_team_ids = team_malloc (ordered); else ws->ordered_team_ids = ws->inline_ordered_team_ids; memset (ws->ordered_team_ids, '\0', ordered); @@ -142,7 +142,7 @@ gomp_fini_work_share (struct gomp_work_share *ws) { gomp_mutex_destroy (&ws->lock); if (ws->ordered_team_ids != ws->inline_ordered_team_ids) - free (ws->ordered_team_ids); + team_free (ws->ordered_team_ids); gomp_ptrlock_destroy (&ws->next_ws); }