From patchwork Mon Mar 21 10:21:51 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 600044 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3qTBhp4JpVz9s5Q for ; Mon, 21 Mar 2016 21:22:18 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=CWkHf7UO; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; q=dns; s= default; b=uy0p3NY19ZuIseJX0eE4F3G0TR+/3646SoBSHk5ODv4po9tRZ5GCL m8tOvQM/CSurpNZRju9Uhjv877//hHpfG4JQOWiYi4i3NvGJo/8H9ql6Wvf9GqWN tcgATva0gjbEWnrR+xiJObkAkCbMZPzOHOC9EzP8BTavV/C6E8nG24= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; s= default; bh=5xvdmygjK6YHdffztw0Rtsm8rQ4=; b=CWkHf7UOk5cersXLpYBg ysFXRdqTpm7rE4TfFXCImpTh4Y8+PUDICZU5FPN2YEFqDcCAqpEkS2OZzVGUlTEN KyNuIOEelDhy7wGxdyAXw+ojs8oOd/bAt/7LueYhSiFtlLCg9ock2RFOqQuULKXl qWRPz+gnb9GXURQK4m4LrE4= Received: (qmail 41629 invoked by alias); 21 Mar 2016 10:22:07 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 41619 invoked by uid 89); 21 Mar 2016 10:22:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.4 required=5.0 tests=AWL, BAYES_50, FROM_12LTRDOM, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=no version=3.3.2 spammy=H*r:Mon, nvptx, acquire, desc X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Mon, 21 Mar 2016 10:22:02 +0000 Received: from svr-orw-fem-06.mgc.mentorg.com ([147.34.97.120]) by relay1.mentorg.com with esmtp id 1ahwyG-0006TP-8V from ChungLin_Tang@mentor.com for gcc-patches@gcc.gnu.org; Mon, 21 Mar 2016 03:22:00 -0700 Received: from [0.0.0.0] (147.34.91.1) by SVR-ORW-FEM-06.mgc.mentorg.com (147.34.97.120) with Microsoft SMTP Server id 14.3.224.2; Mon, 21 Mar 2016 03:21:58 -0700 From: Chung-Lin Tang Subject: [PATCH 2/4, libgomp] Resolve deadlock on plugin exit, nvptx plugin parts To: Nathan Sidwell , Thomas Schwinge , gcc-patches Message-ID: <56EFCB3F.6000409@codesourcery.com> Date: Mon, 21 Mar 2016 18:21:51 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 X-IsSubscribed: yes As attached. Thanks, Chung-Lin * plugin/plugin-nvptx.c (CUDA_CALL_ERET): New convenience macro. (CUDA_CALL): Likewise. (CUDA_CALL_ASSERT): Likewise. (map_init): Change return type to bool, use CUDA_CALL* macros. (map_fini): Likewise. (init_streams_for_device): Change return type to bool, adjust call to map_init. (fini_streams_for_device): Change return type to bool, adjust call to map_fini. (select_stream_for_async): Release stream_lock before calls to GOMP_PLUGIN_fatal, adjust call to map_init. (nvptx_init): Use CUDA_CALL* macros. (nvptx_attach_host_thread_to_device): Change return type to bool, use CUDA_CALL* macros. (nvptx_open_device): Use CUDA_CALL* macros. (nvptx_close_device): Change return type to bool, use CUDA_CALL* macros. (nvptx_get_num_devices): Use CUDA_CALL* macros. (link_ptx): Change return type to bool, use CUDA_CALL* macros. (nvptx_exec): Use CUDA_CALL* macros. (nvptx_alloc): Use CUDA_CALL* macros. (nvptx_free): Change return type to bool, use CUDA_CALL* macros. (nvptx_host2dev): Likewise. (nvptx_dev2host): Likewise. (nvptx_wait): Use CUDA_CALL* macros. (nvptx_wait_async): Likewise. (nvptx_wait_all): Likewise. (nvptx_wait_all_async): Likewise. (nvptx_set_cuda_stream): Adjust order of stream_lock acquire, use CUDA_CALL* macros, adjust call to map_fini. (GOMP_OFFLOAD_init_device): Change return type to bool, adjust code accordingly. (GOMP_OFFLOAD_fini_device): Likewise. (GOMP_OFFLOAD_load_image): Adjust calls to nvptx_attach_host_thread_to_device/link_ptx to handle errors, use CUDA_CALL* macros. (GOMP_OFFLOAD_unload_image): Change return type to bool, adjust return code. (GOMP_OFFLOAD_alloc): Adjust calls to code to handle error return. (GOMP_OFFLOAD_free): Change return type to bool, adjust calls to handle error return. (GOMP_OFFLOAD_dev2host): Likewise. (GOMP_OFFLOAD_host2dev): Likewise. (GOMP_OFFLOAD_openacc_register_async_cleanup): Use CUDA_CALL* macros. (GOMP_OFFLOAD_openacc_create_thread_data): Likewise. Index: libgomp/plugin/plugin-nvptx.c =================================================================== --- libgomp/plugin/plugin-nvptx.c (revision 234358) +++ libgomp/plugin/plugin-nvptx.c (working copy) @@ -63,6 +63,34 @@ cuda_error (CUresult r) return desc; } +/* Convenience macros for the frequently used CUDA library call and + error handling sequence. This does not capture all the cases we + use in this file, but is common enough. */ + +#define CUDA_CALL_ERET(ERET, FN, ...) \ + do { \ + unsigned __r = FN (__VA_ARGS__); \ + if (__r != CUDA_SUCCESS) \ + { \ + GOMP_PLUGIN_error (#FN " error: %s", \ + cuda_error (__r)); \ + return ERET; \ + } \ + } while (0) + +#define CUDA_CALL(FN, ...) \ + CUDA_CALL_ERET (false, (FN), __VA_ARGS__) + +#define CUDA_CALL_ASSERT(FN, ...) \ + do { \ + unsigned __r = FN (__VA_ARGS__); \ + if (__r != CUDA_SUCCESS) \ + { \ + GOMP_PLUGIN_fatal (#FN " error: %s", \ + cuda_error (__r)); \ + } \ + } while (0) + static unsigned int instantiated_devices = 0; static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER; @@ -98,25 +126,18 @@ struct map char mappings[0]; }; -static void +static bool map_init (struct ptx_stream *s) { - CUresult r; - int size = getpagesize (); assert (s); assert (!s->d); assert (!s->h); - r = cuMemAllocHost (&s->h, size); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemAllocHost error: %s", cuda_error (r)); + CUDA_CALL (cuMemAllocHost, &s->h, size); + CUDA_CALL (cuMemHostGetDevicePointer, &s->d, s->h, 0); - r = cuMemHostGetDevicePointer (&s->d, s->h, 0); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemHostGetDevicePointer error: %s", cuda_error (r)); - assert (s->h); s->h_begin = s->h; @@ -125,16 +146,14 @@ map_init (struct ptx_stream *s) assert (s->h_next); assert (s->h_end); + return true; } -static void +static bool map_fini (struct ptx_stream *s) { - CUresult r; - - r = cuMemFreeHost (s->h); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r)); + CUDA_CALL (cuMemFreeHost, s->h); + return true; } static void @@ -325,7 +344,7 @@ nvptx_thread (void) return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); } -static void +static bool init_streams_for_device (struct ptx_device *ptx_dev, int concurrency) { int i; @@ -337,9 +356,10 @@ init_streams_for_device (struct ptx_device *ptx_de null_stream->multithreaded = true; null_stream->d = (CUdeviceptr) NULL; null_stream->h = NULL; - map_init (null_stream); - ptx_dev->null_stream = null_stream; + if (!map_init (null_stream)) + return false; + ptx_dev->null_stream = null_stream; ptx_dev->active_streams = NULL; pthread_mutex_init (&ptx_dev->stream_lock, NULL); @@ -355,25 +375,35 @@ init_streams_for_device (struct ptx_device *ptx_de for (i = 0; i < concurrency; i++) ptx_dev->async_streams.arr[i] = NULL; + + return true; } -static void +static bool fini_streams_for_device (struct ptx_device *ptx_dev) { free (ptx_dev->async_streams.arr); + bool ret = true; while (ptx_dev->active_streams != NULL) { struct ptx_stream *s = ptx_dev->active_streams; ptx_dev->active_streams = ptx_dev->active_streams->next; - map_fini (s); - cuStreamDestroy (s->stream); + ret &= map_fini (s); + + CUresult r = cuStreamDestroy (s->stream); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r)); + ret = false; + } free (s); } - map_fini (ptx_dev->null_stream); + ret &= map_fini (ptx_dev->null_stream); free (ptx_dev->null_stream); + return ret; } /* Select a stream for (OpenACC-semantics) ASYNC argument for the current @@ -447,7 +477,11 @@ select_stream_for_async (int async, pthread_t thre { r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT); if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamCreate error: %s", cuda_error (r)); + { + pthread_mutex_unlock (&ptx_dev->stream_lock); + GOMP_PLUGIN_fatal ("cuStreamCreate error: %s", + cuda_error (r)); + } } /* If CREATE is true, we're going to be queueing some work on this @@ -457,7 +491,11 @@ select_stream_for_async (int async, pthread_t thre s->d = (CUdeviceptr) NULL; s->h = NULL; - map_init (s); + if (!map_init (s)) + { + pthread_mutex_unlock (&ptx_dev->stream_lock); + GOMP_PLUGIN_fatal ("map_init fail"); + } s->next = ptx_dev->active_streams; ptx_dev->active_streams = s; @@ -467,7 +505,11 @@ select_stream_for_async (int async, pthread_t thre stream = ptx_dev->async_streams.arr[async]; } else if (async < 0) - GOMP_PLUGIN_fatal ("bad async %d", async); + { + if (create) + pthread_mutex_unlock (&ptx_dev->stream_lock); + GOMP_PLUGIN_fatal ("bad async %d", async); + } if (create) { @@ -498,34 +540,25 @@ select_stream_for_async (int async, pthread_t thre static bool nvptx_init (void) { - CUresult r; int ndevs; if (instantiated_devices != 0) return true; - r = cuInit (0); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r)); - + CUDA_CALL (cuInit, 0); ptx_events = NULL; - pthread_mutex_init (&ptx_event_lock, NULL); - r = cuDeviceGetCount (&ndevs); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r)); - + CUDA_CALL (cuDeviceGetCount, &ndevs); ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *) * ndevs); - return true; } /* Select the N'th PTX device for the current host thread. The device must have been previously opened before calling this function. */ -static void +static bool nvptx_attach_host_thread_to_device (int n) { CUdevice dev; @@ -535,34 +568,34 @@ nvptx_attach_host_thread_to_device (int n) r = cuCtxGetDevice (&dev); if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT) - GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r)); + { + GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); + return false; + } if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n) - return; + return true; else { CUcontext old_ctx; ptx_dev = ptx_devices[n]; - assert (ptx_dev); + if (!ptx_dev) + { + GOMP_PLUGIN_error ("device %d not found", n); + return false; + } - r = cuCtxGetCurrent (&thd_ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r)); + CUDA_CALL (cuCtxGetCurrent, &thd_ctx); /* We don't necessarily have a current context (e.g. if it has been destroyed. Pop it if we do though. */ if (thd_ctx != NULL) - { - r = cuCtxPopCurrent (&old_ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r)); - } + CUDA_CALL (cuCtxPopCurrent, &old_ctx); - r = cuCtxPushCurrent (ptx_dev->ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r)); + CUDA_CALL (cuCtxPushCurrent, ptx_dev->ctx); } + return true; } static struct ptx_device * @@ -573,9 +606,7 @@ nvptx_open_device (int n) CUresult r; int async_engines, pi; - r = cuDeviceGet (&dev, n); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGet error: %s", cuda_error (r)); + CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n); ptx_dev = GOMP_PLUGIN_malloc (sizeof (struct ptx_device)); @@ -585,60 +616,44 @@ nvptx_open_device (int n) r = cuCtxGetDevice (&ctx_dev); if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT) - GOMP_PLUGIN_fatal ("cuCtxGetDevice error: %s", cuda_error (r)); + { + GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); + return NULL; + } if (r != CUDA_ERROR_INVALID_CONTEXT && ctx_dev != dev) { /* The current host thread has an active context for a different device. Detach it. */ CUcontext old_ctx; - - r = cuCtxPopCurrent (&old_ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxPopCurrent error: %s", cuda_error (r)); + CUDA_CALL_ERET (NULL, cuCtxPopCurrent, &old_ctx); } - r = cuCtxGetCurrent (&ptx_dev->ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r)); + CUDA_CALL_ERET (NULL, cuCtxGetCurrent, &ptx_dev->ctx); if (!ptx_dev->ctx) - { - r = cuCtxCreate (&ptx_dev->ctx, CU_CTX_SCHED_AUTO, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxCreate error: %s", cuda_error (r)); - } + CUDA_CALL_ERET (NULL, cuCtxCreate, &ptx_dev->ctx, CU_CTX_SCHED_AUTO, dev); else ptx_dev->ctx_shared = true; - r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); - + CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, + &pi, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev); ptx_dev->overlap = pi; - r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); - + CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, + &pi, CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY, dev); ptx_dev->map = pi; - r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); - + CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, + &pi, CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, dev); ptx_dev->concur = pi; - r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); - + CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, + &pi, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, dev); ptx_dev->mode = pi; - r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); - + CUDA_CALL_ERET (NULL, cuDeviceGetAttribute, + &pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev); ptx_dev->mkern = pi; r = cuDeviceGetAttribute (&async_engines, @@ -649,38 +664,34 @@ nvptx_open_device (int n) ptx_dev->images = NULL; pthread_mutex_init (&ptx_dev->image_lock, NULL); - init_streams_for_device (ptx_dev, async_engines); + if (!init_streams_for_device (ptx_dev, async_engines)) + return NULL; return ptx_dev; } -static void +static bool nvptx_close_device (struct ptx_device *ptx_dev) { - CUresult r; - if (!ptx_dev) - return; + return true; - fini_streams_for_device (ptx_dev); + if (!fini_streams_for_device (ptx_dev)) + return false; pthread_mutex_destroy (&ptx_dev->image_lock); if (!ptx_dev->ctx_shared) - { - r = cuCtxDestroy (ptx_dev->ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxDestroy error: %s", cuda_error (r)); - } + CUDA_CALL (cuCtxDestroy, ptx_dev->ctx); free (ptx_dev); + return true; } static int nvptx_get_num_devices (void) { int n; - CUresult r; /* PR libgomp/65099: Currently, we only support offloading in 64-bit configurations. */ @@ -693,22 +704,19 @@ nvptx_get_num_devices (void) further initialization). */ if (instantiated_devices == 0) { - r = cuInit (0); + CUresult r = cuInit (0); /* This is not an error: e.g. we may have CUDA libraries installed but no devices available. */ if (r != CUDA_SUCCESS) return 0; } - r = cuDeviceGetCount (&n); - if (r!= CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r)); - + CUDA_CALL_ERET (-1, cuDeviceGetCount, &n); return n; } -static void +static bool link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs, unsigned num_objs) { @@ -742,9 +750,7 @@ link_ptx (CUmodule *module, const struct targ_ptx_ opts[5] = CU_JIT_LOG_VERBOSE; optvals[5] = (void *) 1; - r = cuLinkCreate (6, opts, optvals, &linkstate); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuLinkCreate error: %s", cuda_error (r)); + CUDA_CALL (cuLinkCreate, 6, opts, optvals, &linkstate); for (; num_objs--; ptx_objs++) { @@ -756,8 +762,9 @@ link_ptx (CUmodule *module, const struct targ_ptx_ if (r != CUDA_SUCCESS) { GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); - GOMP_PLUGIN_fatal ("cuLinkAddData (ptx_code) error: %s", + GOMP_PLUGIN_error ("cuLinkAddData (ptx_code) error: %s", cuda_error (r)); + return false; } } @@ -768,15 +775,14 @@ link_ptx (CUmodule *module, const struct targ_ptx_ GOMP_PLUGIN_debug (0, "Link log %s\n", &ilog[0]); if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuLinkComplete error: %s", cuda_error (r)); + { + GOMP_PLUGIN_error ("cuLinkComplete error: %s", cuda_error (r)); + return false; + } - r = cuModuleLoadData (module, linkout); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuModuleLoadData error: %s", cuda_error (r)); - - r = cuLinkDestroy (linkstate); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuLinkDestory error: %s", cuda_error (r)); + CUDA_CALL (cuModuleLoadData, module, linkout); + CUDA_CALL (cuLinkDestroy, linkstate); + return true; } static void @@ -923,10 +929,8 @@ nvptx_exec (void (*fn), size_t mapnum, void **host /* Copy the (device) pointers to arguments to the device (dp and hp might in fact have the same value on a unified-memory system). */ - r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *)); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuda_error (r)); - + CUDA_CALL_ASSERT (cuMemcpy, (CUdeviceptr) dp, (CUdeviceptr) hp, + mapnum * sizeof (void *)); GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" " gangs=%u, workers=%u, vectors=%u\n", __FUNCTION__, targ_fn->launch->fn, @@ -939,12 +943,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host // vector length ntid.x kargs[0] = &dp; - r = cuLaunchKernel (function, - dims[GOMP_DIM_GANG], 1, 1, - dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, - 0, dev_str->stream, kargs, 0); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuLaunchKernel, function, + dims[GOMP_DIM_GANG], 1, 1, + dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, + 0, dev_str->stream, kargs, 0); #ifndef DISABLE_ASYNC if (async < acc_async_noval) @@ -971,9 +973,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **host event_gc (true); - r = cuEventRecord (*e, dev_str->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream); event_add (PTX_EVT_KNL, e, (void *)dev_str); } @@ -1001,163 +1001,139 @@ static void * nvptx_alloc (size_t s) { CUdeviceptr d; - CUresult r; - r = cuMemAlloc (&d, s); - if (r == CUDA_ERROR_OUT_OF_MEMORY) - return 0; - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r)); - return (void *)d; + CUDA_CALL_ERET (NULL, cuMemAlloc, &d, s); + return (void *) d; } -static void +static bool nvptx_free (void *p) { - CUresult r; CUdeviceptr pb; size_t ps; - r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)p); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p); + if ((CUdeviceptr) p != pb) + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } - if ((CUdeviceptr)p != pb) - GOMP_PLUGIN_fatal ("invalid device address"); - - r = cuMemFree ((CUdeviceptr)p); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + CUDA_CALL (cuMemFree, (CUdeviceptr) p); + return true; } -static void * + +static bool nvptx_host2dev (void *d, const void *h, size_t s) { - CUresult r; CUdeviceptr pb; size_t ps; struct nvptx_thread *nvthd = nvptx_thread (); if (!s) - return 0; - + return true; if (!d) - GOMP_PLUGIN_fatal ("invalid device address"); + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } - r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d); if (!pb) - GOMP_PLUGIN_fatal ("invalid device address"); - + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } if (!h) - GOMP_PLUGIN_fatal ("invalid host address"); - + { + GOMP_PLUGIN_error ("invalid host address"); + return false; + } if (d == h) - GOMP_PLUGIN_fatal ("invalid host or device address"); - + { + GOMP_PLUGIN_error ("invalid host or device address"); + return false; + } if ((void *)(d + s) > (void *)(pb + ps)) - GOMP_PLUGIN_fatal ("invalid size"); + { + GOMP_PLUGIN_error ("invalid size"); + return false; + } #ifndef DISABLE_ASYNC if (nvthd->current_stream != nvthd->ptx_dev->null_stream) { - CUevent *e; - - e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); - - r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); - + CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); event_gc (false); - - r = cuMemcpyHtoDAsync ((CUdeviceptr)d, h, s, - nvthd->current_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemcpyHtoDAsync error: %s", cuda_error (r)); - - r = cuEventRecord (*e, nvthd->current_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); - + CUDA_CALL (cuMemcpyHtoDAsync, + (CUdeviceptr) d, h, s, nvthd->current_stream->stream); + CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream); event_add (PTX_EVT_MEM, e, (void *)h); } else #endif - { - r = cuMemcpyHtoD ((CUdeviceptr)d, h, s); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r)); - } + CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s); - return 0; + return true; } -static void * +static bool nvptx_dev2host (void *h, const void *d, size_t s) { - CUresult r; CUdeviceptr pb; size_t ps; struct nvptx_thread *nvthd = nvptx_thread (); if (!s) - return 0; - + return true; if (!d) - GOMP_PLUGIN_fatal ("invalid device address"); + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } - r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d); if (!pb) - GOMP_PLUGIN_fatal ("invalid device address"); - + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } if (!h) - GOMP_PLUGIN_fatal ("invalid host address"); - + { + GOMP_PLUGIN_error ("invalid host address"); + return false; + } if (d == h) - GOMP_PLUGIN_fatal ("invalid host or device address"); - + { + GOMP_PLUGIN_error ("invalid host or device address"); + return false; + } if ((void *)(d + s) > (void *)(pb + ps)) - GOMP_PLUGIN_fatal ("invalid size"); + { + GOMP_PLUGIN_error ("invalid size"); + return false; + } #ifndef DISABLE_ASYNC if (nvthd->current_stream != nvthd->ptx_dev->null_stream) { - CUevent *e; - - e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); - - r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s\n", cuda_error (r)); - + CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); event_gc (false); - - r = cuMemcpyDtoHAsync (h, (CUdeviceptr)d, s, - nvthd->current_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemcpyDtoHAsync error: %s", cuda_error (r)); - - r = cuEventRecord (*e, nvthd->current_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); - + CUDA_CALL (cuMemcpyDtoHAsync, + h, (CUdeviceptr) d, s, nvthd->current_stream->stream); + CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream); event_add (PTX_EVT_MEM, e, (void *)h); } else #endif - { - r = cuMemcpyDtoH (h, (CUdeviceptr)d, s); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r)); - } + CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s); - return 0; + return true; } static void @@ -1227,17 +1203,13 @@ nvptx_async_test_all (void) static void nvptx_wait (int async) { - CUresult r; struct ptx_stream *s; s = select_stream_for_async (async, pthread_self (), false, NULL); - if (!s) GOMP_PLUGIN_fatal ("unknown async %d", async); - r = cuStreamSynchronize (s->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream); event_gc (true); } @@ -1245,7 +1217,6 @@ nvptx_wait (int async) static void nvptx_wait_async (int async1, int async2) { - CUresult r; CUevent *e; struct ptx_stream *s1, *s2; pthread_t self = pthread_self (); @@ -1261,23 +1232,17 @@ nvptx_wait_async (int async1, int async2) if (s1 == s2) GOMP_PLUGIN_fatal ("identical parameters"); - e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); event_gc (true); - r = cuEventRecord (*e, s1->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream); event_add (PTX_EVT_SYNC, e, NULL); - r = cuStreamWaitEvent (s2->stream, *e, 0); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0); } static void @@ -1302,9 +1267,7 @@ nvptx_wait_all (void) else if (r != CUDA_ERROR_NOT_READY) GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); - r = cuStreamSynchronize (s->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream); } } @@ -1316,7 +1279,6 @@ nvptx_wait_all (void) static void nvptx_wait_all_async (int async) { - CUresult r; struct ptx_stream *waiting_stream, *other_stream; CUevent *e; struct nvptx_thread *nvthd = nvptx_thread (); @@ -1346,20 +1308,14 @@ nvptx_wait_all_async (int async) e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); /* Record an event on the waited-for stream. */ - r = cuEventRecord (*e, other_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream); event_add (PTX_EVT_SYNC, e, NULL); - r = cuStreamWaitEvent (waiting_stream->stream, *e, 0); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0); } pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); @@ -1408,11 +1364,11 @@ nvptx_set_cuda_stream (int async, void *stream) pthread_t self = pthread_self (); struct nvptx_thread *nvthd = nvptx_thread (); - pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); - if (async < 0) GOMP_PLUGIN_fatal ("bad async %d", async); + pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); + /* We have a list of active streams and an array mapping async values to entries of that list. We need to take "ownership" of the passed-in stream, and add it to our list, removing the previous entry also (if there was one) @@ -1435,8 +1391,11 @@ nvptx_set_cuda_stream (int async, void *stream) s->next = s->next->next; } - cuStreamDestroy (oldstream->stream); - map_fini (oldstream); + CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream); + + if (!map_fini (oldstream)) + GOMP_PLUGIN_fatal ("error when freeing host memory"); + free (oldstream); } @@ -1473,37 +1432,50 @@ GOMP_OFFLOAD_get_num_devices (void) return nvptx_get_num_devices (); } -void +bool GOMP_OFFLOAD_init_device (int n) { + struct ptx_device *dev; + pthread_mutex_lock (&ptx_dev_lock); if (!nvptx_init () || ptx_devices[n] != NULL) { pthread_mutex_unlock (&ptx_dev_lock); - return; + return false; } - ptx_devices[n] = nvptx_open_device (n); - instantiated_devices++; + dev = nvptx_open_device (n); + if (dev) + { + ptx_devices[n] = dev; + instantiated_devices++; + } pthread_mutex_unlock (&ptx_dev_lock); + + return dev != NULL; } -void +bool GOMP_OFFLOAD_fini_device (int n) { pthread_mutex_lock (&ptx_dev_lock); if (ptx_devices[n] != NULL) { - nvptx_attach_host_thread_to_device (n); - nvptx_close_device (ptx_devices[n]); + if (!nvptx_attach_host_thread_to_device (n) + || !nvptx_close_device (ptx_devices[n])) + { + pthread_mutex_unlock (&ptx_dev_lock); + return false; + } ptx_devices[n] = NULL; instantiated_devices--; } pthread_mutex_unlock (&ptx_dev_lock); + return true; } /* Return the libgomp version number we're compatible with. There is @@ -1526,7 +1498,6 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version const char *const *var_names; const struct targ_fn_launch *fn_descs; unsigned int fn_entries, var_entries, i, j; - CUresult r; struct targ_fn_descriptor *targ_fns; struct addr_pair *targ_tbl; const nvptx_tdata_t *img_header = (const nvptx_tdata_t *) target_data; @@ -1534,18 +1505,19 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version struct ptx_device *dev; if (GOMP_VERSION_DEV (version) > GOMP_VERSION_NVIDIA_PTX) - GOMP_PLUGIN_fatal ("Offload data incompatible with PTX plugin" - " (expected %u, received %u)", - GOMP_VERSION_NVIDIA_PTX, GOMP_VERSION_DEV (version)); - - GOMP_OFFLOAD_init_device (ord); + { + GOMP_PLUGIN_error ("Offload data incompatible with PTX plugin" + " (expected %u, received %u)", + GOMP_VERSION_NVIDIA_PTX, GOMP_VERSION_DEV (version)); + return -1; + } + if (!nvptx_attach_host_thread_to_device (ord) + || !link_ptx (&module, img_header->ptx_objs, img_header->ptx_num)) + return -1; + dev = ptx_devices[ord]; - - nvptx_attach_host_thread_to_device (ord); - link_ptx (&module, img_header->ptx_objs, img_header->ptx_num); - /* The mkoffload utility emits a struct of pointers/integers at the start of each offload image. The array of kernel names and the functions addresses form a one-to-one correspondence. */ @@ -1576,9 +1548,8 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version { CUfunction function; - r = cuModuleGetFunction (&function, module, fn_descs[i].fn); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r)); + CUDA_CALL_ERET (-1, cuModuleGetFunction, &function, module, + fn_descs[i].fn); targ_fns->fn = function; targ_fns->launch = &fn_descs[i]; @@ -1592,9 +1563,8 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version CUdeviceptr var; size_t bytes; - r = cuModuleGetGlobal (&var, &bytes, module, var_names[j]); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r)); + CUDA_CALL_ERET (-1, cuModuleGetGlobal, + &var, &bytes, module, var_names[j]); targ_tbl->start = (uintptr_t) var; targ_tbl->end = targ_tbl->start + bytes; @@ -1606,54 +1576,58 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version /* Unload the program described by TARGET_DATA. DEV_DATA is the function descriptors allocated by G_O_load_image. */ -void +bool GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data) { struct ptx_image_data *image, **prev_p; struct ptx_device *dev = ptx_devices[ord]; if (GOMP_VERSION_DEV (version) > GOMP_VERSION_NVIDIA_PTX) - return; - + return true; + + bool ret = true; pthread_mutex_lock (&dev->image_lock); for (prev_p = &dev->images; (image = *prev_p) != 0; prev_p = &image->next) if (image->target_data == target_data) { *prev_p = image->next; - cuModuleUnload (image->module); + if (cuModuleUnload (image->module) != CUDA_SUCCESS) + ret = false; free (image->fns); free (image); break; } pthread_mutex_unlock (&dev->image_lock); + return ret; } void * GOMP_OFFLOAD_alloc (int ord, size_t size) { - nvptx_attach_host_thread_to_device (ord); + if (!nvptx_attach_host_thread_to_device (ord)) + return NULL; return nvptx_alloc (size); } -void +bool GOMP_OFFLOAD_free (int ord, void *ptr) { - nvptx_attach_host_thread_to_device (ord); - nvptx_free (ptr); + return (nvptx_attach_host_thread_to_device (ord) + && nvptx_free (ptr)); } -void * +bool GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n) { - nvptx_attach_host_thread_to_device (ord); - return nvptx_dev2host (dst, src, n); + return (nvptx_attach_host_thread_to_device (ord) + && nvptx_dev2host (dst, src, n)); } -void * +bool GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n) { - nvptx_attach_host_thread_to_device (ord); - return nvptx_host2dev (dst, src, n); + return (nvptx_attach_host_thread_to_device (ord) + && nvptx_host2dev (dst, src, n)); } void (*device_run) (int n, void *fn_ptr, void *vars) = NULL; @@ -1669,20 +1643,11 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *) void GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc) { - CUevent *e; - CUresult r; struct nvptx_thread *nvthd = nvptx_thread (); + CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - - r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); - - r = cuEventRecord (*e, nvthd->current_stream->stream); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); - + CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); + CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream); event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc); } @@ -1734,25 +1699,18 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord) struct ptx_device *ptx_dev; struct nvptx_thread *nvthd = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread)); - CUresult r; CUcontext thd_ctx; ptx_dev = ptx_devices[ord]; assert (ptx_dev); - r = cuCtxGetCurrent (&thd_ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuCtxGetCurrent, &thd_ctx); assert (ptx_dev->ctx); if (!thd_ctx) - { - r = cuCtxPushCurrent (ptx_dev->ctx); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r)); - } + CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx); nvthd->current_stream = ptx_dev->null_stream; nvthd->ptx_dev = ptx_dev;