From patchwork Mon Jan 22 19:45:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tobias Burnus X-Patchwork-Id: 1889368 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=baylibre-com.20230601.gappssmtp.com header.i=@baylibre-com.20230601.gappssmtp.com header.a=rsa-sha256 header.s=20230601 header.b=Rvd5JMTE; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4TJggc3MGKz23dq for ; Tue, 23 Jan 2024 06:45:48 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6BF7D3858414 for ; Mon, 22 Jan 2024 19:45:46 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-wr1-x42c.google.com (mail-wr1-x42c.google.com [IPv6:2a00:1450:4864:20::42c]) by sourceware.org (Postfix) with ESMTPS id 87AFB3858D20 for ; Mon, 22 Jan 2024 19:45:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 87AFB3858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 87AFB3858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::42c ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705952723; cv=none; b=ss+Bgb2kpPBRW18wTof2qjxI2/R0bkvjIf4Jz+fDdQNFHzVPvnBeljjhw8mkbedEJv/CA3Jzo3h3gPgTKK8SCaP+pRmCq6yTMCj+UsCqxBDd/YP4AaxDIiB6VJl75deix06DkzbqVC7igcuA88KBDfz5+Q8GBHKSpfJNKpKuC9g= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705952723; c=relaxed/simple; bh=02seW9s/x9OITCAChZrS++PsxvBQt0d+yGpAUoyfy8M=; h=DKIM-Signature:Message-ID:Date:MIME-Version:To:From:Subject; b=p3k2egoOchT20C5Jm+VmESs681bVy508qIAnKyJJ73rXCQ70r1ynGLN1Rrbx/gFyPB1fnATuVvaBSQazcbLYRShpAQF1qM5x+D0PixOUqMOsglNuw9i3pczBUeDgRi36vUog7aDwZTSpU9/IYO2Ea5gb3Gj3WuW/ku1MDXdHVo4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-wr1-x42c.google.com with SMTP id ffacd0b85a97d-33929364bdaso1854607f8f.2 for ; Mon, 22 Jan 2024 11:45:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1705952719; x=1706557519; darn=gcc.gnu.org; h=subject:from:to:content-language:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=9tAXGywyGZw8XK9p4JW4O/0TB22Deq9IPmidTvzU6dw=; b=Rvd5JMTEBG9az1nze4eMhjptf1LYBQJ+wb+DlLT8KXWBEEeHXHqaxZcWO2H+8jqaA3 cVKnDTQHqD9pDtkK3UiODYJfLTbUITnJUKkdFj61hcfY5YN+kMAn6kBk78r/v9gpB1Hr 8MiL4ogdngHCr+SwGtk2wLSnI1v8nS8hUPay5PVo4ifhj77BVf5pIyUsX5in8/FpPwAK ZwG3sKLuUzt04d3yqTCIOgI00w77DYB/c4Qb+p2//ZTQp0QzFHPOa4bRys5sQ5MH5tZ2 e3pQLiDi0i8AbskF9lZgVF9I5yz06Wqgret11QT1iogPMoLE+0T/F0ZIFXaNbsQDZwXk PyMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705952719; x=1706557519; h=subject:from:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=9tAXGywyGZw8XK9p4JW4O/0TB22Deq9IPmidTvzU6dw=; b=UryqaRyOZ5d7ZfOSJ2/DiXv4em3g1LPLQUQfxUGsmT4Q/qq+qbHUMjoNrCgS/Zc4eN z3R0hCCZeAQpXWj00nFZBWRaahidlLgsp20oPRZZ+s9W+I3Pcj9I9Lmd7Ku3Td/qoZtu mybw6f08cGKn8jWdF5e3e2Cuwk1QiPmitUsDGBTvdrpwSVG6p2GVYeAh7tWBO37INq2t m/iRC4lfjt40VFK4i24ROzGtS/JiWfpD1LmpJhBIQzh2Aaet6HPxCIqtxLYuFwN+PicT jSeN9RpM33MmRrSQkbNqEudk3v3bvUxX6QNZGJGy/fwlYKy4yF/sju0O517X3ElOvaPt 8xjg== X-Gm-Message-State: AOJu0Yyn10ZFyIHulH7D9RUTh0XvkYvdymbSim6xhc/6qMe3k6KAin8y ML58IeWGrbSEs8I+FxTs/gk/Ud30scRAa1BqgB4nMOWZ7kr2G608yyYWa9vDijgszwkfu6LF73x SJJw= X-Google-Smtp-Source: AGHT+IF/ohihsAFuYl08weQEBQqtMkoyN475tFMOzGGJyeF2xU/iUIM9EF+ZJbugAAPu6cWL/AmF/A== X-Received: by 2002:adf:ea0f:0:b0:337:ccd5:d0bc with SMTP id q15-20020adfea0f000000b00337ccd5d0bcmr2929719wrm.33.1705952719164; Mon, 22 Jan 2024 11:45:19 -0800 (PST) Received: from ?IPV6:2001:16b8:2ad4:b000:be03:58ff:fe31:f74? (200116b82ad4b000be0358fffe310f74.dip.versatel-1u1.de. [2001:16b8:2ad4:b000:be03:58ff:fe31:f74]) by smtp.gmail.com with ESMTPSA id w6-20020adfee46000000b00337cf4a20c6sm11915589wro.31.2024.01.22.11.45.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Jan 2024 11:45:18 -0800 (PST) Message-ID: <30b08783-4f6d-4ae1-9459-9391fc8e6262@baylibre.com> Date: Mon, 22 Jan 2024 20:45:17 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: gcc-patches , Thomas Schwinge , Jakub Jelinek From: Tobias Burnus Subject: [patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513] X-Spam-Status: No, score=-13.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Testing showed that the libgomp.c/target-52.c failed with: libgomp: cuCtxGetDevice error: unknown cuda error libgomp: device finalization failed This testcase uses OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only fails if dg-set-target-env-var is honored. If both env vars are set, the device initialization occurs earlier as OMP_DEFAULT_DEVICE is shown due to the display-env env var and its value (when target-offload-var is 'mandatory') might be either 'omp_invalid_device' or '0'. It turned out that this had an effect on device finalization, which caused CUDA to stop earlier than expected. This patch now handles this case gracefully. For details, see the commit log message in the attached patch and/or the PR. Comments, remarks, suggestions? Does this look sensible? (I would like to see some acknowledgement by someone who feels more comfortable with CUDA than me.) Tobias plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513] The following issue was found when running libgomp.c/target-52.c with nvptx offloading when the dg-set-target-env-var was honored. The issue occurred for both -foffload=disable and with offloading configured when an nvidia device is available. At the end of the program, the offloading parts are shutdown via two means: The callback registered via 'atexit (gomp_target_fini)' and - via code generated in mkoffload, the '__attribute__((destructor)) fini' function that calls GOMP_offload_unregister_ver. In normal processing, first gomp_target_fini is called - which then sets GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver, but that's then because the state is GOMP_DEVICE_FINALIZED. If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set, the call omp_display_env already invokes gomp_init_targets_once, i.e. it occurs earlier than usual and is invoked via __attribute__((constructor)) initialize_env. For some unknown reasons, while this does not have an effect on the order of the called plugin functions for initialization, it changes the order of function calls for shutting down. Namely, when the two environment variables are set, GOMP_offload_unregister_ver is called now before gomp_target_fini. - And it seems as if CUDA regards a call to cuModuleUnload (or unloading the last module?) as indication that the device context should be destroyed - or, at least, afterwards calling cuCtxGetDevice will return CUDA_ERROR_DEINITIALIZED. As the previous code in nvptx_attach_host_thread_to_device wasn't expecting that result, it called GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); causing a fatal error of the program. This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such that GOMP_OFFLOAD_fini_device just works. When reading the code, the following was observed in addition: When gomp_fini_device is called, it invokes goacc_fini_asyncqueues to ensure that the queue is emptied. It seems to make sense to do likewise for GOMP_offload_unregister_ver, which this commit does in addition. libgomp/ChangeLog: PR libgomp/113513 * target.c (GOMP_offload_unregister_ver): Call goacc_fini_asyncqueues before invoking GOMP_offload_unregister_ver. * plugin/plugin-nvptx.c (nvptx_attach_host_thread_to_device): Change return type to int and return -1 for CUDA_ERROR_DEINITIALIZED. (GOMP_OFFLOAD_fini_device): Handle the latter gracefully. (nvptx_init, GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_alloc, GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host, GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d, GOMP_OFFLOAD_openacc_async_host2dev, GOMP_OFFLOAD_openacc_async_dev2host): Update for return-type change. Signed-off-by: Tobias Burnus libgomp/plugin/plugin-nvptx.c | 41 +++++++++++++++++++++++++---------------- libgomp/target.c | 7 +++++-- 2 files changed, 30 insertions(+), 18 deletions(-) diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index c04c3acd679..dccbae44abd 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -382,9 +382,11 @@ nvptx_init (void) } /* Select the N'th PTX device for the current host thread. The device must - have been previously opened before calling this function. */ + have been previously opened before calling this function. + Returns 1 if successful, 0 if an error occurred, and -1 for + CUDA_ERROR_DEINITIALIZED. */ -static bool +static int nvptx_attach_host_thread_to_device (int n) { CUdevice dev; @@ -393,15 +395,17 @@ nvptx_attach_host_thread_to_device (int n) CUcontext thd_ctx; r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev); + if (r == CUDA_ERROR_DEINITIALIZED) + return -1; if (r == CUDA_ERROR_NOT_PERMITTED) { /* Assume we're in a CUDA callback, just return true. */ - return true; + return 1; } if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT) { GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); - return false; + return 0; } if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n) @@ -414,7 +418,7 @@ nvptx_attach_host_thread_to_device (int n) if (!ptx_dev) { GOMP_PLUGIN_error ("device %d not found", n); - return false; + return 0; } CUDA_CALL (cuCtxGetCurrent, &thd_ctx); @@ -426,7 +430,7 @@ nvptx_attach_host_thread_to_device (int n) CUDA_CALL (cuCtxPushCurrent, ptx_dev->ctx); } - return true; + return 1; } static struct ptx_device * @@ -1252,8 +1256,11 @@ GOMP_OFFLOAD_fini_device (int n) if (ptx_devices[n] != NULL) { - if (!nvptx_attach_host_thread_to_device (n) - || !nvptx_close_device (ptx_devices[n])) + /* Returns 1 if successful, 0 if an error occurred, and -1 for + CUDA_ERROR_DEINITIALIZED. */ + int r = nvptx_attach_host_thread_to_device (n); + if (r == 0 + || (r == 1 && !nvptx_close_device (ptx_devices[n]))) { pthread_mutex_unlock (&ptx_dev_lock); return false; @@ -1329,7 +1336,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data, return -1; } - if (!nvptx_attach_host_thread_to_device (ord) + if (nvptx_attach_host_thread_to_device (ord) != 1 || !link_ptx (&module, img_header->ptx_objs, img_header->ptx_num)) return -1; @@ -1568,7 +1575,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data) void * GOMP_OFFLOAD_alloc (int ord, size_t size) { - if (!nvptx_attach_host_thread_to_device (ord)) + if (nvptx_attach_host_thread_to_device (ord) != 1) return NULL; struct ptx_device *ptx_dev = ptx_devices[ord]; @@ -1837,7 +1844,7 @@ cuda_memcpy_sanity_check (const void *h, const void *d, size_t s) bool GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n) { - if (!nvptx_attach_host_thread_to_device (ord) + if (nvptx_attach_host_thread_to_device (ord) != 1 || !cuda_memcpy_sanity_check (src, dst, n)) return false; CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n); @@ -1847,7 +1854,7 @@ GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n) bool GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n) { - if (!nvptx_attach_host_thread_to_device (ord) + if (nvptx_attach_host_thread_to_device (ord) != 1 || !cuda_memcpy_sanity_check (dst, src, n)) return false; CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n); @@ -1868,7 +1875,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size, const void *src, size_t src_offset1_size, size_t src_offset0_len, size_t src_dim1_size) { - if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord)) + if (nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord) + != 1) return false; /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported. */ @@ -1960,7 +1968,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, size_t src_offset0_len, size_t src_dim2_size, size_t src_dim1_len) { - if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord)) + if (nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord) + != 1) return false; /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported. */ @@ -2050,7 +2059,7 @@ bool GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src, size_t n, struct goacc_asyncqueue *aq) { - if (!nvptx_attach_host_thread_to_device (ord) + if (nvptx_attach_host_thread_to_device (ord) != 1 || !cuda_memcpy_sanity_check (src, dst, n)) return false; CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream); @@ -2061,7 +2070,7 @@ bool GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src, size_t n, struct goacc_asyncqueue *aq) { - if (!nvptx_attach_host_thread_to_device (ord) + if (nvptx_attach_host_thread_to_device (ord) != 1 || !cuda_memcpy_sanity_check (dst, src, n)) return false; CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream); diff --git a/libgomp/target.c b/libgomp/target.c index 1367e9cce6c..8d05877deb7 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -2706,8 +2706,11 @@ GOMP_offload_unregister_ver (unsigned version, const void *host_table, gomp_mutex_lock (&devicep->lock); if (devicep->type == target_type && devicep->state == GOMP_DEVICE_INITIALIZED) - gomp_unload_image_from_device (devicep, version, - host_table, target_data); + { + goacc_fini_asyncqueues (devicep); + gomp_unload_image_from_device (devicep, version, + host_table, target_data); + } gomp_mutex_unlock (&devicep->lock); }