From patchwork Fri Mar 8 10:34:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 1909583 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=baylibre-com.20230601.gappssmtp.com header.i=@baylibre-com.20230601.gappssmtp.com header.a=rsa-sha256 header.s=20230601 header.b=JEvGonS2; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4TrjGs0C9yz1yWx for ; Fri, 8 Mar 2024 21:34:59 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9365E385DC08 for ; Fri, 8 Mar 2024 10:34:57 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by sourceware.org (Postfix) with ESMTPS id 6BC1A3858D38 for ; Fri, 8 Mar 2024 10:34:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6BC1A3858D38 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 6BC1A3858D38 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::62c ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709894082; cv=none; b=r1eXt9yrcO2jkYAqKX4E/+ro2qLia2xhh3x2OCPsvtOchVl/95qketK2Y24/JC1Emz8IGs3DlUq0wY5MzFuKLiB321uBk9aXmo3TXOjxm5hxXd2xgvDPopaRKdlsiiP0dJEAzR4u3LDv0Ah1c9jyzAYG0Z4jgor8U4Dj2cDteuk= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709894082; c=relaxed/simple; bh=XGB8/F54uN3VfcsFeHQBYo69i7/NehCRzxZeI5w7cg0=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=u03Av5rKjQStpecBhidFa//ext2faqxBU6sWOVn4nIK0aFq5aJAwF15fpaMRrZiGr/rt9xYWMhUA8KTFqp5TbvJxpa9ws8dnVhmDxSPJaAyuenkQ1k8cyZIJJGromRsJENLsFy13NcRs/sw/6QSXGYkQ/h1ouu0KCg6HYvfTrys= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-a4429c556efso88659966b.0 for ; Fri, 08 Mar 2024 02:34:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1709894078; x=1710498878; darn=gcc.gnu.org; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=dij197tmU1DSbbrgt1IJabfsyuguqP965x++hw4vMds=; b=JEvGonS2zRyzVxy77OisCRPp3iQ2iUou6isx4pkKNiq8dN3/GDGm2YB/OwRSAv+811 aS/aSvtTJ9PmGz1cUE9BDHSwv07/Oy15JtcyIVTbg52sK92sawqHodt8JSqF9c/nJpgg 5QaSHq7BxD5wrTkE4rOmT4/9The6YS9vHDIwAwPlfD+ALSbHPNOju+kEWYVbw/7eTdow cuMoHtR8IG6spAmdUKnpFkcvLl+3ycHnlQmxhP2jn88nTGqN0Ze8ukVZg/Z3L3+BVOhJ 6oBGA6G+k4/YNVj+nbTZMQP5WeDEIy/dkQiaRhE9ElOOV2StHgIWFPw7nZ9qkDDVNegd 8CLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709894078; x=1710498878; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=dij197tmU1DSbbrgt1IJabfsyuguqP965x++hw4vMds=; b=QRs3f6SSuguIY2asz7gEk3HuwjchntVqafV21pzceXnyH/d4x5a/7gfMU1f1dw0E8r xbHbdDZlL53lr3+VtguEGgttp0KHqO6Z4y2M/8bvXe636sFwemUmGHYifSP3rPz9d8kb LDswqSM8ZDvBZJaK2b1OUw7/rMCPrjizg/63TNihjYq2XX/XWyvthZ8I2JMEOr3QGbPy gBruxOwkpLveS7Do1nEuH1q0eTYxWt+Ze2UaPOhxpuN/YzAXZFiVX/DKUDpQw9pJulJT U6y57pSBooEeCgeMcDFQfxijrwW63edcQWmGmu/BIDuOc27fllou76drLRcYqiBi/yLo elKw== X-Forwarded-Encrypted: i=1; AJvYcCUKEV/67F9xhm4wxAsOuOaQPdsSGY+9Tcj+WfiAWW2rL/IRI/cGNKd43PjadUonzCgg9FFAh3Ko8wZdqyvxtQak+M4FdOGnDA== X-Gm-Message-State: AOJu0Yyda2pdTqq6oKQ6Ahb99d68r3EX1jHABhN7bgDWSaxpJUkNJSJt JEyMNnOAf1qbHiVcwqmDDug+sNvXa7VuehE2MzdlC7+a66fnMagN/51vMYzPBTU= X-Google-Smtp-Source: AGHT+IG+Z5zsbx+HQS+/Yv9WpwrOOHvuRS92I7ZFcZUe+z/D9T2WxfloSTuxQiNCh4JboowM0m274A== X-Received: by 2002:a17:906:398f:b0:a45:b91f:2f95 with SMTP id h15-20020a170906398f00b00a45b91f2f95mr5254896eje.72.1709894078089; Fri, 08 Mar 2024 02:34:38 -0800 (PST) Received: from euler.schwinge.homeip.net (p200300c8b70336000b0134869109dcb1.dip0.t-ipconnect.de. [2003:c8:b703:3600:b01:3486:9109:dcb1]) by smtp.gmail.com with ESMTPSA id z24-20020a170906669800b00a44ce07ad77sm7164662ejo.166.2024.03.08.02.34.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Mar 2024 02:34:37 -0800 (PST) From: Thomas Schwinge To: Andrew Stubbs , gcc-patches@gcc.gnu.org, Jakub Jelinek Cc: Richard Biener , Tobias Burnus Subject: GCN, nvptx: Errors during device probing are fatal (was: Stabilizing flaky libgomp GCN target/offloading testing) In-Reply-To: <87il2ij8sm.fsf@euler.schwinge.ddns.net> References: <20240124124304.1780645-1-ams@baylibre.com> <78875q15-qq2n-45o2-nooo-59r0s0ss9031@fhfr.qr> <56rn2n3n-n340-n6on-6prr-soqpr9r7083q@fhfr.qr> <878r44l00i.fsf@euler.schwinge.ddns.net> <7sn70594-70r4-q5pp-7q5p-qr865r9q53qn@fhfr.qr> <87il2ij8sm.fsf@euler.schwinge.ddns.net> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/29.1 (x86_64-pc-linux-gnu) Date: Fri, 08 Mar 2024 11:34:33 +0100 Message-ID: <871q8l6mh2.fsf@euler.schwinge.ddns.net> MIME-Version: 1.0 X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Hi! On 2024-02-21T13:34:01+0100, I wrote: > On 2024-02-01T15:49:02+0100, Richard Biener wrote: >> On Thu, 1 Feb 2024, Thomas Schwinge wrote: >>> [...] what I >>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX. [...] > >>> [...] execution test FAILs. Not all FAILs appear all the time [...] > > What disturbs the testing a lot is, that the GPU may get into a bad > state, upon which any use either fails with a > 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in > 'libhsa-runtime64.so.1'... So, there's a "fun" aspect: if we run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (or other errors; and similar in the libgomp nvptx plugin) during libgomp GCN plugin device probing, then it's not fatal, but instead silently disables the libgomp plugin/device, thus typically silently resorting to host-fallback execution. That's not helpful behavior in my opinion, so I propose the attached "GCN, nvptx: Errors during device probing are fatal". OK to push? (That's also the behavior that's implemented in both the GCN and nvptx target 'run' tools.) Grüße Thomas From 0dc72089dccc10d3b55096ade5fc4d72de6cb96f Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Thu, 7 Mar 2024 14:42:07 +0100 Subject: [PATCH] GCN, nvptx: Errors during device probing are fatal Currently, we silently disable libgomp GCN and nvptx plugins/devices in presence of certain error conditions during device probing, thus typically silently resorting to host-fallback execution. Make such errors fatal, similar as for any other device access later on, so that we early and reliably notice when things go wrong. (Keep just two cases non-fatal: (a) libgomp GCN or nvptx plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not, and (b) those are available, but the corresponding devices are not.) This resolves the issue that we've got execution test cases unexpectedly PASSing, despite: libgomp: GCN fatal error: Run-time could not be initialized Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. ..., and therefore they were not offloaded to the GCN device, but ran in host-fallback execution mode. What happend in that scenario is that in 'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just silently disabled the libgomp plugin/device. Especially "entertaining" were cases where such unintended host-fallback execution happened during effective-target checks like 'offload_device_available' (host-fallback execution there meaning: no offload device available), but actual test cases then were running with an offload device available, and therefore mis-configured. include/ * cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'. libgomp/ * plugin/plugin-gcn.c (init_hsa_context): Add and handle 'bool probe' parameter. Adjust all users; errors during device probing are fatal. * plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from 'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal. --- include/cuda/cuda.h | 1 + libgomp/plugin/plugin-gcn.c | 14 ++++++++------ libgomp/plugin/plugin-nvptx.c | 4 +++- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 114aba4e074..0dca4b3a5c0 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -57,6 +57,7 @@ typedef enum { CUDA_ERROR_OUT_OF_MEMORY = 2, CUDA_ERROR_NOT_INITIALIZED = 3, CUDA_ERROR_DEINITIALIZED = 4, + CUDA_ERROR_NO_DEVICE = 100, CUDA_ERROR_INVALID_CONTEXT = 201, CUDA_ERROR_INVALID_HANDLE = 400, CUDA_ERROR_NOT_FOUND = 500, diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index 7e141a85f31..2bea9157e9d 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -1511,10 +1511,12 @@ assign_agent_ids (hsa_agent_t agent, void *data) } /* Initialize hsa_context if it has not already been done. - Return TRUE on success. */ + If !PROBE: returns TRUE on success. + If PROBE: returns TRUE on success or if the plugin/device shall be silently + ignored, and otherwise emits an error and returns FALSE. */ static bool -init_hsa_context (void) +init_hsa_context (bool probe) { hsa_status_t status; int agent_index = 0; @@ -1529,7 +1531,7 @@ init_hsa_context (void) GOMP_PLUGIN_fatal ("%s\n", msg); else GCN_WARNING ("%s\n", msg); - return false; + return probe ? true : false; } status = hsa_fns.hsa_init_fn (); if (status != HSA_STATUS_SUCCESS) @@ -3321,8 +3323,8 @@ GOMP_OFFLOAD_version (void) int GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) { - if (!init_hsa_context ()) - return 0; + if (!init_hsa_context (true)) + exit (EXIT_FAILURE); /* Return -1 if no omp_requires_mask cannot be fulfilled but devices were present. */ if (hsa_context.agent_count > 0 @@ -3339,7 +3341,7 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) bool GOMP_OFFLOAD_init_device (int n) { - if (!init_hsa_context ()) + if (!init_hsa_context (false)) return false; if (n >= hsa_context.agent_count) { diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 81b4a7f499a..ba92a3a48cb 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -622,12 +622,14 @@ nvptx_get_num_devices (void) CUresult r = CUDA_CALL_NOCHECK (cuInit, 0); /* This is not an error: e.g. we may have CUDA libraries installed but no devices available. */ - if (r != CUDA_SUCCESS) + if (r == CUDA_ERROR_NO_DEVICE) { GOMP_PLUGIN_debug (0, "Disabling nvptx offloading; cuInit: %s\n", cuda_error (r)); return 0; } + else if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r)); } CUDA_CALL_ASSERT (cuDeviceGetCount, &n); -- 2.34.1