diff mbox series

GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal (was: [PATCH 7/7 libgomp,amdgcn] GCN Libgomp Plugin)

Message ID 87y1aue0vd.fsf@euler.schwinge.ddns.net
State New
Headers show
Series GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal (was: [PATCH 7/7 libgomp,amdgcn] GCN Libgomp Plugin) | expand

Commit Message

Thomas Schwinge March 7, 2024, 11:29 a.m. UTC
Hi!

On 2019-11-12T13:29:16+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
> This patch contributes the GCN libgomp plugin, with the various
> configure and make bits to go with it.

An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
different from the libgomp-level host-fallback execution):

> --- /dev/null
> +++ b/libgomp/plugin/plugin-gcn.c

> +/* Flag to decide if the runtime should suppress a possible fallback to host
> +   execution.  */
> +
> +static bool suppress_host_fallback;

> +static void
> +init_environment_variables (void)
> +{
> +  [...]
> +  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
> +    suppress_host_fallback = true;
> +  else
> +    suppress_host_fallback = false;

> +/* Return true if the HSA runtime can run function FN_PTR.  */
> +
> +bool
> +GOMP_OFFLOAD_can_run (void *fn_ptr)
> +{
> +  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
> +
> +  init_kernel (kernel);
> +  if (kernel->initialization_failed)
> +    goto failure;
> +
> +  return true;
> +
> +failure:
> +  if (suppress_host_fallback)
> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
> +  return false;
> +}

This originates in the libgomp HSA plugin, where the idea was -- in my
understanding -- that you wouldn't have device code available for all
'fn_ptr's, and in that case transparently (shared-memory system!) do
host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
you'd get those diagnosed.

This has then been copied into the libgomp GCN plugin (see above).
However, is it really still applicable there; don't we assume that we're
generating device code for all relevant functions?  (I suppose everyone
really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
actually remove 'suppress_host_fallback' (that is, make it
always-'true'), including removal of the 'can_run' hook?  (I suppose that
even in a future shared-memory "GCN" configuration, we're not expecting
to use this again; expecting always-'true' for 'can_run'?)


Now my actual issue: the libgomp GCN plugin then invented an additional
use of 'GCN_SUPPRESS_HOST_FALLBACK':

> +/* Initialize hsa_context if it has not already been done.
> +   Return TRUE on success.  */
> +
> +static bool
> +init_hsa_context (void)
> +{
> +  hsa_status_t status;
> +  int agent_index = 0;
> +
> +  if (hsa_context.initialized)
> +    return true;
> +  init_environment_variables ();
> +  if (!init_hsa_runtime_functions ())
> +    {
> +      GCN_WARNING ("Run-time could not be dynamically opened\n");
> +      if (suppress_host_fallback)
> +	GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
> +      return false;
> +    }

That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
original purpose), and you have the libgomp GCN plugin configured, but
don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.

The libgomp nvptx plugin in such cases silently disables the
plugin/device (and thus lets libgomp proper do its thing), and I propose
we do the same here.  OK to push the attached
"GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal"?


Grüße
 Thomas

Comments

Andrew Stubbs March 7, 2024, 11:38 a.m. UTC | #1
On 07/03/2024 11:29, Thomas Schwinge wrote:
> Hi!
> 
> On 2019-11-12T13:29:16+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
>> This patch contributes the GCN libgomp plugin, with the various
>> configure and make bits to go with it.
> 
> An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
> different from the libgomp-level host-fallback execution):
> 
>> --- /dev/null
>> +++ b/libgomp/plugin/plugin-gcn.c
> 
>> +/* Flag to decide if the runtime should suppress a possible fallback to host
>> +   execution.  */
>> +
>> +static bool suppress_host_fallback;
> 
>> +static void
>> +init_environment_variables (void)
>> +{
>> +  [...]
>> +  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
>> +    suppress_host_fallback = true;
>> +  else
>> +    suppress_host_fallback = false;
> 
>> +/* Return true if the HSA runtime can run function FN_PTR.  */
>> +
>> +bool
>> +GOMP_OFFLOAD_can_run (void *fn_ptr)
>> +{
>> +  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
>> +
>> +  init_kernel (kernel);
>> +  if (kernel->initialization_failed)
>> +    goto failure;
>> +
>> +  return true;
>> +
>> +failure:
>> +  if (suppress_host_fallback)
>> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
>> +  return false;
>> +}
> 
> This originates in the libgomp HSA plugin, where the idea was -- in my
> understanding -- that you wouldn't have device code available for all
> 'fn_ptr's, and in that case transparently (shared-memory system!) do
> host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
> you'd get those diagnosed.
> 
> This has then been copied into the libgomp GCN plugin (see above).
> However, is it really still applicable there; don't we assume that we're
> generating device code for all relevant functions?  (I suppose everyone
> really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
> actually remove 'suppress_host_fallback' (that is, make it
> always-'true'), including removal of the 'can_run' hook?  (I suppose that
> even in a future shared-memory "GCN" configuration, we're not expecting
> to use this again; expecting always-'true' for 'can_run'?)
> 
> 
> Now my actual issue: the libgomp GCN plugin then invented an additional
> use of 'GCN_SUPPRESS_HOST_FALLBACK':
> 
>> +/* Initialize hsa_context if it has not already been done.
>> +   Return TRUE on success.  */
>> +
>> +static bool
>> +init_hsa_context (void)
>> +{
>> +  hsa_status_t status;
>> +  int agent_index = 0;
>> +
>> +  if (hsa_context.initialized)
>> +    return true;
>> +  init_environment_variables ();
>> +  if (!init_hsa_runtime_functions ())
>> +    {
>> +      GCN_WARNING ("Run-time could not be dynamically opened\n");
>> +      if (suppress_host_fallback)
>> +	GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>> +      return false;
>> +    }
> 
> That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
> original purpose), and you have the libgomp GCN plugin configured, but
> don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.
> 
> The libgomp nvptx plugin in such cases silently disables the
> plugin/device (and thus lets libgomp proper do its thing), and I propose
> we do the same here.  OK to push the attached
> "GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal"?

If you try to run the offload testsuite on a device that is not properly 
configured then we want FAIL, not pass-via-fallback. You're breaking that.

Andrew
Tobias Burnus March 7, 2024, 11:43 a.m. UTC | #2
Hi,

Thomas Schwinge wrote:
> An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
> different from the libgomp-level host-fallback execution):
>> +failure:
>> +  if (suppress_host_fallback)
>> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
>> +  return false;
>> +}
>
> This originates in the libgomp HSA plugin, where the idea was -- in my
> understanding -- that you wouldn't have device code available for all
> 'fn_ptr's, and in that case transparently (shared-memory system!) do
> host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
> you'd get those diagnosed.
>
> This has then been copied into the libgomp GCN plugin (see above).
> However, is it really still applicable there; don't we assume that we're
> generating device code for all relevant functions?  (I suppose everyone
> really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)

First, I think most users do not set GCN_SUPPRESS_HOST_FALLBACK – and it 
is also not really desirable.

If I run on my Linux system the system compiler with nvptx + gcn suppost 
installed, I get (with a nvptx permission problem):

$ GCN_SUPPRESS_HOST_FALLBACK=1 ./a.out

libgomp: GCN host fallback has been suppressed

And exit code = 1. The same result with '-foffload=disable' or with 
'-foffload=nvptx-none'.

> Should we thus
> actually remove 'suppress_host_fallback' (that is, make it
> always-'true'),

If we want to remove it, we can make it always false - but I am strongly 
against making it always true.

Use OMP_TARGET_OFFLOAD=mandatory (or that GCN env) if you want to 
prevent the host fallback, but don't break somewhat common systems.

Tobias
Thomas Schwinge March 7, 2024, 1:37 p.m. UTC | #3
Hi Andrew!

On 2024-03-07T11:38:27+0000, Andrew Stubbs <ams@baylibre.com> wrote:
> On 07/03/2024 11:29, Thomas Schwinge wrote:
>> On 2019-11-12T13:29:16+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
>>> This patch contributes the GCN libgomp plugin, with the various
>>> configure and make bits to go with it.
>> 
>> An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
>> different from the libgomp-level host-fallback execution):
>> 
>>> --- /dev/null
>>> +++ b/libgomp/plugin/plugin-gcn.c
>> 
>>> +/* Flag to decide if the runtime should suppress a possible fallback to host
>>> +   execution.  */
>>> +
>>> +static bool suppress_host_fallback;
>> 
>>> +static void
>>> +init_environment_variables (void)
>>> +{
>>> +  [...]
>>> +  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
>>> +    suppress_host_fallback = true;
>>> +  else
>>> +    suppress_host_fallback = false;
>> 
>>> +/* Return true if the HSA runtime can run function FN_PTR.  */
>>> +
>>> +bool
>>> +GOMP_OFFLOAD_can_run (void *fn_ptr)
>>> +{
>>> +  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
>>> +
>>> +  init_kernel (kernel);
>>> +  if (kernel->initialization_failed)
>>> +    goto failure;
>>> +
>>> +  return true;
>>> +
>>> +failure:
>>> +  if (suppress_host_fallback)
>>> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>>> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
>>> +  return false;
>>> +}
>> 
>> This originates in the libgomp HSA plugin, where the idea was -- in my
>> understanding -- that you wouldn't have device code available for all
>> 'fn_ptr's, and in that case transparently (shared-memory system!) do
>> host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
>> you'd get those diagnosed.
>> 
>> This has then been copied into the libgomp GCN plugin (see above).
>> However, is it really still applicable there; don't we assume that we're
>> generating device code for all relevant functions?  (I suppose everyone
>> really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
>> actually remove 'suppress_host_fallback' (that is, make it
>> always-'true'), including removal of the 'can_run' hook?  (I suppose that
>> even in a future shared-memory "GCN" configuration, we're not expecting
>> to use this again; expecting always-'true' for 'can_run'?)
>> 
>> 
>> Now my actual issue: the libgomp GCN plugin then invented an additional
>> use of 'GCN_SUPPRESS_HOST_FALLBACK':
>> 
>>> +/* Initialize hsa_context if it has not already been done.
>>> +   Return TRUE on success.  */
>>> +
>>> +static bool
>>> +init_hsa_context (void)
>>> +{
>>> +  hsa_status_t status;
>>> +  int agent_index = 0;
>>> +
>>> +  if (hsa_context.initialized)
>>> +    return true;
>>> +  init_environment_variables ();
>>> +  if (!init_hsa_runtime_functions ())
>>> +    {
>>> +      GCN_WARNING ("Run-time could not be dynamically opened\n");
>>> +      if (suppress_host_fallback)
>>> +	GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>>> +      return false;
>>> +    }
>> 
>> That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
>> original purpose), and you have the libgomp GCN plugin configured, but
>> don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.
>> 
>> The libgomp nvptx plugin in such cases silently disables the
>> plugin/device (and thus lets libgomp proper do its thing), and I propose
>> we do the same here.  OK to push the attached
>> "GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal"?
>
> If you try to run the offload testsuite on a device that is not properly 
> configured then we want FAIL

Exactly, and that's what I'm working towards.  (Currently we're not
implementing that properly.)

But why is 'GCN_SUPPRESS_HOST_FALLBACK' controlling
'init_hsa_runtime_functions' relevant for that?  As you know, that
function only deals with dynamically loading 'libhsa-runtime64.so.1', and
Failure to load that one (because it doesn't exist) should have the
agreed-upon behavior of *not* raising an error.  (Any other, later errors
should be fatal, I certainly agree.)

> not pass-via-fallback. You're breaking that.

Sorry, I don't follow, please explain?


Grüße
 Thomas
Thomas Schwinge March 7, 2024, 1:37 p.m. UTC | #4
Hi Tobias!

On 2024-03-07T12:43:07+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
> Thomas Schwinge wrote:
>> An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
>> different from the libgomp-level host-fallback execution):
>>> +failure:
>>> +  if (suppress_host_fallback)
>>> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>>> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
>>> +  return false;
>>> +}
>>
>> This originates in the libgomp HSA plugin, where the idea was -- in my
>> understanding -- that you wouldn't have device code available for all
>> 'fn_ptr's, and in that case transparently (shared-memory system!) do
>> host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
>> you'd get those diagnosed.
>>
>> This has then been copied into the libgomp GCN plugin (see above).
>> However, is it really still applicable there; don't we assume that we're
>> generating device code for all relevant functions?  (I suppose everyone
>> really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)
>
> First, I think most users do not set GCN_SUPPRESS_HOST_FALLBACK – and it 
> is also not really desirable.

External users probably don't, but certainly all our internal testing is
setting it, and also implicitly all nvptx offloading testing: simply by
means of having such knob in the libgomp nvptx plugin.  That is, the
libgomp nvptx plugin has an implicit 'suppress_host_fallback = true' for
(the original meaning of) that flag (and does not have the "init"-error
behavior that I consider bogus, and try to remove from the libgomp GCN
plugin).

And, one step back: how is (the original meaning of)
'suppress_host_fallback = false' even supposed to work on non-shared
memory systems as currently implemented by the libgomp GCN plugin?

> If I run on my Linux system the system compiler with nvptx + gcn suppost 
> installed, I get (with a nvptx permission problem):
>
> $ GCN_SUPPRESS_HOST_FALLBACK=1 ./a.out
>
> libgomp: GCN host fallback has been suppressed
>
> And exit code = 1. The same result with '-foffload=disable' or with 
> '-foffload=nvptx-none'.

I can't tell if that's what you expect to see there, or not?

(For avoidance of doubt: I'm expecting silent host-fallback execution in
case that libgomp GCN and/or nvptx plugins are available, but no
corresponding devices.  That's what my patch achieves.)

>> Should we thus
>> actually remove 'suppress_host_fallback' (that is, make it
>> always-'true'),
>
> If we want to remove it, we can make it always false - but I am strongly 
> against making it always true.

I'm confused.  So you want the GCN and nvptx plugins to behave
differently in that regard?  What is the rationale for that?  In
particular also regarding this whole concept of dynamic plugin-level
host-fallback execution being in conflict with our current non-shared
memory system configurations?


> Use OMP_TARGET_OFFLOAD=mandatory (or that GCN env) if you want to 
> prevent the host fallback, but don't break somewhat common systems.

That's an orthogonal concept?


Grüße
 Thomas
Andrew Stubbs March 7, 2024, 1:47 p.m. UTC | #5
On 07/03/2024 13:37, Thomas Schwinge wrote:
> Hi Andrew!
> 
> On 2024-03-07T11:38:27+0000, Andrew Stubbs <ams@baylibre.com> wrote:
>> On 07/03/2024 11:29, Thomas Schwinge wrote:
>>> On 2019-11-12T13:29:16+0000, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>> This patch contributes the GCN libgomp plugin, with the various
>>>> configure and make bits to go with it.
>>>
>>> An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
>>> different from the libgomp-level host-fallback execution):
>>>
>>>> --- /dev/null
>>>> +++ b/libgomp/plugin/plugin-gcn.c
>>>
>>>> +/* Flag to decide if the runtime should suppress a possible fallback to host
>>>> +   execution.  */
>>>> +
>>>> +static bool suppress_host_fallback;
>>>
>>>> +static void
>>>> +init_environment_variables (void)
>>>> +{
>>>> +  [...]
>>>> +  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
>>>> +    suppress_host_fallback = true;
>>>> +  else
>>>> +    suppress_host_fallback = false;
>>>
>>>> +/* Return true if the HSA runtime can run function FN_PTR.  */
>>>> +
>>>> +bool
>>>> +GOMP_OFFLOAD_can_run (void *fn_ptr)
>>>> +{
>>>> +  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
>>>> +
>>>> +  init_kernel (kernel);
>>>> +  if (kernel->initialization_failed)
>>>> +    goto failure;
>>>> +
>>>> +  return true;
>>>> +
>>>> +failure:
>>>> +  if (suppress_host_fallback)
>>>> +    GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>>>> +  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
>>>> +  return false;
>>>> +}
>>>
>>> This originates in the libgomp HSA plugin, where the idea was -- in my
>>> understanding -- that you wouldn't have device code available for all
>>> 'fn_ptr's, and in that case transparently (shared-memory system!) do
>>> host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
>>> you'd get those diagnosed.
>>>
>>> This has then been copied into the libgomp GCN plugin (see above).
>>> However, is it really still applicable there; don't we assume that we're
>>> generating device code for all relevant functions?  (I suppose everyone
>>> really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
>>> actually remove 'suppress_host_fallback' (that is, make it
>>> always-'true'), including removal of the 'can_run' hook?  (I suppose that
>>> even in a future shared-memory "GCN" configuration, we're not expecting
>>> to use this again; expecting always-'true' for 'can_run'?)
>>>
>>>
>>> Now my actual issue: the libgomp GCN plugin then invented an additional
>>> use of 'GCN_SUPPRESS_HOST_FALLBACK':
>>>
>>>> +/* Initialize hsa_context if it has not already been done.
>>>> +   Return TRUE on success.  */
>>>> +
>>>> +static bool
>>>> +init_hsa_context (void)
>>>> +{
>>>> +  hsa_status_t status;
>>>> +  int agent_index = 0;
>>>> +
>>>> +  if (hsa_context.initialized)
>>>> +    return true;
>>>> +  init_environment_variables ();
>>>> +  if (!init_hsa_runtime_functions ())
>>>> +    {
>>>> +      GCN_WARNING ("Run-time could not be dynamically opened\n");
>>>> +      if (suppress_host_fallback)
>>>> +	GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
>>>> +      return false;
>>>> +    }
>>>
>>> That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
>>> original purpose), and you have the libgomp GCN plugin configured, but
>>> don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.
>>>
>>> The libgomp nvptx plugin in such cases silently disables the
>>> plugin/device (and thus lets libgomp proper do its thing), and I propose
>>> we do the same here.  OK to push the attached
>>> "GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal"?
>>
>> If you try to run the offload testsuite on a device that is not properly
>> configured then we want FAIL
> 
> Exactly, and that's what I'm working towards.  (Currently we're not
> implementing that properly.)
> 
> But why is 'GCN_SUPPRESS_HOST_FALLBACK' controlling
> 'init_hsa_runtime_functions' relevant for that?  As you know, that
> function only deals with dynamically loading 'libhsa-runtime64.so.1', and
> Failure to load that one (because it doesn't exist) should have the
> agreed-upon behavior of *not* raising an error.  (Any other, later errors
> should be fatal, I certainly agree.)
> 
>> not pass-via-fallback. You're breaking that.
> 
> Sorry, I don't follow, please explain?

If the plugin load fails then libgomp will run in host-fallback. In that 
case, IIRC, this is the *only* opportunity we get to enforce 
GCN_SUPPRESS_HOST_FALLBACK. As far as I'm aware, that variable is 
internal, undocumented, meant for dev testing only. It says "I'm testing 
GCN features and if they're not working then I want to know about it."

Users should be using official OMP features.

Andrew
Tobias Burnus March 7, 2024, 2:07 p.m. UTC | #6
Hi Thomas,

first, I have the feeling we talk about (more or less) the same code 
region and use the same words – but we talk about rather different 
things. Thus, you confuse me (and possibly Andrew) – and my reply 
confuses you.

Thomas Schwinge wrote:
> On 2024-03-07T12:43:07+0100, Tobias Burnus<tburnus@baylibre.com>  wrote:
>> Thomas Schwinge wrote:
>>> First, I think most users do not set GCN_SUPPRESS_HOST_FALLBACK – and it
>>> is also not really desirable.
> External users probably don't, but certainly all our internal testing is
> setting it,

First, I doubt it – secondly, if it were true, it was broken for the 
last 5 years or so as we definitely did not notice fails due to not 
working offload devices. – Neither for AMD GCN nor ...

> and also implicitly all nvptx offloading testing: simply by
> means of having such knob in the libgomp nvptx plugin.

I did see it at some places set for AMD but I do not see any 
nvptx-specific environment variable which permits to do the same.

However:
>   That is, the
> libgomp nvptx plugin has an implicit 'suppress_host_fallback = true' for
> (the original meaning of) that flag

I think that's one of the problems here – you talk about 
suppress_host_fallback (implicit, original meaning), while I talk about 
the GCN_SUPPRESS_HOST_FALLBACK environment variable.

Besides all the talk about suppress_host_fallback, 
'init_hsa_runtime_functions' is not fatal' of the subject line seems to 
be something to be considered (beyond the patches you already suggested).


>> If I run on my Linux system the system compiler with nvptx + gcn suppost
>> installed, I get (with a nvptx permission problem):
>>
>> $ GCN_SUPPRESS_HOST_FALLBACK=1 ./a.out
>>
>> libgomp: GCN host fallback has been suppressed
>>
>> And exit code = 1. The same result with '-foffload=disable' or with
>> '-foffload=nvptx-none'.
> I can't tell if that's what you expect to see there, or not?

Well, obviously not that I get this error by default – and as your 
wording indicated that the internal variable will be always true – and 
not only when the env var GCN_SUPPRESS_HOST_FALLBACK is explicit set, I 
worry that I would get the error any time.

> (For avoidance of doubt: I'm expecting silent host-fallback execution in
> case that libgomp GCN and/or nvptx plugins are available, but no
> corresponding devices.  That's what my patch achieves.)

I concur that the silent host fallback should happen by default (unless 
env vars tell otherwise) - at least when either no code was generated 
for the device (e.g. -foffload=disable) or when the vendor runtime 
library is not available or no device (be it no hardware or no permission).

That's the current behavior and if that remains, my main concern evaporates.

* * *

>> If we want to remove it, we can make it always false - but I am strongly
>> against making it always true.
> I'm confused.  So you want the GCN and nvptx plugins to behave
> differently in that regard?
No – or at least: not unless GCN_SUPPRESS_HOST_FALLBACK is set.
>> Use OMP_TARGET_OFFLOAD=mandatory (or that GCN env) if you want to
>> prevent the host fallback, but don't break somewhat common systems.
> That's an orthogonal concept?

No – It's the same concept of the main use of the 
GCN_SUPPRESS_HOST_FALLBACK environment variable: You get a run-time 
error instead of a silent host fallback.

But I have in the whole thread the feeling that – while talking about 
the same code region and throwing in the same words – we actually talk 
about completely different things.

Tobias
Thomas Schwinge March 8, 2024, 10:21 a.m. UTC | #7
Hi!

On 2024-03-07T15:07:32+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
> first, I have the feeling we talk about (more or less) the same code 
> region and use the same words – but we talk about rather different 
> things. Thus, you confuse me (and possibly Andrew) – and my reply 
> confuses you.

That, indeed, is my impression, too.  :-/

And actually the biggest confusion seems to be that both you would like
'GCN_SUPPRESS_HOST_FALLBACK' to mean something else than
'HSA_SUPPRESS_HOST_FALLBACK' originally meant.

Hopefully the
"GCN: The original meaning of 'GCN_SUPPRESS_HOST_FALLBACK' isn't applicable (non-shared memory system)"
does clarify that.


Just to close this out, let's try again for the other discussion items:

> Thomas Schwinge wrote:
>> On 2024-03-07T12:43:07+0100, Tobias Burnus<tburnus@baylibre.com>  wrote:
>>> Thomas Schwinge wrote:
>>>> First, I think most users do not set GCN_SUPPRESS_HOST_FALLBACK – and it
>>>> is also not really desirable.
>> External users probably don't, but certainly all our internal testing is
>> setting it,
>
> First, I doubt it

'git grep --cached GCN_SUPPRESS_HOST_FALLBACK' in our internal scripts is
your friend.

> secondly, if it were true, it was broken for the 
> last 5 years or so as we definitely did not notice fails due to not 
> working offload devices. – Neither for AMD GCN nor ...

You're saying that 'GCN_SUPPRESS_HOST_FALLBACK=1' doesn't report as fatal
certain errors during device probing?  That's not what the code as well
as my experience says.

>> and also implicitly all nvptx offloading testing: simply by
>> means of having ["no" missing here -- sorry!] such knob in the libgomp nvptx plugin.
>
> I did see it at some places set for AMD but I do not see any 
> nvptx-specific environment variable which permits to do the same.

Right, that was confusing: there was a "no" missing in that sentence --
sorry!

> However:
>>   That is, the
>> libgomp nvptx plugin has an implicit 'suppress_host_fallback = true' for
>> (the original meaning of) that flag
>
> I think that's one of the problems here – you talk about 
> suppress_host_fallback (implicit, original meaning), while I talk about 
> the GCN_SUPPRESS_HOST_FALLBACK environment variable.

The 'suppress_host_fallback' internal variable directly corresponds to
the 'GCN_SUPPRESS_HOST_FALLBACK' environment variable.

> Besides all the talk about suppress_host_fallback, 
> 'init_hsa_runtime_functions' is not fatal' of the subject line seems to 
> be something to be considered (beyond the patches you already suggested).

I'll next submit "GCN, nvptx: Errors during device probing are fatal".

>>> If I run on my Linux system the system compiler with nvptx + gcn suppost
>>> installed, I get (with a nvptx permission problem):
>>>
>>> $ GCN_SUPPRESS_HOST_FALLBACK=1 ./a.out
>>>
>>> libgomp: GCN host fallback has been suppressed
>>>
>>> And exit code = 1. The same result with '-foffload=disable' or with
>>> '-foffload=nvptx-none'.
>> I can't tell if that's what you expect to see there, or not?
>
> Well, obviously

In this discussion thread here, nothing was obvious to my anymore...  ;-|

> not that I get this error by default – and as your 
> wording indicated that the internal variable will be always true

That always-'true' suggestion was only for the *original* meaning of the
variable: the use in 'GOMP_OFFLOAD_can_run'.

> – and 
> not only when the env var GCN_SUPPRESS_HOST_FALLBACK is explicit set, I 
> worry that I would get the error any time.

That was exactly the point of my patch in this thread: to get rid of the
*additional*/*new* behavior that the libgomp GCN plugin derives from
'GCN_SUPPRESS_HOST_FALLBACK', different from what
'HSA_SUPPRESS_HOST_FALLBACK' originally meant.

However, I now understand that Andrew would like to keep that *new*
behavior.

>> (For avoidance of doubt: I'm expecting silent host-fallback execution in
>> case that libgomp GCN and/or nvptx plugins are available, but no
>> corresponding devices.  That's what my patch achieves.)
>
> I concur that the silent host fallback should happen by default (unless 
> env vars tell otherwise) - at least when either no code was generated 
> for the device (e.g. -foffload=disable) or when the vendor runtime 
> library is not available or no device (be it no hardware or no permission).
>
> That's the current behavior and if that remains, my main concern evaporates.

ACK, thanks.


Grüße
 Thomas


>>> If we want to remove it, we can make it always false - but I am strongly
>>> against making it always true.
>> I'm confused.  So you want the GCN and nvptx plugins to behave
>> differently in that regard?
> No – or at least: not unless GCN_SUPPRESS_HOST_FALLBACK is set.
>>> Use OMP_TARGET_OFFLOAD=mandatory (or that GCN env) if you want to
>>> prevent the host fallback, but don't break somewhat common systems.
>> That's an orthogonal concept?
>
> No – It's the same concept of the main use of the 
> GCN_SUPPRESS_HOST_FALLBACK environment variable: You get a run-time 
> error instead of a silent host fallback.
>
> But I have in the whole thread the feeling that – while talking about 
> the same code region and throwing in the same words – we actually talk 
> about completely different things.
>
> Tobias
diff mbox series

Patch

From f037d2d8274940f042633a0ecb18a53942c075f5 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Thu, 7 Mar 2024 10:43:15 +0100
Subject: [PATCH] GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to
 'init_hsa_runtime_functions' is not fatal

'GCN_SUPPRESS_HOST_FALLBACK' controls the libgomp GCN plugin's capability to
transparently use host-fallback execution for certain device functions; it
shouldn't control failure of libgomp GCN plugin initialization (which libgomp
handles fine: triggering use of a different plugin/device, or general
host-fallback execution, or fatal error, as applicable).

	libgomp/
	* plugin/plugin-gcn.c (init_hsa_context): Even with
	'GCN_SUPPRESS_HOST_FALLBACK' set, failure to
	'init_hsa_runtime_functions' is not fatal.
---
 libgomp/plugin/plugin-gcn.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 2771123252a..464164afb03 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1524,8 +1524,6 @@  init_hsa_context (void)
   if (!init_hsa_runtime_functions ())
     {
       GCN_WARNING ("Run-time could not be dynamically opened\n");
-      if (suppress_host_fallback)
-	GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
       return false;
     }
   status = hsa_fns.hsa_init_fn ();
-- 
2.34.1