diff mbox series

[v2,1/6] occ: Wait if OCC GPU presence status not immediately available

Message ID 543316e7a0efa5d60fe6196d4aa1ed6a5cbef9e5.1535359753.git-series.andrew.donnellan@au1.ibm.com
State Superseded
Headers show
Series OpenCAPI support for Witherspoon | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success master/apply_patch Successfully applied

Commit Message

Andrew Donnellan Aug. 27, 2018, 8:55 a.m. UTC
It takes a few seconds for the OCC to set everything up in order to read
GPU presence. At present, we try to kick off OCC initialisation as early as
possible to maximise the time it has to read GPU presence.

Unfortunately sometimes that's not enough, so add a loop in
occ_get_gpu_presence() so that on the first time we try to get GPU presence
we keep trying for up to 2 seconds. Experimentally this seems to be
adequate.

Fixes: 9b394a32c8ea ("occ: Add support for GPU presence detection")
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
---
 hw/occ.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Frederic Barrat Aug. 29, 2018, 12:44 p.m. UTC | #1
Le 27/08/2018 à 10:55, Andrew Donnellan a écrit :
> It takes a few seconds for the OCC to set everything up in order to read
> GPU presence. At present, we try to kick off OCC initialisation as early as
> possible to maximise the time it has to read GPU presence.
> 
> Unfortunately sometimes that's not enough, so add a loop in
> occ_get_gpu_presence() so that on the first time we try to get GPU presence
> we keep trying for up to 2 seconds. Experimentally this seems to be
> adequate.
> 
> Fixes: 9b394a32c8ea ("occ: Add support for GPU presence detection")
> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> ---
>   hw/occ.c | 18 +++++++++++++++---
>   1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/occ.c b/hw/occ.c
> index a55bf8ed4f54..9fcac3f9581c 100644
> --- a/hw/occ.c
> +++ b/hw/occ.c
> @@ -1238,14 +1238,26 @@ exit:
>   bool occ_get_gpu_presence(struct proc_chip *chip, int gpu_num)
>   {
>   	struct occ_dynamic_data *ddata;
> +	static int max_retries = 20;
> +	static bool found = false;
> 
>   	assert(gpu_num <= 2);
> 
>   	ddata = get_occ_dynamic_data(chip);
> -
> -	if (ddata->major_version != 0 || ddata->minor_version < 1) {
> +	while (!found && max_retries) {
> +		if (ddata->major_version == 0 && ddata->minor_version >= 1) {
> +			found = true;
> +			break;
> +		}
>   		prlog(PR_INFO, "OCC: OCC not reporting GPU slot presence, "
> -		      "assuming device is present\n");
> +		      "waiting\n");

Do we really want to print up to 20 times the same message?
Other than that:
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>


> +		time_wait_ms(100);
> +		max_retries--;
> +		ddata = get_occ_dynamic_data(chip);
> +	}
> +
> +	if (!found) {
> +		prlog(PR_INFO, "OCC: No GPU slot presence, assuming GPU present\n");
>   		return true;
>   	}
>
Andrew Donnellan Aug. 30, 2018, 3:17 a.m. UTC | #2
On 29/08/18 22:44, Frederic Barrat wrote:>>           prlog(PR_INFO, 
"OCC: OCC not reporting GPU slot presence, "
>> -              "assuming device is present\n");
>> +              "waiting\n");
> 
> Do we really want to print up to 20 times the same message?

Argh, Rashmica had pointed that out to me even before I sent v1 and I 
forgot to fix it :)
diff mbox series

Patch

diff --git a/hw/occ.c b/hw/occ.c
index a55bf8ed4f54..9fcac3f9581c 100644
--- a/hw/occ.c
+++ b/hw/occ.c
@@ -1238,14 +1238,26 @@  exit:
 bool occ_get_gpu_presence(struct proc_chip *chip, int gpu_num)
 {
 	struct occ_dynamic_data *ddata;
+	static int max_retries = 20;
+	static bool found = false;
 
 	assert(gpu_num <= 2);
 
 	ddata = get_occ_dynamic_data(chip);
-
-	if (ddata->major_version != 0 || ddata->minor_version < 1) {
+	while (!found && max_retries) {
+		if (ddata->major_version == 0 && ddata->minor_version >= 1) {
+			found = true;
+			break;
+		}
 		prlog(PR_INFO, "OCC: OCC not reporting GPU slot presence, "
-		      "assuming device is present\n");
+		      "waiting\n");
+		time_wait_ms(100);
+		max_retries--;
+		ddata = get_occ_dynamic_data(chip);
+	}
+
+	if (!found) {
+		prlog(PR_INFO, "OCC: No GPU slot presence, assuming GPU present\n");
 		return true;
 	}