[v2,1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()

Message ID	074529df859e2aae5ee1683e567f708b65e3558d.1587361657.git.sbobroff@linux.ibm.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linuxppc-dev@lists.ozlabs.org> from <sbobroff@linux.ibm.com>; Mon, 20 Apr 2020 06:47:32 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Mon, 20 Apr 2020 06:47:30 +0100 From: Sam Bobroff <sbobroff@linux.ibm.com> To: linuxppc-dev@lists.ozlabs.org Subject: [PATCH v2 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge() Date: Mon, 20 Apr 2020 15:47:39 +1000 In-Reply-To: <cover.1587361657.git.sbobroff@linux.ibm.com> References: <cover.1587361657.git.sbobroff@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <074529df859e2aae5ee1683e567f708b65e3558d.1587361657.git.sbobroff@linux.ibm.com> Precedence: list Cc: Oliver O'Halloran <oohall@gmail.com> Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>
Series	powerpc/eeh: Release EEH device state synchronously \| expand [v2,0/2] powerpc/eeh: Release EEH device state synchronously [v2,1/2] powerpc/eeh: fix pseries_eeh_configure_bridge() [v2,2/2] powerpc/eeh: Release EEH device state synchronously

Message ID

074529df859e2aae5ee1683e567f708b65e3558d.1587361657.git.sbobroff@linux.ibm.com (mailing list archive)

State

Changes Requested

Headers

From: Sam Bobroff <sbobroff@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org
Subject: [PATCH v2 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()
Date: Mon, 20 Apr 2020 15:47:39 +1000
In-Reply-To: <cover.1587361657.git.sbobroff@linux.ibm.com>
References: <cover.1587361657.git.sbobroff@linux.ibm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: 
 <074529df859e2aae5ee1683e567f708b65e3558d.1587361657.git.sbobroff@linux.ibm.com>
Precedence: list
Cc: Oliver O'Halloran <oohall@gmail.com>
Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
 <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Series

powerpc/eeh: Release EEH device state synchronously | expand

Checks

Context	Check	Description
snowpatch_ozlabs/apply_patch	success	Successfully applied on branch powerpc/merge (a9aa21d05c33c556e48c5062b6632a9b94906570)
snowpatch_ozlabs/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 8 lines checked
snowpatch_ozlabs/needsstable	success	Patch has no Fixes tags

Context

Check

Description

snowpatch_ozlabs/apply_patch

success

Successfully applied on branch powerpc/merge (a9aa21d05c33c556e48c5062b6632a9b94906570)

snowpatch_ozlabs/checkpatch

success

total: 0 errors, 0 warnings, 0 checks, 8 lines checked

snowpatch_ozlabs/needsstable

success

Patch has no Fixes tags

Commit Message

Sam Bobroff April 20, 2020, 5:47 a.m. UTC

If a device is hot unplgged during EEH recovery, it's possible for the
RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
parameter error (-3), however negative return values are not checked
for and this leads to an infinite loop.

Fix this by correctly bailing out on negative values.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Nathan Lynch April 21, 2020, 11:33 p.m. UTC | #1

Sam Bobroff <sbobroff@linux.ibm.com> writes:
> If a device is hot unplgged during EEH recovery, it's possible for the
> RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
> parameter error (-3), however negative return values are not checked
> for and this leads to an infinite loop.
>
> Fix this by correctly bailing out on negative values.
>
> Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index 893ba3f562c4..c4ef03bec0de 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -605,7 +605,7 @@ static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
>  				config_addr, BUID_HI(pe->phb->buid),
>  				BUID_LO(pe->phb->buid));
>  
> -		if (!ret)
> +		if (ret <= 0)
>  			return ret;

Note that this returns the firmware error value (e.g. -3 parameter
error) without converting it to a Linux errno. Nothing checks the error
value of this function as best I can tell, but -EINVAL would be better
than an implicit -ESRCH here.

And while this will behave correctly, the pr_warn() at the end of
pseries_eeh_configure_bridge() hints that someone had the intention
that this code should log a message on such an error:

static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
{
	int config_addr;
	int ret;
	/* Waiting 0.2s maximum before skipping configuration */
	int max_wait = 200;

	/* Figure out the PE address */
	config_addr = pe->config_addr;
	if (pe->addr)
		config_addr = pe->addr;

	while (max_wait > 0) {
		ret = rtas_call(ibm_configure_pe, 3, 1, NULL,
				config_addr, BUID_HI(pe->phb->buid),
				BUID_LO(pe->phb->buid));

		if (!ret)
			return ret;

		/*
		 * If RTAS returns a delay value that's above 100ms, cut it
		 * down to 100ms in case firmware made a mistake.  For more
		 * on how these delay values work see rtas_busy_delay_time
		 */
		if (ret > RTAS_EXTENDED_DELAY_MIN+2 &&
		    ret <= RTAS_EXTENDED_DELAY_MAX)
			ret = RTAS_EXTENDED_DELAY_MIN+2;

		max_wait -= rtas_busy_delay_time(ret);

		if (max_wait < 0)
			break;

		rtas_busy_delay(ret);
	}

	pr_warn("%s: Unable to configure bridge PHB#%x-PE#%x (%d)\n",
		__func__, pe->phb->global_number, pe->addr, ret);
	return ret;
}

So perhaps the error path should be made to break out of the loop
instead of returning. Or is the parameter error result simply
uninteresting in this scenario?

Sam Bobroff April 22, 2020, 3:30 a.m. UTC | #2

On Tue, Apr 21, 2020 at 06:33:36PM -0500, Nathan Lynch wrote:
> Sam Bobroff <sbobroff@linux.ibm.com> writes:
> > If a device is hot unplgged during EEH recovery, it's possible for the
> > RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
> > parameter error (-3), however negative return values are not checked
> > for and this leads to an infinite loop.
> >
> > Fix this by correctly bailing out on negative values.
> >
> > Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> > ---
> >  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > index 893ba3f562c4..c4ef03bec0de 100644
> > --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> > +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > @@ -605,7 +605,7 @@ static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
> >  				config_addr, BUID_HI(pe->phb->buid),
> >  				BUID_LO(pe->phb->buid));
> >  
> > -		if (!ret)
> > +		if (ret <= 0)
> >  			return ret;
> 
> Note that this returns the firmware error value (e.g. -3 parameter
> error) without converting it to a Linux errno. Nothing checks the error
> value of this function as best I can tell, but -EINVAL would be better
> than an implicit -ESRCH here.

Right, it's never used but I agree. I'll change it for v3.

> And while this will behave correctly, the pr_warn() at the end of
> pseries_eeh_configure_bridge() hints that someone had the intention
> that this code should log a message on such an error:
> 
> static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
> {
> 	int config_addr;
> 	int ret;
> 	/* Waiting 0.2s maximum before skipping configuration */
> 	int max_wait = 200;
> 
> 	/* Figure out the PE address */
> 	config_addr = pe->config_addr;
> 	if (pe->addr)
> 		config_addr = pe->addr;
> 
> 	while (max_wait > 0) {
> 		ret = rtas_call(ibm_configure_pe, 3, 1, NULL,
> 				config_addr, BUID_HI(pe->phb->buid),
> 				BUID_LO(pe->phb->buid));
> 
> 		if (!ret)
> 			return ret;
> 
> 		/*
> 		 * If RTAS returns a delay value that's above 100ms, cut it
> 		 * down to 100ms in case firmware made a mistake.  For more
> 		 * on how these delay values work see rtas_busy_delay_time
> 		 */
> 		if (ret > RTAS_EXTENDED_DELAY_MIN+2 &&
> 		    ret <= RTAS_EXTENDED_DELAY_MAX)
> 			ret = RTAS_EXTENDED_DELAY_MIN+2;
> 
> 		max_wait -= rtas_busy_delay_time(ret);
> 
> 		if (max_wait < 0)
> 			break;
> 
> 		rtas_busy_delay(ret);
> 	}
> 
> 	pr_warn("%s: Unable to configure bridge PHB#%x-PE#%x (%d)\n",
> 		__func__, pe->phb->global_number, pe->addr, ret);
> 	return ret;
> }
> 
> So perhaps the error path should be made to break out of the loop
> instead of returning. Or is the parameter error result simply
> uninteresting in this scenario?

Sounds reasonable to me, and given that the only way I know to trigger
the error path (see the commit message) is not going to be common, I
think a message is a good idea. (And, as one of the people likely to
debug a future issue here, I'll probably appreciate it.)

Cheers,
Sam.

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 893ba3f562c4..c4ef03bec0de 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -605,7 +605,7 @@  static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
 				config_addr, BUID_HI(pe->phb->buid),
 				BUID_LO(pe->phb->buid));
 
-		if (!ret)
+		if (ret <= 0)
 			return ret;
 
 		/*