diff mbox series

[2/8] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag

Message ID 28b287eb3939b1941bd46b2ed9a6981a577568c4.1553050609.git.sbobroff@linux.ibm.com (mailing list archive)
State Changes Requested
Headers show
Series | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (de3c83c2fd2b87cf68214eda76dfa66989d78cb6)
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked

Commit Message

Sam Bobroff March 20, 2019, 2:58 a.m. UTC
The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Alexey Kardashevskiy March 20, 2019, 6:02 a.m. UTC | #1
On 20/03/2019 13:58, Sam Bobroff wrote:
> The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
> use of driver callbacks in drivers that have been bound part way
> through the recovery process. This is necessary to prevent later stage
> handlers from being called when the earlier stage handlers haven't,
> which can be confusing for drivers.

The flag is used from eeh_pe_report()->eeh_pe_report_edev which is
called many times from eeh_handle_normal_event() (and you clear the flag
here unconditionally) and once from eeh_handle_special_event() - so this
is actually the only case now when the flag matters. Is my understanding
correct? Also is not clearing the flag correct in that case? I do not
quite understand eeh_handle_normal_event vs. eeh_handle_special_event
business though.


> 
> However, the flag is set for all devices that are added after boot
> time and only cleared at the end of the EEH recovery process. This
> results in hot plugged devices erroneously having the flag set during
> the first recovery after they are added (causing their driver's
> handlers to be incorrectly ignored).
> 
> To remedy this, clear the flag at the beginning of recovery
> processing. The flag is still cleared at the end of recovery
> processing, although it is no longer really necessary.

Then may be remove that redundant clearing?

> 
> Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> ---
>  arch/powerpc/kernel/eeh_driver.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 6f3ee30565dd..4c34b9901f15 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -819,6 +819,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  		result = PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
> +	eeh_for_each_pe(pe, tmp_pe)
> +		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> +			edev->mode &= ~EEH_DEV_NO_HANDLER;
> +
>  	/* Walk the various device drivers attached to this slot through
>  	 * a reset sequence, giving each an opportunity to do what it needs
>  	 * to accomplish the reset.  Each child gets a report of the
>
Sam Bobroff April 8, 2019, 6:50 a.m. UTC | #2
On Wed, Mar 20, 2019 at 05:02:57PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 20/03/2019 13:58, Sam Bobroff wrote:
> > The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
> > use of driver callbacks in drivers that have been bound part way
> > through the recovery process. This is necessary to prevent later stage
> > handlers from being called when the earlier stage handlers haven't,
> > which can be confusing for drivers.
> 
> The flag is used from eeh_pe_report()->eeh_pe_report_edev which is
> called many times from eeh_handle_normal_event() (and you clear the flag
> here unconditionally) and once from eeh_handle_special_event() - so this
> is actually the only case now when the flag matters. Is my understanding
> correct? Also is not clearing the flag correct in that case? I do not
> quite understand eeh_handle_normal_event vs. eeh_handle_special_event
> business though.

I'm not sure I fully understand your question, but here's the situation:

* EEH is detected on a PCI device that has no driver bound but there is
  a driver that COULD bind.
* eeh_handle_normal_event() follows the "EEH: Reset with hotplug
  activity" path because the device doesn't (currently) have a driver
  that supports EEH.
* eeh_reset_device() removes the device (pci_hp_remove_devices()).
* eeh_reset_device() re-discovers the device with pci_hp_add_devices().
* As part of re-discovery the PCI subsystem will bind the available driver.
* eeh_handle_normal_event() calls eeh_report_resume() (via eeh_pe_report()).

If the (newly bound) driver has a resume() handler, then
eeh_report_resume() will call it and AFAIK this will cause a problem for
some drivers because their error_detected() handler wasn't called first.

The fix for this is to have eeh_add_device_late() set EEH_DEV_NO_HANDLER
so that we can detect that the device has been added DURING recovery,
and avoid calling it's handlers later.

I see what you mean about the eeh_handle_special_event() case, where
EEH_DEV_NO_HANDLER isn't cleared before calling eeh_pe_report(), and I
think it's a bug! I'll fix it in the next version.

(Cleaning up that flag is on my list. I don't think it's a very good
solution.)

> > 
> > However, the flag is set for all devices that are added after boot
> > time and only cleared at the end of the EEH recovery process. This
> > results in hot plugged devices erroneously having the flag set during
> > the first recovery after they are added (causing their driver's
> > handlers to be incorrectly ignored).
> > 
> > To remedy this, clear the flag at the beginning of recovery
> > processing. The flag is still cleared at the end of recovery
> > processing, although it is no longer really necessary.
> 
> Then may be remove that redundant clearing?

I don't really mind either way; clearing it when we are finished with
recovery seems "cleaner" to me but it doesn't have any function. (In
any case, I think I will eventually want to remove it.)

> > 
> > Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> > ---
> >  arch/powerpc/kernel/eeh_driver.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 6f3ee30565dd..4c34b9901f15 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -819,6 +819,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
> >  		result = PCI_ERS_RESULT_DISCONNECT;
> >  	}
> >  
> > +	eeh_for_each_pe(pe, tmp_pe)
> > +		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
> > +			edev->mode &= ~EEH_DEV_NO_HANDLER;
> > +
> >  	/* Walk the various device drivers attached to this slot through
> >  	 * a reset sequence, giving each an opportunity to do what it needs
> >  	 * to accomplish the reset.  Each child gets a report of the
> > 
> 
> -- 
> Alexey
>
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 6f3ee30565dd..4c34b9901f15 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -819,6 +819,10 @@  void eeh_handle_normal_event(struct eeh_pe *pe)
 		result = PCI_ERS_RESULT_DISCONNECT;
 	}
 
+	eeh_for_each_pe(pe, tmp_pe)
+		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+			edev->mode &= ~EEH_DEV_NO_HANDLER;
+
 	/* Walk the various device drivers attached to this slot through
 	 * a reset sequence, giving each an opportunity to do what it needs
 	 * to accomplish the reset.  Each child gets a report of the