diff mbox

[4/4] powerpc/eeh: Avoid event on passed PE

Message ID 1400574612-19411-5-git-send-email-gwshan@linux.vnet.ibm.com
State New, archived
Headers show

Commit Message

Gavin Shan May 20, 2014, 8:30 a.m. UTC
If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh.c                 | 8 ++++++++
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Alexander Graf May 20, 2014, 11:25 a.m. UTC | #1
On 20.05.14 10:30, Gavin Shan wrote:
> If we detects frozen state on PE that has been passed to guest, we
> needn't handle it. Instead, we rely on the guest to detect and recover
> it. The patch avoid EEH event on the frozen passed PE so that the guest
> can have chance to handle that.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

How does the guest learn about this failure? We'd need to inject an 
error into it, no?

I think what you want is an irqfd that the in-kernel eeh code notifies 
when it sees a failure. When such an fd exists, the kernel skips its own 
error handling.


Alex

> ---
>   arch/powerpc/kernel/eeh.c                 | 8 ++++++++
>   arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
>   2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 9c6b899..6543f05 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>   	if (ret > 0)
>   		return ret;
>   
> +	/*
> +	 * If the PE has been passed to guest, we won't check the
> +	 * state. Instead, let the guest handle it if the PE has
> +	 * been frozen.
> +	 */
> +	if (eeh_pe_passed(pe))
> +		return 0;
> +
>   	/* If we already have a pending isolation event for this
>   	 * slot, we know it's bad already, we don't need to check.
>   	 * Do this checking under a lock; as multiple PCI devices
> diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
> index 1b5982f..03a3ed2 100644
> --- a/arch/powerpc/platforms/powernv/eeh-ioda.c
> +++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
> @@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
>   				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
>   					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>   				ret = EEH_NEXT_ERR_NONE;
> -			} else if ((*pe)->state & EEH_PE_ISOLATED) {
> +			} else if ((*pe)->state & EEH_PE_ISOLATED ||
> +				   eeh_pe_passed(*pe)) {
>   				ret = EEH_NEXT_ERR_NONE;
>   			} else {
>   				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan May 20, 2014, 11:56 a.m. UTC | #2
On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>
>On 20.05.14 10:30, Gavin Shan wrote:
>>If we detects frozen state on PE that has been passed to guest, we
>>needn't handle it. Instead, we rely on the guest to detect and recover
>>it. The patch avoid EEH event on the frozen passed PE so that the guest
>>can have chance to handle that.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>
>How does the guest learn about this failure? We'd need to inject an
>error into it, no?
>

When error is existing in HW level, 0xFF's will be turned on reading
PCI config space or memory BARs. Guest retrieves the failure state,
which is captured by HW automatically, via RTAS call
"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
the guest kernel starts to recovery.

It can be called as "passive" reporting. There possible has one case that
the error can't be reported for ever: No device driver binding to the VFIO
PCI device and no access to device's config space and memory BARs. However,
it doesn't matter. As we don't use the device, we needn't detect and recover
the error at all.

>I think what you want is an irqfd that the in-kernel eeh code
>notifies when it sees a failure. When such an fd exists, the kernel
>skips its own error handling.
>

Yeah, it's a good idea and something for me to improve in phase II. We
can discuss for more later. For now, what I have in my head is something
like this:

      [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU 
                                                           |
                                   -------------(A)---------
                                   |
                        Send one EEH event to guest kernel
                                   |
                        Guest kernel starts the recovery

(A): I didn't figure out one convienent way to do the EEH event injection yet.

Thanks,
Gavin

>>---
>>  arch/powerpc/kernel/eeh.c                 | 8 ++++++++
>>  arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
>>  2 files changed, 10 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 9c6b899..6543f05 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>>  	if (ret > 0)
>>  		return ret;
>>+	/*
>>+	 * If the PE has been passed to guest, we won't check the
>>+	 * state. Instead, let the guest handle it if the PE has
>>+	 * been frozen.
>>+	 */
>>+	if (eeh_pe_passed(pe))
>>+		return 0;
>>+
>>  	/* If we already have a pending isolation event for this
>>  	 * slot, we know it's bad already, we don't need to check.
>>  	 * Do this checking under a lock; as multiple PCI devices
>>diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>index 1b5982f..03a3ed2 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
>>  				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
>>  					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>  				ret = EEH_NEXT_ERR_NONE;
>>-			} else if ((*pe)->state & EEH_PE_ISOLATED) {
>>+			} else if ((*pe)->state & EEH_PE_ISOLATED ||
>>+				   eeh_pe_passed(*pe)) {
>>  				ret = EEH_NEXT_ERR_NONE;
>>  			} else {
>>  				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
>

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf May 20, 2014, 12:14 p.m. UTC | #3
On 20.05.14 13:56, Gavin Shan wrote:
> On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>> On 20.05.14 10:30, Gavin Shan wrote:
>>> If we detects frozen state on PE that has been passed to guest, we
>>> needn't handle it. Instead, we rely on the guest to detect and recover
>>> it. The patch avoid EEH event on the frozen passed PE so that the guest
>>> can have chance to handle that.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> How does the guest learn about this failure? We'd need to inject an
>> error into it, no?
>>
> When error is existing in HW level, 0xFF's will be turned on reading
> PCI config space or memory BARs. Guest retrieves the failure state,
> which is captured by HW automatically, via RTAS call
> "ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
> space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
> the guest kernel starts to recovery.
>
> It can be called as "passive" reporting. There possible has one case that
> the error can't be reported for ever: No device driver binding to the VFIO
> PCI device and no access to device's config space and memory BARs. However,
> it doesn't matter. As we don't use the device, we needn't detect and recover
> the error at all.

So if the guest is waiting for an interrupt to happen it will wait 
forever? Not really nice.

>> I think what you want is an irqfd that the in-kernel eeh code
>> notifies when it sees a failure. When such an fd exists, the kernel
>> skips its own error handling.
>>
> Yeah, it's a good idea and something for me to improve in phase II. We
> can discuss for more later.

I think it makes sense to at least walk into that direction immediately. 
The reason I brought it up in the context of this patch is that with an 
irqfd you wouldn't need the passed flag at all.

>   For now, what I have in my head is something
> like this:
>
>        [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
>                                                             |
>                                     -------------(A)---------
>                                     |
>                          Send one EEH event to guest kernel
>                                     |
>                          Guest kernel starts the recovery
>
> (A): I didn't figure out one convienent way to do the EEH event injection yet.

How does the guest learn about errors in pHyp?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan May 20, 2014, 12:45 p.m. UTC | #4
On Tue, May 20, 2014 at 02:14:56PM +0200, Alexander Graf wrote:
>
>On 20.05.14 13:56, Gavin Shan wrote:
>>On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>>>On 20.05.14 10:30, Gavin Shan wrote:
>>>>If we detects frozen state on PE that has been passed to guest, we
>>>>needn't handle it. Instead, we rely on the guest to detect and recover
>>>>it. The patch avoid EEH event on the frozen passed PE so that the guest
>>>>can have chance to handle that.
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>How does the guest learn about this failure? We'd need to inject an
>>>error into it, no?
>>>
>>When error is existing in HW level, 0xFF's will be turned on reading
>>PCI config space or memory BARs. Guest retrieves the failure state,
>>which is captured by HW automatically, via RTAS call
>>"ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
>>space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
>>the guest kernel starts to recovery.
>>
>>It can be called as "passive" reporting. There possible has one case that
>>the error can't be reported for ever: No device driver binding to the VFIO
>>PCI device and no access to device's config space and memory BARs. However,
>>it doesn't matter. As we don't use the device, we needn't detect and recover
>>the error at all.
>
>So if the guest is waiting for an interrupt to happen it will wait
>forever? Not really nice.
>

Nope, the error reporting in guest isn't interrupt-driven. It's always
"polling" :-)

>>>I think what you want is an irqfd that the in-kernel eeh code
>>>notifies when it sees a failure. When such an fd exists, the kernel
>>>skips its own error handling.
>>>
>>Yeah, it's a good idea and something for me to improve in phase II. We
>>can discuss for more later.
>
>I think it makes sense to at least walk into that direction
>immediately. The reason I brought it up in the context of this patch
>is that with an irqfd you wouldn't need the passed flag at all.
>

I don't see how it can avoid the "passed" flag. Without the flag, any
PCI config and memory BAR access on host side could trigger EEH recovery
for those PCI devices passed to guest. That's unexpected behaviour. 

For host, we have 2 ways to report errors: interrupt driven and polling.
For the guest, we only have "polling" :-)

>>  For now, what I have in my head is something
>>like this:
>>
>>       [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
>>                                                            |
>>                                    -------------(A)---------
>>                                    |
>>                         Send one EEH event to guest kernel
>>                                    |
>>                         Guest kernel starts the recovery
>>
>>(A): I didn't figure out one convienent way to do the EEH event injection yet.
>
>How does the guest learn about errors in pHyp?
>

It relies on "polling".

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf May 20, 2014, 1:49 p.m. UTC | #5
On 20.05.14 14:45, Gavin Shan wrote:
> On Tue, May 20, 2014 at 02:14:56PM +0200, Alexander Graf wrote:
>> On 20.05.14 13:56, Gavin Shan wrote:
>>> On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>>>> On 20.05.14 10:30, Gavin Shan wrote:
>>>>> If we detects frozen state on PE that has been passed to guest, we
>>>>> needn't handle it. Instead, we rely on the guest to detect and recover
>>>>> it. The patch avoid EEH event on the frozen passed PE so that the guest
>>>>> can have chance to handle that.
>>>>>
>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>> How does the guest learn about this failure? We'd need to inject an
>>>> error into it, no?
>>>>
>>> When error is existing in HW level, 0xFF's will be turned on reading
>>> PCI config space or memory BARs. Guest retrieves the failure state,
>>> which is captured by HW automatically, via RTAS call
>>> "ibm,read-slot-reset-state2" when seeing 0xFF's on reading PCI config
>>> space or memory BARs. If "ibm,read-slot-reset-state2" reports errors in HW,
>>> the guest kernel starts to recovery.
>>>
>>> It can be called as "passive" reporting. There possible has one case that
>>> the error can't be reported for ever: No device driver binding to the VFIO
>>> PCI device and no access to device's config space and memory BARs. However,
>>> it doesn't matter. As we don't use the device, we needn't detect and recover
>>> the error at all.
>> So if the guest is waiting for an interrupt to happen it will wait
>> forever? Not really nice.
>>
> Nope, the error reporting in guest isn't interrupt-driven. It's always
> "polling" :-)

That sucks :).

>
>>>> I think what you want is an irqfd that the in-kernel eeh code
>>>> notifies when it sees a failure. When such an fd exists, the kernel
>>>> skips its own error handling.
>>>>
>>> Yeah, it's a good idea and something for me to improve in phase II. We
>>> can discuss for more later.
>> I think it makes sense to at least walk into that direction
>> immediately. The reason I brought it up in the context of this patch
>> is that with an irqfd you wouldn't need the passed flag at all.
>>
> I don't see how it can avoid the "passed" flag. Without the flag, any
> PCI config and memory BAR access on host side could trigger EEH recovery
> for those PCI devices passed to guest. That's unexpected behaviour.

Instead of

   if (passed_flag)
     return;

you would do

   if (trigger_irqfd) {
     trigger_irqfd();
     return;
   }

which would be a much nicer, generic interface.

> For host, we have 2 ways to report errors: interrupt driven and polling.
> For the guest, we only have "polling" :-)

And the interrupt path is powernv specific? Does sPAPR specify anything 
here?

>
>>>   For now, what I have in my head is something
>>> like this:
>>>
>>>        [ Host ] -> Error detected -> irqfd (or eventfd) -> QEMU
>>>                                                             |
>>>                                     -------------(A)---------
>>>                                     |
>>>                          Send one EEH event to guest kernel
>>>                                     |
>>>                          Guest kernel starts the recovery
>>>
>>> (A): I didn't figure out one convienent way to do the EEH event injection yet.
>> How does the guest learn about errors in pHyp?
>>
> It relies on "polling".

Sigh ;).

So how about we just implement this whole thing properly as irqfd? 
Whether QEMU can actually do anything with the interrupt is a different 
question - we can leave it be for now. But we could model all the code 
with the assumption that it should either handle the error itself or 
trigger and irqfd write.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt May 21, 2014, 12:12 a.m. UTC | #6
On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote:

 .../...

> >I think what you want is an irqfd that the in-kernel eeh code
> >notifies when it sees a failure. When such an fd exists, the kernel
> >skips its own error handling.
> >
> 
> Yeah, it's a good idea and something for me to improve in phase II. We
> can discuss for more later. For now, what I have in my head is something
> like this:

However, this would be a deviation from (or extension of) PAPR. At the
moment, the way things work in PAPR is that the guest is responsible for
querying the EEH state when something "looks" like an error (ie, getting
ff's back). This is also how it works in pHyp.

We have an interrupt path in the host when doing "native" EEH, and it
would be nice to extend PAPR to also be able to shoot an event to the
guest possibly using RTAS events, but let's get the basics working and
upstream first.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt May 21, 2014, 12:13 a.m. UTC | #7
On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
> Instead of
> 
>    if (passed_flag)
>      return;
> 
> you would do
> 
>    if (trigger_irqfd) {
>      trigger_irqfd();
>      return;
>    }
> 
> which would be a much nicer, generic interface.

But that's not how PAPR works.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt May 21, 2014, 12:19 a.m. UTC | #8
On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
> So how about we just implement this whole thing properly as irqfd? 
> Whether QEMU can actually do anything with the interrupt is a different 
> question - we can leave it be for now. But we could model all the code 
> with the assumption that it should either handle the error itself or 
> trigger and irqfd write.

I don't object to the idea... however this smells of Deja Vu...

You often tend to want to turn something submitted that fills a specific
gap and implements a specific spec/function into some kind of idealized
grand design :-) And that means nothing gets upstream for weeks or monthes
as we churn and churn...

Sometimes it's probably worth it. Here I would argue against it and would
advocate for doing the basic functionality first, as it is used by guests,
and later add the irqfd option. I don't see any emergency here and adding
the irqfd will not cause fundamental design changes:

The "passed" flag (though I'm not fan of the name) is really something
we want in the low level handlers to avoid triggering host side EEH in
various places, regardless of whether we use irqfd or not.

This is totally orthogonal from the mechanism used for notifications.

Even in host, the detection path doesn't always involve interrupts, and
we can detect some things as a result of a host side config space access
for example etc...

So let's keep things nice and separate here. The interrupt notification
is just an "optimization" which will speed up discovery of the error in
*some* cases later on (but adds its own complexity since we have multiple
discovery path in host, so we need to keep track whether we have notified
yet or not etc...) so let's keep it for later.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan May 21, 2014, 4:41 a.m. UTC | #9
On Wed, May 21, 2014 at 10:12:11AM +1000, Benjamin Herrenschmidt wrote:
>On Tue, 2014-05-20 at 21:56 +1000, Gavin Shan wrote:
>
> .../...
>
>> >I think what you want is an irqfd that the in-kernel eeh code
>> >notifies when it sees a failure. When such an fd exists, the kernel
>> >skips its own error handling.
>> >
>> 
>> Yeah, it's a good idea and something for me to improve in phase II. We
>> can discuss for more later. For now, what I have in my head is something
>> like this:
>
>However, this would be a deviation from (or extension of) PAPR. At the
>moment, the way things work in PAPR is that the guest is responsible for
>querying the EEH state when something "looks" like an error (ie, getting
>ff's back). This is also how it works in pHyp.
>
>We have an interrupt path in the host when doing "native" EEH, and it
>would be nice to extend PAPR to also be able to shoot an event to the
>guest possibly using RTAS events, but let's get the basics working and
>upstream first.
>

Got it. Thanks, Ben :-)

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf May 21, 2014, 6:16 a.m. UTC | #10
> Am 21.05.2014 um 02:13 schrieb Benjamin Herrenschmidt <benh@kernel.crashing.org>:
> 
>> On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
>> Instead of
>> 
>>   if (passed_flag)
>>     return;
>> 
>> you would do
>> 
>>   if (trigger_irqfd) {
>>     trigger_irqfd();
>>     return;
>>   }
>> 
>> which would be a much nicer, generic interface.
> 
> But that's not how PAPR works.

But it's what a non-QEMU VFIO user would want, and it should be easy to implement.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf May 21, 2014, 6:20 a.m. UTC | #11
> Am 21.05.2014 um 02:19 schrieb Benjamin Herrenschmidt <benh@kernel.crashing.org>:
> 
>> On Tue, 2014-05-20 at 15:49 +0200, Alexander Graf wrote:
>> So how about we just implement this whole thing properly as irqfd? 
>> Whether QEMU can actually do anything with the interrupt is a different 
>> question - we can leave it be for now. But we could model all the code 
>> with the assumption that it should either handle the error itself or 
>> trigger and irqfd write.
> 
> I don't object to the idea... however this smells of Deja Vu...
> 
> You often tend to want to turn something submitted that fills a specific
> gap and implements a specific spec/function into some kind of idealized
> grand design :-) And that means nothing gets upstream for weeks or monthes
> as we churn and churn...
> 
> Sometimes it's probably worth it. Here I would argue against it and would
> advocate for doing the basic functionality first, as it is used by guests,
> and later add the irqfd option. I don't see any emergency here and adding
> the irqfd will not cause fundamental design changes:
> 
> The "passed" flag (though I'm not fan of the name) is really something
> we want in the low level handlers to avoid triggering host side EEH in
> various places, regardless of whether we use irqfd or not.
> 
> This is totally orthogonal from the mechanism used for notifications.
> 
> Even in host, the detection path doesn't always involve interrupts, and
> we can detect some things as a result of a host side config space access
> for example etc...
> 
> So let's keep things nice and separate here. The interrupt notification
> is just an "optimization" which will speed up discovery of the error in
> *some* cases later on (but adds its own complexity since we have multiple
> discovery path in host, so we need to keep track whether we have notified
> yet or not etc...) so let's keep it for later.

EEH handling is your call, but I only see reduced complexity here. I won't nak the current approach though.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras June 3, 2014, 5:54 a.m. UTC | #12
On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
> 
> On 20.05.14 10:30, Gavin Shan wrote:
> >If we detects frozen state on PE that has been passed to guest, we
> >needn't handle it. Instead, we rely on the guest to detect and recover
> >it. The patch avoid EEH event on the frozen passed PE so that the guest
> >can have chance to handle that.
> >
> >Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> 
> How does the guest learn about this failure? We'd need to inject an error
> into it, no?
> 
> I think what you want is an irqfd that the in-kernel eeh code notifies when
> it sees a failure. When such an fd exists, the kernel skips its own error
> handling.

Well... we don't have irqfd support for book3s HV upstream yet.  The
way the current code is, we have to turn on GSI routing, which puts a
hard and relatively small limit on the hardware IRQ numbers we can use
as it uses a flat array indexed by hardware IRQ number.  Which is a
problem that I need to solve somehow, but it makes using an irqfd
unattractive in the short term.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf June 3, 2014, 7:45 a.m. UTC | #13
> Am 03.06.2014 um 07:54 schrieb Paul Mackerras <paulus@samba.org>:
> 
>> On Tue, May 20, 2014 at 01:25:11PM +0200, Alexander Graf wrote:
>> 
>>> On 20.05.14 10:30, Gavin Shan wrote:
>>> If we detects frozen state on PE that has been passed to guest, we
>>> needn't handle it. Instead, we rely on the guest to detect and recover
>>> it. The patch avoid EEH event on the frozen passed PE so that the guest
>>> can have chance to handle that.
>>> 
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> 
>> How does the guest learn about this failure? We'd need to inject an error
>> into it, no?
>> 
>> I think what you want is an irqfd that the in-kernel eeh code notifies when
>> it sees a failure. When such an fd exists, the kernel skips its own error
>> handling.
> 
> Well... we don't have irqfd support for book3s HV upstream yet.  The
> way the current code is, we have to turn on GSI routing, which puts a
> hard and relatively small limit on the hardware IRQ numbers we can use
> as it uses a flat array indexed by hardware IRQ number.  Which is a
> problem that I need to solve somehow,

Please sync up with the ARM folks on this - they were also unhappy about the routing requirements for irqfd ;).

> but it makes using an irqfd
> unattractive in the short term.

For EEH it could as well be a dumb eventfd - really just a side channel that can tell user space that something happened asynchronously :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt June 3, 2014, 7:52 a.m. UTC | #14
On Tue, 2014-06-03 at 09:45 +0200, Alexander Graf wrote:
> For EEH it could as well be a dumb eventfd - really just a side
> channel that can tell user space that something happened
> asynchronously :).

Which the host kernel may have no way to detect without actively poking
at the device (fences in powernv or anything in PAPR host) and the only
user of this for now has no use for.

I insist don't bother.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..6543f05 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@  int eeh_dev_check_failure(struct eeh_dev *edev)
 	if (ret > 0)
 		return ret;
 
+	/*
+	 * If the PE has been passed to guest, we won't check the
+	 * state. Instead, let the guest handle it if the PE has
+	 * been frozen.
+	 */
+	if (eeh_pe_passed(pe))
+		return 0;
+
 	/* If we already have a pending isolation event for this
 	 * slot, we know it's bad already, we don't need to check.
 	 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@  static int ioda_eeh_next_error(struct eeh_pe **pe)
 				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
 					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
 				ret = EEH_NEXT_ERR_NONE;
-			} else if ((*pe)->state & EEH_PE_ISOLATED) {
+			} else if ((*pe)->state & EEH_PE_ISOLATED ||
+				   eeh_pe_passed(*pe)) {
 				ret = EEH_NEXT_ERR_NONE;
 			} else {
 				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",