diff mbox series

[RFC,kernel] powerpc/xive: Drop deregistered irqs

Message ID 20190712082036.40440-1-aik@ozlabs.ru (mailing list archive)
State Superseded
Headers show
Series [RFC,kernel] powerpc/xive: Drop deregistered irqs | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (c06dc11e93d41bae3f06a812b33ba422839936d0)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 14 lines checked

Commit Message

Alexey Kardashevskiy July 12, 2019, 8:20 a.m. UTC
There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This checks if irq is still registered, if not, it assumes no valid irq
was fetched from the loop and if there is none left, it continues to
the irq==0 case (not visible in this patch) and sets priority to 0xff
which is basically unmasking. This calls irq_to_desc() on a hot path now
which is a radix tree lookup; hopefully this won't be noticeable as
that tree is quite small.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.

Alternatives:

1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
drop deregistered irqs.

2. Exploit chip->irq_get_irqchip_state function from
62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".

Both require deep XIVE knowledge which I do not have.
---
 arch/powerpc/sysdev/xive/common.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Benjamin Herrenschmidt July 12, 2019, 8:29 a.m. UTC | #1
On Fri, 2019-07-12 at 18:20 +1000, Alexey Kardashevskiy wrote:
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..65742e280337 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
>  		irq = xive_read_eq(&xc->queue[prio], just_peek);
>  
>  		/* Found something ? That's it */
> -		if (irq)
> -			break;
> +		if (irq) {
> +			/* Another CPU may have shut this irq down, check it */
> +			if (irq_to_desc(irq))

What if it gets deregistered here .... ?

> +				break;
> +			irq = 0;
> +		}
>  
>  		/* Clear pending bits */
>  		xc->pending_prio &= ~(1 << prio);

Wouldn't it be better to check the return value from generic_handle_irq
instead ?

Cheers,
Ben.
Alexey Kardashevskiy July 12, 2019, 9:37 a.m. UTC | #2
On 12/07/2019 18:29, Benjamin Herrenschmidt wrote:
> On Fri, 2019-07-12 at 18:20 +1000, Alexey Kardashevskiy wrote:
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index 082c7e1c20f0..65742e280337 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
>>   		irq = xive_read_eq(&xc->queue[prio], just_peek);
>>   
>>   		/* Found something ? That's it */
>> -		if (irq)
>> -			break;
>> +		if (irq) {
>> +			/* Another CPU may have shut this irq down, check it */
>> +			if (irq_to_desc(irq))
> 
> What if it gets deregistered here .... ?

Yeah that is the problem.

> 
>> +				break;
>> +			irq = 0;
>> +		}
>>   
>>   		/* Clear pending bits */
>>   		xc->pending_prio &= ~(1 << prio);
> 
> Wouldn't it be better to check the return value from generic_handle_irq
> instead ?

Where exactly, here?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614

If so, then in order to do EOI, I'll need the desc which is gone, or I 
am missing the point?
Benjamin Herrenschmidt July 12, 2019, 11:47 p.m. UTC | #3
On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
> 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
> 
> If so, then in order to do EOI, I'll need the desc which is gone, or
> I am missing the point?

All you need is drop the local CPU priority.

Cheers,
Ben.
Alexey Kardashevskiy July 13, 2019, 8:53 a.m. UTC | #4
On 13/07/2019 09:47, Benjamin Herrenschmidt wrote:
> On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
>>
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
>>
>> If so, then in order to do EOI, I'll need the desc which is gone, or
>> I am missing the point?
> 
> All you need is drop the local CPU priority.

I know that, the question was how. I cannot use irq_chip in
arch/powerpc/kernel/irq.c and I do not want to add ppc_md.enable_irqs().
Benjamin Herrenschmidt July 14, 2019, 1:31 a.m. UTC | #5
On Sat, 2019-07-13 at 18:53 +1000, Alexey Kardashevskiy wrote:
> 
> On 13/07/2019 09:47, Benjamin Herrenschmidt wrote:
> > On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
> > > 
> > 
> > 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
> > > 
> > > If so, then in order to do EOI, I'll need the desc which is gone,
> > > or
> > > I am missing the point?
> > 
> > All you need is drop the local CPU priority.
> 
> I know that, the question was how. I cannot use irq_chip in
> arch/powerpc/kernel/irq.c and I do not want to add
> ppc_md.enable_irqs().

Well, best is probably to do just that though, but call it something
like ppc_md.orphan_irq() or something like that instead. Another option
as you mention is to try to scrub queues, but that's trickier to do due
to the lockless nature of the queue handling.

Cheers,
Ben.
Daniel Henrique Barboza July 14, 2019, 7:14 p.m. UTC | #6
This patch fixed an issue I was experiencing with virsh start/destroy
of guests with mlx5 and GPU passthrough in a Power 9 server. I
believe it's a similar situation which Alexey described in the post
commit msg.


Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>


On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote:
> There is a race between releasing an irq on one cpu and fetching it
> from XIVE on another cpu as there does not seem to be any locking between
> these, probably because xive_irq_chip::irq_shutdown() is supposed to
> remove the irq from all queues in the system which it does not do.
>
> As a result, when such released irq appears in a queue, we take it
> from the queue but we do not change the current priority on that cpu and
> since there is no handler for the irq, EOI is never called and the cpu
> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
> is assigned to the same cpu, then that device stops working until irq
> is moved to another cpu or the device is reset.
>
> This checks if irq is still registered, if not, it assumes no valid irq
> was fetched from the loop and if there is none left, it continues to
> the irq==0 case (not visible in this patch) and sets priority to 0xff
> which is basically unmasking. This calls irq_to_desc() on a hot path now
> which is a radix tree lookup; hopefully this won't be noticeable as
> that tree is quite small.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> Found it on P9 system with:
> - a host with 8 cpus online
> - a boot disk on ahci with its msix on cpu#0
> - a guest with 2xGPUs + 6xNVLink + 4 cpus
> - GPU#0 from the guest is bound to the same cpu#0.
>
> Killing a guest killed ahci and therefore the host because of the race.
> Note that VFIO masks interrupts first and only then resets the device.
>
> Alternatives:
>
> 1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
> drop deregistered irqs.
>
> 2. Exploit chip->irq_get_irqchip_state function from
> 62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".
>
> Both require deep XIVE knowledge which I do not have.
> ---
>   arch/powerpc/sysdev/xive/common.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..65742e280337 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
>   		irq = xive_read_eq(&xc->queue[prio], just_peek);
>   
>   		/* Found something ? That's it */
> -		if (irq)
> -			break;
> +		if (irq) {
> +			/* Another CPU may have shut this irq down, check it */
> +			if (irq_to_desc(irq))
> +				break;
> +			irq = 0;
> +		}
>   
>   		/* Clear pending bits */
>   		xc->pending_prio &= ~(1 << prio);
Cédric Le Goater July 14, 2019, 7:44 p.m. UTC | #7
On 14/07/2019 03:31, Benjamin Herrenschmidt wrote:
> On Sat, 2019-07-13 at 18:53 +1000, Alexey Kardashevskiy wrote:
>>
>> On 13/07/2019 09:47, Benjamin Herrenschmidt wrote:
>>> On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
>>>>
>>>
>>>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
>>>>
>>>> If so, then in order to do EOI, I'll need the desc which is gone,
>>>> or
>>>> I am missing the point?
>>>
>>> All you need is drop the local CPU priority.
>>
>> I know that, the question was how. I cannot use irq_chip in
>> arch/powerpc/kernel/irq.c and I do not want to add
>> ppc_md.enable_irqs().
> 
> Well, best is probably to do just that though, but call it something
> like ppc_md.orphan_irq() or something like that instead. Another option
> as you mention is to try to scrub queues, but that's trickier to do due
> to the lockless nature of the queue handling.

When the IRQ is shutdown, couldn't we cleanup the CPU EQ by filtering 
all the dangling entries, and replacing them with zeroes ? That would
be alternative 1, but I don't think we need to scan all cpus. The last
target should be enough.

C.
Benjamin Herrenschmidt July 14, 2019, 10:48 p.m. UTC | #8
On Sun, 2019-07-14 at 21:44 +0200, Cédric Le Goater wrote:
> > Well, best is probably to do just that though, but call it something
> > like ppc_md.orphan_irq() or something like that instead. Another option
> > as you mention is to try to scrub queues, but that's trickier to do due
> > to the lockless nature of the queue handling.
> 
> When the IRQ is shutdown, couldn't we cleanup the CPU EQ by filtering 
> all the dangling entries, and replacing them with zeroes ? That would
> be alternative 1, but I don't think we need to scan all cpus. The last
> target should be enough.

It's a bit tricky due to the lockless nature of the queues...

Cheers,
Ben.
Alexey Kardashevskiy July 15, 2019, 2:27 a.m. UTC | #9
On 15/07/2019 05:14, Daniel Henrique Barboza wrote:
> This patch fixed an issue I was experiencing with virsh start/destroy
> of guests with mlx5 and GPU passthrough in a Power 9 server. I
> believe it's a similar situation which Alexey described in the post
> commit msg.
> 
> 
> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>


Thanks for testing! Unfortunately, this is not the right fix as it only 
reduces the race window but does not eliminate it, we need something 
better than this, i.e. do not backport this anywhere please :) Thanks,


> 
> 
> On 7/12/19 5:20 AM, Alexey Kardashevskiy wrote:
>> There is a race between releasing an irq on one cpu and fetching it
>> from XIVE on another cpu as there does not seem to be any locking between
>> these, probably because xive_irq_chip::irq_shutdown() is supposed to
>> remove the irq from all queues in the system which it does not do.
>>
>> As a result, when such released irq appears in a queue, we take it
>> from the queue but we do not change the current priority on that cpu and
>> since there is no handler for the irq, EOI is never called and the cpu
>> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
>> is assigned to the same cpu, then that device stops working until irq
>> is moved to another cpu or the device is reset.
>>
>> This checks if irq is still registered, if not, it assumes no valid irq
>> was fetched from the loop and if there is none left, it continues to
>> the irq==0 case (not visible in this patch) and sets priority to 0xff
>> which is basically unmasking. This calls irq_to_desc() on a hot path now
>> which is a radix tree lookup; hopefully this won't be noticeable as
>> that tree is quite small.
>>
>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>> ---
>>
>> Found it on P9 system with:
>> - a host with 8 cpus online
>> - a boot disk on ahci with its msix on cpu#0
>> - a guest with 2xGPUs + 6xNVLink + 4 cpus
>> - GPU#0 from the guest is bound to the same cpu#0.
>>
>> Killing a guest killed ahci and therefore the host because of the race.
>> Note that VFIO masks interrupts first and only then resets the device.
>>
>> Alternatives:
>>
>> 1. Fix xive_irq_chip::irq_shutdown() to walk through all cpu queues and
>> drop deregistered irqs.
>>
>> 2. Exploit chip->irq_get_irqchip_state function from
>> 62e0468650c30f0298 "genirq: Add optional hardware synchronization for shutdown".
>>
>> Both require deep XIVE knowledge which I do not have.
>> ---
>>   arch/powerpc/sysdev/xive/common.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index 082c7e1c20f0..65742e280337 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
>>   		irq = xive_read_eq(&xc->queue[prio], just_peek);
>>   
>>   		/* Found something ? That's it */
>> -		if (irq)
>> -			break;
>> +		if (irq) {
>> +			/* Another CPU may have shut this irq down, check it */
>> +			if (irq_to_desc(irq))
>> +				break;
>> +			irq = 0;
>> +		}
>>   
>>   		/* Clear pending bits */
>>   		xc->pending_prio &= ~(1 << prio);
>
diff mbox series

Patch

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..65742e280337 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -148,8 +148,12 @@  static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
 		irq = xive_read_eq(&xc->queue[prio], just_peek);
 
 		/* Found something ? That's it */
-		if (irq)
-			break;
+		if (irq) {
+			/* Another CPU may have shut this irq down, check it */
+			if (irq_to_desc(irq))
+				break;
+			irq = 0;
+		}
 
 		/* Clear pending bits */
 		xc->pending_prio &= ~(1 << prio);