diff mbox series

[2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context

Message ID 20171116160052.18672-3-npiggin@gmail.com (mailing list archive)
State Accepted
Commit acb1feab320e38588fccc568e3767761f494976f
Headers show
Series interrupt tracing fixes | expand

Commit Message

Nicholas Piggin Nov. 16, 2017, 4 p.m. UTC
When an interrupt is returning to a soft-disabled context (which can
happen for non-maskable interrupts or synchronous interrupts), it goes
through the motions of soft-disabling again, including calling
TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).

This is not necessary, because we must already be soft-disabled in the
interrupt context, it also may be causing crashes in the irq tracing
code to re-enter as an nmi. Replace it with a warning to ensure that
soft-interrupts are still disabled.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/entry_64.S | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Michael Ellerman Dec. 4, 2017, 5:09 a.m. UTC | #1
Nicholas Piggin <npiggin@gmail.com> writes:

> When an interrupt is returning to a soft-disabled context (which can
> happen for non-maskable interrupts or synchronous interrupts), it goes
> through the motions of soft-disabling again, including calling
> TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
>
> This is not necessary, because we must already be soft-disabled in the
> interrupt context, it also may be causing crashes in the irq tracing
> code to re-enter as an nmi. Replace it with a warning to ensure that
> soft-interrupts are still disabled.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/kernel/entry_64.S | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

So this patch is the core of the bug fix I gather.

Git blames says:

  Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
  Cc: stable@vger.kernel.org # v3.4+

But I'm wondering how this has been broken that long without us
noticing? You hit it doing some sort of perf stress test I think - so is
it just that we've never pushed hard enough? Or did something change to
expose this? Or we're just not sure?

cheers

> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 3320bcac7192..36878b6ee8b8 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -911,9 +911,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>  	beq	1f
>  	rlwinm	r7,r7,0,~PACA_IRQ_HARD_DIS
>  	stb	r7,PACAIRQHAPPENED(r13)
> -1:	li	r0,0
> -	stb	r0,PACASOFTIRQEN(r13);
> -	TRACE_DISABLE_INTS
> +1:
> +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG)
> +	/* The interrupt should not have soft enabled. */
> +	lbz	r7,PACASOFTIRQEN(r13)
> +1:	tdnei	r7,0
> +	EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
> +#endif
>  	b	.Ldo_restore
>  
>  	/*
> -- 
> 2.15.0
Nicholas Piggin Dec. 4, 2017, 6:07 a.m. UTC | #2
On Mon, 04 Dec 2017 16:09:57 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > When an interrupt is returning to a soft-disabled context (which can
> > happen for non-maskable interrupts or synchronous interrupts), it goes
> > through the motions of soft-disabling again, including calling
> > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
> >
> > This is not necessary, because we must already be soft-disabled in the
> > interrupt context, it also may be causing crashes in the irq tracing
> > code to re-enter as an nmi. Replace it with a warning to ensure that
> > soft-interrupts are still disabled.
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/kernel/entry_64.S | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)  
> 
> So this patch is the core of the bug fix I gather.
> 
> Git blames says:
> 
>   Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
>   Cc: stable@vger.kernel.org # v3.4+
> 
> But I'm wondering how this has been broken that long without us
> noticing? You hit it doing some sort of perf stress test I think - so is
> it just that we've never pushed hard enough? Or did something change to
> expose this? Or we're just not sure?

I'm not really sure. A customer hit it, during either a stress test or
long running workload with lockdep irq tracing and perf running at the
same time. I don't have a lot more details but we might be able to get
some offline if necessary.

Thanks,
Nick
Michael Ellerman Dec. 4, 2017, 12:55 p.m. UTC | #3
Nicholas Piggin <npiggin@gmail.com> writes:

> On Mon, 04 Dec 2017 16:09:57 +1100
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> 
>> > When an interrupt is returning to a soft-disabled context (which can
>> > happen for non-maskable interrupts or synchronous interrupts), it goes
>> > through the motions of soft-disabling again, including calling
>> > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
>> >
>> > This is not necessary, because we must already be soft-disabled in the
>> > interrupt context, it also may be causing crashes in the irq tracing
>> > code to re-enter as an nmi. Replace it with a warning to ensure that
>> > soft-interrupts are still disabled.
>> >
>> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> > ---
>> >  arch/powerpc/kernel/entry_64.S | 10 +++++++---
>> >  1 file changed, 7 insertions(+), 3 deletions(-)  
>> 
>> So this patch is the core of the bug fix I gather.
>> 
>> Git blames says:
>> 
>>   Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
>>   Cc: stable@vger.kernel.org # v3.4+
>> 
>> But I'm wondering how this has been broken that long without us
>> noticing? You hit it doing some sort of perf stress test I think - so is
>> it just that we've never pushed hard enough? Or did something change to
>> expose this? Or we're just not sure?
>
> I'm not really sure. A customer hit it, during either a stress test or
> long running workload with lockdep irq tracing and perf running at the
> same time. I don't have a lot more details but we might be able to get
> some offline if necessary.

No worries.

I'll put it in and add it to my maybe-for-stable list. We can let it
soak a bit and then send it to stable if we think it's necessary and has
no issues.

cheers
Benjamin Herrenschmidt Dec. 4, 2017, 4:12 p.m. UTC | #4
On Mon, 2017-12-04 at 16:09 +1100, Michael Ellerman wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > When an interrupt is returning to a soft-disabled context (which can
> > happen for non-maskable interrupts or synchronous interrupts), it goes
> > through the motions of soft-disabling again, including calling
> > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
> > 
> > This is not necessary, because we must already be soft-disabled in the
> > interrupt context, it also may be causing crashes in the irq tracing
> > code to re-enter as an nmi. Replace it with a warning to ensure that
> > soft-interrupts are still disabled.
> > 
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/kernel/entry_64.S | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> So this patch is the core of the bug fix I gather.
> 
> Git blames says:
> 
>   Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
>   Cc: stable@vger.kernel.org # v3.4+
> 
> But I'm wondering how this has been broken that long without us
> noticing? You hit it doing some sort of perf stress test I think - so is
> it just that we've never pushed hard enough? Or did something change to
> expose this? Or we're just not sure?

We have some traps that do local_irq_enable ... you may want to double
check instruction emu, page faults, alignment etc... I wouldn't be
surprised if we have case where an interrupt "returns" soft enabled.

> cheers
> 
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 3320bcac7192..36878b6ee8b8 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -911,9 +911,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
> >  	beq	1f
> >  	rlwinm	r7,r7,0,~PACA_IRQ_HARD_DIS
> >  	stb	r7,PACAIRQHAPPENED(r13)
> > -1:	li	r0,0
> > -	stb	r0,PACASOFTIRQEN(r13);
> > -	TRACE_DISABLE_INTS
> > +1:
> > +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG)
> > +	/* The interrupt should not have soft enabled. */
> > +	lbz	r7,PACASOFTIRQEN(r13)
> > +1:	tdnei	r7,0
> > +	EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
> > +#endif
> >  	b	.Ldo_restore
> >  
> >  	/*
> > -- 
> > 2.15.0
Michael Ellerman Dec. 5, 2017, 5:52 a.m. UTC | #5
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Mon, 2017-12-04 at 16:09 +1100, Michael Ellerman wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> 
>> > When an interrupt is returning to a soft-disabled context (which can
>> > happen for non-maskable interrupts or synchronous interrupts), it goes
>> > through the motions of soft-disabling again, including calling
>> > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
>> > 
>> > This is not necessary, because we must already be soft-disabled in the
>> > interrupt context, it also may be causing crashes in the irq tracing
>> > code to re-enter as an nmi. Replace it with a warning to ensure that
>> > soft-interrupts are still disabled.
>> > 
>> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> > ---
>> >  arch/powerpc/kernel/entry_64.S | 10 +++++++---
>> >  1 file changed, 7 insertions(+), 3 deletions(-)
>> 
>> So this patch is the core of the bug fix I gather.
>> 
>> Git blames says:
>> 
>>   Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
>>   Cc: stable@vger.kernel.org # v3.4+
>> 
>> But I'm wondering how this has been broken that long without us
>> noticing? You hit it doing some sort of perf stress test I think - so is
>> it just that we've never pushed hard enough? Or did something change to
>> expose this? Or we're just not sure?
>
> We have some traps that do local_irq_enable ... you may want to double
> check instruction emu, page faults, alignment etc... I wouldn't be
> surprised if we have case where an interrupt "returns" soft enabled.

AFAICT those all check that they're coming from a soft-enabled context:

	if (!arch_irq_disabled_regs(regs))
		local_irq_enable();


And the code Nick is patching is the case where we're returning to a
soft-disabled context. So I think the patch is good.

cheers
Michael Ellerman Dec. 12, 2017, 11:39 a.m. UTC | #6
On Thu, 2017-11-16 at 16:00:50 UTC, Nicholas Piggin wrote:
> When an interrupt is returning to a soft-disabled context (which can
> happen for non-maskable interrupts or synchronous interrupts), it goes
> through the motions of soft-disabling again, including calling
> TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).
> 
> This is not necessary, because we must already be soft-disabled in the
> interrupt context, it also may be causing crashes in the irq tracing
> code to re-enter as an nmi. Replace it with a warning to ensure that
> soft-interrupts are still disabled.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/acb1feab320e38588fccc568e37677

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 3320bcac7192..36878b6ee8b8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -911,9 +911,13 @@  END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 	beq	1f
 	rlwinm	r7,r7,0,~PACA_IRQ_HARD_DIS
 	stb	r7,PACAIRQHAPPENED(r13)
-1:	li	r0,0
-	stb	r0,PACASOFTIRQEN(r13);
-	TRACE_DISABLE_INTS
+1:
+#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_BUG)
+	/* The interrupt should not have soft enabled. */
+	lbz	r7,PACASOFTIRQEN(r13)
+1:	tdnei	r7,0
+	EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
+#endif
 	b	.Ldo_restore
 
 	/*