diff mbox series

powerpc/perf: Fix PMU callbacks to clear pending PMI before resetting an overflown PMC

Message ID 1617720464-1651-2-git-send-email-atrajeev@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show
Series powerpc/perf: Fix PMU callbacks to clear pending PMI before resetting an overflown PMC | expand
Related show

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (87d76f542a24ecfa797e9bd3bb56c0f19aabff57)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch warning total: 0 errors, 2 warnings, 1 checks, 119 lines checked
snowpatch_ozlabs/needsstable warning Please consider tagging this patch for stable!

Commit Message

Athira Rajeev April 6, 2021, 2:47 p.m. UTC
Running perf fuzzer showed below in dmesg logs:
"Can't find PMC that caused IRQ"

This means a PMU exception happened, but none of the PMC's (Performance
Monitor Counter) were found to be overflown. There are some corner cases
that clears the PMCs after PMI gets masked. In such cases, the perf
interrupt handler will not find the active PMC values that had caused
the overflow and thus leads to this message while replaying.

Case 1: PMU Interrupt happens during replay of other interrupts and
counter values gets cleared by PMU callbacks before replay:

During replay of interrupts like timer, __do_irq and doorbell exception, we
conditionally enable interrupts via may_hard_irq_enable(). This could
potentially create a window to generate a PMI. Since irq soft mask is set
to ALL_DISABLED, the PMI will get masked here. We could get IPIs run before
perf interrupt is replayed and the PMU events could deleted or stopped.
This will change the PMU SPR values and resets the counters. Snippet of
ftrace log showing PMU callbacks invoked in "__do_irq":

<idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
<idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
<idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
<idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
<<>>
<idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
<idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
<idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
<idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
<idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
<idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
<idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
<idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
<idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
<idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
<idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
<idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
<<>>
<idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
<idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts

Case 2: PMI's masked during local_* operations, example local_add.
If the local_add operation happens within a local_irq_save, replay of
PMI will be during local_irq_restore. Similar to case 1, this could
also create a window before replay where PMU events gets deleted or
stopped.

Patch adds a fix to update the PMU callback functions (del,stop,enable) to
check for pending perf interrupt. If there is an overflown PMC and pending
perf interrupt indicated in Paca, clear the PMI bit in paca to drop that
sample. In case of power_pmu_del, also clear the MMCR0 PMAO bit which
otherwise could lead to spurious interrupts in some corner cases. Example,
a timer after power_pmu_del which will re-enable interrupts since PMI is
cleared and triggers a PMI again since PMAO bit is still set.

We can't just replay PMI any time. Hence this approach is preferred rather
than replaying PMI before resetting overflown PMC. Patch also documents
core-book3s on a race condition which can trigger these PMC messages during
idle path in PowerNV.

Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and replay them")
Reported-by: Nageswara R Sastry <nasastry@in.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Suggested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pmc.h  | 11 +++++++++
 arch/powerpc/perf/core-book3s.c | 55 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

Comments

Nicholas Piggin April 9, 2021, 1:08 a.m. UTC | #1
I was going to nitpick "overflown" here as something birds do, but some
sources says overflown is okay for past tense.

You could use "overflowed" for that, but I understand the issue with the 
word: you are talking about counters that are currently in an "overflow" 
state, but the overflow occurred in the past and is not still happening
so you "overflowing" doesn't exactly fit either.

overflown kind of works for some reason you can kind of use it for
present tense!

Excerpts from Athira Rajeev's message of April 7, 2021 12:47 am:
> Running perf fuzzer showed below in dmesg logs:
> "Can't find PMC that caused IRQ"
> 
> This means a PMU exception happened, but none of the PMC's (Performance
> Monitor Counter) were found to be overflown. There are some corner cases
> that clears the PMCs after PMI gets masked. In such cases, the perf
> interrupt handler will not find the active PMC values that had caused
> the overflow and thus leads to this message while replaying.
> 
> Case 1: PMU Interrupt happens during replay of other interrupts and
> counter values gets cleared by PMU callbacks before replay:
> 
> During replay of interrupts like timer, __do_irq and doorbell exception, we
> conditionally enable interrupts via may_hard_irq_enable(). This could
> potentially create a window to generate a PMI. Since irq soft mask is set
> to ALL_DISABLED, the PMI will get masked here.

I wonder if may_hard_irq_enable shouldn't enable if PMI is soft
disabled. And also maybe replay should not set ALL_DISABLED if
there are no PMI interrupts pending.

Still, I think those are a bit more tricky and might take a while
to get right or just not be worth while, so I think your patch is
fine.

> We could get IPIs run before
> perf interrupt is replayed and the PMU events could deleted or stopped.
> This will change the PMU SPR values and resets the counters. Snippet of
> ftrace log showing PMU callbacks invoked in "__do_irq":
> 
> <idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
> <idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
> <idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
> <idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
> <<>>
> <idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
> <idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
> <idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
> <idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
> <idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
> <idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
> <idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
> <idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
> <idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
> <idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
> <idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
> <idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
> <<>>
> <idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
> <idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts
> 
> Case 2: PMI's masked during local_* operations, example local_add.
> If the local_add operation happens within a local_irq_save, replay of
> PMI will be during local_irq_restore. Similar to case 1, this could
> also create a window before replay where PMU events gets deleted or
> stopped.

Here as well perhaps PMIs should be replayed if they are unmasked
even if other interrupts are still masked. Again that might be more
complexity than it's worth.

> 
> Patch adds a fix to update the PMU callback functions (del,stop,enable) to
> check for pending perf interrupt. If there is an overflown PMC and pending
> perf interrupt indicated in Paca, clear the PMI bit in paca to drop that
> sample. In case of power_pmu_del, also clear the MMCR0 PMAO bit which
> otherwise could lead to spurious interrupts in some corner cases. Example,
> a timer after power_pmu_del which will re-enable interrupts since PMI is
> cleared and triggers a PMI again since PMAO bit is still set.
> 
> We can't just replay PMI any time. Hence this approach is preferred rather
> than replaying PMI before resetting overflown PMC. Patch also documents
> core-book3s on a race condition which can trigger these PMC messages during
> idle path in PowerNV.
> 
> Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and replay them")
> Reported-by: Nageswara R Sastry <nasastry@in.ibm.com>
> Suggested-by: Nicholas Piggin <npiggin@gmail.com>
> Suggested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pmc.h  | 11 +++++++++
>  arch/powerpc/perf/core-book3s.c | 55 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
> index c6bbe9778d3c..97b4bd8de25b 100644
> --- a/arch/powerpc/include/asm/pmc.h
> +++ b/arch/powerpc/include/asm/pmc.h
> @@ -34,11 +34,22 @@ static inline void ppc_set_pmu_inuse(int inuse)
>  #endif
>  }
>  
> +static inline int clear_paca_irq_pmi(void)
> +{
> +	if (get_paca()->irq_happened & PACA_IRQ_PMI) {
> +		WARN_ON_ONCE(mfmsr() & MSR_EE);
> +		get_paca()->irq_happened &= ~PACA_IRQ_PMI;
> +		return 1;
> +	}
> +	return 0;
> +}

Could you put this in arch/powerpc/include/asm/hw_irq.h and
rather than paca_irq, call it irq_pending perhaps

  clear_pmi_irq_pending()

  get_clear_pmi_irq_pending() if you're also testing it.

Could you add a little comment about the corner cases above it too?
The root cause seem to be interrupt replay while a masked PMI is
pending can result in other interrupts arriving which clear the PMU
overflow so the pending PMI must be cleared.

> +
>  extern void power4_enable_pmcs(void);
>  
>  #else /* CONFIG_PPC64 */
>  
>  static inline void ppc_set_pmu_inuse(int inuse) { }
> +static inline int clear_paca_irq_pmi(void) { return 0; }
>  
>  #endif
>  
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 766f064f00fb..18ca3c90f866 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -847,6 +847,20 @@ static void write_pmc(int idx, unsigned long val)
>  	}
>  }
>  
> +static int pmc_overflown(int idx)
> +{
> +	unsigned long val[8];
> +	int i;
> +
> +	for (i = 0; i < ppmu->n_counter; i++)
> +		val[i] = read_pmc(i + 1);
> +
> +	if ((int)val[idx-1] < 0)
> +		return 1;
> +
> +	return 0;
> +}
> +
>  /* Called from sysrq_handle_showregs() */
>  void perf_event_print_debug(void)
>  {
> @@ -1438,6 +1452,15 @@ static void power_pmu_enable(struct pmu *pmu)
>  		event = cpuhw->event[i];
>  		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
>  			power_pmu_read(event);
> +			/*
> +			 * if the PMC corresponding to event->hw.idx is
> +			 * overflown, check if there is any pending perf
> +			 * interrupt set in paca. If so, disable the interrupt
> +			 * by clearing the paca bit for PMI since we are going
> +			 * to reset the PMC.
> +			 */
> +			if (pmc_overflown(event->hw.idx))
> +				clear_paca_irq_pmi();

If the pmc is not overflown, could there still be a PMI pending?

>  			write_pmc(event->hw.idx, 0);
>  			event->hw.idx = 0;
>  		}
> @@ -1474,6 +1497,10 @@ static void power_pmu_enable(struct pmu *pmu)
>  		event->hw.idx = idx;
>  		if (event->hw.state & PERF_HES_STOPPED)
>  			val = 0;
> +
> +		/* See above for clear_paca_irq_pmi */
> +		if (pmc_overflown(event->hw.idx))
> +			clear_paca_irq_pmi();
>  		write_pmc(idx, val);
>  
>  		perf_event_update_userpage(event);
> @@ -1619,6 +1646,7 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>  	struct cpu_hw_events *cpuhw;
>  	long i;
>  	unsigned long flags;
> +	unsigned long val_mmcr0;
>  
>  	local_irq_save(flags);
>  	perf_pmu_disable(event->pmu);
> @@ -1636,6 +1664,22 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>  			--cpuhw->n_events;
>  			ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr);
>  			if (event->hw.idx) {
> +				/*
> +				 * if the PMC corresponding to event->hw.idx is
> +				 * overflown, check if there is any pending perf
> +				 * interrupt set in paca. If so, disable the interrupt
> +				 * and clear the MMCR0 PMAO bit since we are going
> +				 * to reset the PMC and delete the event.
> +				 */
> +				if (pmc_overflown(event->hw.idx)) {
> +					if (clear_paca_irq_pmi()) {
> +						val_mmcr0 = mfspr(SPRN_MMCR0);
> +						val_mmcr0 &= ~MMCR0_PMAO;
> +						write_mmcr0(cpuhw, val_mmcr0);
> +						mb();
> +						isync();

I don't know the perf subsystem, but just out of curiosity why does
MMCR0 need to be cleared only in this case? What if we disabled MSR[EE]
right before a perf interrupt came in, so we don't get a pending PMI
but the condition is still close to the same.

> +				}
>  				write_pmc(event->hw.idx, 0);
>  				event->hw.idx = 0;
>  			}
> @@ -1714,6 +1758,8 @@ static void power_pmu_stop(struct perf_event *event, int ef_flags)
>  
>  	local_irq_save(flags);
>  	perf_pmu_disable(event->pmu);
> +	if (pmc_overflown(event->hw.idx))
> +		clear_paca_irq_pmi();
>  
>  	power_pmu_read(event);
>  	event->hw.state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
> @@ -2343,6 +2389,15 @@ static void __perf_event_interrupt(struct pt_regs *regs)
>  			}
>  		}
>  	}
> +
> +	/*
> +	 * During system wide profling or while specific CPU
> +	 * is monitored for an event, some corner cases could
> +	 * cause PMC to overflow in idle path. This will trigger
> +	 * a PMI after waking up from idle. Since counter values
> +	 * are _not_ saved/restored in idle path, can lead to
> +	 * below "Can't find PMC" message.
> +	 */
>  	if (unlikely(!found) && !arch_irq_disabled_regs(regs))
>  		printk_ratelimited(KERN_WARNING "Can't find PMC that caused IRQ\n");
>  
> -- 
> 1.8.3.1
> 
>
Athira Rajeev April 9, 2021, 12:53 p.m. UTC | #2
> On 09-Apr-2021, at 6:38 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
Hi Nick,

Thanks for checking the patch and sharing review comments.

> I was going to nitpick "overflown" here as something birds do, but some
> sources says overflown is okay for past tense.
> 
> You could use "overflowed" for that, but I understand the issue with the 
> word: you are talking about counters that are currently in an "overflow" 
> state, but the overflow occurred in the past and is not still happening
> so you "overflowing" doesn't exactly fit either.
> 
> overflown kind of works for some reason you can kind of use it for
> present tense!

Ok sure, Yes counter is currently in an “overflow” state.

> 
> Excerpts from Athira Rajeev's message of April 7, 2021 12:47 am:
>> Running perf fuzzer showed below in dmesg logs:
>> "Can't find PMC that caused IRQ"
>> 
>> This means a PMU exception happened, but none of the PMC's (Performance
>> Monitor Counter) were found to be overflown. There are some corner cases
>> that clears the PMCs after PMI gets masked. In such cases, the perf
>> interrupt handler will not find the active PMC values that had caused
>> the overflow and thus leads to this message while replaying.
>> 
>> Case 1: PMU Interrupt happens during replay of other interrupts and
>> counter values gets cleared by PMU callbacks before replay:
>> 
>> During replay of interrupts like timer, __do_irq and doorbell exception, we
>> conditionally enable interrupts via may_hard_irq_enable(). This could
>> potentially create a window to generate a PMI. Since irq soft mask is set
>> to ALL_DISABLED, the PMI will get masked here.
> 
> I wonder if may_hard_irq_enable shouldn't enable if PMI is soft
> disabled. And also maybe replay should not set ALL_DISABLED if
> there are no PMI interrupts pending.
> 
> Still, I think those are a bit more tricky and might take a while
> to get right or just not be worth while, so I think your patch is
> fine.

Ok Nick.
> 
>> We could get IPIs run before
>> perf interrupt is replayed and the PMU events could deleted or stopped.
>> This will change the PMU SPR values and resets the counters. Snippet of
>> ftrace log showing PMU callbacks invoked in "__do_irq":
>> 
>> <idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
>> <idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
>> <idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
>> <idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
>> <<>>
>> <idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
>> <idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
>> <idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
>> <idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
>> <idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
>> <idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
>> <idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
>> <idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
>> <idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
>> <idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
>> <idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
>> <idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
>> <<>>
>> <idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
>> <idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts
>> 
>> Case 2: PMI's masked during local_* operations, example local_add.
>> If the local_add operation happens within a local_irq_save, replay of
>> PMI will be during local_irq_restore. Similar to case 1, this could
>> also create a window before replay where PMU events gets deleted or
>> stopped.
> 
> Here as well perhaps PMIs should be replayed if they are unmasked
> even if other interrupts are still masked. Again that might be more
> complexity than it's worth.
Ok..

> 
>> 
>> Patch adds a fix to update the PMU callback functions (del,stop,enable) to
>> check for pending perf interrupt. If there is an overflown PMC and pending
>> perf interrupt indicated in Paca, clear the PMI bit in paca to drop that
>> sample. In case of power_pmu_del, also clear the MMCR0 PMAO bit which
>> otherwise could lead to spurious interrupts in some corner cases. Example,
>> a timer after power_pmu_del which will re-enable interrupts since PMI is
>> cleared and triggers a PMI again since PMAO bit is still set.
>> 
>> We can't just replay PMI any time. Hence this approach is preferred rather
>> than replaying PMI before resetting overflown PMC. Patch also documents
>> core-book3s on a race condition which can trigger these PMC messages during
>> idle path in PowerNV.
>> 
>> Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and replay them")
>> Reported-by: Nageswara R Sastry <nasastry@in.ibm.com>
>> Suggested-by: Nicholas Piggin <npiggin@gmail.com>
>> Suggested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/pmc.h  | 11 +++++++++
>> arch/powerpc/perf/core-book3s.c | 55 +++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 66 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
>> index c6bbe9778d3c..97b4bd8de25b 100644
>> --- a/arch/powerpc/include/asm/pmc.h
>> +++ b/arch/powerpc/include/asm/pmc.h
>> @@ -34,11 +34,22 @@ static inline void ppc_set_pmu_inuse(int inuse)
>> #endif
>> }
>> 
>> +static inline int clear_paca_irq_pmi(void)
>> +{
>> +	if (get_paca()->irq_happened & PACA_IRQ_PMI) {
>> +		WARN_ON_ONCE(mfmsr() & MSR_EE);
>> +		get_paca()->irq_happened &= ~PACA_IRQ_PMI;
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
> 
> Could you put this in arch/powerpc/include/asm/hw_irq.h and
> rather than paca_irq, call it irq_pending perhaps
> 
>  clear_pmi_irq_pending()
> 
>  get_clear_pmi_irq_pending() if you're also testing it.

Sure,  I will use “get_clear_pmi_irq_pending()” and try with moving this to arch/powerpc/include/asm/hw_irq.h

> 
> Could you add a little comment about the corner cases above it too?
> The root cause seem to be interrupt replay while a masked PMI is
> pending can result in other interrupts arriving which clear the PMU
> overflow so the pending PMI must be cleared.

Ok, I will add comment and fix this in next version.

> 
>> +
>> extern void power4_enable_pmcs(void);
>> 
>> #else /* CONFIG_PPC64 */
>> 
>> static inline void ppc_set_pmu_inuse(int inuse) { }
>> +static inline int clear_paca_irq_pmi(void) { return 0; }
>> 
>> #endif
>> 
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index 766f064f00fb..18ca3c90f866 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -847,6 +847,20 @@ static void write_pmc(int idx, unsigned long val)
>> 	}
>> }
>> 
>> +static int pmc_overflown(int idx)
>> +{
>> +	unsigned long val[8];
>> +	int i;
>> +
>> +	for (i = 0; i < ppmu->n_counter; i++)
>> +		val[i] = read_pmc(i + 1);
>> +
>> +	if ((int)val[idx-1] < 0)
>> +		return 1;
>> +
>> +	return 0;
>> +}
>> +
>> /* Called from sysrq_handle_showregs() */
>> void perf_event_print_debug(void)
>> {
>> @@ -1438,6 +1452,15 @@ static void power_pmu_enable(struct pmu *pmu)
>> 		event = cpuhw->event[i];
>> 		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
>> 			power_pmu_read(event);
>> +			/*
>> +			 * if the PMC corresponding to event->hw.idx is
>> +			 * overflown, check if there is any pending perf
>> +			 * interrupt set in paca. If so, disable the interrupt
>> +			 * by clearing the paca bit for PMI since we are going
>> +			 * to reset the PMC.
>> +			 */
>> +			if (pmc_overflown(event->hw.idx))
>> +				clear_paca_irq_pmi();
> 
> If the pmc is not overflown, could there still be a PMI pending?

I didn’t hit that scenario where PMI is pending without an overflown PMC.
Also I believe if such a case happens, we will need an investigation there. It could be a different case to be handled.

I actually considered below two points for adding this PMC check instead of just clearing the PMI.

1. Make sure we are not masking any bug here by just clearing PACA_IRQ_PMI.
Ideally if PMI is set in irq_happened, it means there was a counter overflow.
2. If there is more than one PMU event, say two events. Make sure we are clearing PMI only for the
event whose counter is overflown.
 
> 
>> 			write_pmc(event->hw.idx, 0);
>> 			event->hw.idx = 0;
>> 		}
>> @@ -1474,6 +1497,10 @@ static void power_pmu_enable(struct pmu *pmu)
>> 		event->hw.idx = idx;
>> 		if (event->hw.state & PERF_HES_STOPPED)
>> 			val = 0;
>> +
>> +		/* See above for clear_paca_irq_pmi */
>> +		if (pmc_overflown(event->hw.idx))
>> +			clear_paca_irq_pmi();
>> 		write_pmc(idx, val);
>> 
>> 		perf_event_update_userpage(event);
>> @@ -1619,6 +1646,7 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>> 	struct cpu_hw_events *cpuhw;
>> 	long i;
>> 	unsigned long flags;
>> +	unsigned long val_mmcr0;
>> 
>> 	local_irq_save(flags);
>> 	perf_pmu_disable(event->pmu);
>> @@ -1636,6 +1664,22 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>> 			--cpuhw->n_events;
>> 			ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr);
>> 			if (event->hw.idx) {
>> +				/*
>> +				 * if the PMC corresponding to event->hw.idx is
>> +				 * overflown, check if there is any pending perf
>> +				 * interrupt set in paca. If so, disable the interrupt
>> +				 * and clear the MMCR0 PMAO bit since we are going
>> +				 * to reset the PMC and delete the event.
>> +				 */
>> +				if (pmc_overflown(event->hw.idx)) {
>> +					if (clear_paca_irq_pmi()) {
>> +						val_mmcr0 = mfspr(SPRN_MMCR0);
>> +						val_mmcr0 &= ~MMCR0_PMAO;
>> +						write_mmcr0(cpuhw, val_mmcr0);
>> +						mb();
>> +						isync();
> 
> I don't know the perf subsystem, but just out of curiosity why does
> MMCR0 need to be cleared only in this case?

I got a corner case in power_pmu_del, with only clearing PACA_IRQ_PMI and without resetting MMCR0 PMAO bit.
Here is the flow:

1. We clear the PMI bit Paca, but MMCR0 has the PMAO bit still set. PMAO bit indicates a PMI has occurred.
2. A timer interrupt is replayed after power_pmu_del which does a “may_hard_irq_enable”.
This will re-enable interrupts and triggers a PMI again since PMAO bit is still set.

So clear PMAO bit to avoid such spurious interrupts.
Ftrace logs showing the same with some debug trace_printks :

     <idle>-0    [134] d.h. 327287888478: power_pmu_del <-event_sched_out.isra.126
     <<>>    Here we cleared the PMI
     <idle>-0    [134] d.h. 327287889272: write_pmc <-power_pmu_del
     <idle>-0    [134] d.h. 327287889346: rcu_read_unlock_strict <-perf_event_update_userpage
     <idle>-0    [134] d.h. 327287889711: power_pmu_del: In power_pmu_del MMCR0 is 82004090, local_paca->irq_happened is 9
     <idle>-0    [134] d.h. 327287889811: power_pmu_enable <-perf_pmu_enable
     <idle>-0    [134] d.h. 327287889982: irq_exit <-doorbell_exception
     <idle>-0    [134] d... 327287890053: idle_cpu <-irq_exit
     <idle>-0    [134] d... 327287890158: tick_nohz_irq_exit <-irq_exit
     <idle>-0    [134] d... 327287890219: ktime_get <-tick_nohz_irq_exit
     <idle>-0    [134] d... 327287890328: replay_soft_interrupts <-interrupt_exit_kernel_prepare
     <idle>-0    [134] d... 327287890399: irq_enter <-timer_interrupt
     <<>>
     <idle>-0    [134] d.h. 327287891163: timer_interrupt: Before may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 1
     <<>>
     <idle>-0    [134] d.h. 327287894310: timer_interrupt: After may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 21

In case of other callbacks like pmu enable, we are programming MMCR0. But in case of event getting deleted, there is no
way we clear PMAO unless an event gets scheduled again in that cpu. Hence added this check only in pmu_del callback.
 

> What if we disabled MSR[EE]
> right before a perf interrupt came in, so we don't get a pending PMI
> but the condition is still close to the same.

Nick, I didn’t get this question exactly. Can you please help explain a bit ?
From my understanding, consider that we disabled MSR[EE] before perf interrupt came in.
So once the interrupts are re-enabled:

1. If soft mask is set to IRQS_DISABLED, perf interrupt will be triggered as NMI.
2. In case of ALL_DISABLED, it will be masked for replay. If PMU callbacks are invoked before replay,
our present patch will take care of clearing PMI in corner cases.

Thanks
Athira.
> 
>> +				}
>> 				write_pmc(event->hw.idx, 0);
>> 				event->hw.idx = 0;
>> 			}
>> @@ -1714,6 +1758,8 @@ static void power_pmu_stop(struct perf_event *event, int ef_flags)
>> 
>> 	local_irq_save(flags);
>> 	perf_pmu_disable(event->pmu);
>> +	if (pmc_overflown(event->hw.idx))
>> +		clear_paca_irq_pmi();
>> 
>> 	power_pmu_read(event);
>> 	event->hw.state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> @@ -2343,6 +2389,15 @@ static void __perf_event_interrupt(struct pt_regs *regs)
>> 			}
>> 		}
>> 	}
>> +
>> +	/*
>> +	 * During system wide profling or while specific CPU
>> +	 * is monitored for an event, some corner cases could
>> +	 * cause PMC to overflow in idle path. This will trigger
>> +	 * a PMI after waking up from idle. Since counter values
>> +	 * are _not_ saved/restored in idle path, can lead to
>> +	 * below "Can't find PMC" message.
>> +	 */
>> 	if (unlikely(!found) && !arch_irq_disabled_regs(regs))
>> 		printk_ratelimited(KERN_WARNING "Can't find PMC that caused IRQ\n");
>> 
>> -- 
>> 1.8.3.1
Nicholas Piggin April 12, 2021, 3:08 a.m. UTC | #3
Excerpts from Athira Rajeev's message of April 9, 2021 10:53 pm:
> 
> 
>> On 09-Apr-2021, at 6:38 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
> Hi Nick,
> 
> Thanks for checking the patch and sharing review comments.
> 
>> I was going to nitpick "overflown" here as something birds do, but some
>> sources says overflown is okay for past tense.
>> 
>> You could use "overflowed" for that, but I understand the issue with the 
>> word: you are talking about counters that are currently in an "overflow" 
>> state, but the overflow occurred in the past and is not still happening
>> so you "overflowing" doesn't exactly fit either.
>> 
>> overflown kind of works for some reason you can kind of use it for
>> present tense!
> 
> Ok sure, Yes counter is currently in an “overflow” state.
> 
>> 
>> Excerpts from Athira Rajeev's message of April 7, 2021 12:47 am:
>>> Running perf fuzzer showed below in dmesg logs:
>>> "Can't find PMC that caused IRQ"
>>> 
>>> This means a PMU exception happened, but none of the PMC's (Performance
>>> Monitor Counter) were found to be overflown. There are some corner cases
>>> that clears the PMCs after PMI gets masked. In such cases, the perf
>>> interrupt handler will not find the active PMC values that had caused
>>> the overflow and thus leads to this message while replaying.
>>> 
>>> Case 1: PMU Interrupt happens during replay of other interrupts and
>>> counter values gets cleared by PMU callbacks before replay:
>>> 
>>> During replay of interrupts like timer, __do_irq and doorbell exception, we
>>> conditionally enable interrupts via may_hard_irq_enable(). This could
>>> potentially create a window to generate a PMI. Since irq soft mask is set
>>> to ALL_DISABLED, the PMI will get masked here.
>> 
>> I wonder if may_hard_irq_enable shouldn't enable if PMI is soft
>> disabled. And also maybe replay should not set ALL_DISABLED if
>> there are no PMI interrupts pending.
>> 
>> Still, I think those are a bit more tricky and might take a while
>> to get right or just not be worth while, so I think your patch is
>> fine.
> 
> Ok Nick.
>> 
>>> We could get IPIs run before
>>> perf interrupt is replayed and the PMU events could deleted or stopped.
>>> This will change the PMU SPR values and resets the counters. Snippet of
>>> ftrace log showing PMU callbacks invoked in "__do_irq":
>>> 
>>> <idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
>>> <idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
>>> <idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
>>> <idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
>>> <<>>
>>> <idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
>>> <idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
>>> <idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
>>> <idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
>>> <idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
>>> <idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
>>> <idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
>>> <idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
>>> <idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
>>> <idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
>>> <idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
>>> <idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
>>> <<>>
>>> <idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
>>> <idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts
>>> 
>>> Case 2: PMI's masked during local_* operations, example local_add.
>>> If the local_add operation happens within a local_irq_save, replay of
>>> PMI will be during local_irq_restore. Similar to case 1, this could
>>> also create a window before replay where PMU events gets deleted or
>>> stopped.
>> 
>> Here as well perhaps PMIs should be replayed if they are unmasked
>> even if other interrupts are still masked. Again that might be more
>> complexity than it's worth.
> Ok..
> 
>> 
>>> 
>>> Patch adds a fix to update the PMU callback functions (del,stop,enable) to
>>> check for pending perf interrupt. If there is an overflown PMC and pending
>>> perf interrupt indicated in Paca, clear the PMI bit in paca to drop that
>>> sample. In case of power_pmu_del, also clear the MMCR0 PMAO bit which
>>> otherwise could lead to spurious interrupts in some corner cases. Example,
>>> a timer after power_pmu_del which will re-enable interrupts since PMI is
>>> cleared and triggers a PMI again since PMAO bit is still set.
>>> 
>>> We can't just replay PMI any time. Hence this approach is preferred rather
>>> than replaying PMI before resetting overflown PMC. Patch also documents
>>> core-book3s on a race condition which can trigger these PMC messages during
>>> idle path in PowerNV.
>>> 
>>> Fixes: f442d004806e ("powerpc/64s: Add support to mask perf interrupts and replay them")
>>> Reported-by: Nageswara R Sastry <nasastry@in.ibm.com>
>>> Suggested-by: Nicholas Piggin <npiggin@gmail.com>
>>> Suggested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>>> ---
>>> arch/powerpc/include/asm/pmc.h  | 11 +++++++++
>>> arch/powerpc/perf/core-book3s.c | 55 +++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 66 insertions(+)
>>> 
>>> diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
>>> index c6bbe9778d3c..97b4bd8de25b 100644
>>> --- a/arch/powerpc/include/asm/pmc.h
>>> +++ b/arch/powerpc/include/asm/pmc.h
>>> @@ -34,11 +34,22 @@ static inline void ppc_set_pmu_inuse(int inuse)
>>> #endif
>>> }
>>> 
>>> +static inline int clear_paca_irq_pmi(void)
>>> +{
>>> +	if (get_paca()->irq_happened & PACA_IRQ_PMI) {
>>> +		WARN_ON_ONCE(mfmsr() & MSR_EE);
>>> +		get_paca()->irq_happened &= ~PACA_IRQ_PMI;
>>> +		return 1;
>>> +	}
>>> +	return 0;
>>> +}
>> 
>> Could you put this in arch/powerpc/include/asm/hw_irq.h and
>> rather than paca_irq, call it irq_pending perhaps
>> 
>>  clear_pmi_irq_pending()
>> 
>>  get_clear_pmi_irq_pending() if you're also testing it.
> 
> Sure,  I will use “get_clear_pmi_irq_pending()” and try with moving this to arch/powerpc/include/asm/hw_irq.h
> 
>> 
>> Could you add a little comment about the corner cases above it too?
>> The root cause seem to be interrupt replay while a masked PMI is
>> pending can result in other interrupts arriving which clear the PMU
>> overflow so the pending PMI must be cleared.
> 
> Ok, I will add comment and fix this in next version.
> 
>> 
>>> +
>>> extern void power4_enable_pmcs(void);
>>> 
>>> #else /* CONFIG_PPC64 */
>>> 
>>> static inline void ppc_set_pmu_inuse(int inuse) { }
>>> +static inline int clear_paca_irq_pmi(void) { return 0; }
>>> 
>>> #endif
>>> 
>>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>>> index 766f064f00fb..18ca3c90f866 100644
>>> --- a/arch/powerpc/perf/core-book3s.c
>>> +++ b/arch/powerpc/perf/core-book3s.c
>>> @@ -847,6 +847,20 @@ static void write_pmc(int idx, unsigned long val)
>>> 	}
>>> }
>>> 
>>> +static int pmc_overflown(int idx)
>>> +{
>>> +	unsigned long val[8];
>>> +	int i;
>>> +
>>> +	for (i = 0; i < ppmu->n_counter; i++)
>>> +		val[i] = read_pmc(i + 1);
>>> +
>>> +	if ((int)val[idx-1] < 0)
>>> +		return 1;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> /* Called from sysrq_handle_showregs() */
>>> void perf_event_print_debug(void)
>>> {
>>> @@ -1438,6 +1452,15 @@ static void power_pmu_enable(struct pmu *pmu)
>>> 		event = cpuhw->event[i];
>>> 		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
>>> 			power_pmu_read(event);
>>> +			/*
>>> +			 * if the PMC corresponding to event->hw.idx is
>>> +			 * overflown, check if there is any pending perf
>>> +			 * interrupt set in paca. If so, disable the interrupt
>>> +			 * by clearing the paca bit for PMI since we are going
>>> +			 * to reset the PMC.
>>> +			 */
>>> +			if (pmc_overflown(event->hw.idx))
>>> +				clear_paca_irq_pmi();
>> 
>> If the pmc is not overflown, could there still be a PMI pending?
> 
> I didn’t hit that scenario where PMI is pending without an overflown PMC.
> Also I believe if such a case happens, we will need an investigation there. It could be a different case to be handled.

Okay, so a PMI will not occur without an overflown PMC, and the 
overflown PMC will only be cleared in places where you also clear a 
possible pending PMI?

> 
> I actually considered below two points for adding this PMC check instead of just clearing the PMI.
> 
> 1. Make sure we are not masking any bug here by just clearing PACA_IRQ_PMI.
> Ideally if PMI is set in irq_happened, it means there was a counter overflow.
> 2. If there is more than one PMU event, say two events. Make sure we are clearing PMI only for the
> event whose counter is overflown.

Those are good points. Would you consider also adding a warning for the 
case of no PMCs overflown but PMI is pending? That way you might have more 
information about such a problem if it ever happens.

We try to add a good deal of warnings around the soft-mask code because 
it's very tricky to change without causing more bugs, so even for future
changes to the code this would probably be useful.

>>> @@ -1636,6 +1664,22 @@ static void power_pmu_del(struct perf_event *event, int ef_flags)
>>> 			--cpuhw->n_events;
>>> 			ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr);
>>> 			if (event->hw.idx) {
>>> +				/*
>>> +				 * if the PMC corresponding to event->hw.idx is
>>> +				 * overflown, check if there is any pending perf
>>> +				 * interrupt set in paca. If so, disable the interrupt
>>> +				 * and clear the MMCR0 PMAO bit since we are going
>>> +				 * to reset the PMC and delete the event.
>>> +				 */
>>> +				if (pmc_overflown(event->hw.idx)) {
>>> +					if (clear_paca_irq_pmi()) {
>>> +						val_mmcr0 = mfspr(SPRN_MMCR0);
>>> +						val_mmcr0 &= ~MMCR0_PMAO;
>>> +						write_mmcr0(cpuhw, val_mmcr0);
>>> +						mb();
>>> +						isync();
>> 
>> I don't know the perf subsystem, but just out of curiosity why does
>> MMCR0 need to be cleared only in this case?
> 
> I got a corner case in power_pmu_del, with only clearing PACA_IRQ_PMI and without resetting MMCR0 PMAO bit.
> Here is the flow:
> 
> 1. We clear the PMI bit Paca, but MMCR0 has the PMAO bit still set. PMAO bit indicates a PMI has occurred.
> 2. A timer interrupt is replayed after power_pmu_del which does a “may_hard_irq_enable”.
> This will re-enable interrupts and triggers a PMI again since PMAO bit is still set.
> 
> So clear PMAO bit to avoid such spurious interrupts.
> Ftrace logs showing the same with some debug trace_printks :
> 
>      <idle>-0    [134] d.h. 327287888478: power_pmu_del <-event_sched_out.isra.126
>      <<>>    Here we cleared the PMI
>      <idle>-0    [134] d.h. 327287889272: write_pmc <-power_pmu_del
>      <idle>-0    [134] d.h. 327287889346: rcu_read_unlock_strict <-perf_event_update_userpage
>      <idle>-0    [134] d.h. 327287889711: power_pmu_del: In power_pmu_del MMCR0 is 82004090, local_paca->irq_happened is 9
>      <idle>-0    [134] d.h. 327287889811: power_pmu_enable <-perf_pmu_enable
>      <idle>-0    [134] d.h. 327287889982: irq_exit <-doorbell_exception
>      <idle>-0    [134] d... 327287890053: idle_cpu <-irq_exit
>      <idle>-0    [134] d... 327287890158: tick_nohz_irq_exit <-irq_exit
>      <idle>-0    [134] d... 327287890219: ktime_get <-tick_nohz_irq_exit
>      <idle>-0    [134] d... 327287890328: replay_soft_interrupts <-interrupt_exit_kernel_prepare
>      <idle>-0    [134] d... 327287890399: irq_enter <-timer_interrupt
>      <<>>
>      <idle>-0    [134] d.h. 327287891163: timer_interrupt: Before may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 1
>      <<>>
>      <idle>-0    [134] d.h. 327287894310: timer_interrupt: After may_hard_irq_enable MMCR0 is 82004090, local_paca->irq_happened is 21
> 
> In case of other callbacks like pmu enable, we are programming MMCR0. But in case of event getting deleted, there is no
> way we clear PMAO unless an event gets scheduled again in that cpu. Hence added this check only in pmu_del callback.
>  
> 
>> What if we disabled MSR[EE]
>> right before a perf interrupt came in, so we don't get a pending PMI
>> but the condition is still close to the same.
> 
> Nick, I didn’t get this question exactly. Can you please help explain a bit ?
> From my understanding, consider that we disabled MSR[EE] before perf interrupt came in.
> So once the interrupts are re-enabled:
> 
> 1. If soft mask is set to IRQS_DISABLED, perf interrupt will be triggered as NMI.
> 2. In case of ALL_DISABLED, it will be masked for replay. If PMU callbacks are invoked before replay,
> our present patch will take care of clearing PMI in corner cases.

Well I'm wondering about the same PMAO bug. Above you said:

  1. We clear the PMI bit Paca, but MMCR0 has the PMAO bit still set. PMAO bit indicates a PMI has occurred.
  2. A timer interrupt is replayed after power_pmu_del which does a “may_hard_irq_enable”.
  This will re-enable interrupts and triggers a PMI again since PMAO bit is still set.

So in this situation, what if we had disabled interrupts and that had 
caused MSR[EE] to be cleared (let's say due to a PCI interrupt 
arriving), and then a PMC overflows and causes PMAO to be set.

Then you run this code:

+				/*
+				 * if the PMC corresponding to event->hw.idx is
+				 * overflown, check if there is any pending perf
+				 * interrupt set in paca. If so, disable the interrupt
+				 * and clear the MMCR0 PMAO bit since we are going
+				 * to reset the PMC and delete the event.
+				 */
+				if (pmc_overflown(event->hw.idx)) {
+					if (clear_paca_irq_pmi()) {
+						val_mmcr0 = mfspr(SPRN_MMCR0);
+						val_mmcr0 &= ~MMCR0_PMAO;
+						write_mmcr0(cpuhw, val_mmcr0);
+						mb();
+						isync();

And this does not clear PMAO because we had no pending PMI, but we still 
have the pending PMAO exception.

The only difference was that MSR[EE] happened to be disabled when the 
PMC overflowed so no pending PMI was recorded, but otherwise everything 
is the same so I wonder why it's not subject to the same problem?

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
index c6bbe9778d3c..97b4bd8de25b 100644
--- a/arch/powerpc/include/asm/pmc.h
+++ b/arch/powerpc/include/asm/pmc.h
@@ -34,11 +34,22 @@  static inline void ppc_set_pmu_inuse(int inuse)
 #endif
 }
 
+static inline int clear_paca_irq_pmi(void)
+{
+	if (get_paca()->irq_happened & PACA_IRQ_PMI) {
+		WARN_ON_ONCE(mfmsr() & MSR_EE);
+		get_paca()->irq_happened &= ~PACA_IRQ_PMI;
+		return 1;
+	}
+	return 0;
+}
+
 extern void power4_enable_pmcs(void);
 
 #else /* CONFIG_PPC64 */
 
 static inline void ppc_set_pmu_inuse(int inuse) { }
+static inline int clear_paca_irq_pmi(void) { return 0; }
 
 #endif
 
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 766f064f00fb..18ca3c90f866 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -847,6 +847,20 @@  static void write_pmc(int idx, unsigned long val)
 	}
 }
 
+static int pmc_overflown(int idx)
+{
+	unsigned long val[8];
+	int i;
+
+	for (i = 0; i < ppmu->n_counter; i++)
+		val[i] = read_pmc(i + 1);
+
+	if ((int)val[idx-1] < 0)
+		return 1;
+
+	return 0;
+}
+
 /* Called from sysrq_handle_showregs() */
 void perf_event_print_debug(void)
 {
@@ -1438,6 +1452,15 @@  static void power_pmu_enable(struct pmu *pmu)
 		event = cpuhw->event[i];
 		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
 			power_pmu_read(event);
+			/*
+			 * if the PMC corresponding to event->hw.idx is
+			 * overflown, check if there is any pending perf
+			 * interrupt set in paca. If so, disable the interrupt
+			 * by clearing the paca bit for PMI since we are going
+			 * to reset the PMC.
+			 */
+			if (pmc_overflown(event->hw.idx))
+				clear_paca_irq_pmi();
 			write_pmc(event->hw.idx, 0);
 			event->hw.idx = 0;
 		}
@@ -1474,6 +1497,10 @@  static void power_pmu_enable(struct pmu *pmu)
 		event->hw.idx = idx;
 		if (event->hw.state & PERF_HES_STOPPED)
 			val = 0;
+
+		/* See above for clear_paca_irq_pmi */
+		if (pmc_overflown(event->hw.idx))
+			clear_paca_irq_pmi();
 		write_pmc(idx, val);
 
 		perf_event_update_userpage(event);
@@ -1619,6 +1646,7 @@  static void power_pmu_del(struct perf_event *event, int ef_flags)
 	struct cpu_hw_events *cpuhw;
 	long i;
 	unsigned long flags;
+	unsigned long val_mmcr0;
 
 	local_irq_save(flags);
 	perf_pmu_disable(event->pmu);
@@ -1636,6 +1664,22 @@  static void power_pmu_del(struct perf_event *event, int ef_flags)
 			--cpuhw->n_events;
 			ppmu->disable_pmc(event->hw.idx - 1, &cpuhw->mmcr);
 			if (event->hw.idx) {
+				/*
+				 * if the PMC corresponding to event->hw.idx is
+				 * overflown, check if there is any pending perf
+				 * interrupt set in paca. If so, disable the interrupt
+				 * and clear the MMCR0 PMAO bit since we are going
+				 * to reset the PMC and delete the event.
+				 */
+				if (pmc_overflown(event->hw.idx)) {
+					if (clear_paca_irq_pmi()) {
+						val_mmcr0 = mfspr(SPRN_MMCR0);
+						val_mmcr0 &= ~MMCR0_PMAO;
+						write_mmcr0(cpuhw, val_mmcr0);
+						mb();
+						isync();
+					}
+				}
 				write_pmc(event->hw.idx, 0);
 				event->hw.idx = 0;
 			}
@@ -1714,6 +1758,8 @@  static void power_pmu_stop(struct perf_event *event, int ef_flags)
 
 	local_irq_save(flags);
 	perf_pmu_disable(event->pmu);
+	if (pmc_overflown(event->hw.idx))
+		clear_paca_irq_pmi();
 
 	power_pmu_read(event);
 	event->hw.state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
@@ -2343,6 +2389,15 @@  static void __perf_event_interrupt(struct pt_regs *regs)
 			}
 		}
 	}
+
+	/*
+	 * During system wide profling or while specific CPU
+	 * is monitored for an event, some corner cases could
+	 * cause PMC to overflow in idle path. This will trigger
+	 * a PMI after waking up from idle. Since counter values
+	 * are _not_ saved/restored in idle path, can lead to
+	 * below "Can't find PMC" message.
+	 */
 	if (unlikely(!found) && !arch_irq_disabled_regs(regs))
 		printk_ratelimited(KERN_WARNING "Can't find PMC that caused IRQ\n");