diff mbox series

[V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

Message ID 1626006357-1611-1-git-send-email-atrajeev@linux.vnet.ibm.com (mailing list archive)
State Changes Requested
Headers show
Series [V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active | expand
Related show

Commit Message

Athira Rajeev July 11, 2021, 12:25 p.m. UTC
During Live Partition Migration (LPM), it is observed that perf
counter values reports zero post migration completion. However
'perf stat' with workload continues to show counts post migration
since PMU gets disabled/enabled during sched switches. But incase
of system/cpu wide monitoring, zero counts were reported with 'perf
stat' after migration completion.

Example:
 ./perf stat -e r1001e -I 1000
           time             counts unit events
     1.001010437         22,137,414      r1001e
     2.002495447         15,455,821      r1001e
<<>> As seen in next below logs, the counter values shows zero
        after migration is completed.
<<>>
    86.142535370    129,392,333,440      r1001e
    87.144714617                  0      r1001e
    88.146526636                  0      r1001e
    89.148085029                  0      r1001e

Here PMU is enabled during start of perf session and counter
values are read at intervals. Counters are only disabled at the
end of session. The powerpc mobility code presently does not handle
disabling and enabling back of PMU counters during partition
migration. Also since the PMU register values are not saved/restored
during migration, PMU registers like Monitor Mode Control Register 0
(MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
the value it was programmed with. Hence PMU counters will not be
enabled correctly post migration.

Fix this in mobility code by handling disabling and enabling of
PMU in all cpu's before and after migration. Patch introduces two
functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
mobility_pmu_disable() is called before the processor threads goes
to suspend state so as to disable the PMU counters. And disable is
done only if there are any active events running on that cpu.
mobility_pmu_enable() is called after the processor threads are
back online to enable back the PMU counters.

Since the performance Monitor counters ( PMCs) are not
saved/restored during LPM, results in PMC value being zero and the
'event->hw.prev_count' being non-zero value. This causes problem
during updation of event->count since we always accumulate
(event->hw.prev_count - PMC value) in event->count.  If
event->hw.prev_count is greater PMC value, event->count becomes
negative. Fix this by re-initialising 'prev_count' also for all
events while enabling back the events. A new variable 'migrate' is
introduced in 'struct cpu_hw_event' to achieve this for LPM cases
in power_pmu_enable. Use the 'migrate' value to clear the PMC
index (stored in event->hw.idx) for all events so that event
count settings will get re-initialised correctly.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
[ Fixed compilation error reported by kernel test robot ]
Reported-by: kernel test robot <lkp@intel.com>
---
Change from v1 -> v2:
 - Moved the mobility_pmu_enable and mobility_pmu_disable
   declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
   Also included 'asm/rtas.h' in core-book3s to fix the
   compilation warning reported by kernel test robot.

 arch/powerpc/include/asm/rtas.h           |  8 ++++++
 arch/powerpc/perf/core-book3s.c           | 44 ++++++++++++++++++++++++++++---
 arch/powerpc/platforms/pseries/mobility.c |  4 +++
 3 files changed, 53 insertions(+), 3 deletions(-)

Comments

Nicholas Piggin Oct. 21, 2021, 10:54 a.m. UTC | #1
Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
> During Live Partition Migration (LPM), it is observed that perf
> counter values reports zero post migration completion. However
> 'perf stat' with workload continues to show counts post migration
> since PMU gets disabled/enabled during sched switches. But incase
> of system/cpu wide monitoring, zero counts were reported with 'perf
> stat' after migration completion.
> 
> Example:
>  ./perf stat -e r1001e -I 1000
>            time             counts unit events
>      1.001010437         22,137,414      r1001e
>      2.002495447         15,455,821      r1001e
> <<>> As seen in next below logs, the counter values shows zero
>         after migration is completed.
> <<>>
>     86.142535370    129,392,333,440      r1001e
>     87.144714617                  0      r1001e
>     88.146526636                  0      r1001e
>     89.148085029                  0      r1001e
> 
> Here PMU is enabled during start of perf session and counter
> values are read at intervals. Counters are only disabled at the
> end of session. The powerpc mobility code presently does not handle
> disabling and enabling back of PMU counters during partition
> migration. Also since the PMU register values are not saved/restored
> during migration, PMU registers like Monitor Mode Control Register 0
> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
> the value it was programmed with. Hence PMU counters will not be
> enabled correctly post migration.
> 
> Fix this in mobility code by handling disabling and enabling of
> PMU in all cpu's before and after migration. Patch introduces two
> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
> mobility_pmu_disable() is called before the processor threads goes
> to suspend state so as to disable the PMU counters. And disable is
> done only if there are any active events running on that cpu.
> mobility_pmu_enable() is called after the processor threads are
> back online to enable back the PMU counters.
> 
> Since the performance Monitor counters ( PMCs) are not
> saved/restored during LPM, results in PMC value being zero and the
> 'event->hw.prev_count' being non-zero value. This causes problem

Interesting. Are they defined to not be migrated, or may not be 
migrated?

I wonder what QEMU migration does with PMU registers.

> during updation of event->count since we always accumulate
> (event->hw.prev_count - PMC value) in event->count.  If
> event->hw.prev_count is greater PMC value, event->count becomes
> negative. Fix this by re-initialising 'prev_count' also for all
> events while enabling back the events. A new variable 'migrate' is
> introduced in 'struct cpu_hw_event' to achieve this for LPM cases
> in power_pmu_enable. Use the 'migrate' value to clear the PMC
> index (stored in event->hw.idx) for all events so that event
> count settings will get re-initialised correctly.
> 
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> [ Fixed compilation error reported by kernel test robot ]
> Reported-by: kernel test robot <lkp@intel.com>
> ---
> Change from v1 -> v2:
>  - Moved the mobility_pmu_enable and mobility_pmu_disable
>    declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
>    Also included 'asm/rtas.h' in core-book3s to fix the
>    compilation warning reported by kernel test robot.
> 
>  arch/powerpc/include/asm/rtas.h           |  8 ++++++
>  arch/powerpc/perf/core-book3s.c           | 44 ++++++++++++++++++++++++++++---
>  arch/powerpc/platforms/pseries/mobility.c |  4 +++
>  3 files changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 9dc97d2..cea72d7 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>  static inline void read_24x7_sys_info(void) { }
>  #endif
>  
> +#ifdef CONFIG_PPC_PERF_CTRS
> +void mobility_pmu_disable(void);
> +void mobility_pmu_enable(void);
> +#else
> +static inline void mobility_pmu_disable(void) { }
> +static inline void mobility_pmu_enable(void) { }
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _POWERPC_RTAS_H */

It's not implemented in rtas, maybe consider putting this into a perf 
header?

> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index bb0ee71..90da7fa 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -18,6 +18,7 @@
>  #include <asm/ptrace.h>
>  #include <asm/code-patching.h>
>  #include <asm/interrupt.h>
> +#include <asm/rtas.h>
>  
>  #ifdef CONFIG_PPC64
>  #include "internal.h"
> @@ -58,6 +59,7 @@ struct cpu_hw_events {
>  
>  	/* Store the PMC values */
>  	unsigned long pmcs[MAX_HWEVENTS];
> +	int migrate;
>  };
>  
>  static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
> @@ -1335,6 +1337,22 @@ static void power_pmu_disable(struct pmu *pmu)
>  }
>  
>  /*
> + * Called from powerpc mobility code
> + * before migration to disable counters
> + * if the PMU is active.
> + */
> +void mobility_pmu_disable(void)
> +{
> +	struct cpu_hw_events *cpuhw;
> +
> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
> +	if (cpuhw->n_events != 0) {
> +		power_pmu_disable(NULL);
> +		cpuhw->migrate = 1;
> +	}

Rather than add this migrate logic into power_pmu_enable(), would you be 
able to do the work in the mobility callbacks instead? Something like
this:

void mobility_pmu_disable(void)
{
        struct cpu_hw_events *cpuhw;

        cpuhw = this_cpu_ptr(&cpu_hw_events);
        if (cpuhw->n_events != 0) {
                int i;

                power_pmu_disable(NULL);
                /*
                 * Read off any pre-existing events because the register
                 * values may not be migrated.
                 */
                for (i = 0; i < cpuhw->n_events; ++i) {
                        event = cpuhw->event[i];
                        if (event->hw.idx) {
                                power_pmu_read(event);
                                event->hw.idx = 0;
                        }
                }
        }
}

void mobility_pmu_enable(void)
{
        struct cpu_hw_events *cpuhw;
 
        cpuhw = this_cpu_ptr(&cpu_hw_events);
        cpuhw->n_added = cpuhw->n_events;
        power_pmu_enable(NULL);
}

> +}
> +
> +/*
>   * Re-enable all events if disable == 0.
>   * If we were previously disabled and events were added, then
>   * put the new config on the PMU.
> @@ -1379,8 +1397,10 @@ static void power_pmu_enable(struct pmu *pmu)
>  	 * no need to recalculate MMCR* settings and reset the PMCs.
>  	 * Just reenable the PMU with the current MMCR* settings
>  	 * (possibly updated for removal of events).
> +	 * While reenabling PMU during partition migration, continue
> +	 * with normal flow.
>  	 */
> -	if (!cpuhw->n_added) {
> +	if (!cpuhw->n_added && !cpuhw->migrate) {
>  		mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>  		mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>  		if (ppmu->flags & PPMU_ARCH_31)
> @@ -1434,11 +1454,15 @@ static void power_pmu_enable(struct pmu *pmu)
>  	/*
>  	 * Read off any pre-existing events that need to move
>  	 * to another PMC.
> +	 * While enabling PMU during partition migration,
> +	 * skip power_pmu_read since all event count settings
> +	 * needs to be re-initialised after migration.
>  	 */
>  	for (i = 0; i < cpuhw->n_events; ++i) {
>  		event = cpuhw->event[i];
> -		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
> -			power_pmu_read(event);
> +		if ((event->hw.idx && event->hw.idx != hwc_index[i] + 1) || (cpuhw->migrate)) {
> +			if (!cpuhw->migrate)
> +				power_pmu_read(event);
>  			write_pmc(event->hw.idx, 0);
>  			event->hw.idx = 0;
>  		}
> @@ -1506,6 +1530,20 @@ static void power_pmu_enable(struct pmu *pmu)
>  	local_irq_restore(flags);
>  }
>  
> +/*
> + * Called from powerpc mobility code
> + * during migration completion to
> + * enable back PMU counters.
> + */
> +void mobility_pmu_enable(void)
> +{
> +	struct cpu_hw_events *cpuhw;
> +
> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
> +	power_pmu_enable(NULL);
> +	cpuhw->migrate = 0;
> +}
> +
>  static int collect_events(struct perf_event *group, int max_count,
>  			  struct perf_event *ctrs[], u64 *events,
>  			  unsigned int *flags)
> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
> index e83e089..ff7a77c 100644
> --- a/arch/powerpc/platforms/pseries/mobility.c
> +++ b/arch/powerpc/platforms/pseries/mobility.c
> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>  retry:
>  	/* Must ensure MSR.EE off for H_JOIN. */
>  	hard_irq_disable();
> +	/* Disable PMU before suspend */
> +	mobility_pmu_disable();
>  	hvrc = plpar_hcall_norets(H_JOIN);
>  
>  	switch (hvrc) {

Does the retry case cause mobility_pmu_disable to be called twice 
without mobility_pmu_enable called? Would be neater if it was
balanced.


> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>  	 * reset the watchdog.
>  	 */
>  	touch_nmi_watchdog();
> +	/* Enable PMU after resuming */
> +	mobility_pmu_enable();
>  	return ret;
>  }

Not really a big deal but while you're updating it, you could leave 
touch_nmi_watchdog() at the very end.

Thanks,
Nick
Nathan Lynch Oct. 21, 2021, 5:17 p.m. UTC | #2
Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
> During Live Partition Migration (LPM), it is observed that perf
> counter values reports zero post migration completion. However
> 'perf stat' with workload continues to show counts post migration
> since PMU gets disabled/enabled during sched switches. But incase
> of system/cpu wide monitoring, zero counts were reported with 'perf
> stat' after migration completion.
>
> Example:
>  ./perf stat -e r1001e -I 1000
>            time             counts unit events
>      1.001010437         22,137,414      r1001e
>      2.002495447         15,455,821      r1001e
> <<>> As seen in next below logs, the counter values shows zero
>         after migration is completed.
> <<>>
>     86.142535370    129,392,333,440      r1001e
>     87.144714617                  0      r1001e
>     88.146526636                  0      r1001e
>     89.148085029                  0      r1001e

Confirmed in my environment:

    51.099987985            300,338      cache-misses
    52.101839374            296,586      cache-misses
    53.116089796            263,150      cache-misses
    54.117949249            232,290      cache-misses
    55.602029375     68,700,421,711      cache-misses
    56.610073969                  0      cache-misses
    57.614732000                  0      cache-misses

I wonder what it means that there is a very unlikely huge value before
the counter stops working -- I believe your example has this phenomenon
too.


> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
> index e83e089..ff7a77c 100644
> --- a/arch/powerpc/platforms/pseries/mobility.c
> +++ b/arch/powerpc/platforms/pseries/mobility.c
> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>  retry:
>  	/* Must ensure MSR.EE off for H_JOIN. */
>  	hard_irq_disable();
> +	/* Disable PMU before suspend */
> +	mobility_pmu_disable();
>  	hvrc = plpar_hcall_norets(H_JOIN);
>  
>  	switch (hvrc) {
> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>  	 * reset the watchdog.
>  	 */
>  	touch_nmi_watchdog();
> +	/* Enable PMU after resuming */
> +	mobility_pmu_enable();
>  	return ret;
>  }

We should minimize calls into other subsystems from this context (the
callback function we've passed to stop_machine); it's fairly sensitive.
Can this be moved out to pseries_migrate_partition() or similar?
Nathan Lynch Oct. 21, 2021, 5:33 p.m. UTC | #3
Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>>  ./perf stat -e r1001e -I 1000
>>            time             counts unit events
>>      1.001010437         22,137,414      r1001e
>>      2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>         after migration is completed.
>> <<>>
>>     86.142535370    129,392,333,440      r1001e
>>     87.144714617                  0      r1001e
>>     88.146526636                  0      r1001e
>>     89.148085029                  0      r1001e
>> 
>> Here PMU is enabled during start of perf session and counter
>> values are read at intervals. Counters are only disabled at the
>> end of session. The powerpc mobility code presently does not handle
>> disabling and enabling back of PMU counters during partition
>> migration. Also since the PMU register values are not saved/restored
>> during migration, PMU registers like Monitor Mode Control Register 0
>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>> the value it was programmed with. Hence PMU counters will not be
>> enabled correctly post migration.
>> 
>> Fix this in mobility code by handling disabling and enabling of
>> PMU in all cpu's before and after migration. Patch introduces two
>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>> mobility_pmu_disable() is called before the processor threads goes
>> to suspend state so as to disable the PMU counters. And disable is
>> done only if there are any active events running on that cpu.
>> mobility_pmu_enable() is called after the processor threads are
>> back online to enable back the PMU counters.
>> 
>> Since the performance Monitor counters ( PMCs) are not
>> saved/restored during LPM, results in PMC value being zero and the
>> 'event->hw.prev_count' being non-zero value. This causes problem
>
> Interesting. Are they defined to not be migrated, or may not be 
> migrated?

PAPR may be silent on this... at least I haven't found anything yet. But
I'm not very familiar with perf counters.

How much assurance do we have that hardware events we've programmed on
the source can be reliably re-enabled on the destination, with the same
semantics? Aren't there some model-specific counters that don't make
sense to handle this way?


>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 9dc97d2..cea72d7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>>  static inline void read_24x7_sys_info(void) { }
>>  #endif
>>  
>> +#ifdef CONFIG_PPC_PERF_CTRS
>> +void mobility_pmu_disable(void);
>> +void mobility_pmu_enable(void);
>> +#else
>> +static inline void mobility_pmu_disable(void) { }
>> +static inline void mobility_pmu_enable(void) { }
>> +#endif
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _POWERPC_RTAS_H */
>
> It's not implemented in rtas, maybe consider putting this into a perf 
> header?

+1
Michael Ellerman Oct. 22, 2021, 12:19 a.m. UTC | #4
Nathan Lynch <nathanl@linux.ibm.com> writes:
> Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>>
>> Example:
>>  ./perf stat -e r1001e -I 1000
>>            time             counts unit events
>>      1.001010437         22,137,414      r1001e
>>      2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>         after migration is completed.
>> <<>>
>>     86.142535370    129,392,333,440      r1001e
>>     87.144714617                  0      r1001e
>>     88.146526636                  0      r1001e
>>     89.148085029                  0      r1001e
>
> Confirmed in my environment:
>
>     51.099987985            300,338      cache-misses
>     52.101839374            296,586      cache-misses
>     53.116089796            263,150      cache-misses
>     54.117949249            232,290      cache-misses
>     55.602029375     68,700,421,711      cache-misses
>     56.610073969                  0      cache-misses
>     57.614732000                  0      cache-misses
>
> I wonder what it means that there is a very unlikely huge value before
> the counter stops working -- I believe your example has this phenomenon
> too.

AFAICS the patch is not reading the PMC values before the migration, so
I suspect we're losing some counts just before the migration and then
the delta is going negative somewhere, leading to an implausibly large
count.

cheers
Madhavan Srinivasan Oct. 22, 2021, 3:33 a.m. UTC | #5
On 10/21/21 11:03 PM, Nathan Lynch wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>>
>>> Example:
>>>   ./perf stat -e r1001e -I 1000
>>>             time             counts unit events
>>>       1.001010437         22,137,414      r1001e
>>>       2.002495447         15,455,821      r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>>          after migration is completed.
>>> <<>>
>>>      86.142535370    129,392,333,440      r1001e
>>>      87.144714617                  0      r1001e
>>>      88.146526636                  0      r1001e
>>>      89.148085029                  0      r1001e
>>>
>>> Here PMU is enabled during start of perf session and counter
>>> values are read at intervals. Counters are only disabled at the
>>> end of session. The powerpc mobility code presently does not handle
>>> disabling and enabling back of PMU counters during partition
>>> migration. Also since the PMU register values are not saved/restored
>>> during migration, PMU registers like Monitor Mode Control Register 0
>>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>>> the value it was programmed with. Hence PMU counters will not be
>>> enabled correctly post migration.
>>>
>>> Fix this in mobility code by handling disabling and enabling of
>>> PMU in all cpu's before and after migration. Patch introduces two
>>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>>> mobility_pmu_disable() is called before the processor threads goes
>>> to suspend state so as to disable the PMU counters. And disable is
>>> done only if there are any active events running on that cpu.
>>> mobility_pmu_enable() is called after the processor threads are
>>> back online to enable back the PMU counters.
>>>
>>> Since the performance Monitor counters ( PMCs) are not
>>> saved/restored during LPM, results in PMC value being zero and the
>>> 'event->hw.prev_count' being non-zero value. This causes problem
>> Interesting. Are they defined to not be migrated, or may not be
>> migrated?
> PAPR may be silent on this... at least I haven't found anything yet. But
> I'm not very familiar with perf counters.

IIUC, from the internal discussion with pHYP, migration of counters is 
OS thing.

> How much assurance do we have that hardware events we've programmed on
> the source can be reliably re-enabled on the destination, with the same
> semantics? Aren't there some model-specific counters that don't make
> sense to handle this way?

migration to same generation processor/model should be ok
but not to the different generation/model (but it is a case
to handle). That said, this patch is to fix the issue of large
value seen when migrating.

>
>
>>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>>> index 9dc97d2..cea72d7 100644
>>> --- a/arch/powerpc/include/asm/rtas.h
>>> +++ b/arch/powerpc/include/asm/rtas.h
>>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>>>   static inline void read_24x7_sys_info(void) { }
>>>   #endif
>>>   
>>> +#ifdef CONFIG_PPC_PERF_CTRS
>>> +void mobility_pmu_disable(void);
>>> +void mobility_pmu_enable(void);
>>> +#else
>>> +static inline void mobility_pmu_disable(void) { }
>>> +static inline void mobility_pmu_enable(void) { }
>>> +#endif
>>> +
>>>   #endif /* __KERNEL__ */
>>>   #endif /* _POWERPC_RTAS_H */
>> It's not implemented in rtas, maybe consider putting this into a perf
>> header?
> +1
>
Nicholas Piggin Oct. 22, 2021, 5:11 a.m. UTC | #6
Excerpts from Michael Ellerman's message of October 22, 2021 10:19 am:
> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>>
>>> Example:
>>>  ./perf stat -e r1001e -I 1000
>>>            time             counts unit events
>>>      1.001010437         22,137,414      r1001e
>>>      2.002495447         15,455,821      r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>>         after migration is completed.
>>> <<>>
>>>     86.142535370    129,392,333,440      r1001e
>>>     87.144714617                  0      r1001e
>>>     88.146526636                  0      r1001e
>>>     89.148085029                  0      r1001e
>>
>> Confirmed in my environment:
>>
>>     51.099987985            300,338      cache-misses
>>     52.101839374            296,586      cache-misses
>>     53.116089796            263,150      cache-misses
>>     54.117949249            232,290      cache-misses
>>     55.602029375     68,700,421,711      cache-misses
>>     56.610073969                  0      cache-misses
>>     57.614732000                  0      cache-misses
>>
>> I wonder what it means that there is a very unlikely huge value before
>> the counter stops working -- I believe your example has this phenomenon
>> too.
> 
> AFAICS the patch is not reading the PMC values before the migration, so

My suggested change I think should take care of that.

Thanks,
Nick
Athira Rajeev Oct. 25, 2021, 5:07 p.m. UTC | #7
> On 21-Oct-2021, at 4:24 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>> ./perf stat -e r1001e -I 1000
>>           time             counts unit events
>>     1.001010437         22,137,414      r1001e
>>     2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>        after migration is completed.
>> <<>>
>>    86.142535370    129,392,333,440      r1001e
>>    87.144714617                  0      r1001e
>>    88.146526636                  0      r1001e
>>    89.148085029                  0      r1001e
>> 
>> Here PMU is enabled during start of perf session and counter
>> values are read at intervals. Counters are only disabled at the
>> end of session. The powerpc mobility code presently does not handle
>> disabling and enabling back of PMU counters during partition
>> migration. Also since the PMU register values are not saved/restored
>> during migration, PMU registers like Monitor Mode Control Register 0
>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>> the value it was programmed with. Hence PMU counters will not be
>> enabled correctly post migration.
>> 
>> Fix this in mobility code by handling disabling and enabling of
>> PMU in all cpu's before and after migration. Patch introduces two
>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>> mobility_pmu_disable() is called before the processor threads goes
>> to suspend state so as to disable the PMU counters. And disable is
>> done only if there are any active events running on that cpu.
>> mobility_pmu_enable() is called after the processor threads are
>> back online to enable back the PMU counters.
>> 
>> Since the performance Monitor counters ( PMCs) are not
>> saved/restored during LPM, results in PMC value being zero and the
>> 'event->hw.prev_count' being non-zero value. This causes problem
> 
> Interesting. Are they defined to not be migrated, or may not be 
> migrated?
> 
> I wonder what QEMU migration does with PMU registers.
> 
>> during updation of event->count since we always accumulate
>> (event->hw.prev_count - PMC value) in event->count.  If
>> event->hw.prev_count is greater PMC value, event->count becomes
>> negative. Fix this by re-initialising 'prev_count' also for all
>> events while enabling back the events. A new variable 'migrate' is
>> introduced in 'struct cpu_hw_event' to achieve this for LPM cases
>> in power_pmu_enable. Use the 'migrate' value to clear the PMC
>> index (stored in event->hw.idx) for all events so that event
>> count settings will get re-initialised correctly.
>> 
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> [ Fixed compilation error reported by kernel test robot ]
>> Reported-by: kernel test robot <lkp@intel.com>
>> ---
>> Change from v1 -> v2:
>> - Moved the mobility_pmu_enable and mobility_pmu_disable
>>   declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
>>   Also included 'asm/rtas.h' in core-book3s to fix the
>>   compilation warning reported by kernel test robot.
>> 
>> arch/powerpc/include/asm/rtas.h           |  8 ++++++
>> arch/powerpc/perf/core-book3s.c           | 44 ++++++++++++++++++++++++++++---
>> arch/powerpc/platforms/pseries/mobility.c |  4 +++
>> 3 files changed, 53 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 9dc97d2..cea72d7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>> static inline void read_24x7_sys_info(void) { }
>> #endif
>> 
>> +#ifdef CONFIG_PPC_PERF_CTRS
>> +void mobility_pmu_disable(void);
>> +void mobility_pmu_enable(void);
>> +#else
>> +static inline void mobility_pmu_disable(void) { }
>> +static inline void mobility_pmu_enable(void) { }
>> +#endif
>> +
>> #endif /* __KERNEL__ */
>> #endif /* _POWERPC_RTAS_H */
> 
> It's not implemented in rtas, maybe consider putting this into a perf 
> header?

Hi Nick,
Thanks for the review comments.

Sure, I will move this to perf_event header file

> 
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index bb0ee71..90da7fa 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -18,6 +18,7 @@
>> #include <asm/ptrace.h>
>> #include <asm/code-patching.h>
>> #include <asm/interrupt.h>
>> +#include <asm/rtas.h>
>> 
>> #ifdef CONFIG_PPC64
>> #include "internal.h"
>> @@ -58,6 +59,7 @@ struct cpu_hw_events {
>> 
>> 	/* Store the PMC values */
>> 	unsigned long pmcs[MAX_HWEVENTS];
>> +	int migrate;
>> };
>> 
>> static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
>> @@ -1335,6 +1337,22 @@ static void power_pmu_disable(struct pmu *pmu)
>> }
>> 
>> /*
>> + * Called from powerpc mobility code
>> + * before migration to disable counters
>> + * if the PMU is active.
>> + */
>> +void mobility_pmu_disable(void)
>> +{
>> +	struct cpu_hw_events *cpuhw;
>> +
>> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
>> +	if (cpuhw->n_events != 0) {
>> +		power_pmu_disable(NULL);
>> +		cpuhw->migrate = 1;
>> +	}
> 
> Rather than add this migrate logic into power_pmu_enable(), would you be 
> able to do the work in the mobility callbacks instead? Something like
> this:
> 
> void mobility_pmu_disable(void)
> {
>        struct cpu_hw_events *cpuhw;
> 
>        cpuhw = this_cpu_ptr(&cpu_hw_events);
>        if (cpuhw->n_events != 0) {
>                int i;
> 
>                power_pmu_disable(NULL);
>                /*
>                 * Read off any pre-existing events because the register
>                 * values may not be migrated.
>                 */
>                for (i = 0; i < cpuhw->n_events; ++i) {
>                        event = cpuhw->event[i];
>                        if (event->hw.idx) {
>                                power_pmu_read(event);
>                                event->hw.idx = 0;
>                        }
>                }
>        }
> }
> 
> void mobility_pmu_enable(void)
> {
>        struct cpu_hw_events *cpuhw;
> 
>        cpuhw = this_cpu_ptr(&cpu_hw_events);
>        cpuhw->n_added = cpuhw->n_events;
>        power_pmu_enable(NULL);
> }
> 

Ok Nick, Will address these changes in next version.
With this approach, I won’t need the “migrate” field.
>> +}
>> +
>> +/*
>>  * Re-enable all events if disable == 0.
>>  * If we were previously disabled and events were added, then
>>  * put the new config on the PMU.
>> @@ -1379,8 +1397,10 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	 * no need to recalculate MMCR* settings and reset the PMCs.
>> 	 * Just reenable the PMU with the current MMCR* settings
>> 	 * (possibly updated for removal of events).
>> +	 * While reenabling PMU during partition migration, continue
>> +	 * with normal flow.
>> 	 */
>> -	if (!cpuhw->n_added) {
>> +	if (!cpuhw->n_added && !cpuhw->migrate) {
>> 		mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>> 		mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>> 		if (ppmu->flags & PPMU_ARCH_31)
>> @@ -1434,11 +1454,15 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	/*
>> 	 * Read off any pre-existing events that need to move
>> 	 * to another PMC.
>> +	 * While enabling PMU during partition migration,
>> +	 * skip power_pmu_read since all event count settings
>> +	 * needs to be re-initialised after migration.
>> 	 */
>> 	for (i = 0; i < cpuhw->n_events; ++i) {
>> 		event = cpuhw->event[i];
>> -		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
>> -			power_pmu_read(event);
>> +		if ((event->hw.idx && event->hw.idx != hwc_index[i] + 1) || (cpuhw->migrate)) {
>> +			if (!cpuhw->migrate)
>> +				power_pmu_read(event);
>> 			write_pmc(event->hw.idx, 0);
>> 			event->hw.idx = 0;
>> 		}
>> @@ -1506,6 +1530,20 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	local_irq_restore(flags);
>> }
>> 
>> +/*
>> + * Called from powerpc mobility code
>> + * during migration completion to
>> + * enable back PMU counters.
>> + */
>> +void mobility_pmu_enable(void)
>> +{
>> +	struct cpu_hw_events *cpuhw;
>> +
>> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
>> +	power_pmu_enable(NULL);
>> +	cpuhw->migrate = 0;
>> +}
>> +
>> static int collect_events(struct perf_event *group, int max_count,
>> 			  struct perf_event *ctrs[], u64 *events,
>> 			  unsigned int *flags)
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index e83e089..ff7a77c 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>> retry:
>> 	/* Must ensure MSR.EE off for H_JOIN. */
>> 	hard_irq_disable();
>> +	/* Disable PMU before suspend */
>> +	mobility_pmu_disable();
>> 	hvrc = plpar_hcall_norets(H_JOIN);
>> 
>> 	switch (hvrc) {
> 
> Does the retry case cause mobility_pmu_disable to be called twice 
> without mobility_pmu_enable called? Would be neater if it was
> balanced.
> 

I will be moving these mobility callbacks to “pseries_migrate_partition” as Nathan suggested.
Will send a V3 with new changes. 

Thanks
Athira

> 
>> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>> 	 * reset the watchdog.
>> 	 */
>> 	touch_nmi_watchdog();
>> +	/* Enable PMU after resuming */
>> +	mobility_pmu_enable();
>> 	return ret;
>> }
> 
> Not really a big deal but while you're updating it, you could leave 
> touch_nmi_watchdog() at the very end.
> 
> Thanks,
> Nick
Athira Rajeev Oct. 25, 2021, 5:09 p.m. UTC | #8
> On 21-Oct-2021, at 10:47 PM, Nathan Lynch <nathanl@linux.ibm.com> wrote:
> 
> Athira Rajeev <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> writes:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>> ./perf stat -e r1001e -I 1000
>>           time             counts unit events
>>     1.001010437         22,137,414      r1001e
>>     2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>        after migration is completed.
>> <<>>
>>    86.142535370    129,392,333,440      r1001e
>>    87.144714617                  0      r1001e
>>    88.146526636                  0      r1001e
>>    89.148085029                  0      r1001e
> 
> Confirmed in my environment:
> 
>    51.099987985            300,338      cache-misses
>    52.101839374            296,586      cache-misses
>    53.116089796            263,150      cache-misses
>    54.117949249            232,290      cache-misses
>    55.602029375     68,700,421,711      cache-misses
>    56.610073969                  0      cache-misses
>    57.614732000                  0      cache-misses
> 
> I wonder what it means that there is a very unlikely huge value before
> the counter stops working -- I believe your example has this phenomenon
> too.
> 
> 
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index e83e089..ff7a77c 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>> retry:
>> 	/* Must ensure MSR.EE off for H_JOIN. */
>> 	hard_irq_disable();
>> +	/* Disable PMU before suspend */
>> +	mobility_pmu_disable();
>> 	hvrc = plpar_hcall_norets(H_JOIN);
>> 
>> 	switch (hvrc) {
>> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>> 	 * reset the watchdog.
>> 	 */
>> 	touch_nmi_watchdog();
>> +	/* Enable PMU after resuming */
>> +	mobility_pmu_enable();
>> 	return ret;
>> }
> 
> We should minimize calls into other subsystems from this context (the
> callback function we've passed to stop_machine); it's fairly sensitive.
> Can this be moved out to pseries_migrate_partition() or similar?

Hi Nathan

Thanks for the review.
I will move the callbacks to “pseries_migrate_partition” in next version

Athira.
Athira Rajeev Oct. 25, 2021, 5:10 p.m. UTC | #9
> On 21-Oct-2021, at 11:03 PM, Nathan Lynch <nathanl@linux.ibm.com> wrote:
> 
> Nicholas Piggin <npiggin@gmail.com <mailto:npiggin@gmail.com>> writes:
>> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>> 
>>> Example:
>>> ./perf stat -e r1001e -I 1000
>>>           time             counts unit events
>>>     1.001010437         22,137,414      r1001e
>>>     2.002495447         15,455,821      r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>>        after migration is completed.
>>> <<>>
>>>    86.142535370    129,392,333,440      r1001e
>>>    87.144714617                  0      r1001e
>>>    88.146526636                  0      r1001e
>>>    89.148085029                  0      r1001e
>>> 
>>> Here PMU is enabled during start of perf session and counter
>>> values are read at intervals. Counters are only disabled at the
>>> end of session. The powerpc mobility code presently does not handle
>>> disabling and enabling back of PMU counters during partition
>>> migration. Also since the PMU register values are not saved/restored
>>> during migration, PMU registers like Monitor Mode Control Register 0
>>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>>> the value it was programmed with. Hence PMU counters will not be
>>> enabled correctly post migration.
>>> 
>>> Fix this in mobility code by handling disabling and enabling of
>>> PMU in all cpu's before and after migration. Patch introduces two
>>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>>> mobility_pmu_disable() is called before the processor threads goes
>>> to suspend state so as to disable the PMU counters. And disable is
>>> done only if there are any active events running on that cpu.
>>> mobility_pmu_enable() is called after the processor threads are
>>> back online to enable back the PMU counters.
>>> 
>>> Since the performance Monitor counters ( PMCs) are not
>>> saved/restored during LPM, results in PMC value being zero and the
>>> 'event->hw.prev_count' being non-zero value. This causes problem
>> 
>> Interesting. Are they defined to not be migrated, or may not be 
>> migrated?
> 
> PAPR may be silent on this... at least I haven't found anything yet. But
> I'm not very familiar with perf counters.
> 
> How much assurance do we have that hardware events we've programmed on
> the source can be reliably re-enabled on the destination, with the same
> semantics? Aren't there some model-specific counters that don't make
> sense to handle this way?
> 
> 
>>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>>> index 9dc97d2..cea72d7 100644
>>> --- a/arch/powerpc/include/asm/rtas.h
>>> +++ b/arch/powerpc/include/asm/rtas.h
>>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>>> static inline void read_24x7_sys_info(void) { }
>>> #endif
>>> 
>>> +#ifdef CONFIG_PPC_PERF_CTRS
>>> +void mobility_pmu_disable(void);
>>> +void mobility_pmu_enable(void);
>>> +#else
>>> +static inline void mobility_pmu_disable(void) { }
>>> +static inline void mobility_pmu_enable(void) { }
>>> +#endif
>>> +
>>> #endif /* __KERNEL__ */
>>> #endif /* _POWERPC_RTAS_H */
>> 
>> It's not implemented in rtas, maybe consider putting this into a perf 
>> header?
> 
> +1

Sure, I will move this to perf_event header file

Thanks
Athira
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9dc97d2..cea72d7 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -380,5 +380,13 @@  static inline void rtas_initialize(void) { }
 static inline void read_24x7_sys_info(void) { }
 #endif
 
+#ifdef CONFIG_PPC_PERF_CTRS
+void mobility_pmu_disable(void);
+void mobility_pmu_enable(void);
+#else
+static inline void mobility_pmu_disable(void) { }
+static inline void mobility_pmu_enable(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _POWERPC_RTAS_H */
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index bb0ee71..90da7fa 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -18,6 +18,7 @@ 
 #include <asm/ptrace.h>
 #include <asm/code-patching.h>
 #include <asm/interrupt.h>
+#include <asm/rtas.h>
 
 #ifdef CONFIG_PPC64
 #include "internal.h"
@@ -58,6 +59,7 @@  struct cpu_hw_events {
 
 	/* Store the PMC values */
 	unsigned long pmcs[MAX_HWEVENTS];
+	int migrate;
 };
 
 static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
@@ -1335,6 +1337,22 @@  static void power_pmu_disable(struct pmu *pmu)
 }
 
 /*
+ * Called from powerpc mobility code
+ * before migration to disable counters
+ * if the PMU is active.
+ */
+void mobility_pmu_disable(void)
+{
+	struct cpu_hw_events *cpuhw;
+
+	cpuhw = this_cpu_ptr(&cpu_hw_events);
+	if (cpuhw->n_events != 0) {
+		power_pmu_disable(NULL);
+		cpuhw->migrate = 1;
+	}
+}
+
+/*
  * Re-enable all events if disable == 0.
  * If we were previously disabled and events were added, then
  * put the new config on the PMU.
@@ -1379,8 +1397,10 @@  static void power_pmu_enable(struct pmu *pmu)
 	 * no need to recalculate MMCR* settings and reset the PMCs.
 	 * Just reenable the PMU with the current MMCR* settings
 	 * (possibly updated for removal of events).
+	 * While reenabling PMU during partition migration, continue
+	 * with normal flow.
 	 */
-	if (!cpuhw->n_added) {
+	if (!cpuhw->n_added && !cpuhw->migrate) {
 		mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
 		mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
 		if (ppmu->flags & PPMU_ARCH_31)
@@ -1434,11 +1454,15 @@  static void power_pmu_enable(struct pmu *pmu)
 	/*
 	 * Read off any pre-existing events that need to move
 	 * to another PMC.
+	 * While enabling PMU during partition migration,
+	 * skip power_pmu_read since all event count settings
+	 * needs to be re-initialised after migration.
 	 */
 	for (i = 0; i < cpuhw->n_events; ++i) {
 		event = cpuhw->event[i];
-		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
-			power_pmu_read(event);
+		if ((event->hw.idx && event->hw.idx != hwc_index[i] + 1) || (cpuhw->migrate)) {
+			if (!cpuhw->migrate)
+				power_pmu_read(event);
 			write_pmc(event->hw.idx, 0);
 			event->hw.idx = 0;
 		}
@@ -1506,6 +1530,20 @@  static void power_pmu_enable(struct pmu *pmu)
 	local_irq_restore(flags);
 }
 
+/*
+ * Called from powerpc mobility code
+ * during migration completion to
+ * enable back PMU counters.
+ */
+void mobility_pmu_enable(void)
+{
+	struct cpu_hw_events *cpuhw;
+
+	cpuhw = this_cpu_ptr(&cpu_hw_events);
+	power_pmu_enable(NULL);
+	cpuhw->migrate = 0;
+}
+
 static int collect_events(struct perf_event *group, int max_count,
 			  struct perf_event *ctrs[], u64 *events,
 			  unsigned int *flags)
diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index e83e089..ff7a77c 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -476,6 +476,8 @@  static int do_join(void *arg)
 retry:
 	/* Must ensure MSR.EE off for H_JOIN. */
 	hard_irq_disable();
+	/* Disable PMU before suspend */
+	mobility_pmu_disable();
 	hvrc = plpar_hcall_norets(H_JOIN);
 
 	switch (hvrc) {
@@ -530,6 +532,8 @@  static int do_join(void *arg)
 	 * reset the watchdog.
 	 */
 	touch_nmi_watchdog();
+	/* Enable PMU after resuming */
+	mobility_pmu_enable();
 	return ret;
 }