diff mbox series

[for-2.12,v3,1/3] spapr/rtas: disable the decrementer interrupt when a CPU is unplugged

Message ID 20171120100347.8601-2-clg@kaod.org
State New
Headers show
Series disable the decrementer interrupt when a CPU is unplugged | expand

Commit Message

Cédric Le Goater Nov. 20, 2017, 10:03 a.m. UTC
When a CPU is stopped with the 'stop-self' RTAS call, its state
'halted' is switched to 1 and, in this case, the MSR is not taken into
account anymore in the cpu_has_work() routine. Only the pending
hardware interrupts are checked with their LPCR:PECE* enablement bit.

If the DECR timer fires after 'stop-self' is called and before the CPU
'stop' state is reached, the nearly-dead CPU will have some work to do
and the guest will crash. This case happens very frequently with the
not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
occasionally fired but after 'stop' state, so no work is to be done
and the guest survives.

I suspect there is a race between the QEMU mainloop triggering the
timers and the TCG CPU thread but I could not quite identify the root
cause. To be safe, let's disable in the LPCR all the exceptions which
can cause an exit while the CPU is in power-saving mode and reenable
them when the CPU is started.

For this purpose, we introduce a little helper routine to calculate
the PECE bits for a processor variant. We could also use the mask
value LPCR_PECE_L_MASK for the P8 and P9 processors. bit 47 and 48 are
reserved on P7 but it is still compatible.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

Changes in v3:

 - introduced a cpu_ppc_papr_pece_bits() helper to gather the PECE
   bits depending on the CPU family.   
 - enabled Power-saving mode Exit Cause exceptions only on the boot CPU.
 
Changes in v2:

 - used a new routine ppc_cpu_pvr_match() to discriminate CPU versions
 - removed the LPCR:PECE* enablement bit when the CPU is initialized
   if it is a secondary

 hw/ppc/spapr_rtas.c         |  9 +++++++++
 target/ppc/cpu.h            |  1 +
 target/ppc/translate_init.c | 33 +++++++++++++++++++++++++--------
 3 files changed, 35 insertions(+), 8 deletions(-)

Comments

David Gibson Nov. 22, 2017, 2:33 a.m. UTC | #1
On Mon, Nov 20, 2017 at 11:03:45AM +0100, Cédric Le Goater wrote:
> When a CPU is stopped with the 'stop-self' RTAS call, its state
> 'halted' is switched to 1 and, in this case, the MSR is not taken into
> account anymore in the cpu_has_work() routine. Only the pending
> hardware interrupts are checked with their LPCR:PECE* enablement bit.
> 
> If the DECR timer fires after 'stop-self' is called and before the CPU
> 'stop' state is reached, the nearly-dead CPU will have some work to do
> and the guest will crash. This case happens very frequently with the
> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
> occasionally fired but after 'stop' state, so no work is to be done
> and the guest survives.
> 
> I suspect there is a race between the QEMU mainloop triggering the
> timers and the TCG CPU thread but I could not quite identify the root
> cause. To be safe, let's disable in the LPCR all the exceptions which
> can cause an exit while the CPU is in power-saving mode and reenable
> them when the CPU is started.
> 
> For this purpose, we introduce a little helper routine to calculate
> the PECE bits for a processor variant. We could also use the mask
> value LPCR_PECE_L_MASK for the P8 and P9 processors. bit 47 and 48 are
> reserved on P7 but it is still compatible.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

I'm not thrilled about addressing this without 100% knowing what's
going on, but this seems like a sensible change in any case, so I'm ok
with applying something like this.

A detail however..

[snip]
>  #if !defined(CONFIG_USER_ONLY)
> +
> +target_ulong cpu_ppc_papr_pece_bits(CPUPPCState *env)
> +{
> +    switch (env->mmu_model) {
> +    case POWERPC_MMU_3_00:
> +        return LPCR_PDEE | LPCR_HDEE | LPCR_EEE | LPCR_DEE | LPCR_OEE;
> +    default:
> +        /* P7 and P8 has slightly different PECE bits, mostly because P8 adds
> +         * bit 47 and 48 which are reserved on P7. Here we set them all, which
> +         * will work as expected for both implementations
> +         */
> +        return LPCR_P8_PECE0 | LPCR_P8_PECE1 | LPCR_P8_PECE2 | LPCR_P8_PECE3 |
> +            LPCR_P8_PECE4;
> +    }
> +}

..since we're working in this area, might as well clean up this
inappropriate use of mmu_model.  Two options which I'd be ok with:

1) Add a pece_bits field to the PowerPCCPUClass, correctly initialized
for the various processors.

2) A similar helper but using ppc_check_compat() to check the arch
level, instead of using env->mmu_model.
Cédric Le Goater Nov. 22, 2017, 6:55 p.m. UTC | #2
On 11/22/2017 03:33 AM, David Gibson wrote:
> On Mon, Nov 20, 2017 at 11:03:45AM +0100, Cédric Le Goater wrote:
>> When a CPU is stopped with the 'stop-self' RTAS call, its state
>> 'halted' is switched to 1 and, in this case, the MSR is not taken into
>> account anymore in the cpu_has_work() routine. Only the pending
>> hardware interrupts are checked with their LPCR:PECE* enablement bit.
>>
>> If the DECR timer fires after 'stop-self' is called and before the CPU
>> 'stop' state is reached, the nearly-dead CPU will have some work to do
>> and the guest will crash. This case happens very frequently with the
>> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
>> occasionally fired but after 'stop' state, so no work is to be done
>> and the guest survives.
>>
>> I suspect there is a race between the QEMU mainloop triggering the
>> timers and the TCG CPU thread but I could not quite identify the root
>> cause. To be safe, let's disable in the LPCR all the exceptions which
>> can cause an exit while the CPU is in power-saving mode and reenable
>> them when the CPU is started.
>>
>> For this purpose, we introduce a little helper routine to calculate
>> the PECE bits for a processor variant. We could also use the mask
>> value LPCR_PECE_L_MASK for the P8 and P9 processors. bit 47 and 48 are
>> reserved on P7 but it is still compatible.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> I'm not thrilled about addressing this without 100% knowing what's
> going on, 

me either :/ I have spent hours, days, on QEMU logs trying to catch 
a possible race in the state machine of the CPUs and didn't find it.
I need a better understanding of the internals. 

> but this seems like a sensible change in any case, so I'm ok
> with applying something like this.
>
> A detail however..
> 
> [snip]
>>  #if !defined(CONFIG_USER_ONLY)
>> +
>> +target_ulong cpu_ppc_papr_pece_bits(CPUPPCState *env)
>> +{
>> +    switch (env->mmu_model) {
>> +    case POWERPC_MMU_3_00:
>> +        return LPCR_PDEE | LPCR_HDEE | LPCR_EEE | LPCR_DEE | LPCR_OEE;
>> +    default:
>> +        /* P7 and P8 has slightly different PECE bits, mostly because P8 adds
>> +         * bit 47 and 48 which are reserved on P7. Here we set them all, which
>> +         * will work as expected for both implementations
>> +         */
>> +        return LPCR_P8_PECE0 | LPCR_P8_PECE1 | LPCR_P8_PECE2 | LPCR_P8_PECE3 |
>> +            LPCR_P8_PECE4;
>> +    }
>> +}
> 
> ..since we're working in this area, might as well clean up this
> inappropriate use of mmu_model.  Two options which I'd be ok with:
> 
> 1) Add a pece_bits field to the PowerPCCPUClass, correctly initialized
> for the various processors.
> 
> 2) A similar helper but using ppc_check_compat() to check the arch
> level, instead of using env->mmu_model.
> 

OK. 

Thanks,

C.
diff mbox series

Patch

diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index cdf0b607a0a0..b5cff3ab29d3 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -174,6 +174,10 @@  static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPRMachineState *spapr,
         kvm_cpu_synchronize_state(cs);
 
         env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
+
+        /* Enable Power-saving mode Exit Cause exceptions for the new CPU */
+        env->spr[SPR_LPCR] |= cpu_ppc_papr_pece_bits(env);
+
         env->nip = start;
         env->gpr[3] = r3;
         cs->halted = 0;
@@ -210,6 +214,11 @@  static void rtas_stop_self(PowerPCCPU *cpu, sPAPRMachineState *spapr,
      * no need to bother with specific bits, we just clear it.
      */
     env->msr = 0;
+
+    /* Disable Power-saving mode Exit Cause exceptions for the CPU.
+     * This could deliver an interrupt on a dying CPU and crash the
+     * guest */
+    env->spr[SPR_LPCR] &= ~cpu_ppc_papr_pece_bits(env);
 }
 
 static inline int sysparm_st(target_ulong addr, target_ulong len,
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 989761b79569..7c84344421f3 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1327,6 +1327,7 @@  void store_booke_tsr (CPUPPCState *env, target_ulong val);
 void ppc_tlb_invalidate_all (CPUPPCState *env);
 void ppc_tlb_invalidate_one (CPUPPCState *env, target_ulong addr);
 void cpu_ppc_set_papr(PowerPCCPU *cpu, PPCVirtualHypervisor *vhyp);
+target_ulong cpu_ppc_papr_pece_bits(CPUPPCState *env);
 #endif
 #endif
 
diff --git a/target/ppc/translate_init.c b/target/ppc/translate_init.c
index b9c49c22f29f..a0bf5e01dc52 100644
--- a/target/ppc/translate_init.c
+++ b/target/ppc/translate_init.c
@@ -8901,11 +8901,28 @@  POWERPC_FAMILY(POWER9)(ObjectClass *oc, void *data)
 }
 
 #if !defined(CONFIG_USER_ONLY)
+
+target_ulong cpu_ppc_papr_pece_bits(CPUPPCState *env)
+{
+    switch (env->mmu_model) {
+    case POWERPC_MMU_3_00:
+        return LPCR_PDEE | LPCR_HDEE | LPCR_EEE | LPCR_DEE | LPCR_OEE;
+    default:
+        /* P7 and P8 has slightly different PECE bits, mostly because P8 adds
+         * bit 47 and 48 which are reserved on P7. Here we set them all, which
+         * will work as expected for both implementations
+         */
+        return LPCR_P8_PECE0 | LPCR_P8_PECE1 | LPCR_P8_PECE2 | LPCR_P8_PECE3 |
+            LPCR_P8_PECE4;
+    }
+}
+
 void cpu_ppc_set_papr(PowerPCCPU *cpu, PPCVirtualHypervisor *vhyp)
 {
     CPUPPCState *env = &cpu->env;
     ppc_spr_t *lpcr = &env->spr_cb[SPR_LPCR];
     ppc_spr_t *amor = &env->spr_cb[SPR_AMOR];
+    CPUState *cs = CPU(cpu);
 
     cpu->vhyp = vhyp;
 
@@ -8947,16 +8964,16 @@  void cpu_ppc_set_papr(PowerPCCPU *cpu, PPCVirtualHypervisor *vhyp)
         } else {
             lpcr->default_value &= ~(LPCR_UPRT | LPCR_GTSE);
         }
-        lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE | LPCR_DEE |
-                               LPCR_OEE;
         break;
     default:
-        /* P7 and P8 has slightly different PECE bits, mostly because P8 adds
-         * bit 47 and 48 which are reserved on P7. Here we set them all, which
-         * will work as expected for both implementations
-         */
-        lpcr->default_value |= LPCR_P8_PECE0 | LPCR_P8_PECE1 | LPCR_P8_PECE2 |
-                               LPCR_P8_PECE3 | LPCR_P8_PECE4;
+        ;
+    }
+
+    /* Only enable Power-saving mode Exit Cause exceptions on the boot
+     * CPU. The RTAS command start-cpu will enable them on secondaries.
+     */
+    if (cs == first_cpu) {
+        lpcr->default_value |= cpu_ppc_papr_pece_bits(env);
     }
 
     /* We should be followed by a CPU reset but update the active value