diff mbox series

[1/2] spapr/rtas: disable the decrementer interrupt when a CPU is unplugged

Message ID 20171005164959.26024-2-clg@kaod.org
State New
Headers show
Series disable the decrementer interrupt when a CPU is unplugged | expand

Commit Message

Cédric Le Goater Oct. 5, 2017, 4:49 p.m. UTC
When a CPU is stopped with the 'stop-self' RTAS call, its state
'halted' is switched to 1 and, in this case, the MSR is not taken into
account anymore in the cpu_has_work() routine. Only the pending
hardware interrupts are checked with their LPCR:PECE* enablement bit.

If the DECR timer fires after 'stop-self' is called and before the CPU
'stop' state is reached, the nearly-dead CPU will have some work to do
and the guest will crash. This case happens very frequently with the
not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
occasionally fired but after 'stop' state, so no work is to be done
and the guest survives.

I suspect there is a race between the QEMU mainloop triggering the
timers and the TCG CPU thread but I could not quite identify the root
cause. To be safe, let's disable the decrementer interrupt in the LPCR
when the CPU is halted and reenable it when the CPU is restarted.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr_rtas.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

David Gibson Oct. 6, 2017, 9:07 a.m. UTC | #1
On Thu, Oct 05, 2017 at 06:49:58PM +0200, Cédric Le Goater wrote:
> When a CPU is stopped with the 'stop-self' RTAS call, its state
> 'halted' is switched to 1 and, in this case, the MSR is not taken into
> account anymore in the cpu_has_work() routine. Only the pending
> hardware interrupts are checked with their LPCR:PECE* enablement bit.
> 
> If the DECR timer fires after 'stop-self' is called and before the CPU
> 'stop' state is reached, the nearly-dead CPU will have some work to do
> and the guest will crash. This case happens very frequently with the
> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
> occasionally fired but after 'stop' state, so no work is to be done
> and the guest survives.
> 
> I suspect there is a race between the QEMU mainloop triggering the
> timers and the TCG CPU thread but I could not quite identify the root
> cause. To be safe, let's disable the decrementer interrupt in the LPCR
> when the CPU is halted and reenable it when the CPU is restarted.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/ppc/spapr_rtas.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> index cdf0b607a0a0..2389220c9738 100644
> --- a/hw/ppc/spapr_rtas.c
> +++ b/hw/ppc/spapr_rtas.c
> @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPRMachineState *spapr,
>          kvm_cpu_synchronize_state(cs);
>  
>          env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
> +
> +        /* Enable DECR interrupt */
> +        if (env->mmu_model == POWERPC_MMU_3_00) {

Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
the right answer in practice, but the LPCR programming has nothing
whatsoever to do with the MMU.

I think explicitly checking if cpu_ is a POWER9 instance with
object_dynamic_cast would be a better option.

> +            env->spr[SPR_LPCR] |= LPCR_DEE;
> +        } else {
> +            /* P7 and P8 both have same bit for DECR */
> +            env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
> +        }
> +
>          env->nip = start;
>          env->gpr[3] = r3;
>          cs->halted = 0;
> @@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>       * no need to bother with specific bits, we just clear it.
>       */
>      env->msr = 0;
> +
> +    if (env->mmu_model == POWERPC_MMU_3_00) {
> +        env->spr[SPR_LPCR] &= ~LPCR_DEE;
> +    } else {
> +        /* P7 and P8 both have same bit for DECR */
> +        env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
> +    }
>  }
>  
>  static inline int sysparm_st(target_ulong addr, target_ulong len,
Benjamin Herrenschmidt Oct. 6, 2017, 9:53 a.m. UTC | #2
On Fri, 2017-10-06 at 20:07 +1100, David Gibson wrote:
> Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
> the right answer in practice, but the LPCR programming has nothing
> whatsoever to do with the MMU.
> 
> I think explicitly checking if cpu_ is a POWER9 instance with
> object_dynamic_cast would be a better option.

Best is ARCH 300 ... do we hvae arch versions outside of MMU model
these days ?

Ben.
David Gibson Oct. 6, 2017, 10:10 a.m. UTC | #3
On Fri, Oct 06, 2017 at 11:53:30AM +0200, Benjamin Herrenschmidt wrote:
> On Fri, 2017-10-06 at 20:07 +1100, David Gibson wrote:
> > Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
> > the right answer in practice, but the LPCR programming has nothing
> > whatsoever to do with the MMU.
> > 
> > I think explicitly checking if cpu_ is a POWER9 instance with
> > object_dynamic_cast would be a better option.
> 
> Best is ARCH 300 ... do we hvae arch versions outside of MMU model
> these days ?

Not that I could spot easily.  Apart from implicitly in the cpu
family.
Cédric Le Goater Oct. 6, 2017, 9:15 p.m. UTC | #4
On 10/06/2017 11:07 AM, David Gibson wrote:
> On Thu, Oct 05, 2017 at 06:49:58PM +0200, Cédric Le Goater wrote:
>> When a CPU is stopped with the 'stop-self' RTAS call, its state
>> 'halted' is switched to 1 and, in this case, the MSR is not taken into
>> account anymore in the cpu_has_work() routine. Only the pending
>> hardware interrupts are checked with their LPCR:PECE* enablement bit.
>>
>> If the DECR timer fires after 'stop-self' is called and before the CPU
>> 'stop' state is reached, the nearly-dead CPU will have some work to do
>> and the guest will crash. This case happens very frequently with the
>> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
>> occasionally fired but after 'stop' state, so no work is to be done
>> and the guest survives.
>>
>> I suspect there is a race between the QEMU mainloop triggering the
>> timers and the TCG CPU thread but I could not quite identify the root
>> cause. To be safe, let's disable the decrementer interrupt in the LPCR
>> when the CPU is halted and reenable it when the CPU is restarted.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/ppc/spapr_rtas.c | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>> index cdf0b607a0a0..2389220c9738 100644
>> --- a/hw/ppc/spapr_rtas.c
>> +++ b/hw/ppc/spapr_rtas.c
>> @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPRMachineState *spapr,
>>          kvm_cpu_synchronize_state(cs);
>>  
>>          env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
>> +
>> +        /* Enable DECR interrupt */
>> +        if (env->mmu_model == POWERPC_MMU_3_00) {
> 
> Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
> the right answer in practice, but the LPCR programming has nothing
> whatsoever to do with the MMU.
> 
> I think explicitly checking if cpu_ is a POWER9 instance with
> object_dynamic_cast would be a better option.

OK. So I guess we should change the switch statement in cpu_ppc_set_papr()
also.

C. 

> 
>> +            env->spr[SPR_LPCR] |= LPCR_DEE;
>> +        } else {
>> +            /* P7 and P8 both have same bit for DECR */
>> +            env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
>> +        }
>> +
>>          env->nip = start;
>>          env->gpr[3] = r3;
>>          cs->halted = 0;
>> @@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>>       * no need to bother with specific bits, we just clear it.
>>       */
>>      env->msr = 0;
>> +
>> +    if (env->mmu_model == POWERPC_MMU_3_00) {
>> +        env->spr[SPR_LPCR] &= ~LPCR_DEE;
>> +    } else {
>> +        /* P7 and P8 both have same bit for DECR */
>> +        env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
>> +    }
>>  }
>>  
>>  static inline int sysparm_st(target_ulong addr, target_ulong len,
>
David Gibson Oct. 7, 2017, 5:16 a.m. UTC | #5
On Fri, Oct 06, 2017 at 11:15:31PM +0200, Cédric Le Goater wrote:
> On 10/06/2017 11:07 AM, David Gibson wrote:
> > On Thu, Oct 05, 2017 at 06:49:58PM +0200, Cédric Le Goater wrote:
> >> When a CPU is stopped with the 'stop-self' RTAS call, its state
> >> 'halted' is switched to 1 and, in this case, the MSR is not taken into
> >> account anymore in the cpu_has_work() routine. Only the pending
> >> hardware interrupts are checked with their LPCR:PECE* enablement bit.
> >>
> >> If the DECR timer fires after 'stop-self' is called and before the CPU
> >> 'stop' state is reached, the nearly-dead CPU will have some work to do
> >> and the guest will crash. This case happens very frequently with the
> >> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
> >> occasionally fired but after 'stop' state, so no work is to be done
> >> and the guest survives.
> >>
> >> I suspect there is a race between the QEMU mainloop triggering the
> >> timers and the TCG CPU thread but I could not quite identify the root
> >> cause. To be safe, let's disable the decrementer interrupt in the LPCR
> >> when the CPU is halted and reenable it when the CPU is restarted.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/ppc/spapr_rtas.c | 16 ++++++++++++++++
> >>  1 file changed, 16 insertions(+)
> >>
> >> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> >> index cdf0b607a0a0..2389220c9738 100644
> >> --- a/hw/ppc/spapr_rtas.c
> >> +++ b/hw/ppc/spapr_rtas.c
> >> @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPRMachineState *spapr,
> >>          kvm_cpu_synchronize_state(cs);
> >>  
> >>          env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
> >> +
> >> +        /* Enable DECR interrupt */
> >> +        if (env->mmu_model == POWERPC_MMU_3_00) {
> > 
> > Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
> > the right answer in practice, but the LPCR programming has nothing
> > whatsoever to do with the MMU.
> > 
> > I think explicitly checking if cpu_ is a POWER9 instance with
> > object_dynamic_cast would be a better option.
> 
> OK. So I guess we should change the switch statement in cpu_ppc_set_papr()
> also.

Yeah, I guess so.  No rush.
Cédric Le Goater Oct. 9, 2017, 2:28 p.m. UTC | #6
On 10/06/2017 12:10 PM, David Gibson wrote:
> On Fri, Oct 06, 2017 at 11:53:30AM +0200, Benjamin Herrenschmidt wrote:
>> On Fri, 2017-10-06 at 20:07 +1100, David Gibson wrote:
>>> Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
>>> the right answer in practice, but the LPCR programming has nothing
>>> whatsoever to do with the MMU.
>>>
>>> I think explicitly checking if cpu_ is a POWER9 instance with
>>> object_dynamic_cast would be a better option.
>>
>> Best is ARCH 300 ... do we hvae arch versions outside of MMU model
>> these days ?
> 
> Not that I could spot easily.  Apart from implicitly in the cpu
> family.
> 

how about : 

	pcc->pvr_match(pcc, CPU_POWERPC_LOGICAL_3_00);

C.
diff mbox series

Patch

diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index cdf0b607a0a0..2389220c9738 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -174,6 +174,15 @@  static void rtas_start_cpu(PowerPCCPU *cpu_, sPAPRMachineState *spapr,
         kvm_cpu_synchronize_state(cs);
 
         env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
+
+        /* Enable DECR interrupt */
+        if (env->mmu_model == POWERPC_MMU_3_00) {
+            env->spr[SPR_LPCR] |= LPCR_DEE;
+        } else {
+            /* P7 and P8 both have same bit for DECR */
+            env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
+        }
+
         env->nip = start;
         env->gpr[3] = r3;
         cs->halted = 0;
@@ -210,6 +219,13 @@  static void rtas_stop_self(PowerPCCPU *cpu, sPAPRMachineState *spapr,
      * no need to bother with specific bits, we just clear it.
      */
     env->msr = 0;
+
+    if (env->mmu_model == POWERPC_MMU_3_00) {
+        env->spr[SPR_LPCR] &= ~LPCR_DEE;
+    } else {
+        /* P7 and P8 both have same bit for DECR */
+        env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
+    }
 }
 
 static inline int sysparm_st(target_ulong addr, target_ulong len,