diff mbox series

[RFC/PATCH,2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry

Message ID 1585656658-1838-3-git-send-email-ego@linux.vnet.ibm.com
State RFC
Headers show
Series Add support for stop instruction inside KVM guest | expand

Commit Message

Gautham R. Shenoy March 31, 2020, 12:10 p.m. UTC
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

ISA v3.0 allows the guest to execute a stop instruction. For this, the
PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
scheduling in the guest vCPU.

Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
set. This patch changes the behaviour to enter the guest with
PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
unconditionally clear these bits. Ideally this should be done
conditionally on platforms where the guest stop instruction has no
Bugs (starting POWER9 DD2.3).

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c            |  2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
 2 files changed, 14 insertions(+), 13 deletions(-)

Comments

Nicholas Piggin April 3, 2020, 2:20 a.m. UTC | #1
Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> ISA v3.0 allows the guest to execute a stop instruction. For this, the
> PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> scheduling in the guest vCPU.
> 
> Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> set. This patch changes the behaviour to enter the guest with
> PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> unconditionally clear these bits. Ideally this should be done
> conditionally on platforms where the guest stop instruction has no
> Bugs (starting POWER9 DD2.3).

How will guests know that they can use this facility safely after your
series? You need both DD2.3 and a patched KVM.

> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c            |  2 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
>  2 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index cdb7224..36d059a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
>  	mtspr(SPRN_IC, vcpu->arch.ic);
>  	mtspr(SPRN_PID, vcpu->arch.pid);
>  
> -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
>  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
>  
>  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index dbc2fec..c2daec3 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_PID, r7
>  	mtspr	SPRN_WORT, r8
>  BEGIN_FTR_SECTION
> +	/* POWER9-only registers */
> +	ld	r5, VCPU_TID(r4)
> +	ld	r6, VCPU_PSSCR(r4)
> +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> +	andc	r6, r6, r7
> +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> +	ld	r7, VCPU_HFSCR(r4)
> +	mtspr	SPRN_TIDR, r5
> +	mtspr	SPRN_PSSCR, r6
> +	mtspr	SPRN_HFSCR, r7
> +FTR_SECTION_ELSE

Why did you move these around? Just because the POWER9 section became
larger than the other?

That's a real wart in the instruction patching implementation, I think
we can fix it by padding with nops in the macros.

Can you just add the additional required nops to the top branch without
changing them around for this patch, so it's easier to see what's going
on? The end result will be the same after patching. Actually changing
these around can have a slight unintended consequence in that code that
runs before features were patched will execute the IF code. Not a
problem here, but another reason why the instruction patching 
restriction is annoying.

Thanks,
Nick

>  	/* POWER8-only registers */
>  	ld	r5, VCPU_TCSCR(r4)
>  	ld	r6, VCPU_ACOP(r4)
> @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
>  	mtspr	SPRN_CSIGR, r7
>  	mtspr	SPRN_TACR, r8
>  	nop
> -FTR_SECTION_ELSE
> -	/* POWER9-only registers */
> -	ld	r5, VCPU_TID(r4)
> -	ld	r6, VCPU_PSSCR(r4)
> -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> -	ld	r7, VCPU_HFSCR(r4)
> -	mtspr	SPRN_TIDR, r5
> -	mtspr	SPRN_PSSCR, r6
> -	mtspr	SPRN_HFSCR, r7
> -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>  8:
>  
>  	ld	r5, VCPU_SPRG0(r4)
> -- 
> 1.9.4
> 
>
Gautham R. Shenoy April 3, 2020, 9:31 a.m. UTC | #2
On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > scheduling in the guest vCPU.
> > 
> > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > set. This patch changes the behaviour to enter the guest with
> > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > unconditionally clear these bits. Ideally this should be done
> > conditionally on platforms where the guest stop instruction has no
> > Bugs (starting POWER9 DD2.3).
> 
> How will guests know that they can use this facility safely after your
> series? You need both DD2.3 and a patched KVM.


Yes, this is something that isn't addressed in this series (mentioned
in the cover letter), which is a POC demonstrating that the stop0lite
state in guest works.

However, to answer your question, this is the scheme that I had in
mind :

OPAL:
   On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:
    1. If "idle-stop-guest" dt-cpu-feature is discovered, then
       we set bool enable_guest_stop = true;

    2. During KVM guest entry, clear PSSCR[ESL|EC] iff
       enable_guest_stop == true.

    3. In kvm_vm_ioctl_check_extension(), for a new capability
       KVM_CAP_STOP, return true iff enable_guest_top == true.

QEMU:
   Check with the hypervisor if KVM_CAP_STOP is present. If so,
   indicate the presence to the guest via device tree.

Guest Kernel:
   Check for the presence of guest stop state support in
   device-tree. If available, enable the stop0lite in the cpuidle
   driver. 
   

We still have a challenge of migrating a guest which started on a
hypervisor supporting guest stop state to a hypervisor without it.
The target hypervisor should atleast have Patch 1 of this series, so
that we don't crash the guest.

> 
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> >  2 files changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index cdb7224..36d059a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> >  	mtspr(SPRN_IC, vcpu->arch.ic);
> >  	mtspr(SPRN_PID, vcpu->arch.pid);
> >  
> > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> >  
> >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index dbc2fec..c2daec3 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> >  	mtspr	SPRN_PID, r7
> >  	mtspr	SPRN_WORT, r8
> >  BEGIN_FTR_SECTION
> > +	/* POWER9-only registers */
> > +	ld	r5, VCPU_TID(r4)
> > +	ld	r6, VCPU_PSSCR(r4)
> > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > +	andc	r6, r6, r7
> > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > +	ld	r7, VCPU_HFSCR(r4)
> > +	mtspr	SPRN_TIDR, r5
> > +	mtspr	SPRN_PSSCR, r6
> > +	mtspr	SPRN_HFSCR, r7
> > +FTR_SECTION_ELSE
> 
> Why did you move these around? Just because the POWER9 section became
> larger than the other?

Yes.

> 
> That's a real wart in the instruction patching implementation, I think
> we can fix it by padding with nops in the macros.
> 
> Can you just add the additional required nops to the top branch without
> changing them around for this patch, so it's easier to see what's going
> on? The end result will be the same after patching. Actually changing
> these around can have a slight unintended consequence in that code that
> runs before features were patched will execute the IF code. Not a
> problem here, but another reason why the instruction patching 
> restriction is annoying.

Sure, I will repost this patch with additional nops instead of
moving them around.

> 
> Thanks,
> Nick
> 
> >  	/* POWER8-only registers */
> >  	ld	r5, VCPU_TCSCR(r4)
> >  	ld	r6, VCPU_ACOP(r4)
> > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> >  	mtspr	SPRN_CSIGR, r7
> >  	mtspr	SPRN_TACR, r8
> >  	nop
> > -FTR_SECTION_ELSE
> > -	/* POWER9-only registers */
> > -	ld	r5, VCPU_TID(r4)
> > -	ld	r6, VCPU_PSSCR(r4)
> > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > -	ld	r7, VCPU_HFSCR(r4)
> > -	mtspr	SPRN_TIDR, r5
> > -	mtspr	SPRN_PSSCR, r6
> > -	mtspr	SPRN_HFSCR, r7
> > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> >  8:
> >  
> >  	ld	r5, VCPU_SPRG0(r4)
> > -- 
> > 1.9.4
> > 
> > 

--
Thanks and Regards
gautham.
David Gibson April 6, 2020, 9:58 a.m. UTC | #3
On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > 
> > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > scheduling in the guest vCPU.
> > > 
> > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > set. This patch changes the behaviour to enter the guest with
> > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > unconditionally clear these bits. Ideally this should be done
> > > conditionally on platforms where the guest stop instruction has no
> > > Bugs (starting POWER9 DD2.3).
> > 
> > How will guests know that they can use this facility safely after your
> > series? You need both DD2.3 and a patched KVM.
> 
> 
> Yes, this is something that isn't addressed in this series (mentioned
> in the cover letter), which is a POC demonstrating that the stop0lite
> state in guest works.
> 
> However, to answer your question, this is the scheme that I had in
> mind :
> 
> OPAL:
>    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
>     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
>        we set bool enable_guest_stop = true;
> 
>     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
>        enable_guest_stop == true.
> 
>     3. In kvm_vm_ioctl_check_extension(), for a new capability
>        KVM_CAP_STOP, return true iff enable_guest_top == true.
> 
> QEMU:
>    Check with the hypervisor if KVM_CAP_STOP is present. If so,
>    indicate the presence to the guest via device tree.

Nack.  Presenting different capabilities to the guest depending on
host capabilities (rather than explicit options) is never ok.  It
means that depending on the system you start on you may or may not be
able to migrate to other systems that you're supposed to be able to,

> Guest Kernel:
>    Check for the presence of guest stop state support in
>    device-tree. If available, enable the stop0lite in the cpuidle
>    driver. 
>    
> 
> We still have a challenge of migrating a guest which started on a
> hypervisor supporting guest stop state to a hypervisor without it.
> The target hypervisor should atleast have Patch 1 of this series, so
> that we don't crash the guest.
> 
> > 
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 
> > 
> > Thanks,
> > Nick
> > 
> > >  	/* POWER8-only registers */
> > >  	ld	r5, VCPU_TCSCR(r4)
> > >  	ld	r6, VCPU_ACOP(r4)
> > > @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
> > >  	mtspr	SPRN_CSIGR, r7
> > >  	mtspr	SPRN_TACR, r8
> > >  	nop
> > > -FTR_SECTION_ELSE
> > > -	/* POWER9-only registers */
> > > -	ld	r5, VCPU_TID(r4)
> > > -	ld	r6, VCPU_PSSCR(r4)
> > > -	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > -	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
> > > -	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > -	ld	r7, VCPU_HFSCR(r4)
> > > -	mtspr	SPRN_TIDR, r5
> > > -	mtspr	SPRN_PSSCR, r6
> > > -	mtspr	SPRN_HFSCR, r7
> > > -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> > > +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
> > >  8:
> > >  
> > >  	ld	r5, VCPU_SPRG0(r4)
>
Gautham R. Shenoy April 7, 2020, 12:33 p.m. UTC | #4
Hello Nicholas,

On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:


[..snip..]
> > > 
> > > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c            |  2 +-
> > >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +++++++++++++------------
> > >  2 files changed, 14 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index cdb7224..36d059a 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > >  	mtspr(SPRN_IC, vcpu->arch.ic);
> > >  	mtspr(SPRN_PID, vcpu->arch.pid);
> > >  
> > > -	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> > > +	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> > >  	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
> > >  
> > >  	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > index dbc2fec..c2daec3 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > > @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > >  	mtspr	SPRN_PID, r7
> > >  	mtspr	SPRN_WORT, r8
> > >  BEGIN_FTR_SECTION
> > > +	/* POWER9-only registers */
> > > +	ld	r5, VCPU_TID(r4)
> > > +	ld	r6, VCPU_PSSCR(r4)
> > > +	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
> > > +	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> > > +	andc	r6, r6, r7
> > > +	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> > > +	ld	r7, VCPU_HFSCR(r4)
> > > +	mtspr	SPRN_TIDR, r5
> > > +	mtspr	SPRN_PSSCR, r6
> > > +	mtspr	SPRN_HFSCR, r7
> > > +FTR_SECTION_ELSE
> > 
> > Why did you move these around? Just because the POWER9 section became
> > larger than the other?
> 
> Yes.
> 
> > 
> > That's a real wart in the instruction patching implementation, I think
> > we can fix it by padding with nops in the macros.
> > 
> > Can you just add the additional required nops to the top branch without
> > changing them around for this patch, so it's easier to see what's going
> > on? The end result will be the same after patching. Actually changing
> > these around can have a slight unintended consequence in that code that
> > runs before features were patched will execute the IF code. Not a
> > problem here, but another reason why the instruction patching 
> > restriction is annoying.
> 
> Sure, I will repost this patch with additional nops instead of
> moving them around.
> 

Below is the same patch without rearranging the FTR_SECTION blocks,
but with an extra nop. 

---
 arch/powerpc/kvm/book3s_hv.c            | 2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c52871c..efa7d3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3433,7 +3433,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 780a499..83a69dc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -833,12 +833,14 @@ BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
+	nop
 FTR_SECTION_ELSE
 	/* POWER9-only registers */
 	ld	r5, VCPU_TID(r4)
 	ld	r6, VCPU_PSSCR(r4)
 	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
 	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
 	ld	r7, VCPU_HFSCR(r4)
 	mtspr	SPRN_TIDR, r5
Gautham R. Shenoy April 7, 2020, 1:25 p.m. UTC | #5
Hello David,

On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > 
> > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > scheduling in the guest vCPU.
> > > > 
> > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > set. This patch changes the behaviour to enter the guest with
> > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > unconditionally clear these bits. Ideally this should be done
> > > > conditionally on platforms where the guest stop instruction has no
> > > > Bugs (starting POWER9 DD2.3).
> > > 
> > > How will guests know that they can use this facility safely after your
> > > series? You need both DD2.3 and a patched KVM.
> > 
> > 
> > Yes, this is something that isn't addressed in this series (mentioned
> > in the cover letter), which is a POC demonstrating that the stop0lite
> > state in guest works.
> > 
> > However, to answer your question, this is the scheme that I had in
> > mind :
> > 
> > OPAL:
> >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> >        we set bool enable_guest_stop = true;
> > 
> >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> >        enable_guest_stop == true.
> > 
> >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > 
> > QEMU:
> >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> >    indicate the presence to the guest via device tree.
> 
> Nack.  Presenting different capabilities to the guest depending on
> host capabilities (rather than explicit options) is never ok.  It
> means that depending on the system you start on you may or may not be
> able to migrate to other systems that you're supposed to be able to,

I agree that blocking migration for the unavailability of this feature
is not desirable. Could you point me to some other capabilities in KVM
which have been implemented via explicit options?

The ISA 3.0 allows the guest to execute the "stop" instruction. If the
Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
"stop" instruction in the causes a Hypervisor Facility Unavailable
Exception, thus giving the hypervisor a chance to emulate the
instruction. However, in the current code, when the hypervisor
receives this exception, it sends a PROGKILL to the guest which
results in crashing the guest.

Patch 1 of this series emulates wakeup from the "stop"
instruction. Would the following scheme be ok?

OPAL:
	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"

Hypervisor Kernel:

	   If "idle-stop-guest" dt feature is available, then, before
	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
	   bits allowing the guest to safely execute stop instruction.

	   If "idle-stop-guest" dt feature is not available, then, the
	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
	   guest "stop" instruction execution to trap back into the
	   hypervisor. We then emulate a wakeup from the stop
	   instruction (Patch 1 of this series).

Guest Kernel:
      If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
      stop0lite cpuidle state.

This allows us to migrate the KVM guest across any POWER9
Hypervisor. The minimal addition that the Hypervisor would need is
Patch 1 of this series.
	   



--
Thanks and Regards
gautham.
David Gibson April 8, 2020, 2:29 a.m. UTC | #6
On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > 
> > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > scheduling in the guest vCPU.
> > > > > 
> > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > set. This patch changes the behaviour to enter the guest with
> > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > unconditionally clear these bits. Ideally this should be done
> > > > > conditionally on platforms where the guest stop instruction has no
> > > > > Bugs (starting POWER9 DD2.3).
> > > > 
> > > > How will guests know that they can use this facility safely after your
> > > > series? You need both DD2.3 and a patched KVM.
> > > 
> > > 
> > > Yes, this is something that isn't addressed in this series (mentioned
> > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > state in guest works.
> > > 
> > > However, to answer your question, this is the scheme that I had in
> > > mind :
> > > 
> > > OPAL:
> > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > >        we set bool enable_guest_stop = true;
> > > 
> > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > >        enable_guest_stop == true.
> > > 
> > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > 
> > > QEMU:
> > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > >    indicate the presence to the guest via device tree.
> > 
> > Nack.  Presenting different capabilities to the guest depending on
> > host capabilities (rather than explicit options) is never ok.  It
> > means that depending on the system you start on you may or may not be
> > able to migrate to other systems that you're supposed to be able to,
> 
> I agree that blocking migration for the unavailability of this feature
> is not desirable. Could you point me to some other capabilities in KVM
> which have been implemented via explicit options?

TBH, most of the options for the 'pseries' machine type are in this
category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
resizing extension), ic-mode (which irq controllers are available to
the guest).

> The ISA 3.0 allows the guest to execute the "stop" instruction.

So, this was a bug in DD2.2's implementation of the architecture?

> If the
> Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> "stop" instruction in the causes a Hypervisor Facility Unavailable
> Exception, thus giving the hypervisor a chance to emulate the
> instruction. However, in the current code, when the hypervisor
> receives this exception, it sends a PROGKILL to the guest which
> results in crashing the guest.
> 
> Patch 1 of this series emulates wakeup from the "stop"
> instruction. Would the following scheme be ok?
> 
> OPAL:
> 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> 
> Hypervisor Kernel:
> 
> 	   If "idle-stop-guest" dt feature is available, then, before
> 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> 	   bits allowing the guest to safely execute stop instruction.
> 
> 	   If "idle-stop-guest" dt feature is not available, then, the
> 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> 	   guest "stop" instruction execution to trap back into the
> 	   hypervisor. We then emulate a wakeup from the stop
> 	   instruction (Patch 1 of this series).
> 
> Guest Kernel:
>       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
>       stop0lite cpuidle state.
> 
> This allows us to migrate the KVM guest across any POWER9
> Hypervisor. The minimal addition that the Hypervisor would need is
> Patch 1 of this series.

That could be workable.  Some caveats, though:

 * How does the latency of the trap-and-emulate compare to the guest
   using H_CEDE in the first place?  i.e. how big a negative impact
   will this have for guests running on DD2.2 hosts?

 * We'll only be able to enable this in a new qemu machine type
   version (say, pseries-5.1.0).  Otherwise a guest could start
   thinking it can use stop states, then be migrated to an older qemu
   or host kernel without the support and crash.
Gautham R. Shenoy April 13, 2020, 10:25 a.m. UTC | #7
Hello David,

On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > Hello David,
> > 
> > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > 
> > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > scheduling in the guest vCPU.
> > > > > > 
> > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > Bugs (starting POWER9 DD2.3).
> > > > > 
> > > > > How will guests know that they can use this facility safely after your
> > > > > series? You need both DD2.3 and a patched KVM.
> > > > 
> > > > 
> > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > state in guest works.
> > > > 
> > > > However, to answer your question, this is the scheme that I had in
> > > > mind :
> > > > 
> > > > OPAL:
> > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > 
> > > > Hypervisor Kernel:
> > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > >        we set bool enable_guest_stop = true;
> > > > 
> > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > >        enable_guest_stop == true.
> > > > 
> > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > 
> > > > QEMU:
> > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > >    indicate the presence to the guest via device tree.
> > > 
> > > Nack.  Presenting different capabilities to the guest depending on
> > > host capabilities (rather than explicit options) is never ok.  It
> > > means that depending on the system you start on you may or may not be
> > > able to migrate to other systems that you're supposed to be able to,
> > 
> > I agree that blocking migration for the unavailability of this feature
> > is not desirable. Could you point me to some other capabilities in KVM
> > which have been implemented via explicit options?
> 
> TBH, most of the options for the 'pseries' machine type are in this
> category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> resizing extension), ic-mode (which irq controllers are available to
> the guest).


Thanks. I will follow this suit.

> 
> > The ISA 3.0 allows the guest to execute the "stop" instruction.
> 
> So, this was a bug in DD2.2's implementation of the architecture?

Yes, the previous versions could miss wakeup events when stop was
executed in HV=0,PR=0 mode. So, the hypervisor had to block that.


> 
> > If the
> > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > Exception, thus giving the hypervisor a chance to emulate the
> > instruction. However, in the current code, when the hypervisor
> > receives this exception, it sends a PROGKILL to the guest which
> > results in crashing the guest.
> > 
> > Patch 1 of this series emulates wakeup from the "stop"
> > instruction. Would the following scheme be ok?
> > 
> > OPAL:
> > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > 
> > Hypervisor Kernel:
> > 
> > 	   If "idle-stop-guest" dt feature is available, then, before
> > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > 	   bits allowing the guest to safely execute stop instruction.
> > 
> > 	   If "idle-stop-guest" dt feature is not available, then, the
> > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > 	   guest "stop" instruction execution to trap back into the
> > 	   hypervisor. We then emulate a wakeup from the stop
> > 	   instruction (Patch 1 of this series).
> > 
> > Guest Kernel:
> >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> >       stop0lite cpuidle state.
> > 
> > This allows us to migrate the KVM guest across any POWER9
> > Hypervisor. The minimal addition that the Hypervisor would need is
> > Patch 1 of this series.
> 
> That could be workable.  Some caveats, though:
> 
>  * How does the latency of the trap-and-emulate compare to the guest
>    using H_CEDE in the first place?  i.e. how big a negative impact
>    will this have for guests running on DD2.2 hosts?


The wakeup latency of trap-and-emulated stop0lite (referred to as
"stop0lite Emulated" in the tables below) the compares favorably
compared to H_CEDE. It is in the order of 5-6us while the wakeup
latency of H_CEDE is ~25-30us.

======================================================================
Wakeup Latency measured using a timer (in ns) [Lower is better]
======================================================================
Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
----------------------------------------------------------------------
snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
----------------------------------------------------------------------
stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
----------------------------------------------------------------------
stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
Emulated   |             |         |        |        |       |       |
----------------------------------------------------------------------
Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
======================================================================


======================================================================
Wakeup latency measured using an IPI (in ns) [Lower is better]
======================================================================
Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
           |samples |         |        |        |         |          |
----------------------------------------------------------------------
snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
----------------------------------------------------------------------
stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
----------------------------------------------------------------------
stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
Emulated   |        |         |        |        |         |          |
----------------------------------------------------------------------
Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
======================================================================

> 
>  * We'll only be able to enable this in a new qemu machine type
>    version (say, pseries-5.1.0).  Otherwise a guest could start
>    thinking it can use stop states, then be migrated to an older qemu
>    or host kernel without the support and crash.

That makese sense. In fact in the case of not being able to backport
Patch 1 to all the older hypervisor kernels, we will need a way of
gating the guest from using stop-states and then migrating onto an
older hypervisor kernel. Associating this with a new qemu machine type
version should solve this problem, assuming that all the newer qemus
will also be running on newer hypervisor kernels.



> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

--
Thanks and Regards
gautham.
David Gibson April 14, 2020, 2:17 a.m. UTC | #8
On Mon, Apr 13, 2020 at 03:55:49PM +0530, Gautham R Shenoy wrote:
> Hello David,
> 
> On Wed, Apr 08, 2020 at 12:29:57PM +1000, David Gibson wrote:
> > On Tue, Apr 07, 2020 at 06:55:26PM +0530, Gautham R Shenoy wrote:
> > > Hello David,
> > > 
> > > On Mon, Apr 06, 2020 at 07:58:19PM +1000, David Gibson wrote:
> > > > On Fri, Apr 03, 2020 at 03:01:03PM +0530, Gautham R Shenoy wrote:
> > > > > On Fri, Apr 03, 2020 at 12:20:26PM +1000, Nicholas Piggin wrote:
> > > > > > Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> > > > > > > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > > ISA v3.0 allows the guest to execute a stop instruction. For this, the
> > > > > > > PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> > > > > > > scheduling in the guest vCPU.
> > > > > > > 
> > > > > > > Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> > > > > > > set. This patch changes the behaviour to enter the guest with
> > > > > > > PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> > > > > > > unconditionally clear these bits. Ideally this should be done
> > > > > > > conditionally on platforms where the guest stop instruction has no
> > > > > > > Bugs (starting POWER9 DD2.3).
> > > > > > 
> > > > > > How will guests know that they can use this facility safely after your
> > > > > > series? You need both DD2.3 and a patched KVM.
> > > > > 
> > > > > 
> > > > > Yes, this is something that isn't addressed in this series (mentioned
> > > > > in the cover letter), which is a POC demonstrating that the stop0lite
> > > > > state in guest works.
> > > > > 
> > > > > However, to answer your question, this is the scheme that I had in
> > > > > mind :
> > > > > 
> > > > > OPAL:
> > > > >    On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > > > 
> > > > > Hypervisor Kernel:
> > > > >     1. If "idle-stop-guest" dt-cpu-feature is discovered, then
> > > > >        we set bool enable_guest_stop = true;
> > > > > 
> > > > >     2. During KVM guest entry, clear PSSCR[ESL|EC] iff
> > > > >        enable_guest_stop == true.
> > > > > 
> > > > >     3. In kvm_vm_ioctl_check_extension(), for a new capability
> > > > >        KVM_CAP_STOP, return true iff enable_guest_top == true.
> > > > > 
> > > > > QEMU:
> > > > >    Check with the hypervisor if KVM_CAP_STOP is present. If so,
> > > > >    indicate the presence to the guest via device tree.
> > > > 
> > > > Nack.  Presenting different capabilities to the guest depending on
> > > > host capabilities (rather than explicit options) is never ok.  It
> > > > means that depending on the system you start on you may or may not be
> > > > able to migrate to other systems that you're supposed to be able to,
> > > 
> > > I agree that blocking migration for the unavailability of this feature
> > > is not desirable. Could you point me to some other capabilities in KVM
> > > which have been implemented via explicit options?
> > 
> > TBH, most of the options for the 'pseries' machine type are in this
> > category: cap-vsx, cap-dfp, cap-htm, a bunch related to various
> > Spectre mitigations, cap-hpt-max-page-size (maximum page size for hash
> > guests), cap-nested-hv, cap-large-decr, cap-fwnmi, resize-hpt (HPT
> > resizing extension), ic-mode (which irq controllers are available to
> > the guest).
> 
> 
> Thanks. I will follow this suit.
> 
> > 
> > > The ISA 3.0 allows the guest to execute the "stop" instruction.
> > 
> > So, this was a bug in DD2.2's implementation of the architecture?
> 
> Yes, the previous versions could miss wakeup events when stop was
> executed in HV=0,PR=0 mode. So, the hypervisor had to block that.
> 
> 
> > 
> > > If the
> > > Hypervisor hasn't cleared the PSSCR[ESL|EC] then, guest executing the
> > > "stop" instruction in the causes a Hypervisor Facility Unavailable
> > > Exception, thus giving the hypervisor a chance to emulate the
> > > instruction. However, in the current code, when the hypervisor
> > > receives this exception, it sends a PROGKILL to the guest which
> > > results in crashing the guest.
> > > 
> > > Patch 1 of this series emulates wakeup from the "stop"
> > > instruction. Would the following scheme be ok?
> > > 
> > > OPAL:
> > > 	On Procs >= DD2.3 : we publish a dt-cpu-feature "idle-stop-guest"
> > > 
> > > Hypervisor Kernel:
> > > 
> > > 	   If "idle-stop-guest" dt feature is available, then, before
> > > 	   entering the guest, the hypervisor clears the PSSCR[EC|ESL]
> > > 	   bits allowing the guest to safely execute stop instruction.
> > > 
> > > 	   If "idle-stop-guest" dt feature is not available, then, the
> > > 	   Hypervisor sets the PSSCR[ESL|EC] bits, thereby causing a
> > > 	   guest "stop" instruction execution to trap back into the
> > > 	   hypervisor. We then emulate a wakeup from the stop
> > > 	   instruction (Patch 1 of this series).
> > > 
> > > Guest Kernel:
> > >       If (cpu_has_feature(CPU_FTR_ARCH_300)) only then use the
> > >       stop0lite cpuidle state.
> > > 
> > > This allows us to migrate the KVM guest across any POWER9
> > > Hypervisor. The minimal addition that the Hypervisor would need is
> > > Patch 1 of this series.
> > 
> > That could be workable.  Some caveats, though:
> > 
> >  * How does the latency of the trap-and-emulate compare to the guest
> >    using H_CEDE in the first place?  i.e. how big a negative impact
> >    will this have for guests running on DD2.2 hosts?
> 
> 
> The wakeup latency of trap-and-emulated stop0lite (referred to as
> "stop0lite Emulated" in the tables below) the compares favorably
> compared to H_CEDE. It is in the order of 5-6us while the wakeup
> latency of H_CEDE is ~25-30us.

Ok.  So allowing the guest to use stop0lite everywhere, but having it
emulated on the older CPUs should work reasonably well.

> 
> ======================================================================
> Wakeup Latency measured using a timer (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr samples |  Min    | Max    | Median | Avg   | Stddev|
> ----------------------------------------------------------------------
> snooze     |   60        |  787    | 1059   |  938   | 937.4 | 42.27 |
> ----------------------------------------------------------------------
> stop0lite  |   60        |  770    | 1182   |  948   | 946.4 | 67.41 |
> ----------------------------------------------------------------------
> stop0lite  |   60        | 2378    | 7659   | 5006   |5093.6 |1578.7 |  
> Emulated   |             |         |        |        |       |       |
> ----------------------------------------------------------------------
> Shared CEDE|   60        | 9550    | 36694  | 29219  |28564.1|3545.9 |
> ======================================================================
> 
> 
> ======================================================================
> Wakeup latency measured using an IPI (in ns) [Lower is better]
> ======================================================================
> Idle state |  Nr    |  Min    | Max    | Median | Avg     | Stddev   |
>            |samples |         |        |        |         |          |
> ----------------------------------------------------------------------
> snooze     |   60   |     2089|    4228|    2259|  2342.31|    316.56|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     1947|    3674|    2653|  2610.57|    266.73|
> ----------------------------------------------------------------------
> stop0lite  |   60   |     3580|    8154|    5596|  5644.95|   1368.44|
> Emulated   |        |         |        |        |         |          |
> ----------------------------------------------------------------------
> Shared CEDE|   60   |    20147|   36305|   21827| 26762.65|   6875.01|
> ======================================================================
> 
> > 
> >  * We'll only be able to enable this in a new qemu machine type
> >    version (say, pseries-5.1.0).  Otherwise a guest could start
> >    thinking it can use stop states, then be migrated to an older qemu
> >    or host kernel without the support and crash.
> 
> That makese sense. In fact in the case of not being able to backport
> Patch 1 to all the older hypervisor kernels, we will need a way of
> gating the guest from using stop-states and then migrating onto an
> older hypervisor kernel. Associating this with a new qemu machine type
> version should solve this problem, assuming that all the newer qemus
> will also be running on newer hypervisor kernels.

We can't assume that automatically, but we can enforce it with a
pretty standard mechanism.  The way to do it is this:

 * Make a new spapr capability flag that enables guest use of the lite
   stop instructions
 * In the capability's '.apply' hook, verify that the host kernel can
   support this, and if not fail
 * Enable the new capability by default in the new machine type

So, running the new machine type with default options on an old kernel
will fail with a meaningful error, but existing setups with an old
qemu, or old machine type.
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cdb7224..36d059a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3424,7 +3424,7 @@  static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	mtspr(SPRN_IC, vcpu->arch.ic);
 	mtspr(SPRN_PID, vcpu->arch.pid);
 
-	mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+	mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
 	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 
 	mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index dbc2fec..c2daec3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -823,6 +823,18 @@  END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 	mtspr	SPRN_PID, r7
 	mtspr	SPRN_WORT, r8
 BEGIN_FTR_SECTION
+	/* POWER9-only registers */
+	ld	r5, VCPU_TID(r4)
+	ld	r6, VCPU_PSSCR(r4)
+	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
+	lis 	r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
+	andc	r6, r6, r7
+	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
+	ld	r7, VCPU_HFSCR(r4)
+	mtspr	SPRN_TIDR, r5
+	mtspr	SPRN_PSSCR, r6
+	mtspr	SPRN_HFSCR, r7
+FTR_SECTION_ELSE
 	/* POWER8-only registers */
 	ld	r5, VCPU_TCSCR(r4)
 	ld	r6, VCPU_ACOP(r4)
@@ -833,18 +845,7 @@  BEGIN_FTR_SECTION
 	mtspr	SPRN_CSIGR, r7
 	mtspr	SPRN_TACR, r8
 	nop
-FTR_SECTION_ELSE
-	/* POWER9-only registers */
-	ld	r5, VCPU_TID(r4)
-	ld	r6, VCPU_PSSCR(r4)
-	lbz	r8, HSTATE_FAKE_SUSPEND(r13)
-	oris	r6, r6, PSSCR_EC@h	/* This makes stop trap to HV */
-	rldimi	r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
-	ld	r7, VCPU_HFSCR(r4)
-	mtspr	SPRN_TIDR, r5
-	mtspr	SPRN_PSSCR, r6
-	mtspr	SPRN_HFSCR, r7
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 8:
 
 	ld	r5, VCPU_SPRG0(r4)