diff mbox series

[v3,09/19] KVM: arm64: Implement PSCI SYSTEM_SUSPEND

Message ID 20220223041844.3984439-10-oupton@google.com
State Not Applicable
Headers show
Series KVM: arm64: Implement PSCI SYSTEM_SUSPEND | expand

Commit Message

Oliver Upton Feb. 23, 2022, 4:18 a.m. UTC
ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
software to request that a system be placed in the deepest possible
low-power state. Effectively, software can use this to suspend itself to
RAM. Note that the semantics of this PSCI call are very similar to
CPU_SUSPEND, which is already implemented in KVM.

Implement the SYSTEM_SUSPEND in KVM. Similar to CPU_SUSPEND, the
low-power state is implemented as a guest WFI. Synchronously reset the
calling CPU before entering the WFI, such that the vCPU may immediately
resume execution when a wakeup event is recognized.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/arm64/kvm/psci.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/kvm/reset.c |  3 ++-
 2 files changed, 53 insertions(+), 1 deletion(-)

Comments

Marc Zyngier Feb. 24, 2022, 2:02 p.m. UTC | #1
On Wed, 23 Feb 2022 04:18:34 +0000,
Oliver Upton <oupton@google.com> wrote:
> 
> ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
> software to request that a system be placed in the deepest possible
> low-power state. Effectively, software can use this to suspend itself to
> RAM. Note that the semantics of this PSCI call are very similar to
> CPU_SUSPEND, which is already implemented in KVM.
> 
> Implement the SYSTEM_SUSPEND in KVM. Similar to CPU_SUSPEND, the
> low-power state is implemented as a guest WFI. Synchronously reset the
> calling CPU before entering the WFI, such that the vCPU may immediately
> resume execution when a wakeup event is recognized.
> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/arm64/kvm/psci.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/reset.c |  3 ++-
>  2 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> index 77a00913cdfd..41adaaf2234a 100644
> --- a/arch/arm64/kvm/psci.c
> +++ b/arch/arm64/kvm/psci.c
> @@ -208,6 +208,50 @@ static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
>  	kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
>  }
>  
> +static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_reset_state reset_state;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_vcpu *tmp;
> +	bool denied = false;
> +	unsigned long i;
> +
> +	reset_state.pc = smccc_get_arg1(vcpu);
> +	if (!kvm_ipa_valid(kvm, reset_state.pc)) {
> +		smccc_set_retval(vcpu, PSCI_RET_INVALID_ADDRESS, 0, 0, 0);
> +		return 1;
> +	}
> +
> +	reset_state.r0 = smccc_get_arg2(vcpu);
> +	reset_state.be = kvm_vcpu_is_be(vcpu);
> +	reset_state.reset = true;
> +
> +	/*
> +	 * The SYSTEM_SUSPEND PSCI call requires that all vCPUs (except the
> +	 * calling vCPU) be in an OFF state, as determined by the
> +	 * implementation.
> +	 *
> +	 * See ARM DEN0022D, 5.19 "SYSTEM_SUSPEND" for more details.
> +	 */
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, tmp, kvm) {
> +		if (tmp != vcpu && !kvm_arm_vcpu_powered_off(tmp)) {
> +			denied = true;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&kvm->lock);

This looks dodgy. Nothing seems to prevent userspace from setting the
mp_state to RUNNING in parallel with this, as only the vcpu mutex is
held when this ioctl is issued.

It looks to me that what you want is what lock_all_vcpus() does
(Alexandru has a patch moving it out of the vgic code as part of his
SPE series).

It is also pretty unclear what the interaction with userspace is once
you have released the lock. If the VMM starts a vcpu other than the
suspending one, what is its state? The spec doesn't see to help
here. I can see two options:

- either all the vcpus have the same reset state applied to them as
  they come up, unless they are started with CPU_ON by a vcpu that has
  already booted (but there is a single 'context_id' provided, and I
  fear this is going to confuse the OS)...

- or only the suspending vcpu can resume the system, and we must fail
  a change of mp_state for the other vcpus.

What do you think?

> +
> +	if (denied) {
> +		smccc_set_retval(vcpu, PSCI_RET_DENIED, 0, 0, 0);
> +		return 1;
> +	}
> +
> +	__kvm_reset_vcpu(vcpu, &reset_state);
> +	kvm_vcpu_wfi(vcpu);

I have mixed feelings about this. The vcpu has reset before being in
WFI, while it really should be the other way around and userspace
could rely on observing the transition.

What breaks if you change this?

Thanks,

	M.
Oliver Upton Feb. 24, 2022, 7:35 p.m. UTC | #2
Hi Marc,

Thanks for reviewing the series. ACK to the nits and smaller comments
you've made, I'll incorporate that feedback in the next series.

On Thu, Feb 24, 2022 at 02:02:34PM +0000, Marc Zyngier wrote:
> On Wed, 23 Feb 2022 04:18:34 +0000,
> Oliver Upton <oupton@google.com> wrote:
> > 
> > ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
> > software to request that a system be placed in the deepest possible
> > low-power state. Effectively, software can use this to suspend itself to
> > RAM. Note that the semantics of this PSCI call are very similar to
> > CPU_SUSPEND, which is already implemented in KVM.
> > 
> > Implement the SYSTEM_SUSPEND in KVM. Similar to CPU_SUSPEND, the
> > low-power state is implemented as a guest WFI. Synchronously reset the
> > calling CPU before entering the WFI, such that the vCPU may immediately
> > resume execution when a wakeup event is recognized.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  arch/arm64/kvm/psci.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/reset.c |  3 ++-
> >  2 files changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> > index 77a00913cdfd..41adaaf2234a 100644
> > --- a/arch/arm64/kvm/psci.c
> > +++ b/arch/arm64/kvm/psci.c
> > @@ -208,6 +208,50 @@ static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
> >  	kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
> >  }
> >  
> > +static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_reset_state reset_state;
> > +	struct kvm *kvm = vcpu->kvm;
> > +	struct kvm_vcpu *tmp;
> > +	bool denied = false;
> > +	unsigned long i;
> > +
> > +	reset_state.pc = smccc_get_arg1(vcpu);
> > +	if (!kvm_ipa_valid(kvm, reset_state.pc)) {
> > +		smccc_set_retval(vcpu, PSCI_RET_INVALID_ADDRESS, 0, 0, 0);
> > +		return 1;
> > +	}
> > +
> > +	reset_state.r0 = smccc_get_arg2(vcpu);
> > +	reset_state.be = kvm_vcpu_is_be(vcpu);
> > +	reset_state.reset = true;
> > +
> > +	/*
> > +	 * The SYSTEM_SUSPEND PSCI call requires that all vCPUs (except the
> > +	 * calling vCPU) be in an OFF state, as determined by the
> > +	 * implementation.
> > +	 *
> > +	 * See ARM DEN0022D, 5.19 "SYSTEM_SUSPEND" for more details.
> > +	 */
> > +	mutex_lock(&kvm->lock);
> > +	kvm_for_each_vcpu(i, tmp, kvm) {
> > +		if (tmp != vcpu && !kvm_arm_vcpu_powered_off(tmp)) {
> > +			denied = true;
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&kvm->lock);
> 
> This looks dodgy. Nothing seems to prevent userspace from setting the
> mp_state to RUNNING in parallel with this, as only the vcpu mutex is
> held when this ioctl is issued.
> 
> It looks to me that what you want is what lock_all_vcpus() does
> (Alexandru has a patch moving it out of the vgic code as part of his
> SPE series).
> 
> It is also pretty unclear what the interaction with userspace is once
> you have released the lock. If the VMM starts a vcpu other than the
> suspending one, what is its state? The spec doesn't see to help
> here. I can see two options:
> 
> - either all the vcpus have the same reset state applied to them as
>   they come up, unless they are started with CPU_ON by a vcpu that has
>   already booted (but there is a single 'context_id' provided, and I
>   fear this is going to confuse the OS)...
> 
> - or only the suspending vcpu can resume the system, and we must fail
>   a change of mp_state for the other vcpus.
> 
> What do you think?

Definitely the latter. The documentation of SYSTEM_SUSPEND is quite
shaky on this, but it would appear that the intention is for the caller
to be the first CPU to wake up.

> > +
> > +	if (denied) {
> > +		smccc_set_retval(vcpu, PSCI_RET_DENIED, 0, 0, 0);
> > +		return 1;
> > +	}
> > +
> > +	__kvm_reset_vcpu(vcpu, &reset_state);
> > +	kvm_vcpu_wfi(vcpu);
> 
> I have mixed feelings about this. The vcpu has reset before being in
> WFI, while it really should be the other way around and userspace
> could rely on observing the transition.
> 
> What breaks if you change this?

I don't think that userspace would be able to observe the transition
even if we WFI before the reset. I imagine that would take the form
of setting KVM_REQ_VCPU_RESET, which we explicitly handle before
letting userspace access the vCPU's state as of commit
6826c6849b46 ("KVM: arm64: Handle PSCI resets before userspace
touches vCPU state").

Given this, I felt it was probably best to avoid all the indirection and
just do the vCPU reset in the handling of SYSTEM_SUSPEND. It does,
however, imply that we have slightly different behavior when userspace
exits are enabled, as that will happen pre-reset and pre-WFI.

--
Oliver
Marc Zyngier Feb. 25, 2022, 6:58 p.m. UTC | #3
On Thu, 24 Feb 2022 19:35:33 +0000,
Oliver Upton <oupton@google.com> wrote:
> 
> Hi Marc,
> 
> Thanks for reviewing the series. ACK to the nits and smaller comments
> you've made, I'll incorporate that feedback in the next series.
> 
> On Thu, Feb 24, 2022 at 02:02:34PM +0000, Marc Zyngier wrote:
> > On Wed, 23 Feb 2022 04:18:34 +0000,
> > Oliver Upton <oupton@google.com> wrote:
> > > 
> > > ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
> > > software to request that a system be placed in the deepest possible
> > > low-power state. Effectively, software can use this to suspend itself to
> > > RAM. Note that the semantics of this PSCI call are very similar to
> > > CPU_SUSPEND, which is already implemented in KVM.
> > > 
> > > Implement the SYSTEM_SUSPEND in KVM. Similar to CPU_SUSPEND, the
> > > low-power state is implemented as a guest WFI. Synchronously reset the
> > > calling CPU before entering the WFI, such that the vCPU may immediately
> > > resume execution when a wakeup event is recognized.
> > > 
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > ---
> > >  arch/arm64/kvm/psci.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
> > >  arch/arm64/kvm/reset.c |  3 ++-
> > >  2 files changed, 53 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> > > index 77a00913cdfd..41adaaf2234a 100644
> > > --- a/arch/arm64/kvm/psci.c
> > > +++ b/arch/arm64/kvm/psci.c
> > > @@ -208,6 +208,50 @@ static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
> > >  	kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
> > >  }
> > >  
> > > +static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct vcpu_reset_state reset_state;
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	struct kvm_vcpu *tmp;
> > > +	bool denied = false;
> > > +	unsigned long i;
> > > +
> > > +	reset_state.pc = smccc_get_arg1(vcpu);
> > > +	if (!kvm_ipa_valid(kvm, reset_state.pc)) {
> > > +		smccc_set_retval(vcpu, PSCI_RET_INVALID_ADDRESS, 0, 0, 0);
> > > +		return 1;
> > > +	}
> > > +
> > > +	reset_state.r0 = smccc_get_arg2(vcpu);
> > > +	reset_state.be = kvm_vcpu_is_be(vcpu);
> > > +	reset_state.reset = true;
> > > +
> > > +	/*
> > > +	 * The SYSTEM_SUSPEND PSCI call requires that all vCPUs (except the
> > > +	 * calling vCPU) be in an OFF state, as determined by the
> > > +	 * implementation.
> > > +	 *
> > > +	 * See ARM DEN0022D, 5.19 "SYSTEM_SUSPEND" for more details.
> > > +	 */
> > > +	mutex_lock(&kvm->lock);
> > > +	kvm_for_each_vcpu(i, tmp, kvm) {
> > > +		if (tmp != vcpu && !kvm_arm_vcpu_powered_off(tmp)) {
> > > +			denied = true;
> > > +			break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&kvm->lock);
> > 
> > This looks dodgy. Nothing seems to prevent userspace from setting the
> > mp_state to RUNNING in parallel with this, as only the vcpu mutex is
> > held when this ioctl is issued.
> > 
> > It looks to me that what you want is what lock_all_vcpus() does
> > (Alexandru has a patch moving it out of the vgic code as part of his
> > SPE series).
> > 
> > It is also pretty unclear what the interaction with userspace is once
> > you have released the lock. If the VMM starts a vcpu other than the
> > suspending one, what is its state? The spec doesn't see to help
> > here. I can see two options:
> > 
> > - either all the vcpus have the same reset state applied to them as
> >   they come up, unless they are started with CPU_ON by a vcpu that has
> >   already booted (but there is a single 'context_id' provided, and I
> >   fear this is going to confuse the OS)...
> > 
> > - or only the suspending vcpu can resume the system, and we must fail
> >   a change of mp_state for the other vcpus.
> > 
> > What do you think?
> 
> Definitely the latter. The documentation of SYSTEM_SUSPEND is quite
> shaky on this, but it would appear that the intention is for the caller
> to be the first CPU to wake up.

Yup. We now have clarification on the intent of the spec (only the
caller CPU can resume the system), and this needs to be tightened.

> 
> > > +
> > > +	if (denied) {
> > > +		smccc_set_retval(vcpu, PSCI_RET_DENIED, 0, 0, 0);
> > > +		return 1;
> > > +	}
> > > +
> > > +	__kvm_reset_vcpu(vcpu, &reset_state);
> > > +	kvm_vcpu_wfi(vcpu);
> > 
> > I have mixed feelings about this. The vcpu has reset before being in
> > WFI, while it really should be the other way around and userspace
> > could rely on observing the transition.
> > 
> > What breaks if you change this?
> 
> I don't think that userspace would be able to observe the transition
> even if we WFI before the reset.

I disagree. At any point can userspace issue a signal which would
trigger a return from WFI and an exit to userspace, and I don't think
this should result in a reset being observed.

This also means that SYSTEM_SUSPEND must be robust wrt signal
delivery, which it doesn't seem to be.

> I imagine that would take the form
> of setting KVM_REQ_VCPU_RESET, which we explicitly handle before
> letting userspace access the vCPU's state as of commit
> 6826c6849b46 ("KVM: arm64: Handle PSCI resets before userspace
> touches vCPU state").

In that case, the vcpu is ready to run, and is not blocked by
anything, so this is quite different.

>
> Given this, I felt it was probably best to avoid all the indirection and
> just do the vCPU reset in the handling of SYSTEM_SUSPEND. It does,
> however, imply that we have slightly different behavior when userspace
> exits are enabled, as that will happen pre-reset and pre-WFI.

And that's exactly the sort of behaviour I'd like to avoid if at all
possible. But maybe we don't need to support the standalone version
that doesn't involve userspace?

	M.
Oliver Upton March 3, 2022, 1:01 a.m. UTC | #4
On Fri, Feb 25, 2022 at 06:58:13PM +0000, Marc Zyngier wrote:
> On Thu, 24 Feb 2022 19:35:33 +0000,
> Oliver Upton <oupton@google.com> wrote:
> > 
> > Hi Marc,
> > 
> > Thanks for reviewing the series. ACK to the nits and smaller comments
> > you've made, I'll incorporate that feedback in the next series.
> > 
> > On Thu, Feb 24, 2022 at 02:02:34PM +0000, Marc Zyngier wrote:
> > > On Wed, 23 Feb 2022 04:18:34 +0000,
> > > Oliver Upton <oupton@google.com> wrote:
> > > > 
> > > > ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
> > > > software to request that a system be placed in the deepest possible
> > > > low-power state. Effectively, software can use this to suspend itself to
> > > > RAM. Note that the semantics of this PSCI call are very similar to
> > > > CPU_SUSPEND, which is already implemented in KVM.
> > > > 
> > > > Implement the SYSTEM_SUSPEND in KVM. Similar to CPU_SUSPEND, the
> > > > low-power state is implemented as a guest WFI. Synchronously reset the
> > > > calling CPU before entering the WFI, such that the vCPU may immediately
> > > > resume execution when a wakeup event is recognized.
> > > > 
> > > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > > ---
> > > >  arch/arm64/kvm/psci.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
> > > >  arch/arm64/kvm/reset.c |  3 ++-
> > > >  2 files changed, 53 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> > > > index 77a00913cdfd..41adaaf2234a 100644
> > > > --- a/arch/arm64/kvm/psci.c
> > > > +++ b/arch/arm64/kvm/psci.c
> > > > @@ -208,6 +208,50 @@ static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
> > > >  	kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
> > > >  }
> > > >  
> > > > +static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +	struct vcpu_reset_state reset_state;
> > > > +	struct kvm *kvm = vcpu->kvm;
> > > > +	struct kvm_vcpu *tmp;
> > > > +	bool denied = false;
> > > > +	unsigned long i;
> > > > +
> > > > +	reset_state.pc = smccc_get_arg1(vcpu);
> > > > +	if (!kvm_ipa_valid(kvm, reset_state.pc)) {
> > > > +		smccc_set_retval(vcpu, PSCI_RET_INVALID_ADDRESS, 0, 0, 0);
> > > > +		return 1;
> > > > +	}
> > > > +
> > > > +	reset_state.r0 = smccc_get_arg2(vcpu);
> > > > +	reset_state.be = kvm_vcpu_is_be(vcpu);
> > > > +	reset_state.reset = true;
> > > > +
> > > > +	/*
> > > > +	 * The SYSTEM_SUSPEND PSCI call requires that all vCPUs (except the
> > > > +	 * calling vCPU) be in an OFF state, as determined by the
> > > > +	 * implementation.
> > > > +	 *
> > > > +	 * See ARM DEN0022D, 5.19 "SYSTEM_SUSPEND" for more details.
> > > > +	 */
> > > > +	mutex_lock(&kvm->lock);
> > > > +	kvm_for_each_vcpu(i, tmp, kvm) {
> > > > +		if (tmp != vcpu && !kvm_arm_vcpu_powered_off(tmp)) {
> > > > +			denied = true;
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	mutex_unlock(&kvm->lock);
> > > 
> > > This looks dodgy. Nothing seems to prevent userspace from setting the
> > > mp_state to RUNNING in parallel with this, as only the vcpu mutex is
> > > held when this ioctl is issued.
> > > 
> > > It looks to me that what you want is what lock_all_vcpus() does
> > > (Alexandru has a patch moving it out of the vgic code as part of his
> > > SPE series).
> > > 
> > > It is also pretty unclear what the interaction with userspace is once
> > > you have released the lock. If the VMM starts a vcpu other than the
> > > suspending one, what is its state? The spec doesn't see to help
> > > here. I can see two options:
> > > 
> > > - either all the vcpus have the same reset state applied to them as
> > >   they come up, unless they are started with CPU_ON by a vcpu that has
> > >   already booted (but there is a single 'context_id' provided, and I
> > >   fear this is going to confuse the OS)...
> > > 
> > > - or only the suspending vcpu can resume the system, and we must fail
> > >   a change of mp_state for the other vcpus.
> > > 
> > > What do you think?
> > 
> > Definitely the latter. The documentation of SYSTEM_SUSPEND is quite
> > shaky on this, but it would appear that the intention is for the caller
> > to be the first CPU to wake up.
> 
> Yup. We now have clarification on the intent of the spec (only the
> caller CPU can resume the system), and this needs to be tightened.
> 

I'm beginning to wonder if the VMM/KVM split implementation of
system-scoped PSCI calls can ever be right. There exists a critical
section in all system-wide PSCI calls that currently spans an exit to
userspace. I cannot devise a sane way to guard such a critical section
when we are returning control to userspace.

For example, KVM offlines all of the CPUs except for the exiting CPU
when handling SYSTEM_RESET or SYSTEM_OFF, but nothing prevents an
interleaving KVM_ARM_VCPU_INIT or KVM_SET_MP_STATE from disturbing the
state of the VM. Couldn't even say its a userspace bug, either, because
a different vCPU could do something before the caller has exited. Even
if we grab all the vCPU mutexes, we'd need to drop them before exiting
to userspace.

If userspace decides to reject the PSCI call, we're giving control
back to the guest in a wildly different state than it had making the
PSCI call. Again, the PSCI spec is vague on this matter, but I believe
the intuitive answer is that we should not change the VM state if the call
is rejected. This could upset an otherwise well-behaved KVM guest.

Doing SYSTEM_SUSPEND in userspace is better, as KVM avoids mucking with
the VM state before the PSCI call is actually accepted. However, any of
the consistency checks in the kernel for SYSTEM_SUSPEND are entirely
moot. Anything can happen between the exit to userspace and the moment
userspace actually recognizes the SYSTEM_SUSPEND call on the exiting
CPU.

KVM rejecting attempts to resume vCPUs besides the caller will break
a correct userspace, given the inherent race that crops up when exiting.
Blocking attempts to resume other vCPUs could have unintented
consequences as well. It seems that we'd need to prevent
KVM_ARM_VCPU_INIT calls as well as KVM_SET_MP_STATE, even though the
former could be used in a valid SYSTEM_SUSPEND implementation.

I really do hate to go back to the drawing board on the PSCI stuff
again, but there seems to be a fundamental issue in how system-scoped
calls are handled. Userspace is probably the only place where we could
quiesce the VM state, assess if the PSCI call should be accepted, and
change the VM state.

Do you think all of this is an issue as well?

--
Oliver
Marc Zyngier March 3, 2022, 11:37 a.m. UTC | #5
On Thu, 03 Mar 2022 01:01:40 +0000,
Oliver Upton <oupton@google.com> wrote:
> 
>
> I'm beginning to wonder if the VMM/KVM split implementation of
> system-scoped PSCI calls can ever be right. There exists a critical
> section in all system-wide PSCI calls that currently spans an exit to
> userspace. I cannot devise a sane way to guard such a critical section
> when we are returning control to userspace.
> 
> For example, KVM offlines all of the CPUs except for the exiting CPU
> when handling SYSTEM_RESET or SYSTEM_OFF, but nothing prevents an
> interleaving KVM_ARM_VCPU_INIT or KVM_SET_MP_STATE from disturbing the
> state of the VM. Couldn't even say its a userspace bug, either, because
> a different vCPU could do something before the caller has exited. Even
> if we grab all the vCPU mutexes, we'd need to drop them before exiting
> to userspace.
> 
> If userspace decides to reject the PSCI call, we're giving control
> back to the guest in a wildly different state than it had making the
> PSCI call. Again, the PSCI spec is vague on this matter, but I believe
> the intuitive answer is that we should not change the VM state if the call
> is rejected. This could upset an otherwise well-behaved KVM guest.

Sure. But this is the equivalent of a buggy firmware/hardware, and a
failing PSCI reboot is likely to have had destructive effects. Is it
nice? Absolutely not. Is it a problem in practice? It hasn't in the
10+ years this API has been implemented.

The alternative is to be able to forward all the PSCI events to
userspace and let it deal with it. It has long been at the back of my
mind to allow userspace to request ranges of hypercalls to be
forwarded directly, without any in-kernel handling. I'm all for it,
but this must be a buy-in from the VMM.

> Doing SYSTEM_SUSPEND in userspace is better, as KVM avoids mucking with
> the VM state before the PSCI call is actually accepted. However, any of
> the consistency checks in the kernel for SYSTEM_SUSPEND are entirely
> moot. Anything can happen between the exit to userspace and the moment
> userspace actually recognizes the SYSTEM_SUSPEND call on the exiting
> CPU.

I agree. Maybe we just don't do any and only exit to userspace on the
calling vcpu. It then becomes the responsibility of userspace to take
the other vcpus out of the kernel and change their state if required.

> 
> KVM rejecting attempts to resume vCPUs besides the caller will break
> a correct userspace, given the inherent race that crops up when exiting.
> Blocking attempts to resume other vCPUs could have unintented
> consequences as well. It seems that we'd need to prevent
> KVM_ARM_VCPU_INIT calls as well as KVM_SET_MP_STATE, even though the
> former could be used in a valid SYSTEM_SUSPEND implementation.

I don't think we need to enforce this if we leave suspend entirely to
userspace. At the end of the day, we rely on the VMM not to screw up
the guest. If the VMM restarts the wrong vcpu, that's bad behaviour,
but there are a million other ways for the VMM to mess the guess up.

> I really do hate to go back to the drawing board on the PSCI stuff
> again, but there seems to be a fundamental issue in how system-scoped
> calls are handled. Userspace is probably the only place where we could
> quiesce the VM state, assess if the PSCI call should be accepted, and
> change the VM state.
>
> Do you think all of this is an issue as well?

I don't think we should worry too much about the other system events.
They are now ABI, and changing them is tricky. For suspend, I think
punting the whole thing to userspace is doable. Otherwise, the
alternative is to implement full userspace PSCI support, which is
going to be a lot of work (and a lot of ABI discussions...).

Thanks,

	M.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 77a00913cdfd..41adaaf2234a 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -208,6 +208,50 @@  static void kvm_psci_system_reset(struct kvm_vcpu *vcpu)
 	kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET);
 }
 
+static int kvm_psci_system_suspend(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_reset_state reset_state;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_vcpu *tmp;
+	bool denied = false;
+	unsigned long i;
+
+	reset_state.pc = smccc_get_arg1(vcpu);
+	if (!kvm_ipa_valid(kvm, reset_state.pc)) {
+		smccc_set_retval(vcpu, PSCI_RET_INVALID_ADDRESS, 0, 0, 0);
+		return 1;
+	}
+
+	reset_state.r0 = smccc_get_arg2(vcpu);
+	reset_state.be = kvm_vcpu_is_be(vcpu);
+	reset_state.reset = true;
+
+	/*
+	 * The SYSTEM_SUSPEND PSCI call requires that all vCPUs (except the
+	 * calling vCPU) be in an OFF state, as determined by the
+	 * implementation.
+	 *
+	 * See ARM DEN0022D, 5.19 "SYSTEM_SUSPEND" for more details.
+	 */
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, tmp, kvm) {
+		if (tmp != vcpu && !kvm_arm_vcpu_powered_off(tmp)) {
+			denied = true;
+			break;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+
+	if (denied) {
+		smccc_set_retval(vcpu, PSCI_RET_DENIED, 0, 0, 0);
+		return 1;
+	}
+
+	__kvm_reset_vcpu(vcpu, &reset_state);
+	kvm_vcpu_wfi(vcpu);
+	return 1;
+}
+
 static void kvm_psci_narrow_to_32bit(struct kvm_vcpu *vcpu)
 {
 	int i;
@@ -343,6 +387,8 @@  static int kvm_psci_1_0_call(struct kvm_vcpu *vcpu)
 		case PSCI_0_2_FN_MIGRATE_INFO_TYPE:
 		case PSCI_0_2_FN_SYSTEM_OFF:
 		case PSCI_0_2_FN_SYSTEM_RESET:
+		case PSCI_1_0_FN_SYSTEM_SUSPEND:
+		case PSCI_1_0_FN64_SYSTEM_SUSPEND:
 		case PSCI_1_0_FN_PSCI_FEATURES:
 		case ARM_SMCCC_VERSION_FUNC_ID:
 			val = 0;
@@ -352,6 +398,11 @@  static int kvm_psci_1_0_call(struct kvm_vcpu *vcpu)
 			break;
 		}
 		break;
+	case PSCI_1_0_FN_SYSTEM_SUSPEND:
+		kvm_psci_narrow_to_32bit(vcpu);
+		fallthrough;
+	case PSCI_1_0_FN64_SYSTEM_SUSPEND:
+		return kvm_psci_system_suspend(vcpu);
 	default:
 		return kvm_psci_0_2_call(vcpu);
 	}
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index f879a8f6a99c..006e7a75ceba 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -215,7 +215,8 @@  static bool vcpu_allowed_register_width(struct kvm_vcpu *vcpu)
  *
  * Note: This function can be called from two paths:
  *  - The KVM_ARM_VCPU_INIT ioctl
- *  - handling a request issued by another VCPU in the PSCI handling code
+ *  - handling a request issued by possibly another VCPU in the PSCI handling
+ *    code
  *
  * In the first case, the VCPU will not be loaded, and in the second case the
  * VCPU will be loaded.  Because this function operates purely on the