[v6] kvm: better MWAIT emulation for guests
diff mbox

Message ID 1491911135-216950-1-git-send-email-agraf@suse.de
State New
Headers show

Commit Message

Alexander Graf April 11, 2017, 11:45 a.m. UTC
From: "Michael S. Tsirkin" <mst@redhat.com>

Guests that are heavy on futexes end up IPI'ing each other a lot. That
can lead to significant slowdowns and latency increase for those guests
when running within KVM.

If only a single guest is needed on a host, we have a lot of spare host
CPU time we can throw at the problem. Modern CPUs implement a feature
called "MWAIT" which allows guests to wake up sleeping remote CPUs without
an IPI - thus without an exit - at the expense of never going out of guest
context.

The decision whether this is something sensible to use should be up to the
VM admin, so to user space. We can however allow MWAIT execution on systems
that support it properly hardware wise.

This patch adds a CAP to user space and a KVM cpuid leaf to indicate
availability of native MWAIT execution. With that enabled, the worst a
guest can do is waste as many cycles as a "jmp ." would do, so it's not
a privilege problem.

We consciously do *not* expose the feature in our CPUID bitmap, as most
people will want to benefit from sleeping vCPUs to allow for over commit.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[agraf: fix amd, change commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>

---

v5 -> v6:

  - Fix AMD check, so that we're consistent between svm and vmx
  - Clarify commit message
---
 Documentation/virtual/kvm/api.txt    |  9 +++++++++
 Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/cpuid.c                 |  3 +++
 arch/x86/kvm/svm.c                   |  7 +++++--
 arch/x86/kvm/vmx.c                   |  6 ++++--
 arch/x86/kvm/x86.c                   |  3 +++
 arch/x86/kvm/x86.h                   | 36 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  1 +
 9 files changed, 68 insertions(+), 4 deletions(-)

Comments

Gabriel L. Somlo April 11, 2017, 12:41 p.m. UTC | #1
On Tue, Apr 11, 2017 at 01:45:35PM +0200, Alexander Graf wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> 
> Guests that are heavy on futexes end up IPI'ing each other a lot. That
> can lead to significant slowdowns and latency increase for those guests
> when running within KVM.
> 
> If only a single guest is needed on a host, we have a lot of spare host
> CPU time we can throw at the problem. Modern CPUs implement a feature
> called "MWAIT" which allows guests to wake up sleeping remote CPUs without
> an IPI - thus without an exit - at the expense of never going out of guest
> context.
> 
> The decision whether this is something sensible to use should be up to the
> VM admin, so to user space. We can however allow MWAIT execution on systems
> that support it properly hardware wise.
> 
> This patch adds a CAP to user space and a KVM cpuid leaf to indicate
> availability of native MWAIT execution. With that enabled, the worst a
> guest can do is waste as many cycles as a "jmp ." would do, so it's not
> a privilege problem.

Did you mean "hlt" rather than "jmp" ?

> We consciously do *not* expose the feature in our CPUID bitmap, as most
> people will want to benefit from sleeping vCPUs to allow for over commit.
> 
> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>

That's maybe a bit inacurate, I didn't actually report anything *this*
patch is trying to address (that was rather commit 87c00572ba05aa8c9d).

Maybe

Acked-by: Gabriel Somlo <gsomlo@gmail.com>

would be a more accurate statement instead?

Thanks much,
--Gabriel

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> [agraf: fix amd, change commit message]
> Signed-off-by: Alexander Graf <agraf@suse.de>
> 
> ---
> 
> v5 -> v6:
> 
>   - Fix AMD check, so that we're consistent between svm and vmx
>   - Clarify commit message
> ---
>  Documentation/virtual/kvm/api.txt    |  9 +++++++++
>  Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/cpuid.c                 |  3 +++
>  arch/x86/kvm/svm.c                   |  7 +++++--
>  arch/x86/kvm/vmx.c                   |  6 ++++--
>  arch/x86/kvm/x86.c                   |  3 +++
>  arch/x86/kvm/x86.h                   | 36 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h             |  1 +
>  9 files changed, 68 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 1a18484..dacc3d7 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4094,3 +4094,12 @@ reserved.
>   2: MIPS64 or microMIPS64 with access to all address segments.
>      Both registers and addresses are 64-bits wide.
>      It will be possible to run 64-bit or 32-bit guest code.
> +
> +8.8 KVM_CAP_X86_GUEST_MWAIT
> +
> +Architectures: x86
> +
> +This capability indicates that guest using memory monotoring instructions
> +(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
> +spent while virtual CPU is halted in this way will then be accounted for as
> +guest running time on the host (as opposed to e.g. HLT).
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..04c201c 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
>                                     ||       || before enabling paravirtualized
>                                     ||       || spinlock support.
>  ------------------------------------------------------------------------------
> +KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
> +                                   ||       || to halt the VCPU without exits,
> +                                   ||       || time spent while halted in this
> +                                   ||       || way is accounted for on host as
> +                                   ||       || VCPU run time.
> +------------------------------------------------------------------------------
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
>                                     ||       || per-cpu warps are expected in
>                                     ||       || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index cff0bb6..9cc77a7 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -24,6 +24,7 @@
>  #define KVM_FEATURE_STEAL_TIME		5
>  #define KVM_FEATURE_PV_EOI		6
>  #define KVM_FEATURE_PV_UNHALT		7
> +#define KVM_FEATURE_MWAIT		8
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index efde6cc..5638102 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		if (sched_info_on())
>  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>  
> +		if (kvm_mwait_in_guest())
> +			entry->eax |= (1 << KVM_FEATURE_MWAIT);
> +
>  		entry->ebx = 0;
>  		entry->ecx = 0;
>  		entry->edx = 0;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1b203ab..c41f03e 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,10 +1198,13 @@ static void init_vmcb(struct vcpu_svm *svm)
>  	set_intercept(svm, INTERCEPT_CLGI);
>  	set_intercept(svm, INTERCEPT_SKINIT);
>  	set_intercept(svm, INTERCEPT_WBINVD);
> -	set_intercept(svm, INTERCEPT_MONITOR);
> -	set_intercept(svm, INTERCEPT_MWAIT);
>  	set_intercept(svm, INTERCEPT_XSETBV);
>  
> +	if (!kvm_mwait_in_guest()) {
> +		set_intercept(svm, INTERCEPT_MONITOR);
> +		set_intercept(svm, INTERCEPT_MWAIT);
> +	}
> +
>  	control->iopm_base_pa = iopm_base;
>  	control->msrpm_base_pa = __pa(svm->msrpm);
>  	control->int_ctl = V_INTR_MASKING_MASK;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index cfdb0d9..2e62f9d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	      CPU_BASED_USE_IO_BITMAPS |
>  	      CPU_BASED_MOV_DR_EXITING |
>  	      CPU_BASED_USE_TSC_OFFSETING |
> -	      CPU_BASED_MWAIT_EXITING |
> -	      CPU_BASED_MONITOR_EXITING |
>  	      CPU_BASED_INVLPG_EXITING |
>  	      CPU_BASED_RDPMC_EXITING;
>  
> +	if (!kvm_mwait_in_guest())
> +		min |= CPU_BASED_MWAIT_EXITING |
> +			CPU_BASED_MONITOR_EXITING;
> +
>  	opt = CPU_BASED_TPR_SHADOW |
>  	      CPU_BASED_USE_MSR_BITMAPS |
>  	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bc47e2..ffaa76d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2679,6 +2679,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ADJUST_CLOCK:
>  		r = KVM_CLOCK_TSC_STABLE;
>  		break;
> +	case KVM_CAP_X86_GUEST_MWAIT:
> +		r = kvm_mwait_in_guest();
> +		break;
>  	case KVM_CAP_X86_SMM:
>  		/* SMBASE is usually relocated above 1M on modern chipsets,
>  		 * and SMM handlers might indeed rely on 4G segment limits,
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index e8ff3e4..6120670 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -1,6 +1,8 @@
>  #ifndef ARCH_X86_KVM_X86_H
>  #define ARCH_X86_KVM_X86_H
>  
> +#include <asm/processor.h>
> +#include <asm/mwait.h>
>  #include <linux/kvm_host.h>
>  #include <asm/pvclock.h>
>  #include "kvm_cache_regs.h"
> @@ -212,4 +214,38 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>  	    __rem;						\
>  	 })
>  
> +static inline bool kvm_mwait_in_guest(void)
> +{
> +	unsigned int eax, ebx, ecx, edx;
> +
> +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> +		return false;
> +
> +	switch (boot_cpu_data.x86_vendor) {
> +	case X86_VENDOR_AMD:
> +		/* All AMD CPUs have a working MWAIT implementation */
> +		return true;
> +	case X86_VENDOR_INTEL:
> +		/* Handle Intel below */
> +		break;
> +	default:
> +		return false;
> +	}
> +
> +	/*
> +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> +	 * they would allow guest to stop the CPU completely by disabling
> +	 * interrupts then invoking MWAIT.
> +	 */
> +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> +		return false;
> +
> +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> +
> +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +		return false;
> +
> +	return true;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1e1a6c7..a3f03d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -890,6 +890,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MIPS_VZ 137
>  #define KVM_CAP_MIPS_TE 138
>  #define KVM_CAP_MIPS_64BIT 139
> +#define KVM_CAP_X86_GUEST_MWAIT 140
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> -- 
> 1.8.5.6
>
Gabriel L. Somlo April 11, 2017, 12:43 p.m. UTC | #2
On Tue, Apr 11, 2017 at 08:41:11AM -0400, Gabriel L. Somlo wrote:
> On Tue, Apr 11, 2017 at 01:45:35PM +0200, Alexander Graf wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > 
> > Guests that are heavy on futexes end up IPI'ing each other a lot. That
> > can lead to significant slowdowns and latency increase for those guests
> > when running within KVM.
> > 
> > If only a single guest is needed on a host, we have a lot of spare host
> > CPU time we can throw at the problem. Modern CPUs implement a feature
> > called "MWAIT" which allows guests to wake up sleeping remote CPUs without
> > an IPI - thus without an exit - at the expense of never going out of guest
> > context.
> > 
> > The decision whether this is something sensible to use should be up to the
> > VM admin, so to user space. We can however allow MWAIT execution on systems
> > that support it properly hardware wise.
> > 
> > This patch adds a CAP to user space and a KVM cpuid leaf to indicate
> > availability of native MWAIT execution. With that enabled, the worst a
> > guest can do is waste as many cycles as a "jmp ." would do, so it's not
> > a privilege problem.
> 
> Did you mean "hlt" rather than "jmp" ?

Or, rather, "nop" ? (apologies, still waiting for the coffee to
cool down to a drinkable temperature... :)
 
> > We consciously do *not* expose the feature in our CPUID bitmap, as most
> > people will want to benefit from sleeping vCPUs to allow for over commit.
> > 
> > Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> 
> That's maybe a bit inacurate, I didn't actually report anything *this*
> patch is trying to address (that was rather commit 87c00572ba05aa8c9d).
> 
> Maybe
> 
> Acked-by: Gabriel Somlo <gsomlo@gmail.com>
> 
> would be a more accurate statement instead?
> 
> Thanks much,
> --Gabriel
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > [agraf: fix amd, change commit message]
> > Signed-off-by: Alexander Graf <agraf@suse.de>
> > 
> > ---
> > 
> > v5 -> v6:
> > 
> >   - Fix AMD check, so that we're consistent between svm and vmx
> >   - Clarify commit message
> > ---
> >  Documentation/virtual/kvm/api.txt    |  9 +++++++++
> >  Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
> >  arch/x86/include/uapi/asm/kvm_para.h |  1 +
> >  arch/x86/kvm/cpuid.c                 |  3 +++
> >  arch/x86/kvm/svm.c                   |  7 +++++--
> >  arch/x86/kvm/vmx.c                   |  6 ++++--
> >  arch/x86/kvm/x86.c                   |  3 +++
> >  arch/x86/kvm/x86.h                   | 36 ++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h             |  1 +
> >  9 files changed, 68 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index 1a18484..dacc3d7 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -4094,3 +4094,12 @@ reserved.
> >   2: MIPS64 or microMIPS64 with access to all address segments.
> >      Both registers and addresses are 64-bits wide.
> >      It will be possible to run 64-bit or 32-bit guest code.
> > +
> > +8.8 KVM_CAP_X86_GUEST_MWAIT
> > +
> > +Architectures: x86
> > +
> > +This capability indicates that guest using memory monotoring instructions
> > +(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
> > +spent while virtual CPU is halted in this way will then be accounted for as
> > +guest running time on the host (as opposed to e.g. HLT).
> > diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> > index 3c65feb..04c201c 100644
> > --- a/Documentation/virtual/kvm/cpuid.txt
> > +++ b/Documentation/virtual/kvm/cpuid.txt
> > @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
> >                                     ||       || before enabling paravirtualized
> >                                     ||       || spinlock support.
> >  ------------------------------------------------------------------------------
> > +KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
> > +                                   ||       || to halt the VCPU without exits,
> > +                                   ||       || time spent while halted in this
> > +                                   ||       || way is accounted for on host as
> > +                                   ||       || VCPU run time.
> > +------------------------------------------------------------------------------
> >  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
> >                                     ||       || per-cpu warps are expected in
> >                                     ||       || kvmclock.
> > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> > index cff0bb6..9cc77a7 100644
> > --- a/arch/x86/include/uapi/asm/kvm_para.h
> > +++ b/arch/x86/include/uapi/asm/kvm_para.h
> > @@ -24,6 +24,7 @@
> >  #define KVM_FEATURE_STEAL_TIME		5
> >  #define KVM_FEATURE_PV_EOI		6
> >  #define KVM_FEATURE_PV_UNHALT		7
> > +#define KVM_FEATURE_MWAIT		8
> >  
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index efde6cc..5638102 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >  		if (sched_info_on())
> >  			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> >  
> > +		if (kvm_mwait_in_guest())
> > +			entry->eax |= (1 << KVM_FEATURE_MWAIT);
> > +
> >  		entry->ebx = 0;
> >  		entry->ecx = 0;
> >  		entry->edx = 0;
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 1b203ab..c41f03e 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1198,10 +1198,13 @@ static void init_vmcb(struct vcpu_svm *svm)
> >  	set_intercept(svm, INTERCEPT_CLGI);
> >  	set_intercept(svm, INTERCEPT_SKINIT);
> >  	set_intercept(svm, INTERCEPT_WBINVD);
> > -	set_intercept(svm, INTERCEPT_MONITOR);
> > -	set_intercept(svm, INTERCEPT_MWAIT);
> >  	set_intercept(svm, INTERCEPT_XSETBV);
> >  
> > +	if (!kvm_mwait_in_guest()) {
> > +		set_intercept(svm, INTERCEPT_MONITOR);
> > +		set_intercept(svm, INTERCEPT_MWAIT);
> > +	}
> > +
> >  	control->iopm_base_pa = iopm_base;
> >  	control->msrpm_base_pa = __pa(svm->msrpm);
> >  	control->int_ctl = V_INTR_MASKING_MASK;
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index cfdb0d9..2e62f9d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
> >  	      CPU_BASED_USE_IO_BITMAPS |
> >  	      CPU_BASED_MOV_DR_EXITING |
> >  	      CPU_BASED_USE_TSC_OFFSETING |
> > -	      CPU_BASED_MWAIT_EXITING |
> > -	      CPU_BASED_MONITOR_EXITING |
> >  	      CPU_BASED_INVLPG_EXITING |
> >  	      CPU_BASED_RDPMC_EXITING;
> >  
> > +	if (!kvm_mwait_in_guest())
> > +		min |= CPU_BASED_MWAIT_EXITING |
> > +			CPU_BASED_MONITOR_EXITING;
> > +
> >  	opt = CPU_BASED_TPR_SHADOW |
> >  	      CPU_BASED_USE_MSR_BITMAPS |
> >  	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 6bc47e2..ffaa76d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2679,6 +2679,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  	case KVM_CAP_ADJUST_CLOCK:
> >  		r = KVM_CLOCK_TSC_STABLE;
> >  		break;
> > +	case KVM_CAP_X86_GUEST_MWAIT:
> > +		r = kvm_mwait_in_guest();
> > +		break;
> >  	case KVM_CAP_X86_SMM:
> >  		/* SMBASE is usually relocated above 1M on modern chipsets,
> >  		 * and SMM handlers might indeed rely on 4G segment limits,
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index e8ff3e4..6120670 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -1,6 +1,8 @@
> >  #ifndef ARCH_X86_KVM_X86_H
> >  #define ARCH_X86_KVM_X86_H
> >  
> > +#include <asm/processor.h>
> > +#include <asm/mwait.h>
> >  #include <linux/kvm_host.h>
> >  #include <asm/pvclock.h>
> >  #include "kvm_cache_regs.h"
> > @@ -212,4 +214,38 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> >  	    __rem;						\
> >  	 })
> >  
> > +static inline bool kvm_mwait_in_guest(void)
> > +{
> > +	unsigned int eax, ebx, ecx, edx;
> > +
> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> > +		return false;
> > +
> > +	switch (boot_cpu_data.x86_vendor) {
> > +	case X86_VENDOR_AMD:
> > +		/* All AMD CPUs have a working MWAIT implementation */
> > +		return true;
> > +	case X86_VENDOR_INTEL:
> > +		/* Handle Intel below */
> > +		break;
> > +	default:
> > +		return false;
> > +	}
> > +
> > +	/*
> > +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> > +	 * they would allow guest to stop the CPU completely by disabling
> > +	 * interrupts then invoking MWAIT.
> > +	 */
> > +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> > +		return false;
> > +
> > +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> > +
> > +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  #endif
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 1e1a6c7..a3f03d3 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -890,6 +890,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_MIPS_VZ 137
> >  #define KVM_CAP_MIPS_TE 138
> >  #define KVM_CAP_MIPS_64BIT 139
> > +#define KVM_CAP_X86_GUEST_MWAIT 140
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > -- 
> > 1.8.5.6
> >
Alexander Graf April 11, 2017, 12:43 p.m. UTC | #3
On 04/11/2017 02:41 PM, Gabriel L. Somlo wrote:
> On Tue, Apr 11, 2017 at 01:45:35PM +0200, Alexander Graf wrote:
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>>
>> Guests that are heavy on futexes end up IPI'ing each other a lot. That
>> can lead to significant slowdowns and latency increase for those guests
>> when running within KVM.
>>
>> If only a single guest is needed on a host, we have a lot of spare host
>> CPU time we can throw at the problem. Modern CPUs implement a feature
>> called "MWAIT" which allows guests to wake up sleeping remote CPUs without
>> an IPI - thus without an exit - at the expense of never going out of guest
>> context.
>>
>> The decision whether this is something sensible to use should be up to the
>> VM admin, so to user space. We can however allow MWAIT execution on systems
>> that support it properly hardware wise.
>>
>> This patch adds a CAP to user space and a KVM cpuid leaf to indicate
>> availability of native MWAIT execution. With that enabled, the worst a
>> guest can do is waste as many cycles as a "jmp ." would do, so it's not
>> a privilege problem.
> Did you mean "hlt" rather than "jmp" ?

No, hlt wouldn't waste cycles, "jmp ." does.

The point I'm trying to make here is that by removing the MWAIT trap we 
don't give the guest more CPU time than we would've granted it before.

>
>> We consciously do *not* expose the feature in our CPUID bitmap, as most
>> people will want to benefit from sleeping vCPUs to allow for over commit.
>>
>> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> That's maybe a bit inacurate, I didn't actually report anything *this*
> patch is trying to address (that was rather commit 87c00572ba05aa8c9d).
>
> Maybe
>
> Acked-by: Gabriel Somlo <gsomlo@gmail.com>
>
> would be a more accurate statement instead?

Works for me :). I'm sure whoever applies this can swizzle the tag?


Alex
Kyle Copperfield via April 11, 2017, 5:10 p.m. UTC | #4
This might be more useful if it could be dynamically toggled on and
off, depending on system load.

On Tue, Apr 11, 2017 at 4:45 AM, Alexander Graf <agraf@suse.de> wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
>
> Guests that are heavy on futexes end up IPI'ing each other a lot. That
> can lead to significant slowdowns and latency increase for those guests
> when running within KVM.
>
> If only a single guest is needed on a host, we have a lot of spare host
> CPU time we can throw at the problem. Modern CPUs implement a feature
> called "MWAIT" which allows guests to wake up sleeping remote CPUs without
> an IPI - thus without an exit - at the expense of never going out of guest
> context.
>
> The decision whether this is something sensible to use should be up to the
> VM admin, so to user space. We can however allow MWAIT execution on systems
> that support it properly hardware wise.
>
> This patch adds a CAP to user space and a KVM cpuid leaf to indicate
> availability of native MWAIT execution. With that enabled, the worst a
> guest can do is waste as many cycles as a "jmp ." would do, so it's not
> a privilege problem.
>
> We consciously do *not* expose the feature in our CPUID bitmap, as most
> people will want to benefit from sleeping vCPUs to allow for over commit.
>
> Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> [agraf: fix amd, change commit message]
> Signed-off-by: Alexander Graf <agraf@suse.de>
>
> ---
>
> v5 -> v6:
>
>   - Fix AMD check, so that we're consistent between svm and vmx
>   - Clarify commit message
> ---
>  Documentation/virtual/kvm/api.txt    |  9 +++++++++
>  Documentation/virtual/kvm/cpuid.txt  |  6 ++++++
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/cpuid.c                 |  3 +++
>  arch/x86/kvm/svm.c                   |  7 +++++--
>  arch/x86/kvm/vmx.c                   |  6 ++++--
>  arch/x86/kvm/x86.c                   |  3 +++
>  arch/x86/kvm/x86.h                   | 36 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h             |  1 +
>  9 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 1a18484..dacc3d7 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4094,3 +4094,12 @@ reserved.
>   2: MIPS64 or microMIPS64 with access to all address segments.
>      Both registers and addresses are 64-bits wide.
>      It will be possible to run 64-bit or 32-bit guest code.
> +
> +8.8 KVM_CAP_X86_GUEST_MWAIT
> +
> +Architectures: x86
> +
> +This capability indicates that guest using memory monotoring instructions
> +(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
> +spent while virtual CPU is halted in this way will then be accounted for as
> +guest running time on the host (as opposed to e.g. HLT).
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 3c65feb..04c201c 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -54,6 +54,12 @@ KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
>                                     ||       || before enabling paravirtualized
>                                     ||       || spinlock support.
>  ------------------------------------------------------------------------------
> +KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
> +                                   ||       || to halt the VCPU without exits,
> +                                   ||       || time spent while halted in this
> +                                   ||       || way is accounted for on host as
> +                                   ||       || VCPU run time.
> +------------------------------------------------------------------------------
>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
>                                     ||       || per-cpu warps are expected in
>                                     ||       || kvmclock.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index cff0bb6..9cc77a7 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -24,6 +24,7 @@
>  #define KVM_FEATURE_STEAL_TIME         5
>  #define KVM_FEATURE_PV_EOI             6
>  #define KVM_FEATURE_PV_UNHALT          7
> +#define KVM_FEATURE_MWAIT              8
>
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index efde6cc..5638102 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -594,6 +594,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>                 if (sched_info_on())
>                         entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
>
> +               if (kvm_mwait_in_guest())
> +                       entry->eax |= (1 << KVM_FEATURE_MWAIT);
> +
>                 entry->ebx = 0;
>                 entry->ecx = 0;
>                 entry->edx = 0;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1b203ab..c41f03e 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1198,10 +1198,13 @@ static void init_vmcb(struct vcpu_svm *svm)
>         set_intercept(svm, INTERCEPT_CLGI);
>         set_intercept(svm, INTERCEPT_SKINIT);
>         set_intercept(svm, INTERCEPT_WBINVD);
> -       set_intercept(svm, INTERCEPT_MONITOR);
> -       set_intercept(svm, INTERCEPT_MWAIT);
>         set_intercept(svm, INTERCEPT_XSETBV);
>
> +       if (!kvm_mwait_in_guest()) {
> +               set_intercept(svm, INTERCEPT_MONITOR);
> +               set_intercept(svm, INTERCEPT_MWAIT);
> +       }
> +
>         control->iopm_base_pa = iopm_base;
>         control->msrpm_base_pa = __pa(svm->msrpm);
>         control->int_ctl = V_INTR_MASKING_MASK;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index cfdb0d9..2e62f9d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3547,11 +3547,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>               CPU_BASED_USE_IO_BITMAPS |
>               CPU_BASED_MOV_DR_EXITING |
>               CPU_BASED_USE_TSC_OFFSETING |
> -             CPU_BASED_MWAIT_EXITING |
> -             CPU_BASED_MONITOR_EXITING |
>               CPU_BASED_INVLPG_EXITING |
>               CPU_BASED_RDPMC_EXITING;
>
> +       if (!kvm_mwait_in_guest())
> +               min |= CPU_BASED_MWAIT_EXITING |
> +                       CPU_BASED_MONITOR_EXITING;
> +
>         opt = CPU_BASED_TPR_SHADOW |
>               CPU_BASED_USE_MSR_BITMAPS |
>               CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bc47e2..ffaa76d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2679,6 +2679,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         case KVM_CAP_ADJUST_CLOCK:
>                 r = KVM_CLOCK_TSC_STABLE;
>                 break;
> +       case KVM_CAP_X86_GUEST_MWAIT:
> +               r = kvm_mwait_in_guest();
> +               break;
>         case KVM_CAP_X86_SMM:
>                 /* SMBASE is usually relocated above 1M on modern chipsets,
>                  * and SMM handlers might indeed rely on 4G segment limits,
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index e8ff3e4..6120670 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -1,6 +1,8 @@
>  #ifndef ARCH_X86_KVM_X86_H
>  #define ARCH_X86_KVM_X86_H
>
> +#include <asm/processor.h>
> +#include <asm/mwait.h>
>  #include <linux/kvm_host.h>
>  #include <asm/pvclock.h>
>  #include "kvm_cache_regs.h"
> @@ -212,4 +214,38 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
>             __rem;                                              \
>          })
>
> +static inline bool kvm_mwait_in_guest(void)
> +{
> +       unsigned int eax, ebx, ecx, edx;
> +
> +       if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> +               return false;
> +
> +       switch (boot_cpu_data.x86_vendor) {
> +       case X86_VENDOR_AMD:
> +               /* All AMD CPUs have a working MWAIT implementation */
> +               return true;
> +       case X86_VENDOR_INTEL:
> +               /* Handle Intel below */
> +               break;
> +       default:
> +               return false;
> +       }
> +
> +       /*
> +        * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> +        * they would allow guest to stop the CPU completely by disabling
> +        * interrupts then invoking MWAIT.
> +        */
> +       if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> +               return false;
> +
> +       cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> +
> +       if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +               return false;
> +
> +       return true;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1e1a6c7..a3f03d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -890,6 +890,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MIPS_VZ 137
>  #define KVM_CAP_MIPS_TE 138
>  #define KVM_CAP_MIPS_64BIT 139
> +#define KVM_CAP_X86_GUEST_MWAIT 140
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> --
> 1.8.5.6
>
Alexander Graf April 11, 2017, 6:23 p.m. UTC | #5
> Am 11.04.2017 um 19:10 schrieb Jim Mattson <jmattson@google.com>:
> 
> This might be more useful if it could be dynamically toggled on and
> off, depending on system load.

What would trapping mwait (currently) buy you?

As it stands today, before this patch, mwait is simply implemented as a nop, so enabling the trap just means you're wasting as much cpu time, but never send the pCPU idle. With this patch, the CPU at least has the chance to go idle.

Keep in mind that this patch does *not* advertise the mwait cpuid feature bit to the guest.

What you're referring to I guess is actual mwait emulation. That is indeed more useful, but a bigger patch than this and needs some more thought on how to properly cache the monitor'ed pages.


Alex
Kyle Copperfield via April 12, 2017, 2:34 p.m. UTC | #6
Actually, we have rejected commit 87c00572ba05aa8c ("kvm: x86: emulate
monitor and mwait instructions as nop"), so when we intercept
MONITOR/MWAIT, we synthesize #UD. Perhaps it is this difference from
vanilla kvm that motivates the following idea...

Since we're still not going to report MONITOR support in CPUID, the
only guests of consequence are paravirtual guests. What if a
paravirtual guest was aware of the fact that sometimes MONITOR/MWAIT
would work as architected, and sometimes they would raise #UD (or do
something else that's guest-visible, to indicate that the hypevisor is
intercepting the instructions). Such a guest could first try a
MONITOR/MWAIT-based idle loop and then fall back on a HLT-based idle
loop if the hypervisor rejected its use of MONITOR/MWAIT.

We already have the loose concept of "this pCPU has other things to
do," which is encoded in the variable-sized PLE window. With
MONITOR/MWAIT, the choice is binary, but a simple implementation could
tie the two together, by allowing the guest to use MONITOR/MWAIT
whenever the PLE window exceeds a certain threshold. Or the decision
could be left to the userspace agent.

On Tue, Apr 11, 2017 at 11:23 AM, Alexander Graf <agraf@suse.de> wrote:
>
>
>> Am 11.04.2017 um 19:10 schrieb Jim Mattson <jmattson@google.com>:
>>
>> This might be more useful if it could be dynamically toggled on and
>> off, depending on system load.
>
> What would trapping mwait (currently) buy you?
>
> As it stands today, before this patch, mwait is simply implemented as a nop, so enabling the trap just means you're wasting as much cpu time, but never send the pCPU idle. With this patch, the CPU at least has the chance to go idle.
>
> Keep in mind that this patch does *not* advertise the mwait cpuid feature bit to the guest.
>
> What you're referring to I guess is actual mwait emulation. That is indeed more useful, but a bigger patch than this and needs some more thought on how to properly cache the monitor'ed pages.
>
>
> Alex
>
>
Alexander Graf April 12, 2017, 2:54 p.m. UTC | #7
On 12.04.17 16:34, Jim Mattson wrote:
> Actually, we have rejected commit 87c00572ba05aa8c ("kvm: x86: emulate
> monitor and mwait instructions as nop"), so when we intercept
> MONITOR/MWAIT, we synthesize #UD. Perhaps it is this difference from
> vanilla kvm that motivates the following idea...

So you're not running upstream kvm? In that case, you can just not take 
this patch either :).

> Since we're still not going to report MONITOR support in CPUID, the
> only guests of consequence are paravirtual guests. What if a

Only if someone actually implemented something for PV guests, yes.

The real motivation is to allow user space to force set the MONITOR 
CPUID flag. That way an admin can - if he really wants to - dedicate 
pCPUs to the VM.

I agree that we don't need the kvm pv flag for that. I'd be happy to 
drop that if everyone agrees.

> paravirtual guest was aware of the fact that sometimes MONITOR/MWAIT
> would work as architected, and sometimes they would raise #UD (or do
> something else that's guest-visible, to indicate that the hypevisor is
> intercepting the instructions). Such a guest could first try a
> MONITOR/MWAIT-based idle loop and then fall back on a HLT-based idle
> loop if the hypervisor rejected its use of MONITOR/MWAIT.

How would that work? That guest would have to atomically notify all 
other vCPUs that wakeup notifications now go via IPIs instead of cache 
line dirtying.

That's probably as much work to get right as it would be to just emulate 
MWAIT inside kvm ;).

> We already have the loose concept of "this pCPU has other things to
> do," which is encoded in the variable-sized PLE window. With
> MONITOR/MWAIT, the choice is binary, but a simple implementation could
> tie the two together, by allowing the guest to use MONITOR/MWAIT
> whenever the PLE window exceeds a certain threshold. Or the decision
> could be left to the userspace agent.

I agree, and that's basically the idea I mentioned earlier with MWAIT 
emulation. We could (for well behaved guests) switch between emulating 
MWAIT and running native MWAIT.



Alex
Kyle Copperfield via April 12, 2017, 3:20 p.m. UTC | #8
On Wed, Apr 12, 2017 at 7:54 AM, Alexander Graf <agraf@suse.de> wrote:
>
>
> On 12.04.17 16:34, Jim Mattson wrote:
>>
>> Actually, we have rejected commit 87c00572ba05aa8c ("kvm: x86: emulate
>> monitor and mwait instructions as nop"), so when we intercept
>> MONITOR/MWAIT, we synthesize #UD. Perhaps it is this difference from
>> vanilla kvm that motivates the following idea...
>
>
> So you're not running upstream kvm? In that case, you can just not take this
> patch either :).
This patch should be harmless. :-)
>
>> Since we're still not going to report MONITOR support in CPUID, the
>> only guests of consequence are paravirtual guests. What if a
>
>
> Only if someone actually implemented something for PV guests, yes.
>
> The real motivation is to allow user space to force set the MONITOR CPUID
> flag. That way an admin can - if he really wants to - dedicate pCPUs to the
> VM.
>
> I agree that we don't need the kvm pv flag for that. I'd be happy to drop
> that if everyone agrees.
>
>> paravirtual guest was aware of the fact that sometimes MONITOR/MWAIT
>> would work as architected, and sometimes they would raise #UD (or do
>> something else that's guest-visible, to indicate that the hypevisor is
>> intercepting the instructions). Such a guest could first try a
>> MONITOR/MWAIT-based idle loop and then fall back on a HLT-based idle
>> loop if the hypervisor rejected its use of MONITOR/MWAIT.
>
>
> How would that work? That guest would have to atomically notify all other
> vCPUs that wakeup notifications now go via IPIs instead of cache line
> dirtying.
>
> That's probably as much work to get right as it would be to just emulate
> MWAIT inside kvm ;).
True. I don't have an easy solution to that problem.
>
>> We already have the loose concept of "this pCPU has other things to
>> do," which is encoded in the variable-sized PLE window. With
>> MONITOR/MWAIT, the choice is binary, but a simple implementation could
>> tie the two together, by allowing the guest to use MONITOR/MWAIT
>> whenever the PLE window exceeds a certain threshold. Or the decision
>> could be left to the userspace agent.
>
>
> I agree, and that's basically the idea I mentioned earlier with MWAIT
> emulation. We could (for well behaved guests) switch between emulating MWAIT
> and running native MWAIT.
Yes, that would probably be the preferred solution.
>
>
>
> Alex
Michael S. Tsirkin April 12, 2017, 4:29 p.m. UTC | #9
On Wed, Apr 12, 2017 at 04:54:10PM +0200, Alexander Graf wrote:
> 
> 
> On 12.04.17 16:34, Jim Mattson wrote:
> > Actually, we have rejected commit 87c00572ba05aa8c ("kvm: x86: emulate
> > monitor and mwait instructions as nop"), so when we intercept
> > MONITOR/MWAIT, we synthesize #UD. Perhaps it is this difference from
> > vanilla kvm that motivates the following idea...
> 
> So you're not running upstream kvm? In that case, you can just not take this
> patch either :).
> 
> > Since we're still not going to report MONITOR support in CPUID, the
> > only guests of consequence are paravirtual guests. What if a
> 
> Only if someone actually implemented something for PV guests, yes.
> 
> The real motivation is to allow user space to force set the MONITOR CPUID
> flag. That way an admin can - if he really wants to - dedicate pCPUs to the
> VM.
> 
> I agree that we don't need the kvm pv flag for that. I'd be happy to drop
> that if everyone agrees.

I don't really agree we do not need the PV flag. mwait on kvm is
different from mwait on bare metal in that you are heavily penalized by
scheduler for polling unless you configure the host just so.
HLT lets you give up the host CPU if you know you won't need
it for a long time.

So while many people can get by with monitor cpuid (those that isolate
host CPUs) and it's a valuable option to have, I think a PV flag is also
a valuable option and can be set for more configurations.

Guest has an idle driver calling mwait on short waits and halt on longer
ones.  I'm in fact testing an idle driver using such a PV flag and will
post when ready (after vacation ~3 weeks from now probably).

> > paravirtual guest was aware of the fact that sometimes MONITOR/MWAIT
> > would work as architected, and sometimes they would raise #UD (or do
> > something else that's guest-visible, to indicate that the hypevisor is
> > intercepting the instructions). Such a guest could first try a
> > MONITOR/MWAIT-based idle loop and then fall back on a HLT-based idle
> > loop if the hypervisor rejected its use of MONITOR/MWAIT.
> 
> How would that work? That guest would have to atomically notify all other
> vCPUs that wakeup notifications now go via IPIs instead of cache line
> dirtying.
> 
> That's probably as much work to get right as it would be to just emulate
> MWAIT inside kvm ;).
> 
> > We already have the loose concept of "this pCPU has other things to
> > do," which is encoded in the variable-sized PLE window. With
> > MONITOR/MWAIT, the choice is binary, but a simple implementation could
> > tie the two together, by allowing the guest to use MONITOR/MWAIT
> > whenever the PLE window exceeds a certain threshold. Or the decision
> > could be left to the userspace agent.
> 
> I agree, and that's basically the idea I mentioned earlier with MWAIT
> emulation. We could (for well behaved guests) switch between emulating MWAIT
> and running native MWAIT.
> 
> 
> 
> Alex
Paolo Bonzini April 21, 2017, 10:02 a.m. UTC | #10
On 12/04/2017 18:29, Michael S. Tsirkin wrote:
> I don't really agree we do not need the PV flag. mwait on kvm is
> different from mwait on bare metal in that you are heavily penalized by
> scheduler for polling unless you configure the host just so.
> HLT lets you give up the host CPU if you know you won't need
> it for a long time.
> 
> So while many people can get by with monitor cpuid (those that isolate
> host CPUs) and it's a valuable option to have, I think a PV flag is also
> a valuable option and can be set for more configurations.
> 
> Guest has an idle driver calling mwait on short waits and halt on longer
> ones.  I'm in fact testing an idle driver using such a PV flag and will
> post when ready (after vacation ~3 weeks from now probably).

For now I think I'm removing the PV flag, making this just an
optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
monitor and mwait instructions as nop").

We can add it for 4.13 together with the idle driver.

Paolo
Alexander Graf April 21, 2017, 10:05 a.m. UTC | #11
On 21.04.17 12:02, Paolo Bonzini wrote:
>
>
> On 12/04/2017 18:29, Michael S. Tsirkin wrote:
>> I don't really agree we do not need the PV flag. mwait on kvm is
>> different from mwait on bare metal in that you are heavily penalized by
>> scheduler for polling unless you configure the host just so.
>> HLT lets you give up the host CPU if you know you won't need
>> it for a long time.
>>
>> So while many people can get by with monitor cpuid (those that isolate
>> host CPUs) and it's a valuable option to have, I think a PV flag is also
>> a valuable option and can be set for more configurations.
>>
>> Guest has an idle driver calling mwait on short waits and halt on longer
>> ones.  I'm in fact testing an idle driver using such a PV flag and will
>> post when ready (after vacation ~3 weeks from now probably).
>
> For now I think I'm removing the PV flag, making this just an
> optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
> monitor and mwait instructions as nop").
>
> We can add it for 4.13 together with the idle driver.

I think that's a perfectly reasonable approach, yes. We can always add 
the PV flag with the driver.

Thanks a lot!


Alex
Paolo Bonzini April 21, 2017, 10:06 a.m. UTC | #12
On 21/04/2017 12:05, Alexander Graf wrote:
> 
> 
> On 21.04.17 12:02, Paolo Bonzini wrote:
>>
>>
>> On 12/04/2017 18:29, Michael S. Tsirkin wrote:
>>> I don't really agree we do not need the PV flag. mwait on kvm is
>>> different from mwait on bare metal in that you are heavily penalized by
>>> scheduler for polling unless you configure the host just so.
>>> HLT lets you give up the host CPU if you know you won't need
>>> it for a long time.
>>>
>>> So while many people can get by with monitor cpuid (those that isolate
>>> host CPUs) and it's a valuable option to have, I think a PV flag is also
>>> a valuable option and can be set for more configurations.
>>>
>>> Guest has an idle driver calling mwait on short waits and halt on longer
>>> ones.  I'm in fact testing an idle driver using such a PV flag and will
>>> post when ready (after vacation ~3 weeks from now probably).
>>
>> For now I think I'm removing the PV flag, making this just an
>> optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
>> monitor and mwait instructions as nop").
>>
>> We can add it for 4.13 together with the idle driver.
> 
> I think that's a perfectly reasonable approach, yes. We can always add
> the PV flag with the driver.
> 
> Thanks a lot!

Queuing the patch for 4.12.

Paolo

Patch
diff mbox

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1a18484..dacc3d7 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4094,3 +4094,12 @@  reserved.
  2: MIPS64 or microMIPS64 with access to all address segments.
     Both registers and addresses are 64-bits wide.
     It will be possible to run 64-bit or 32-bit guest code.
+
+8.8 KVM_CAP_X86_GUEST_MWAIT
+
+Architectures: x86
+
+This capability indicates that guest using memory monotoring instructions
+(MWAIT/MWAITX) to stop the virtual CPU will not cause a VM exit.  As such time
+spent while virtual CPU is halted in this way will then be accounted for as
+guest running time on the host (as opposed to e.g. HLT).
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 3c65feb..04c201c 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -54,6 +54,12 @@  KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
                                    ||       || before enabling paravirtualized
                                    ||       || spinlock support.
 ------------------------------------------------------------------------------
+KVM_FEATURE_MWAIT                  ||     8 || guest can use monitor/mwait
+                                   ||       || to halt the VCPU without exits,
+                                   ||       || time spent while halted in this
+                                   ||       || way is accounted for on host as
+                                   ||       || VCPU run time.
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index cff0bb6..9cc77a7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -24,6 +24,7 @@ 
 #define KVM_FEATURE_STEAL_TIME		5
 #define KVM_FEATURE_PV_EOI		6
 #define KVM_FEATURE_PV_UNHALT		7
+#define KVM_FEATURE_MWAIT		8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index efde6cc..5638102 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -594,6 +594,9 @@  static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
+		if (kvm_mwait_in_guest())
+			entry->eax |= (1 << KVM_FEATURE_MWAIT);
+
 		entry->ebx = 0;
 		entry->ecx = 0;
 		entry->edx = 0;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1b203ab..c41f03e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1198,10 +1198,13 @@  static void init_vmcb(struct vcpu_svm *svm)
 	set_intercept(svm, INTERCEPT_CLGI);
 	set_intercept(svm, INTERCEPT_SKINIT);
 	set_intercept(svm, INTERCEPT_WBINVD);
-	set_intercept(svm, INTERCEPT_MONITOR);
-	set_intercept(svm, INTERCEPT_MWAIT);
 	set_intercept(svm, INTERCEPT_XSETBV);
 
+	if (!kvm_mwait_in_guest()) {
+		set_intercept(svm, INTERCEPT_MONITOR);
+		set_intercept(svm, INTERCEPT_MWAIT);
+	}
+
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
 	control->int_ctl = V_INTR_MASKING_MASK;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cfdb0d9..2e62f9d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3547,11 +3547,13 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_USE_IO_BITMAPS |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETING |
-	      CPU_BASED_MWAIT_EXITING |
-	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
 
+	if (!kvm_mwait_in_guest())
+		min |= CPU_BASED_MWAIT_EXITING |
+			CPU_BASED_MONITOR_EXITING;
+
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |
 	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bc47e2..ffaa76d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2679,6 +2679,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ADJUST_CLOCK:
 		r = KVM_CLOCK_TSC_STABLE;
 		break;
+	case KVM_CAP_X86_GUEST_MWAIT:
+		r = kvm_mwait_in_guest();
+		break;
 	case KVM_CAP_X86_SMM:
 		/* SMBASE is usually relocated above 1M on modern chipsets,
 		 * and SMM handlers might indeed rely on 4G segment limits,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e8ff3e4..6120670 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -1,6 +1,8 @@ 
 #ifndef ARCH_X86_KVM_X86_H
 #define ARCH_X86_KVM_X86_H
 
+#include <asm/processor.h>
+#include <asm/mwait.h>
 #include <linux/kvm_host.h>
 #include <asm/pvclock.h>
 #include "kvm_cache_regs.h"
@@ -212,4 +214,38 @@  static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 	    __rem;						\
 	 })
 
+static inline bool kvm_mwait_in_guest(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+		return false;
+
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+		/* All AMD CPUs have a working MWAIT implementation */
+		return true;
+	case X86_VENDOR_INTEL:
+		/* Handle Intel below */
+		break;
+	default:
+		return false;
+	}
+
+	/*
+	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
+	 * they would allow guest to stop the CPU completely by disabling
+	 * interrupts then invoking MWAIT.
+	 */
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return false;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return false;
+
+	return true;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1e1a6c7..a3f03d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -890,6 +890,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MIPS_VZ 137
 #define KVM_CAP_MIPS_TE 138
 #define KVM_CAP_MIPS_64BIT 139
+#define KVM_CAP_X86_GUEST_MWAIT 140
 
 #ifdef KVM_CAP_IRQ_ROUTING