diff mbox

KVM: enabling per domain PLE

Message ID 88A92D351643BA4CB23E3031551706260FD89CC8@SHSMSX102.ccr.corp.intel.com
State New
Headers show

Commit Message

Hu, Xuekun Oct. 16, 2012, 6:53 a.m. UTC
Setting the same PLE parameter arbitrarily for different
workloads is not a good solution. The solution enables
per domain PLE which gives user ability to set PLE parameter
for different domain for better performance.

Signed-off-by: Xuekun Hu <xuekun.hu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    6 ++++++
 arch/x86/kvm/svm.c              |   21 +++++++++++++++++++++
 arch/x86/kvm/vmx.c              |   32 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |   15 ++++++++++++++-
 include/linux/kvm.h             |    3 +++
 5 files changed, 74 insertions(+), 3 deletions(-)

Comments

Avi Kivity Oct. 16, 2012, 9:24 a.m. UTC | #1
On 10/16/2012 08:53 AM, Hu, Xuekun wrote:
> Setting the same PLE parameter arbitrarily for different
> workloads is not a good solution. 

True.

> The solution enables
> per domain PLE which gives user ability to set PLE parameter
> for different domain for better performance.

The problem with this is that it requires an administrator to understand
the workload, not only of the guest, but also of other guests on the
machine.  With low overcommit, a high PLE window reduces unneeded exits,
but with high overcommit we need those exits to reduce spinning.

In addition, most kvm hosts don't have an administrator.  They are
controlled by a management system, which means we'll need some algorithm
in userspace to control the PLE window.  Taking the two together, we
need a dynamic (for changing workloads) algorithm.

There are threads discussing this dynamic algorithm, we are making slow
progress because it's such a difficult problem, but I think this is much
more useful than anything requiring user intervention.
Hu, Xuekun Oct. 17, 2012, 8:02 a.m. UTC | #2
> 
> The problem with this is that it requires an administrator to understand the
> workload, not only of the guest, but also of other guests on the machine.
> With low overcommit, a high PLE window reduces unneeded exits, but with
> high overcommit we need those exits to reduce spinning.
> 
> In addition, most kvm hosts don't have an administrator.  They are controlled
> by a management system, which means we'll need some algorithm in
> userspace to control the PLE window.  Taking the two together, we need a
> dynamic (for changing workloads) algorithm.
> 
> There are threads discussing this dynamic algorithm, we are making slow
> progress because it's such a difficult problem, but I think this is much more
> useful than anything requiring user intervention.

Avi, agreed that dynamic adaptive ple should be the best solution. However
currently it is a difficult problem like you said. Our solution just gives user
a choice who know how to set the two PLE values. So the solution is a compromise
solution, which should be better than nothing, for now? :-)

Your comments?
Avi Kivity Oct. 17, 2012, 10:11 a.m. UTC | #3
On 10/17/2012 10:02 AM, Hu, Xuekun wrote:
>> 
>> The problem with this is that it requires an administrator to understand the
>> workload, not only of the guest, but also of other guests on the machine.
>> With low overcommit, a high PLE window reduces unneeded exits, but with
>> high overcommit we need those exits to reduce spinning.
>> 
>> In addition, most kvm hosts don't have an administrator.  They are controlled
>> by a management system, which means we'll need some algorithm in
>> userspace to control the PLE window.  Taking the two together, we need a
>> dynamic (for changing workloads) algorithm.
>> 
>> There are threads discussing this dynamic algorithm, we are making slow
>> progress because it's such a difficult problem, but I think this is much more
>> useful than anything requiring user intervention.
> 
> Avi, agreed that dynamic adaptive ple should be the best solution. However
> currently it is a difficult problem like you said. Our solution just gives user
> a choice who know how to set the two PLE values. So the solution is a compromise
> solution, which should be better than nothing, for now? :-)

Let's see how the PLE thread works out.  Yes the patches give the user
control, but we need to make sure the user knows how to control it (in
fact your patch doesn't even update the documentation).  Just throwing
out a new ioctl, even if it is documented, doesn't mean that userspace
will begin to use it, or that users will exploit it.

Do you have a specific use case in mind?
Hu, Xuekun Oct. 18, 2012, 9:11 a.m. UTC | #4
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com]
> Sent: Wednesday, October 17, 2012 6:12 PM
> To: Hu, Xuekun
> Cc: kvm@vger.kernel.org; qemu-devel@nongnu.org; Zhang, Xiantao
> Subject: Re: [Patch]KVM: enabling per domain PLE
> 
> On 10/17/2012 10:02 AM, Hu, Xuekun wrote:
> >>
> >> The problem with this is that it requires an administrator to
> >> understand the workload, not only of the guest, but also of other guests on
> the machine.
> >> With low overcommit, a high PLE window reduces unneeded exits, but
> >> with high overcommit we need those exits to reduce spinning.
> >>
> >> In addition, most kvm hosts don't have an administrator.  They are
> >> controlled by a management system, which means we'll need some
> >> algorithm in userspace to control the PLE window.  Taking the two
> >> together, we need a dynamic (for changing workloads) algorithm.
> >>
> >> There are threads discussing this dynamic algorithm, we are making
> >> slow progress because it's such a difficult problem, but I think this
> >> is much more useful than anything requiring user intervention.
> >
> > Avi, agreed that dynamic adaptive ple should be the best solution.
> > However currently it is a difficult problem like you said. Our
> > solution just gives user a choice who know how to set the two PLE
> > values. So the solution is a compromise solution, which should be
> > better than nothing, for now? :-)
> 
> Let's see how the PLE thread works out.  Yes the patches give the user control,
> but we need to make sure the user knows how to control it (in fact your patch
> doesn't even update the documentation).  Just throwing out a new ioctl, even
> if it is documented, doesn't mean that userspace will begin to use it, or that
> users will exploit it.
> 
> Do you have a specific use case in mind?
> 

Yes, some cloud vendors already knew that different PLE values has big performance
impact on their applications. They want one interface for them to set. And I think the
big cloud vendors should have administrators that have experience on PLE tuning. :-) 

> --
> error compiling committee.c: too many arguments to function
Hu, Xuekun Oct. 29, 2012, 12:43 p.m. UTC | #5
Hi, Avi

> 
> Yes, some cloud vendors already knew that different PLE values has big
> performance impact on their applications. They want one interface for them to
> set. And I think the big cloud vendors should have administrators that have
> experience on PLE tuning. :-)
> 

For current stage, do you think still need to approach dynamic adaptive ple solution? 


> 
> > --
> > error compiling committee.c: too many arguments to function
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..b54f38b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -569,6 +569,8 @@  struct kvm_arch {
 	#ifdef CONFIG_KVM_MMU_AUDIT
 	int audit_point;
 	#endif
+	int ple_gap;
+	int ple_window;
 };
 
 struct kvm_vm_stat {
@@ -621,6 +623,10 @@  struct kvm_x86_ops {
 	int (*hardware_setup)(void);               /* __init */
 	void (*hardware_unsetup)(void);            /* __exit */
 	bool (*cpu_has_accelerated_tpr)(void);
+	bool (*cpu_has_ple)(void);
+	void (*ple_setup)(struct kvm *kvm);                   /* __init */
+	void (*set_ple_gap)(struct kvm *kvm, int value);
+	void (*set_ple_window)(struct kvm *kvm, int value);
 	void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
 	/* Create, but do not attach this VCPU */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d017df3..311ed96 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3976,6 +3976,23 @@  static bool svm_cpu_has_accelerated_tpr(void)
 	return false;
 }
 
+static bool cpu_has_svm_ple(void)
+{
+	return false;
+}
+
+static void svm_ple_setup(struct kvm *kvm)
+{
+}
+
+static void svm_set_ple_gap(struct kvm *kvm, int value)
+{
+}
+
+static void svm_set_ple_window(struct kvm *kvm, int value)
+{
+}
+
 static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	return 0;
@@ -4229,6 +4246,10 @@  static struct kvm_x86_ops svm_x86_ops = {
 	.hardware_enable = svm_hardware_enable,
 	.hardware_disable = svm_hardware_disable,
 	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
+	.cpu_has_ple = cpu_has_svm_ple,
+	.ple_setup = svm_ple_setup,
+	.set_ple_gap = svm_set_ple_gap,
+	.set_ple_window = svm_set_ple_window,
 
 	.vcpu_create = svm_create_vcpu,
 	.vcpu_free = svm_free_vcpu,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad6b1dd..b761669 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3805,6 +3805,29 @@  static void ept_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
 }
 
+static void vmx_ple_setup(struct kvm *kvm)
+{
+	/*
+	 * set up ple default value
+	 */
+	if (ple_gap) {
+		kvm->arch.ple_gap = ple_gap;
+		kvm->arch.ple_window = ple_window;
+	}
+}
+
+static void vmx_set_ple_gap(struct kvm *kvm, int value)
+{
+	if (ple_gap)
+		kvm->arch.ple_gap = value;
+}
+
+static void vmx_set_ple_window(struct kvm *kvm, int value)
+{
+	if (ple_gap)
+		kvm->arch.ple_window = value;
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3814,6 +3837,7 @@  static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 	unsigned long a;
 #endif
 	int i;
+	struct kvm *kvm = vmx->vcpu.kvm;
 
 	/* I/O */
 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3836,8 +3860,8 @@  static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 	}
 
 	if (ple_gap) {
-		vmcs_write32(PLE_GAP, ple_gap);
-		vmcs_write32(PLE_WINDOW, ple_window);
+		vmcs_write32(PLE_GAP, kvm->arch.ple_gap);
+		vmcs_write32(PLE_WINDOW, kvm->arch.ple_window);
 	}
 
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7233,6 +7257,10 @@  static struct kvm_x86_ops vmx_x86_ops = {
 	.hardware_enable = hardware_enable,
 	.hardware_disable = hardware_disable,
 	.cpu_has_accelerated_tpr = report_flexpriority,
+	.cpu_has_ple = cpu_has_vmx_ple,
+	.ple_setup = vmx_ple_setup,
+	.set_ple_gap = vmx_set_ple_gap,
+	.set_ple_window = vmx_set_ple_window,
 
 	.vcpu_create = vmx_create_vcpu,
 	.vcpu_free = vmx_free_vcpu,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..89b8291 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2197,6 +2197,9 @@  int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_VAPIC:
 		r = !kvm_x86_ops->cpu_has_accelerated_tpr();
 		break;
+	case KVM_CAP_PLE:
+		r = kvm_x86_ops->cpu_has_ple();
+		break;
 	case KVM_CAP_NR_VCPUS:
 		r = KVM_SOFT_MAX_VCPUS;
 		break;
@@ -3462,7 +3465,16 @@  long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
-
+	case KVM_SET_PLE_GAP: {
+		kvm_x86_ops->set_ple_gap(kvm, arg);
+		r = 0;
+		break;
+	}
+	case KVM_SET_PLE_WINDOW: {
+		kvm_x86_ops->set_ple_window(kvm, arg);
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
@@ -6295,6 +6307,7 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
 
+	kvm_x86_ops->ple_setup(kvm);
 	return 0;
 }
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 0a6d6ba..02d499e 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -626,6 +626,7 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_READONLY_MEM 81
 #endif
 #define KVM_CAP_IRQFD_RESAMPLE 82
+#define KVM_CAP_PLE 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -786,6 +787,8 @@  struct kvm_msi {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_PLE_GAP           _IOW(KVMIO,  0x49, __u32)
+#define KVM_SET_PLE_WINDOW        _IOW(KVMIO,  0x4a, __u32)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {