[19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl

Message ID 20190107191006.10648-3-clg@kaod.org
State Changes Requested
Headers show
Series
  • KVM: PPC: Book3S HV: add XIVE native exploitation mode
Related show

Commit Message

Cédric Le Goater Jan. 7, 2019, 7:10 p.m.
This will be used to destroy the KVM XICS or XIVE device when the
sPAPR machine is reseted. When the VM boots, the CAS negotiation
process will determine which interrupt mode to use and the appropriate
KVM device will then be created.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/linux/kvm_host.h              |  2 ++
 include/uapi/linux/kvm.h              |  2 ++
 arch/powerpc/kvm/book3s_xive.c        | 38 +++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_xive_native.c | 24 +++++++++++++++++
 virt/kvm/kvm_main.c                   | 39 +++++++++++++++++++++++++++
 5 files changed, 104 insertions(+), 1 deletion(-)

Comments

Paul Mackerras Jan. 22, 2019, 5:42 a.m. | #1
On Mon, Jan 07, 2019 at 08:10:06PM +0100, Cédric Le Goater wrote:
> This will be used to destroy the KVM XICS or XIVE device when the
> sPAPR machine is reseted. When the VM boots, the CAS negotiation
> process will determine which interrupt mode to use and the appropriate
> KVM device will then be created.

What would be the consequence if we didn't destroy the device?

The reason I ask is that we will have to be much more careful about
memory allocation lifetimes with this patch.  Having KVM devices last
until the KVM instance is destroyed means that we generally avoid
use-after-free bugs.  With this patch we will have to do a careful
analysis of the lifetime of the xive structures vs. possible accesses
on other threads to prove there are no use-after-free bugs.

For example, it is not sufficient to set any pointers in struct kvm or
struct kvm_vcpu that point into xive structures to NULL before freeing
the structures.  There could be code on another CPU that has read the
pointer value before you set it to NULL and then goes and accesses it
after you have freed it.  You need to prove that can't happen,
possibly using some sort of explicit synchronization that ensures that
no other CPU could still be accessing the structure at the time when
you free it.  RCU can help with this, but in general means you need
RCU synchronization primitives (rcu_read_lock() etc.) at all the
places where you use the pointer, which I don't think you currently
have.

If there is a good fundamental reason why this can't happen, even
though you don't have explicit synchronization, then at a minimum you
need to explain that in the patch description, and ideally also in
code comments.

Paul.
Cédric Le Goater Jan. 23, 2019, 6:39 p.m. | #2
On 1/22/19 6:42 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:06PM +0100, Cédric Le Goater wrote:
>> This will be used to destroy the KVM XICS or XIVE device when the
>> sPAPR machine is reseted. When the VM boots, the CAS negotiation
>> process will determine which interrupt mode to use and the appropriate
>> KVM device will then be created.
> 
> What would be the consequence if we didn't destroy the device?

So, if we don't destroy the device, it would mean that we are 
maintaining its availability under the KVM PPC structures, VM and
vCPUs, I think the changes would be significant to have two interrupt 
devices unde the VM. We would also need a way to activate one or 
the other depending on the interrupt mode chosen by CAS. In other 
words, it's moving all the interrupt mode politics from QEMU to KVM. 
It's possible of course but I would prefer to leave the ugly details 
in QEMU.  

Let's suppose now that we keep the device alive but disconnect the 
presenters from it, and from the VM also. We would have an unused 
device in the VM. We would need way to keep an handle on it (fd 
certainly) and a KVM interface to soft reset a KVM device partially 
initialized. That's one other option.

It seemed easier to do an hard reset : create/destroy.  

> The reason I ask is that we will have to be much more careful about
> memory allocation lifetimes with this patch. 

yes. bad refcounting will lead the host kernel to a crash. 

> Having KVM devices last
> until the KVM instance is destroyed means that we generally avoid
> use-after-free bugs.  With this patch we will have to do a careful
> analysis of the lifetime of the xive structures vs. possible accesses
> on other threads to prove there are no use-after-free bugs.
> 
> For example, it is not sufficient to set any pointers in struct kvm or
> struct kvm_vcpu that point into xive structures to NULL before freeing
> the structures.  There could be code on another CPU that has read the
> pointer value before you set it to NULL and then goes and accesses it
> after you have freed it.  You need to prove that can't happen,
> possibly using some sort of explicit synchronization that ensures that
> no other CPU could still be accessing the structure at the time when
> you free it.  RCU can help with this, but in general means you need
> RCU synchronization primitives (rcu_read_lock() etc.) at all the
> places where you use the pointer, which I don't think you currently
> have.

no. indeed. I have overlooked the synchronization aspect.

> If there is a good fundamental reason why this can't happen, even
> though you don't have explicit synchronization, then at a minimum you
> need to explain that in the patch description, and ideally also in
> code comments.

OK. I did leave that patch at the end for one reason. It needs more care.

Thanks,

C.
Benjamin Herrenschmidt Jan. 23, 2019, 9:32 p.m. | #3
On Wed, 2019-01-23 at 19:39 +0100, Cédric Le Goater wrote:
> > The reason I ask is that we will have to be much more careful about
> > memory allocation lifetimes with this patch. 
> 
> yes. bad refcounting will lead the host kernel to a crash. 

One way to alleviate that is to make sure this is only supported on
selected devices such as XICS via some flag or the presence of a
callback.

Cheers,
Ben.

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..259b6885dc74 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1218,6 +1218,8 @@  struct kvm_device_ops {
 	 */
 	void (*destroy)(struct kvm_device *dev);
 
+	int (*delete)(struct kvm_device *dev);
+
 	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
 	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
 	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52bf74a1616e..b00cb4d986cf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1331,6 +1331,8 @@  struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DELETE_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9b4751713554..5449fb4c87f9 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1109,11 +1109,19 @@  void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-	struct kvmppc_xive *xive = xc->xive;
+	struct kvmppc_xive *xive;
 	int i;
 
+	if (!kvmppc_xics_enabled(vcpu))
+		return;
+
+	if (!xc)
+		return;
+
 	pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+	xive = xc->xive;
+
 	/* Ensure no interrupt is still routed to that VP */
 	xc->valid = false;
 	kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1150,6 +1158,10 @@  void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	}
 	/* Free the VP */
 	kfree(xc);
+
+	/* Cleanup the vcpu */
+	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+	vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1861,6 +1873,29 @@  static void kvmppc_xive_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+static int kvmppc_xive_delete(struct kvm_device *dev)
+{
+	struct kvm *kvm = dev->kvm;
+	unsigned int i;
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm->arch.xive)
+		return -EPERM;
+
+	/*
+	 * call kick_all_cpus_sync() to ensure that all CPUs have
+	 * executed any pending interrupts
+	 */
+	if (is_kvmppc_hv_enabled(kvm))
+		kick_all_cpus_sync();
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_xive_cleanup_vcpu(vcpu);
+
+	kvmppc_xive_free(dev);
+	return 0;
+}
+
 static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
@@ -2035,6 +2070,7 @@  struct kvm_device_ops kvm_xive_ops = {
 	.create = kvmppc_xive_create,
 	.init = kvmppc_xive_init,
 	.destroy = kvmppc_xive_free,
+	.delete = kvmppc_xive_delete,
 	.set_attr = xive_set_attr,
 	.get_attr = xive_get_attr,
 	.has_attr = xive_has_attr,
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 12edac29995e..7367962e670a 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -979,6 +979,29 @@  static void kvmppc_xive_native_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+static int kvmppc_xive_native_delete(struct kvm_device *dev)
+{
+	struct kvm *kvm = dev->kvm;
+	unsigned int i;
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm->arch.xive)
+		return -EPERM;
+
+	/*
+	 * call kick_all_cpus_sync() to ensure that all CPUs have
+	 * executed any pending interrupts
+	 */
+	if (is_kvmppc_hv_enabled(kvm))
+		kick_all_cpus_sync();
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+
+	kvmppc_xive_native_free(dev);
+	return 0;
+}
+
 /*
  * ESB MMIO address of chip 0
  */
@@ -1350,6 +1373,7 @@  struct kvm_device_ops kvm_xive_native_ops = {
 	.create = kvmppc_xive_native_create,
 	.init = kvmppc_xive_native_init,
 	.destroy = kvmppc_xive_native_free,
+	.delete = kvmppc_xive_native_delete,
 	.set_attr = kvmppc_xive_native_set_attr,
 	.get_attr = kvmppc_xive_native_get_attr,
 	.has_attr = kvmppc_xive_native_has_attr,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f888a103f78..c93c35c43675 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3009,6 +3009,31 @@  static int kvm_ioctl_create_device(struct kvm *kvm,
 	return 0;
 }
 
+static int kvm_ioctl_delete_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct fd f;
+	struct kvm_device *dev;
+	int ret;
+
+	f = fdget(cd->fd);
+	if (!f.file)
+		return -EBADF;
+
+	dev = kvm_device_from_filp(f.file);
+	fdput(f);
+
+	if (!dev)
+		return -EPERM;
+
+	mutex_lock(&kvm->lock);
+	list_del(&dev->vm_node);
+	mutex_unlock(&kvm->lock);
+	ret = dev->ops->delete(dev);
+
+	return ret;
+}
+
 static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
 	switch (arg) {
@@ -3253,6 +3278,20 @@  static long kvm_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_DELETE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_delete_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = 0;
+		break;
+	}
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;