diff mbox series

[v5,16/16] KVM: PPC: Book3S HV: XIVE: introduce a 'release' device operation

Message ID 20190410170448.3923-17-clg@kaod.org
State Superseded
Headers show
Series KVM: PPC: Book3S HV: add XIVE native exploitation mode | expand

Commit Message

Cédric Le Goater April 10, 2019, 5:04 p.m. UTC
When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.

To be able to switch from one mode to another, we introduce the
capability to release a KVM device without destroying the VM. The KVM
device interface is extended with a new 'release' operation which is
called when the file descriptor of the device is closed.

Such operations are defined for the XICS-on-XIVE and the XIVE native
KVM devices. They clear the vCPU interrupt presenters that could be
attached and then destroy the device.

This is not considered as a safe operation as the vCPUs are still
running and could be referencing the KVM device through their
presenters. To protect the system from any breakage, the kvmppc_xive
objects representing both KVM devices are now stored in an array under
the VM. Allocation is performed on first usage and memory is freed
only when the VM exits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/kvm_host.h   |  1 +
 arch/powerpc/kvm/book3s_xive.h        |  1 +
 include/linux/kvm_host.h              |  1 +
 arch/powerpc/kvm/book3s_xive.c        | 99 ++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_xive_native.c | 45 +++++++++++-
 virt/kvm/kvm_main.c                   | 13 ++++
 6 files changed, 156 insertions(+), 4 deletions(-)

Comments

Paul Mackerras April 11, 2019, 3:16 a.m. UTC | #1
On Wed, Apr 10, 2019 at 07:04:48PM +0200, Cédric Le Goater wrote:
> When a P9 sPAPR VM boots, the CAS negotiation process determines which
> interrupt mode to use (XICS legacy or XIVE native) and invokes a
> machine reset to activate the chosen mode.
> 
> To be able to switch from one mode to another, we introduce the
> capability to release a KVM device without destroying the VM. The KVM
> device interface is extended with a new 'release' operation which is
> called when the file descriptor of the device is closed.

I believe the release operation is not called until all of the mmaps
using the fd are unmapped - which is a good thing for us, since it
means the guest can't possibly be accessing the XIVE directly.
You might want to reword that last paragraph to mention that.

> Such operations are defined for the XICS-on-XIVE and the XIVE native
> KVM devices. They clear the vCPU interrupt presenters that could be
> attached and then destroy the device.
> 
> This is not considered as a safe operation as the vCPUs are still
> running and could be referencing the KVM device through their
> presenters. To protect the system from any breakage, the kvmppc_xive
> objects representing both KVM devices are now stored in an array under
> the VM. Allocation is performed on first usage and memory is freed
> only when the VM exits.

One quick comment below:

> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 480a3fc6b9fd..064a9f2ae678 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
>  void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> -	struct kvmppc_xive *xive = xc->xive;
> +	struct kvmppc_xive *xive;
>  	int i;
>  
> +	if (!kvmppc_xics_enabled(vcpu))
> +		return;

Should that be kvmppc_xive_enabled() rather than xics?

Paul.
David Gibson April 11, 2019, 4:38 a.m. UTC | #2
On Thu, Apr 11, 2019 at 01:16:25PM +1000, Paul Mackerras wrote:
> On Wed, Apr 10, 2019 at 07:04:48PM +0200, Cédric Le Goater wrote:
> > When a P9 sPAPR VM boots, the CAS negotiation process determines which
> > interrupt mode to use (XICS legacy or XIVE native) and invokes a
> > machine reset to activate the chosen mode.
> > 
> > To be able to switch from one mode to another, we introduce the
> > capability to release a KVM device without destroying the VM. The KVM
> > device interface is extended with a new 'release' operation which is
> > called when the file descriptor of the device is closed.
> 
> I believe the release operation is not called until all of the mmaps
> using the fd are unmapped - which is a good thing for us, since it
> means the guest can't possibly be accessing the XIVE directly.
> You might want to reword that last paragraph to mention that.
> 
> > Such operations are defined for the XICS-on-XIVE and the XIVE native
> > KVM devices. They clear the vCPU interrupt presenters that could be
> > attached and then destroy the device.
> > 
> > This is not considered as a safe operation as the vCPUs are still
> > running and could be referencing the KVM device through their
> > presenters. To protect the system from any breakage, the kvmppc_xive
> > objects representing both KVM devices are now stored in an array under
> > the VM. Allocation is performed on first usage and memory is freed
> > only when the VM exits.
> 
> One quick comment below:
> 
> > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > index 480a3fc6b9fd..064a9f2ae678 100644
> > --- a/arch/powerpc/kvm/book3s_xive.c
> > +++ b/arch/powerpc/kvm/book3s_xive.c
> > @@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
> >  void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
> >  {
> >  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> > -	struct kvmppc_xive *xive = xc->xive;
> > +	struct kvmppc_xive *xive;
> >  	int i;
> >  
> > +	if (!kvmppc_xics_enabled(vcpu))
> > +		return;
> 
> Should that be kvmppc_xive_enabled() rather than xics?

I think I asked that on an earlier iteration, and the answer is no.
The names are confusing, but this file is all about xics-on-xive
rather than xive native.  So here we're checking what's available from
the guest's point of view, so "xics", but most of the surrounding
functions are named "xive" because that's the backend.
Cédric Le Goater April 11, 2019, 6:12 a.m. UTC | #3
On 4/11/19 6:38 AM, David Gibson wrote:
> On Thu, Apr 11, 2019 at 01:16:25PM +1000, Paul Mackerras wrote:
>> On Wed, Apr 10, 2019 at 07:04:48PM +0200, Cédric Le Goater wrote:
>>> When a P9 sPAPR VM boots, the CAS negotiation process determines which
>>> interrupt mode to use (XICS legacy or XIVE native) and invokes a
>>> machine reset to activate the chosen mode.
>>>
>>> To be able to switch from one mode to another, we introduce the
>>> capability to release a KVM device without destroying the VM. The KVM
>>> device interface is extended with a new 'release' operation which is
>>> called when the file descriptor of the device is closed.
>>
>> I believe the release operation is not called until all of the mmaps
>> using the fd are unmapped - which is a good thing for us, since it
>> means the guest can't possibly be accessing the XIVE directly.
yes.

>> You might want to reword that last paragraph to mention that.

ok. 

>>> Such operations are defined for the XICS-on-XIVE and the XIVE native
>>> KVM devices. They clear the vCPU interrupt presenters that could be
>>> attached and then destroy the device.
>>>
>>> This is not considered as a safe operation as the vCPUs are still
>>> running and could be referencing the KVM device through their
>>> presenters. To protect the system from any breakage, the kvmppc_xive
>>> objects representing both KVM devices are now stored in an array under
>>> the VM. Allocation is performed on first usage and memory is freed
>>> only when the VM exits.
>>
>> One quick comment below:
>>
>>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>>> index 480a3fc6b9fd..064a9f2ae678 100644
>>> --- a/arch/powerpc/kvm/book3s_xive.c
>>> +++ b/arch/powerpc/kvm/book3s_xive.c
>>> @@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
>>>  void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>>>  {
>>>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>>> -	struct kvmppc_xive *xive = xc->xive;
>>> +	struct kvmppc_xive *xive;
>>>  	int i;
>>>  
>>> +	if (!kvmppc_xics_enabled(vcpu))
>>> +		return;
>>
>> Should that be kvmppc_xive_enabled() rather than xics?
> 
> I think I asked that on an earlier iteration, and the answer is no.
> The names are confusing, but this file is all about xics-on-xive
> rather than xive native.  So here we're checking what's available from
> the guest's point of view, so "xics", but most of the surrounding
> functions are named "xive" because that's the backend.
> 

yes. 

The relevant part is at the end of the kvmppc_xive_connect_vcpu() routine :

  int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
  			     struct kvm_vcpu *vcpu, u32 cpu)
  {
	...
  	vcpu->arch.irq_type = KVMPPC_IRQ_XICS;
	return 0;
  }



David suggested a few cleanups that we could do in the xics-on-xive 
device. We might want to introduce a KVMPPC_IRQ_XICS_ON_XIVE flag also. 
First, I would like to get rid of references to the kvmppc_xive struct 
and remove some useless attributes to improve locking.

Once the XIVE native mode is merged, all kernels above 4.14 running on 
a P9 sPAPR guest will switch to XIVE and the xics-on-xive device will 
only be useful for nested.


C.
Paul Mackerras April 11, 2019, 10:27 a.m. UTC | #4
On Wed, Apr 10, 2019 at 07:04:48PM +0200, Cédric Le Goater wrote:
> When a P9 sPAPR VM boots, the CAS negotiation process determines which
> interrupt mode to use (XICS legacy or XIVE native) and invokes a
> machine reset to activate the chosen mode.
> 
> To be able to switch from one mode to another, we introduce the
> capability to release a KVM device without destroying the VM. The KVM
> device interface is extended with a new 'release' operation which is
> called when the file descriptor of the device is closed.

Unfortunately, I think there is now a memory leak:

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ea2018ae1cd7..ea2619d5ca98 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2938,6 +2938,19 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
>  	struct kvm_device *dev = filp->private_data;
>  	struct kvm *kvm = dev->kvm;
>  
> +	if (!dev)
> +		return -ENODEV;
> +
> +	if (dev->kvm != kvm)
> +		return -EPERM;
> +
> +	if (dev->ops->release) {
> +		mutex_lock(&kvm->lock);
> +		list_del(&dev->vm_node);

Because the device is now no longer in the kvm->devices list,
kvm_destroy_devices() won't find it there and therefore won't call the
device's destroy method.  In fact now the device's destroy method will
never get called; I can't see how kvmppc_xive_free() or
kvmppc_xive_native_free() will ever get called.  Thus the memory for
the kvmppc_xive structs will never get freed as far as I can see.

We could fix that by freeing both of the kvm->arch.xive_devices
entries at VM destruction time.

If it is true that any device that has a release method will never see
its destroy method being called, then that needs to be documented
clearly somewhere.

Paul.
Cédric Le Goater April 11, 2019, 11:48 a.m. UTC | #5
On 4/11/19 12:27 PM, Paul Mackerras wrote:
> On Wed, Apr 10, 2019 at 07:04:48PM +0200, Cédric Le Goater wrote:
>> When a P9 sPAPR VM boots, the CAS negotiation process determines which
>> interrupt mode to use (XICS legacy or XIVE native) and invokes a
>> machine reset to activate the chosen mode.
>>
>> To be able to switch from one mode to another, we introduce the
>> capability to release a KVM device without destroying the VM. The KVM
>> device interface is extended with a new 'release' operation which is
>> called when the file descriptor of the device is closed.
> 
> Unfortunately, I think there is now a memory leak:
> 
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index ea2018ae1cd7..ea2619d5ca98 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2938,6 +2938,19 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
>>  	struct kvm_device *dev = filp->private_data;
>>  	struct kvm *kvm = dev->kvm;
>>  
>> +	if (!dev)
>> +		return -ENODEV;
>> +
>> +	if (dev->kvm != kvm)
>> +		return -EPERM;
>> +
>> +	if (dev->ops->release) {
>> +		mutex_lock(&kvm->lock);
>> +		list_del(&dev->vm_node);
> 
> Because the device is now no longer in the kvm->devices list,
> kvm_destroy_devices() won't find it there and therefore won't call the
> device's destroy method.  In fact now the device's destroy method will
> never get called; I can't see how kvmppc_xive_free() or
> kvmppc_xive_native_free() will ever get called.  Thus the memory for
> the kvmppc_xive structs will never get freed as far as I can see.

ah yes. indeed ...

> We could fix that by freeing both of the kvm->arch.xive_devices
> entries at VM destruction time.

That is what I was doing in the first patch I sent : 

    http://patchwork.ozlabs.org/patch/1082303/

It worked fine and then, I had this better (worse) idea which I included 
in v5. 
 
> If it is true that any device that has a release method will never see
> its destroy method being called, then that needs to be documented
> clearly somewhere.

Yes. Closing the fd should take care of it. I have to rework that patch.

Thanks,

C.
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 9cc6abdce1b9..ed059c95e56a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -314,6 +314,7 @@  struct kvm_arch {
 #ifdef CONFIG_KVM_XICS
 	struct kvmppc_xics *xics;
 	struct kvmppc_xive *xive;
+	struct kvmppc_xive *xive_devices[2];
 	struct kvmppc_passthru_irqmap *pimap;
 #endif
 	struct kvmppc_ops *kvm_ops;
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index e011622dc038..426146332984 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -283,6 +283,7 @@  void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
 int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
 int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 				  bool single_escalation);
+struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 831d963451d8..3b444620d8fc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1246,6 +1246,7 @@  struct kvm_device_ops {
 	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
 		      unsigned long arg);
 	int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma);
+	void (*release)(struct kvm_device *dev);
 };
 
 void kvm_device_get(struct kvm_device *dev);
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 480a3fc6b9fd..064a9f2ae678 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1100,11 +1100,19 @@  void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-	struct kvmppc_xive *xive = xc->xive;
+	struct kvmppc_xive *xive;
 	int i;
 
+	if (!kvmppc_xics_enabled(vcpu))
+		return;
+
+	if (!xc)
+		return;
+
 	pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+	xive = xc->xive;
+
 	/* Ensure no interrupt is still routed to that VP */
 	xc->valid = false;
 	kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1141,6 +1149,10 @@  void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	}
 	/* Free the VP */
 	kfree(xc);
+
+	/* Cleanup the vcpu */
+	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+	vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1158,7 +1170,7 @@  int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	}
 	if (xive->kvm != vcpu->kvm)
 		return -EPERM;
-	if (vcpu->arch.irq_type)
+	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
 	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
 		pr_devel("Duplicate !\n");
@@ -1824,6 +1836,9 @@  void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 	}
 }
 
+/*
+ * Called when VM is destroyed
+ */
 static void kvmppc_xive_free(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
@@ -1851,6 +1866,83 @@  static void kvmppc_xive_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+/*
+ * Called when device fd is closed
+ */
+static void kvmppc_xive_release(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	pr_devel("Releasing xive device\n");
+
+	/*
+	 * When releasing the KVM device fd, the vCPUs can still be
+	 * running and we should clean up the vCPU interrupt
+	 * presenters first.
+	 */
+	if (atomic_read(&kvm->online_vcpus) != 0) {
+		/*
+		 * call kick_all_cpus_sync() to ensure that all CPUs
+		 * have executed any pending interrupts
+		 */
+		if (is_kvmppc_hv_enabled(kvm))
+			kick_all_cpus_sync();
+
+		/*
+		 * TODO: There is still a race window with the early
+		 * checks in kvmppc_native_connect_vcpu()
+		 */
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			kvmppc_xive_cleanup_vcpu(vcpu);
+	}
+
+	if (kvm)
+		kvm->arch.xive = NULL;
+
+	/* Mask and free interrupts */
+	for (i = 0; i <= xive->max_sbid; i++) {
+		if (xive->src_blocks[i])
+			kvmppc_xive_free_sources(xive->src_blocks[i]);
+		kfree(xive->src_blocks[i]);
+		xive->src_blocks[i] = NULL;
+	}
+
+	if (xive->vp_base != XIVE_INVALID_VP)
+		xive_native_free_vp_block(xive->vp_base);
+
+	/* Do no free device private data. Destroy handler will */
+	kfree(dev);
+}
+
+/*
+ * When the guest chooses the interrupt mode (XICS legacy or XIVE
+ * native), the VM will switch of KVM device. The previous device will
+ * be "released" before the new one is created.
+ *
+ * Until we are sure all execution paths are well protected, provide a
+ * fail safe (transitional) method for device destruction, in which
+ * the XIVE device pointer is recycled and not directly freed.
+ */
+struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type)
+{
+	struct kvmppc_xive *xive;
+	bool xive_native_index = type == KVM_DEV_TYPE_XIVE;
+
+	xive = kvm->arch.xive_devices[xive_native_index];
+
+	if (!xive) {
+		xive = kzalloc(sizeof(*xive), GFP_KERNEL);
+		kvm->arch.xive_devices[xive_native_index] = xive;
+	} else {
+		memset(xive, 0, sizeof(*xive));
+	}
+
+	return xive;
+}
+
 static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
@@ -1859,7 +1951,7 @@  static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 
 	pr_devel("Creating xive for partition\n");
 
-	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
+	xive = kvmppc_xive_get_device(kvm, type);
 	if (!xive)
 		return -ENOMEM;
 
@@ -2024,6 +2116,7 @@  struct kvm_device_ops kvm_xive_ops = {
 	.name = "kvm-xive",
 	.create = kvmppc_xive_create,
 	.init = kvmppc_xive_init,
+	.release = kvmppc_xive_release,
 	.destroy = kvmppc_xive_free,
 	.set_attr = xive_set_attr,
 	.get_attr = xive_get_attr,
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 62648f833adf..ba1c55a93673 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -964,6 +964,9 @@  static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 	return -ENXIO;
 }
 
+/*
+ * Called when VM is destroyed
+ */
 static void kvmppc_xive_native_free(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
@@ -991,6 +994,45 @@  static void kvmppc_xive_native_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+/*
+ * Called when device fd is closed
+ */
+static void kvmppc_xive_native_release(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	pr_devel("Releasing xive native device\n");
+
+	/*
+	 * When releasing the KVM device fd, the vCPUs can still be
+	 * running and we should clean up the vCPU interrupt
+	 * presenters first.
+	 */
+	if (atomic_read(&kvm->online_vcpus) != 0) {
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			kvmppc_xive_native_cleanup_vcpu(vcpu);
+	}
+
+	if (kvm)
+		kvm->arch.xive = NULL;
+
+	for (i = 0; i <= xive->max_sbid; i++) {
+		if (xive->src_blocks[i])
+			kvmppc_xive_free_sources(xive->src_blocks[i]);
+		kfree(xive->src_blocks[i]);
+		xive->src_blocks[i] = NULL;
+	}
+
+	if (xive->vp_base != XIVE_INVALID_VP)
+		xive_native_free_vp_block(xive->vp_base);
+
+	/* Do no free device private data. Destroy handler will */
+	kfree(dev);
+}
+
 static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
@@ -1002,7 +1044,7 @@  static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (kvm->arch.xive)
 		return -EEXIST;
 
-	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
+	xive = kvmppc_xive_get_device(kvm, type);
 	if (!xive)
 		return -ENOMEM;
 
@@ -1182,6 +1224,7 @@  struct kvm_device_ops kvm_xive_native_ops = {
 	.name = "kvm-xive-native",
 	.create = kvmppc_xive_native_create,
 	.init = kvmppc_xive_native_init,
+	.release = kvmppc_xive_native_release,
 	.destroy = kvmppc_xive_native_free,
 	.set_attr = kvmppc_xive_native_set_attr,
 	.get_attr = kvmppc_xive_native_get_attr,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ea2018ae1cd7..ea2619d5ca98 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2938,6 +2938,19 @@  static int kvm_device_release(struct inode *inode, struct file *filp)
 	struct kvm_device *dev = filp->private_data;
 	struct kvm *kvm = dev->kvm;
 
+	if (!dev)
+		return -ENODEV;
+
+	if (dev->kvm != kvm)
+		return -EPERM;
+
+	if (dev->ops->release) {
+		mutex_lock(&kvm->lock);
+		list_del(&dev->vm_node);
+		dev->ops->release(dev);
+		mutex_unlock(&kvm->lock);
+	}
+
 	kvm_put_kvm(kvm);
 	return 0;
 }