diff mbox series

[1/1] kvm/book3s_64: Fixes crash caused by not cleaning vhost IOTLB

Message ID 20191217210658.73144-1-leonardo@linux.ibm.com
State Changes Requested
Headers show
Series [1/1] kvm/book3s_64: Fixes crash caused by not cleaning vhost IOTLB | expand

Commit Message

Leonardo Bras Dec. 17, 2019, 9:06 p.m. UTC
Fixes a bug that happens when a virtual machine is created without DDW,
with vhost supporting a virtio-net device.

In this scenario, an IOMMU with 32-bit DMA window will possibly map
IOVA's to different memory addresses.

As the code works today, H_STUFF_TCE hypercall will be dealt only with
kvm code, which does not invalidate the IOTLB entry in vhost, meaning
that at some point, and old entry can cause an access to a previous
memory address that IOVA pointed.

Example:
- virtio-net passes IOVA N to vhost, which point to M1
- vhost tries IOTLB, but miss
- vhost translates IOVA N and stores result to IOTLB
- vhost writes to M1
- (some IOMMU usage)
- virtio-net passes IOVA N to vhost, which now points to M2
- vhost tries IOTLB, and translates IOVA N to M1
- vhost writes to M1 <error, should write to M2>

The reason why this error was not so evident, is probably because the
IOTLB was small enough to almost always miss at the point an IOVA was
reused. Raising the IOTLB size to 32k (which is a module parameter that
defaults to 2k) is enough to reproduce the bug in +90% of the runs.
It usually takes less than 10 seconds of netperf to cause this bug
to happen.

A few minutes after reproducing this bug, the guest usually crash.

Fixing this bug involves cleaning a IOVA entry from IOTLB.
The guest kernel trigger this by doing a H_STUFF_TCE hypercall with
tce_value == 0.

This change fixes this bug by returning H_TOO_HARD on kvmppc_h_stuff_tce
when tce_value == 0, which causes kvm to let qemu deal with this.
In this case, qemu does free the vhost IOTLB entry, which fixes the bug.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_64_vio.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Alexey Kardashevskiy Dec. 18, 2019, 4:53 a.m. UTC | #1
On 18/12/2019 08:06, Leonardo Bras wrote:
> Fixes a bug that happens when a virtual machine is created without DDW,
> with vhost supporting a virtio-net device.
> 
> In this scenario, an IOMMU with 32-bit DMA window will possibly map
> IOVA's to different memory addresses.
> 
> As the code works today, H_STUFF_TCE hypercall will be dealt only with
> kvm code, which does not invalidate the IOTLB entry in vhost, meaning
> that at some point, and old entry can cause an access to a previous
> memory address that IOVA pointed.
> 
> Example:
> - virtio-net passes IOVA N to vhost, which point to M1
> - vhost tries IOTLB, but miss
> - vhost translates IOVA N and stores result to IOTLB
> - vhost writes to M1
> - (some IOMMU usage)
> - virtio-net passes IOVA N to vhost, which now points to M2
> - vhost tries IOTLB, and translates IOVA N to M1
> - vhost writes to M1 <error, should write to M2>
> 
> The reason why this error was not so evident, is probably because the
> IOTLB was small enough to almost always miss at the point an IOVA was
> reused. Raising the IOTLB size to 32k (which is a module parameter that
> defaults to 2k) is enough to reproduce the bug in +90% of the runs.
> It usually takes less than 10 seconds of netperf to cause this bug
> to happen.
> 
> A few minutes after reproducing this bug, the guest usually crash.
> 
> Fixing this bug involves cleaning a IOVA entry from IOTLB.
> The guest kernel trigger this by doing a H_STUFF_TCE hypercall with
> tce_value == 0.
> 
> This change fixes this bug by returning H_TOO_HARD on kvmppc_h_stuff_tce
> when tce_value == 0, which causes kvm to let qemu deal with this.
> In this case, qemu does free the vhost IOTLB entry, which fixes the bug.
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 883a66e76638..841eff3f6392 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -710,6 +710,9 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	if (tce_value == 0)


H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere
calls it with a value other than zero, and I probably saw some other
value somewhere but in QEMU/KVM case it is 0 so you effectively disable
in-kernel acceleration of H_STUFF_TCE which is undesirable.

For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU
just like we do for VFIO for older host kernels:

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208

I am not sure what a proper solution would be, something like an eventfd
and KVM's kvmppc_h_stuff_tce() signalling vhost that the latter needs to
invalidate iotlbs. Or we can just say that we do not allow KVM
acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty
much). Thanks,



> +		return H_TOO_HARD;
> +
>  	/* Check permission bits only to allow userspace poison TCE for debug */
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>
Leonardo Bras Dec. 18, 2019, 11:28 p.m. UTC | #2
On Wed, 2019-12-18 at 15:53 +1100, Alexey Kardashevskiy wrote:
> H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere
> calls it with a value other than zero, and I probably saw some other
> value somewhere but in QEMU/KVM case it is 0 so you effectively disable
> in-kernel acceleration of H_STUFF_TCE which is
> undesirable.
> 

Thanks for the feedback!

> For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU
> just like we do for VFIO for older host kernels:
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208
 
I am still reading into this temporary solution, I could still not
understand how it works.

> I am not sure what a proper solution would be, something like an eventfd
> and KVM's kvmppc_h_stuff_tce() signaling vhost that the latter needs to
> invalidate iotlbs. Or we can just say that we do not allow KVM
> acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty
> much). Thanks,

I am not used to eventfd, but i agree it's a valid solution to talk to
QEMU and then use it to send a message via /dev/vhost.
KVM -> QEMU -> vhost

But I can't get my mind out of another solution: doing it in
kernelspace.  I am not sure how that would work, though.

If I could understand correctly, there is a vhost IOTLB per vhost_dev,
and H_STUFF_TCE is not called in 64-bit DMA case (for tce_value == 0
case, at least), which makes sense, given it doesn't need to invalidate
entries on IOTLB.

So, we would need to somehow replace `return H_TOO_HARD` in this patch
with code that could call vhost_process_iotlb_msg() with
VHOST_IOTLB_INVALIDATE.

For that, I would need to know what are the vhost_dev's of that
process, which I don't know if it's possible to do currently (or safe
at all).

I am thinking of linking all vhost_dev's with a list (list.h) that
could be searched, comparing `mm_struct *` of the calling task with all
vhost_dev's, and removing the entry of all IOTLB that hits.

Not sure if that's the best approach to find the related vhost_dev's.

What do you think?

Best regards,
Leonardo Bras
Alexey Kardashevskiy Dec. 20, 2019, 3:35 a.m. UTC | #3
On 19/12/2019 10:28, Leonardo Bras wrote:
> On Wed, 2019-12-18 at 15:53 +1100, Alexey Kardashevskiy wrote:
>> H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere
>> calls it with a value other than zero, and I probably saw some other
>> value somewhere but in QEMU/KVM case it is 0 so you effectively disable
>> in-kernel acceleration of H_STUFF_TCE which is
>> undesirable.
>>
> 
> Thanks for the feedback!
> 
>> For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU
>> just like we do for VFIO for older host kernels:
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208
>  
> I am still reading into this temporary solution, I could still not
> understand how it works.
> 
>> I am not sure what a proper solution would be, something like an eventfd
>> and KVM's kvmppc_h_stuff_tce() signaling vhost that the latter needs to
>> invalidate iotlbs. Or we can just say that we do not allow KVM
>> acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty
>> much). Thanks,
> 
> I am not used to eventfd, but i agree it's a valid solution to talk to
> QEMU and then use it to send a message via /dev/vhost.
> KVM -> QEMU -> vhost
> 
> But I can't get my mind out of another solution: doing it in
> kernelspace.  I am not sure how that would work, though.
> 
> If I could understand correctly, there is a vhost IOTLB per vhost_dev,
> and H_STUFF_TCE is not called in 64-bit DMA case (for tce_value == 0
> case, at least), which makes sense, given it doesn't need to invalidate
> entries on IOTLB.
> 
> So, we would need to somehow replace `return H_TOO_HARD` in this patch
> with code that could call vhost_process_iotlb_msg() with
> VHOST_IOTLB_INVALIDATE.
> 
> For that, I would need to know what are the vhost_dev's of that
> process, which I don't know if it's possible to do currently (or safe
> at all).
> 
> I am thinking of linking all vhost_dev's with a list (list.h) that
> could be searched, comparing `mm_struct *` of the calling task with all
> vhost_dev's, and removing the entry of all IOTLB that hits.
> 
> Not sure if that's the best approach to find the related vhost_dev's.
> 
> What do you think?


As discussed in slack, we need to do the same thing we do with physical
devices when we invalidate hardware IOMMU translation caches via
tbl->it_ops->tce_kill. The problem to solve now is how we tell KVM/PPC
about vhost/iotlb (is there an fd?), something similar to the existing
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE. I guess x86 handles all the mappings
in QEMU and therefore they do not have this problem. Thanks,
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 883a66e76638..841eff3f6392 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -710,6 +710,9 @@  long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	if (tce_value == 0)
+		return H_TOO_HARD;
+
 	/* Check permission bits only to allow userspace poison TCE for debug */
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;