mbox series

[v15,0/7] MTE support for KVM guest

Message ID 20210614090525.4338-1-steven.price@arm.com
Headers show
Series MTE support for KVM guest | expand

Message

Steven Price June 14, 2021, 9:05 a.m. UTC
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

I realise there are still open questions[1] around the performance of
this series (the 'big lock', tag_sync_lock, introduced in the first
patch). But there should be no impact on non-MTE workloads and until we
get real MTE-enabled hardware it's hard to know whether there is a need
for something more sophisticated or not. Peter Collingbourne's patch[3]
to clear the tags at page allocation time should hide more of the impact
for non-VM cases. So the remaining concern is around VM startup which
could be effectively serialised through the lock.

Changes since v14[2]:

 * Dropped "Handle MTE tags zeroing" patch in favour of Peter's similar
   patch[3] (now in arm64 tree).

 * Improved documentation following Catalin's review.

[1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
[2]: https://lore.kernel.org/r/20210607110816.25762-1-steven.price@arm.com/
[3]: https://lore.kernel.org/r/20210602235230.3928842-4-pcc@google.com/

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst             | 57 +++++++++++++++
 arch/arm64/include/asm/kvm_arm.h           |  3 +-
 arch/arm64/include/asm/kvm_emulate.h       |  3 +
 arch/arm64/include/asm/kvm_host.h          | 12 ++++
 arch/arm64/include/asm/kvm_mte.h           | 66 +++++++++++++++++
 arch/arm64/include/asm/mte-def.h           |  1 +
 arch/arm64/include/asm/mte.h               |  8 ++-
 arch/arm64/include/asm/pgtable.h           | 22 +++++-
 arch/arm64/include/asm/sysreg.h            |  3 +-
 arch/arm64/include/uapi/asm/kvm.h          | 11 +++
 arch/arm64/kernel/asm-offsets.c            |  2 +
 arch/arm64/kernel/mte.c                    | 54 ++++++++++++--
 arch/arm64/kvm/arm.c                       | 16 +++++
 arch/arm64/kvm/guest.c                     | 82 ++++++++++++++++++++++
 arch/arm64/kvm/hyp/entry.S                 |  7 ++
 arch/arm64/kvm/hyp/exception.c             |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++
 arch/arm64/kvm/mmu.c                       | 42 ++++++++++-
 arch/arm64/kvm/reset.c                     |  3 +-
 arch/arm64/kvm/sys_regs.c                  | 32 +++++++--
 include/uapi/linux/kvm.h                   |  2 +
 21 files changed, 429 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

Comments

Catalin Marinas June 17, 2021, 12:13 p.m. UTC | #1
On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> I realise there are still open questions[1] around the performance of
> this series (the 'big lock', tag_sync_lock, introduced in the first
> patch). But there should be no impact on non-MTE workloads and until we
> get real MTE-enabled hardware it's hard to know whether there is a need
> for something more sophisticated or not. Peter Collingbourne's patch[3]
> to clear the tags at page allocation time should hide more of the impact
> for non-VM cases. So the remaining concern is around VM startup which
> could be effectively serialised through the lock.
[...]
> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org

Start-up, VM resume, migration could be affected by this lock, basically
any time you fault a page into the guest. As you said, for now it should
be fine as long as the hardware doesn't support MTE or qemu doesn't
enable MTE in guests. But the problem won't go away.

We have a partial solution with an array of locks to mitigate against
this but there's still the question of whether we should actually bother
for something that's unlikely to happen in practice: MAP_SHARED memory
in guests (ignoring the stage 1 case for now).

If MAP_SHARED in guests is not a realistic use-case, we have the vma in
user_mem_abort() and if the VM_SHARED flag is set together with MTE
enabled for guests, we can reject the mapping.

We can discuss the stage 1 case separately from this series.
Marc Zyngier June 17, 2021, 1:15 p.m. UTC | #2
On Thu, 17 Jun 2021 13:13:22 +0100,
Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> > I realise there are still open questions[1] around the performance of
> > this series (the 'big lock', tag_sync_lock, introduced in the first
> > patch). But there should be no impact on non-MTE workloads and until we
> > get real MTE-enabled hardware it's hard to know whether there is a need
> > for something more sophisticated or not. Peter Collingbourne's patch[3]
> > to clear the tags at page allocation time should hide more of the impact
> > for non-VM cases. So the remaining concern is around VM startup which
> > could be effectively serialised through the lock.
> [...]
> > [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
> 
> Start-up, VM resume, migration could be affected by this lock, basically
> any time you fault a page into the guest. As you said, for now it should
> be fine as long as the hardware doesn't support MTE or qemu doesn't
> enable MTE in guests. But the problem won't go away.

Indeed. And I find it odd to say "it's not a problem, we don't have
any HW available". By this token, why should we merge this work the
first place, or any of the MTE work that has gone into the kernel over
the past years?

> We have a partial solution with an array of locks to mitigate against
> this but there's still the question of whether we should actually bother
> for something that's unlikely to happen in practice: MAP_SHARED memory
> in guests (ignoring the stage 1 case for now).
> 
> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> enabled for guests, we can reject the mapping.

That's a reasonable approach. I wonder whether we could do that right
at the point where the memslot is associated with the VM, like this:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a36a2e3082d8..ebd3b3224386 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		if (!vma)
 			break;
 
+		if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
+			return -EINVAL;
+
 		/*
 		 * Take the intersection of this VMA with the memory region
 		 */

which takes the problem out of the fault path altogether? We document
the restriction and move on. With that, we can use a non-locking
version of mte_sync_page_tags().

> We can discuss the stage 1 case separately from this series.

Works for me.

Thanks,

	M.
Steven Price June 17, 2021, 1:24 p.m. UTC | #3
On 17/06/2021 14:15, Marc Zyngier wrote:
> On Thu, 17 Jun 2021 13:13:22 +0100,
> Catalin Marinas <catalin.marinas@arm.com> wrote:
>>
>> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
>>> I realise there are still open questions[1] around the performance of
>>> this series (the 'big lock', tag_sync_lock, introduced in the first
>>> patch). But there should be no impact on non-MTE workloads and until we
>>> get real MTE-enabled hardware it's hard to know whether there is a need
>>> for something more sophisticated or not. Peter Collingbourne's patch[3]
>>> to clear the tags at page allocation time should hide more of the impact
>>> for non-VM cases. So the remaining concern is around VM startup which
>>> could be effectively serialised through the lock.
>> [...]
>>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
>>
>> Start-up, VM resume, migration could be affected by this lock, basically
>> any time you fault a page into the guest. As you said, for now it should
>> be fine as long as the hardware doesn't support MTE or qemu doesn't
>> enable MTE in guests. But the problem won't go away.
> 
> Indeed. And I find it odd to say "it's not a problem, we don't have
> any HW available". By this token, why should we merge this work the
> first place, or any of the MTE work that has gone into the kernel over
> the past years?
> 
>> We have a partial solution with an array of locks to mitigate against
>> this but there's still the question of whether we should actually bother
>> for something that's unlikely to happen in practice: MAP_SHARED memory
>> in guests (ignoring the stage 1 case for now).
>>
>> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
>> user_mem_abort() and if the VM_SHARED flag is set together with MTE
>> enabled for guests, we can reject the mapping.
> 
> That's a reasonable approach. I wonder whether we could do that right
> at the point where the memslot is associated with the VM, like this:
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a36a2e3082d8..ebd3b3224386 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  		if (!vma)
>  			break;
>  
> +		if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> +			return -EINVAL;
> +
>  		/*
>  		 * Take the intersection of this VMA with the memory region
>  		 */
> 
> which takes the problem out of the fault path altogether? We document
> the restriction and move on. With that, we can use a non-locking
> version of mte_sync_page_tags().

Does this deal with the case where the VMAs are changed after the
memslot is created? While we can do the check here to give the VMM a
heads-up if it gets it wrong, I think we also need it in
user_mem_abort() to deal with a VMM which mmap()s over the VA of the
memslot. Or am I missing something?

But if everyone is happy with the restriction (just for KVM) of not
allowing MTE+VM_SHARED then that sounds like a good way forward.

Thanks,

Steve

>> We can discuss the stage 1 case separately from this series.
> 
> Works for me.
> 
> Thanks,
> 
> 	M.
>
Marc Zyngier June 17, 2021, 1:53 p.m. UTC | #4
On Thu, 17 Jun 2021 14:24:25 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 17/06/2021 14:15, Marc Zyngier wrote:
> > On Thu, 17 Jun 2021 13:13:22 +0100,
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>
> >> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> >>> I realise there are still open questions[1] around the performance of
> >>> this series (the 'big lock', tag_sync_lock, introduced in the first
> >>> patch). But there should be no impact on non-MTE workloads and until we
> >>> get real MTE-enabled hardware it's hard to know whether there is a need
> >>> for something more sophisticated or not. Peter Collingbourne's patch[3]
> >>> to clear the tags at page allocation time should hide more of the impact
> >>> for non-VM cases. So the remaining concern is around VM startup which
> >>> could be effectively serialised through the lock.
> >> [...]
> >>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
> >>
> >> Start-up, VM resume, migration could be affected by this lock, basically
> >> any time you fault a page into the guest. As you said, for now it should
> >> be fine as long as the hardware doesn't support MTE or qemu doesn't
> >> enable MTE in guests. But the problem won't go away.
> > 
> > Indeed. And I find it odd to say "it's not a problem, we don't have
> > any HW available". By this token, why should we merge this work the
> > first place, or any of the MTE work that has gone into the kernel over
> > the past years?
> > 
> >> We have a partial solution with an array of locks to mitigate against
> >> this but there's still the question of whether we should actually bother
> >> for something that's unlikely to happen in practice: MAP_SHARED memory
> >> in guests (ignoring the stage 1 case for now).
> >>
> >> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> >> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> >> enabled for guests, we can reject the mapping.
> > 
> > That's a reasonable approach. I wonder whether we could do that right
> > at the point where the memslot is associated with the VM, like this:
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index a36a2e3082d8..ebd3b3224386 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> >  		if (!vma)
> >  			break;
> >  
> > +		if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> > +			return -EINVAL;
> > +
> >  		/*
> >  		 * Take the intersection of this VMA with the memory region
> >  		 */
> > 
> > which takes the problem out of the fault path altogether? We document
> > the restriction and move on. With that, we can use a non-locking
> > version of mte_sync_page_tags().
> 
> Does this deal with the case where the VMAs are changed after the
> memslot is created? While we can do the check here to give the VMM a
> heads-up if it gets it wrong, I think we also need it in
> user_mem_abort() to deal with a VMM which mmap()s over the VA of the
> memslot. Or am I missing something?

No, you're right. I wish the memslot API wasn't so lax... Anyway, even
a VMA flag check in user_mem_abort() will be cheaper than this new BKL.

> But if everyone is happy with the restriction (just for KVM) of not
> allowing MTE+VM_SHARED then that sounds like a good way forward.

Definitely works for me.

	M.