diff mbox series

[v4,22/33] hostmem-epc: Add the reset interface for EPC backend reset

Message ID 20210719112136.57018-23-yang.zhong@intel.com
State New
Headers show
Series Qemu SGX virtualization | expand

Commit Message

Yang Zhong July 19, 2021, 11:21 a.m. UTC
Add the sgx_memory_backend_reset() interface to handle EPC backend
reset when VM is reset. This reset function will destroy previous
backend memory region and re-mmap the EPC section for guest.

Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 backends/hostmem-epc.c | 16 ++++++++++++++++
 include/hw/i386/pc.h   |  2 ++
 2 files changed, 18 insertions(+)

Comments

Paolo Bonzini Sept. 10, 2021, 3:10 p.m. UTC | #1
On 19/07/21 13:21, Yang Zhong wrote:
> +void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
> +                              Error **errp)
> +{
> +    MemoryRegion *mr = &backend->mr;
> +
> +    mr->enabled = false;
> +
> +    /* destroy the old memory region if it exist */
> +    if (fd > 0 && mr->destructor) {
> +        mr->destructor(mr);
> +    }
> +
> +    sgx_epc_backend_memory_alloc(backend, errp);
> +}
> +

Jarkko, Sean, Kai,

this I think is problematic because it has a race window while 
/dev/sgx_vepc is closed and then reopened.  First, the vEPC space could 
be exhausted by somebody doing another mmap in the meanwhile.  Second, 
somebody might (for whatever reason) remove /dev/sgx_vepc while QEMU runs.

Yang explained to me (offlist) that this is needed because Windows fails 
to reboot without it.  We would need a way to ask Linux to reinitialize 
the vEPC, that doesn't involve munmap/mmap; this could be for example 
fallocate(FALLOC_FL_ZERO_RANGE).

What do you all think?

Paolo
Sean Christopherson Sept. 10, 2021, 3:34 p.m. UTC | #2
On Fri, Sep 10, 2021, Paolo Bonzini wrote:
> On 19/07/21 13:21, Yang Zhong wrote:
> > +void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
> > +                              Error **errp)
> > +{
> > +    MemoryRegion *mr = &backend->mr;
> > +
> > +    mr->enabled = false;
> > +
> > +    /* destroy the old memory region if it exist */
> > +    if (fd > 0 && mr->destructor) {
> > +        mr->destructor(mr);
> > +    }
> > +
> > +    sgx_epc_backend_memory_alloc(backend, errp);
> > +}
> > +
> 
> Jarkko, Sean, Kai,
> 
> this I think is problematic because it has a race window while /dev/sgx_vepc
> is closed and then reopened.  First, the vEPC space could be exhausted by
> somebody doing another mmap in the meanwhile.  Second, somebody might (for
> whatever reason) remove /dev/sgx_vepc while QEMU runs.

Yep, both error cases are possible.

> Yang explained to me (offlist) that this is needed because Windows fails to
> reboot without it.  We would need a way to ask Linux to reinitialize the
> vEPC, that doesn't involve munmap/mmap; this could be for example
> fallocate(FALLOC_FL_ZERO_RANGE).
> 
> What do you all think?

Mechanically, FALLOC_FL_ZERO_RANGE could work, but it's definitely a stretch of
the semantics as the underlying memory would not actually be zeroed.

The only other option that comes to mind is a dedicated ioctl().
Paolo Bonzini Sept. 10, 2021, 5:09 p.m. UTC | #3
On 10/09/21 17:34, Sean Christopherson wrote:
>> Yang explained to me (offlist) that this is needed because Windows fails to
>> reboot without it.  We would need a way to ask Linux to reinitialize the
>> vEPC, that doesn't involve munmap/mmap; this could be for example
>> fallocate(FALLOC_FL_ZERO_RANGE).
>>
>> What do you all think?
>
> Mechanically, FALLOC_FL_ZERO_RANGE could work, but it's definitely a stretch of
> the semantics as the underlying memory would not actually be zeroed.

The contents are not visible to anyone, so they might well be zero
(not entirely serious, but also not entirely unserious).

> The only other option that comes to mind is a dedicated ioctl().

If it is not too restrictive to do it for the whole mapping at once,
that would be fine.  If it makes sense to do it for a range, however,
the effort of defining a ioctl() API is probably not worth it when
fallocate() is available.

Either way, I suppose it would be just something like

	/* for fallocate; otherwise just use xa_for_each */
	size = min_t(unsigned long, size, -start);
	end = (start + size - 1) >> PAGE_SHIFT;
	start >>= PAGE_SHIFT;

	/*
	 * Removing in two passes lets us remove SECS pages as well,
	 * since they can only be EREMOVE'd after all their child pages.
	 * An EREMOVE failure on the second pass means that the SECS
	 * page still has children on another instance.  Since it is
	 * still in the xarray, bury our head in the sand and ignore
	 * it; sgx_vepc_release() will deal with it.
	 */
	LIST_HEAD(secs_pages);
         xa_for_each_range(&vepc->page_array, index, entry, start, end) {
                 if (!sgx_vepc_free_page(entry))
                         list_add_tail(&epc_page->list, &secs_pages);
         }

         list_for_each_entry_safe(epc_page, tmp, &secs_pages, list) {
		list_del(&epc_page->list);
                 sgx_vepc_free_page(epc_page);
         }
  
So another advantage of the ioctl would be e.g. being able to return the
number of successfully EREMOVEd pages.  But since QEMU would not do
anything with the return value and _also_ bury its head in the sand,
that would only be useful if you guys have other uses in mind.

Paolo
Sean Christopherson Sept. 10, 2021, 5:34 p.m. UTC | #4
On Fri, Sep 10, 2021, Paolo Bonzini wrote:
> On 10/09/21 17:34, Sean Christopherson wrote:
> > > Yang explained to me (offlist) that this is needed because Windows fails to
> > > reboot without it.  We would need a way to ask Linux to reinitialize the
> > > vEPC, that doesn't involve munmap/mmap; this could be for example
> > > fallocate(FALLOC_FL_ZERO_RANGE).
> > > 
> > > What do you all think?
> > 
> > Mechanically, FALLOC_FL_ZERO_RANGE could work, but it's definitely a stretch of
> > the semantics as the underlying memory would not actually be zeroed.
> 
> The contents are not visible to anyone, so they might well be zero
> (not entirely serious, but also not entirely unserious).

Yeah, it wouldn't be a sticking point, just odd.

> > The only other option that comes to mind is a dedicated ioctl().
> 
> If it is not too restrictive to do it for the whole mapping at once,
> that would be fine.

Oooh, rats.  That reminds me of a complication.  If QEMU creates multiple EPC
sections, e.g. for a vNUMA setup, resetting each section individually will fail
if the guest did an unclean RESET and a given enclaves has EPC pages from multiple
sections.  E.g. an SECS in vEPC[X] can have children in vEPC[0..N-1], and all
those children need to be removed before the SECS can be removed.  Yay SGX!

> If it makes sense to do it for a range, however,
> the effort of defining a ioctl() API is probably not worth it when
> fallocate() is available.
> 
> Either way, I suppose it would be just something like
> 
> 	/* for fallocate; otherwise just use xa_for_each */
> 	size = min_t(unsigned long, size, -start);
> 	end = (start + size - 1) >> PAGE_SHIFT;
> 	start >>= PAGE_SHIFT;
> 
> 	/*
> 	 * Removing in two passes lets us remove SECS pages as well,
> 	 * since they can only be EREMOVE'd after all their child pages.
> 	 * An EREMOVE failure on the second pass means that the SECS
> 	 * page still has children on another instance.  Since it is
> 	 * still in the xarray, bury our head in the sand and ignore
> 	 * it; sgx_vepc_release() will deal with it.
> 	 */
> 	LIST_HEAD(secs_pages);
>         xa_for_each_range(&vepc->page_array, index, entry, start, end) {
>                 if (!sgx_vepc_free_page(entry))
>                         list_add_tail(&epc_page->list, &secs_pages);
>         }
> 
>         list_for_each_entry_safe(epc_page, tmp, &secs_pages, list) {
> 		list_del(&epc_page->list);
>                 sgx_vepc_free_page(epc_page);
>         }
> So another advantage of the ioctl would be e.g. being able to return the
> number of successfully EREMOVEd pages.  But since QEMU would not do
> anything with the return value and _also_ bury its head in the sand,
> that would only be useful if you guys have other uses in mind.

There are two options: 1) QEMU has to handle "failure", or 2) the kernel provides
an ioctl() that takes multiple vEPC fds and handles the SECS dependencies.  #1 is
probably the least awful option.  For #2, in addition to the kernel having to deal
with multiple fds, it would also need a second list_head object in each page so
that it could track which pages failed to be removed.  Using the existing list_head
would work for now, but it won't work if/when an EPC cgroup is added.

Note, for #1, QEMU would have to potentially do three passes.

  1. Remove child pages for a given vEPC.
  2. Remove SECS for a given vEPC that were pinned by children in the same vEPC.
  3. Remove SECS for all vEPC that were pinned by children in different vEPC.
Paolo Bonzini Sept. 10, 2021, 7:51 p.m. UTC | #5
On 10/09/21 19:34, Sean Christopherson wrote:
> On Fri, Sep 10, 2021, Paolo Bonzini wrote:
>> On 10/09/21 17:34, Sean Christopherson wrote:
>>> The only other option that comes to mind is a dedicated ioctl().
>>
>> If it is not too restrictive to do it for the whole mapping at once,
>> that would be fine.
> 
> Oooh, rats.  That reminds me of a complication.  If QEMU creates multiple EPC
> sections, e.g. for a vNUMA setup, resetting each section individually will fail
> if the guest did an unclean RESET and a given enclaves has EPC pages from multiple
> sections.  E.g. an SECS in vEPC[X] can have children in vEPC[0..N-1], and all
> those children need to be removed before the SECS can be removed.  Yay SGX!
> 
> There are two options: 1) QEMU has to handle "failure", or 2) the kernel provides
> an ioctl() that takes multiple vEPC fds and handles the SECS dependencies.  #1 is
> probably the least awful option.  For #2, in addition to the kernel having to deal
> with multiple fds, it would also need a second list_head object in each page so
> that it could track which pages failed to be removed.  Using the existing list_head
> would work for now, but it won't work if/when an EPC cgroup is added.
> 
> Note, for #1, QEMU would have to potentially do three passes.
> 
>    1. Remove child pages for a given vEPC.
>    2. Remove SECS for a given vEPC that were pinned by children in the same vEPC.
>    3. Remove SECS for all vEPC that were pinned by children in different vEPC.

It's also possible that QEMU handles failure, but the kernel does two 
passes; then QEMU can just do two passes.  The kernel will overall do 
four passes, but:

1) the second (SECS pinned by children in the same vEPC) would be 
cheaper than a full second pass

2) the fourth would actually do nothing, because there would be no pages 
failing the EREMOV'al.

A hypothetical other SGX client that only uses one vEPC will do the 
right thing with a single pass.

Paolo
Sean Christopherson Sept. 10, 2021, 8:21 p.m. UTC | #6
On Fri, Sep 10, 2021, Paolo Bonzini wrote:
> On 10/09/21 19:34, Sean Christopherson wrote:
> > On Fri, Sep 10, 2021, Paolo Bonzini wrote:
> > > On 10/09/21 17:34, Sean Christopherson wrote:
> > > > The only other option that comes to mind is a dedicated ioctl().
> > > 
> > > If it is not too restrictive to do it for the whole mapping at once,
> > > that would be fine.
> > 
> > Oooh, rats.  That reminds me of a complication.  If QEMU creates multiple EPC
> > sections, e.g. for a vNUMA setup, resetting each section individually will fail
> > if the guest did an unclean RESET and a given enclaves has EPC pages from multiple
> > sections.  E.g. an SECS in vEPC[X] can have children in vEPC[0..N-1], and all
> > those children need to be removed before the SECS can be removed.  Yay SGX!
> > 
> > There are two options: 1) QEMU has to handle "failure", or 2) the kernel provides
> > an ioctl() that takes multiple vEPC fds and handles the SECS dependencies.  #1 is
> > probably the least awful option.  For #2, in addition to the kernel having to deal
> > with multiple fds, it would also need a second list_head object in each page so
> > that it could track which pages failed to be removed.  Using the existing list_head
> > would work for now, but it won't work if/when an EPC cgroup is added.
> > 
> > Note, for #1, QEMU would have to potentially do three passes.
> > 
> >    1. Remove child pages for a given vEPC.
> >    2. Remove SECS for a given vEPC that were pinned by children in the same vEPC.
> >    3. Remove SECS for all vEPC that were pinned by children in different vEPC.
> 
> It's also possible that QEMU handles failure, but the kernel does two
> passes; then QEMU can just do two passes.  The kernel will overall do four
> passes, but:
> 
> 1) the second (SECS pinned by children in the same vEPC) would be cheaper
> than a full second pass

The problem is that this would require a list_head (or temp allocations) to track
the SECS pages that failed the first time 'round.  For vEPC destruction, the kernel
can use sgx_epc_page.list because it can take the pages off the active/allocated
list, but that's not an option in this case because the presumably-upcoming EPC
cgroup needs to keep pages on the list to handle OOM.

The kernel's ioctl/syscall/whatever could return the number of pages that were
not freed, or maybe just -EAGAIN, and userspace could use that to know it needs
to do another reset to free everything.

My thought for QEMU was to do (bad pseudocode):

	/* Retry to EREMOVE pinned SECS pages if necessary. */
	ret = ioctl(SGX_VEPC_RESET, ...);
	if (ret)
		ret = ioctl(SGX_VEPC_RESET, ...);

	/*
	 * Tag the VM as needed yet another round of resets to ERMOVE SECS pages
	 * that were pinned across vEPC sections.
	 */
	vm->sgx_epc_final_reset_needed = !!ret;

> 2) the fourth would actually do nothing, because there would be no pages
> failing the EREMOV'al.
> 
> A hypothetical other SGX client that only uses one vEPC will do the right
> thing with a single pass.
Paolo Bonzini Sept. 10, 2021, 8:57 p.m. UTC | #7
Il ven 10 set 2021, 22:21 Sean Christopherson <seanjc@google.com> ha
scritto:

> > It's also possible that QEMU handles failure, but the kernel does two
> > passes; then QEMU can just do two passes.  The kernel will overall do
> four
> > passes, but:
> >
> > 1) the second (SECS pinned by children in the same vEPC) would be cheaper
> > than a full second pass
>
> The problem is that this would require a list_head (or temp allocations)
> to track
> the SECS pages that failed the first time 'round.  For vEPC destruction,
> the kernel
> can use sgx_epc_page.list because it can take the pages off the
> active/allocated
> list, but that's not an option in this case because the
> presumably-upcoming EPC
> cgroup needs to keep pages on the list to handle OOM.
>

Good point, so yeah: let's go for a ioctl that does full removal, returning
the number of failures. I will try and cobble up a patch unless Kai beats
me to it.

Thanks for the quick discussion!

Paolo


> The kernel's ioctl/syscall/whatever could return the number of pages that
> were
> not freed, or maybe just -EAGAIN, and userspace could use that to know it
> needs
> to do another reset to free everything.
>
> My thought for QEMU was to do (bad pseudocode):
>
>         /* Retry to EREMOVE pinned SECS pages if necessary. */
>         ret = ioctl(SGX_VEPC_RESET, ...);
>         if (ret)
>                 ret = ioctl(SGX_VEPC_RESET, ...);
>
>         /*
>          * Tag the VM as needed yet another round of resets to ERMOVE SECS
> pages
>          * that were pinned across vEPC sections.
>          */
>         vm->sgx_epc_final_reset_needed = !!ret;
>
> > 2) the fourth would actually do nothing, because there would be no pages
> > failing the EREMOV'al.
> >
> > A hypothetical other SGX client that only uses one vEPC will do the right
> > thing with a single pass.
>
>
Jarkko Sakkinen Sept. 13, 2021, 8:17 p.m. UTC | #8
On Fri, 2021-09-10 at 17:10 +0200, Paolo Bonzini wrote:
> On 19/07/21 13:21, Yang Zhong wrote:
> > +void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
> > +                              Error **errp)
> > +{
> > +    MemoryRegion *mr = &backend->mr;
> > +
> > +    mr->enabled = false;
> > +
> > +    /* destroy the old memory region if it exist */
> > +    if (fd > 0 && mr->destructor) {
> > +        mr->destructor(mr);
> > +    }
> > +
> > +    sgx_epc_backend_memory_alloc(backend, errp);
> > +}
> > +
> 
> Jarkko, Sean, Kai,
> 
> this I think is problematic because it has a race window while 
> /dev/sgx_vepc is closed and then reopened.  First, the vEPC space could 
> be exhausted by somebody doing another mmap in the meanwhile.  Second, 
> somebody might (for whatever reason) remove /dev/sgx_vepc while QEMU runs.

1: Why is it a problem that mmap() could fail?

2: Are you speaking about removing device node? If you have succesfully
   mapped /dev/sgx_vepc, that should not have much effect (file refcount).

/Jarkko
Sean Christopherson Sept. 13, 2021, 8:37 p.m. UTC | #9
On Mon, Sep 13, 2021, Jarkko Sakkinen wrote:
> On Fri, 2021-09-10 at 17:10 +0200, Paolo Bonzini wrote:
> > On 19/07/21 13:21, Yang Zhong wrote:
> > > +void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
> > > +                              Error **errp)
> > > +{
> > > +    MemoryRegion *mr = &backend->mr;
> > > +
> > > +    mr->enabled = false;
> > > +
> > > +    /* destroy the old memory region if it exist */
> > > +    if (fd > 0 && mr->destructor) {
> > > +        mr->destructor(mr);
> > > +    }
> > > +
> > > +    sgx_epc_backend_memory_alloc(backend, errp);
> > > +}
> > > +
> > 
> > Jarkko, Sean, Kai,
> > 
> > this I think is problematic because it has a race window while 
> > /dev/sgx_vepc is closed and then reopened.  First, the vEPC space could 
> > be exhausted by somebody doing another mmap in the meanwhile.  Second, 
> > somebody might (for whatever reason) remove /dev/sgx_vepc while QEMU runs.
> 
> 1: Why is it a problem that mmap() could fail?

The flow in question is QEMU's emulation of a guest RESET.  If mmap() fails, QEMU
either has to kill the VM or disable SGX.  In either case, it's fatal to a running
workload/VM.

> 2: Are you speaking about removing device node? If you have succesfully
>    mapped /dev/sgx_vepc, that should not have much effect (file refcount).

Paolo was calling out that doing munmap() before mmap() would allow /dev/sgx_vepc
to be removed because QEMU would no longer hold a reference to /dev/sgx_vepc.  That
would again be fatal to the VM as QEMU would not be able to re-mmap() guest EPC.
Jarkko Sakkinen Sept. 13, 2021, 9:23 p.m. UTC | #10
On Mon, 2021-09-13 at 20:37 +0000, Sean Christopherson wrote:
> On Mon, Sep 13, 2021, Jarkko Sakkinen wrote:
> > On Fri, 2021-09-10 at 17:10 +0200, Paolo Bonzini wrote:
> > > On 19/07/21 13:21, Yang Zhong wrote:
> > > > +void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
> > > > +                              Error **errp)
> > > > +{
> > > > +    MemoryRegion *mr = &backend->mr;
> > > > +
> > > > +    mr->enabled = false;
> > > > +
> > > > +    /* destroy the old memory region if it exist */
> > > > +    if (fd > 0 && mr->destructor) {
> > > > +        mr->destructor(mr);
> > > > +    }
> > > > +
> > > > +    sgx_epc_backend_memory_alloc(backend, errp);
> > > > +}
> > > > +
> > > 
> > > Jarkko, Sean, Kai,
> > > 
> > > this I think is problematic because it has a race window while 
> > > /dev/sgx_vepc is closed and then reopened.  First, the vEPC space could 
> > > be exhausted by somebody doing another mmap in the meanwhile.  Second, 
> > > somebody might (for whatever reason) remove /dev/sgx_vepc while QEMU runs.
> > 
> > 1: Why is it a problem that mmap() could fail?
> 
> The flow in question is QEMU's emulation of a guest RESET.  If mmap() fails, QEMU
> either has to kill the VM or disable SGX.  In either case, it's fatal to a running
> workload/VM.

Thanks for the explanations.

Isn't this more about badly configured system/workloads? That's
at least for me the existential question.

I'm interested of legit workloads where this behaviour could still
cause any issues.

I'd guess than in e.g. data center environment, you'd have firly
strict orchestration for this type of resource so that you know
that workloads have an appropriate bandwidth.

/Jarkko
diff mbox series

Patch

diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
index b512a68cb0..3bd1535d82 100644
--- a/backends/hostmem-epc.c
+++ b/backends/hostmem-epc.c
@@ -16,6 +16,7 @@ 
 #include "qom/object_interfaces.h"
 #include "qapi/error.h"
 #include "sysemu/hostmem.h"
+#include "hw/i386/pc.h"
 
 #define TYPE_MEMORY_BACKEND_EPC "memory-backend-epc"
 
@@ -55,6 +56,21 @@  sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     g_free(name);
 }
 
+void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
+                              Error **errp)
+{
+    MemoryRegion *mr = &backend->mr;
+
+    mr->enabled = false;
+
+    /* destroy the old memory region if it exist */
+    if (fd > 0 && mr->destructor) {
+        mr->destructor(mr);
+    }
+
+    sgx_epc_backend_memory_alloc(backend, errp);
+}
+
 static void sgx_epc_backend_instance_init(Object *obj)
 {
     HostMemoryBackend *m = MEMORY_BACKEND(obj);
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 2fd6e6193b..c725dfef5b 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -204,6 +204,8 @@  extern const size_t pc_compat_6_0_len;
 
 /* sgx-epc.c */
 void pc_machine_init_sgx_epc(PCMachineState *pcms);
+void sgx_memory_backend_reset(HostMemoryBackend *backend, int fd,
+                              Error **errp);
 
 extern GlobalProperty pc_compat_5_2[];
 extern const size_t pc_compat_5_2_len;