Patchwork Permit -mem-path without sync mmu

login
register
mail settings
Submitter David Gibson
Date Aug. 5, 2011, 4:02 a.m.
Message ID <1312516970-26606-1-git-send-email-david@gibson.dropbear.id.au>
Download mbox | patch
Permalink /patch/108603/
State New
Headers show

Comments

David Gibson - Aug. 5, 2011, 4:02 a.m.
At present, an explicit test disallows use of -mem-path when kvm is enabled
but KVM_CAP_SYNC_MMU is not set.  In particular, this prevents the user
from using hugetlbfs to back the guest memory.

I can see no reason for this check, and when I asked about it previously,
the only theory offered was that this was a limitation of the very early
days of kvm which only happened to match the SYNC_MMU flag by accident.

This patch, therefore, removes the check.  This is of particular use to
us on POWER, where we haven't yet implement SYNC_MMU, but where backing
the guest with hugepages is possible, and in fact mandatory (for now).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 exec.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)
Jan Kiszka - Aug. 5, 2011, 6:16 a.m.
On 2011-08-05 06:02, David Gibson wrote:
> At present, an explicit test disallows use of -mem-path when kvm is enabled
> but KVM_CAP_SYNC_MMU is not set.  In particular, this prevents the user
> from using hugetlbfs to back the guest memory.
> 
> I can see no reason for this check, and when I asked about it previously,
> the only theory offered was that this was a limitation of the very early
> days of kvm which only happened to match the SYNC_MMU flag by accident.
> 
> This patch, therefore, removes the check.  This is of particular use to
> us on POWER, where we haven't yet implement SYNC_MMU, but where backing
> the guest with hugepages is possible, and in fact mandatory (for now).
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  exec.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 476b507..041637c 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block,
>          return NULL;
>      }
>  
> -    if (kvm_enabled() && !kvm_has_sync_mmu()) {
> -        fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n");
> -        return NULL;
> -    }
> -
>      if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) {
>          return NULL;
>      }

This is nothing trivial, see ce9a92411d in qemu-kvm or
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it
should rather target uq/master. CCing Avi, Marcelo, and the kvm list.

Jan
Marcelo Tosatti - Aug. 5, 2011, 3:30 p.m.
On Fri, Aug 05, 2011 at 08:16:42AM +0200, Jan Kiszka wrote:
> On 2011-08-05 06:02, David Gibson wrote:
> > At present, an explicit test disallows use of -mem-path when kvm is enabled
> > but KVM_CAP_SYNC_MMU is not set.  In particular, this prevents the user
> > from using hugetlbfs to back the guest memory.
> > 
> > I can see no reason for this check, and when I asked about it previously,
> > the only theory offered was that this was a limitation of the very early
> > days of kvm which only happened to match the SYNC_MMU flag by accident.
> > 
> > This patch, therefore, removes the check.  This is of particular use to
> > us on POWER, where we haven't yet implement SYNC_MMU, but where backing
> > the guest with hugepages is possible, and in fact mandatory (for now).
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  exec.c |    5 -----
> >  1 files changed, 0 insertions(+), 5 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 476b507..041637c 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block,
> >          return NULL;
> >      }
> >  
> > -    if (kvm_enabled() && !kvm_has_sync_mmu()) {
> > -        fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n");
> > -        return NULL;
> > -    }
> > -
> >      if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) {
> >          return NULL;
> >      }
> 
> This is nothing trivial, see ce9a92411d in qemu-kvm or
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it
> should rather target uq/master. CCing Avi, Marcelo, and the kvm list.
> 
> Jan

Yes, the check cannot be removed because there is the possibility of
corruption using hugepages without mmu notifiers (described in the 
archived message above).

Why are mmu notifiers not implemented for PPC again?
David Gibson - Aug. 8, 2011, 6:03 a.m.
On Fri, Aug 05, 2011 at 12:30:53PM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 05, 2011 at 08:16:42AM +0200, Jan Kiszka wrote:
> > On 2011-08-05 06:02, David Gibson wrote:
> > > At present, an explicit test disallows use of -mem-path when kvm is enabled
> > > but KVM_CAP_SYNC_MMU is not set.  In particular, this prevents the user
> > > from using hugetlbfs to back the guest memory.
> > > 
> > > I can see no reason for this check, and when I asked about it previously,
> > > the only theory offered was that this was a limitation of the very early
> > > days of kvm which only happened to match the SYNC_MMU flag by accident.
> > > 
> > > This patch, therefore, removes the check.  This is of particular use to
> > > us on POWER, where we haven't yet implement SYNC_MMU, but where backing
> > > the guest with hugepages is possible, and in fact mandatory (for now).
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  exec.c |    5 -----
> > >  1 files changed, 0 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/exec.c b/exec.c
> > > index 476b507..041637c 100644
> > > --- a/exec.c
> > > +++ b/exec.c
> > > @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block,
> > >          return NULL;
> > >      }
> > >  
> > > -    if (kvm_enabled() && !kvm_has_sync_mmu()) {
> > > -        fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n");
> > > -        return NULL;
> > > -    }
> > > -
> > >      if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) {
> > >          return NULL;
> > >      }
> > 
> > This is nothing trivial, see ce9a92411d in qemu-kvm or
> > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it
> > should rather target uq/master. CCing Avi, Marcelo, and the kvm list.
> > 
> > Jan

Well, sending the patch flushed out the real reason for that check, at
least, as I thought it might.

> Yes, the check cannot be removed because there is the possibility of
> corruption using hugepages without mmu notifiers (described in the 
> archived message above).

Ok, so.  If I understand the archived message correctly.  First, this
check *is* all about hugepages - which is not obvious from the test
itself.

Second, if userspace qemu passing hugepages to kvm can cause (host)
kernel memory corruption, that is clearly a host kernel bug.  So am I
correct in thinking this is basically just a safety feature if qemu is
run on a buggy kernel.  Presumably this bug was corrected at some
point?  Is the presence of the SYNC_MMU feature just being used as a
proxy for "is this kernel recent enough to have the corruption bug
fixed"?

In any case this test sure as hell needs a big comment next to it
explaining this context.

> Why are mmu notifiers not implemented for PPC again?

It's just not done yet; we're working on it.  (That is, mmu notifiers
are certainly present on PPC, it's just they're not wired up to kvm,
yet).
Avi Kivity - Aug. 8, 2011, 8:24 a.m.
On 08/08/2011 09:03 AM, David Gibson wrote:
> Second, if userspace qemu passing hugepages to kvm can cause (host)
> kernel memory corruption, that is clearly a host kernel bug.  So am I
> correct in thinking this is basically just a safety feature if qemu is
> run on a buggy kernel.

Seems so, yes.  2.6.2[456] are exploitable.  We only found out after 
these were all released.

> Presumably this bug was corrected at some
> point?  Is the presence of the SYNC_MMU feature just being used as a
> proxy for "is this kernel recent enough to have the corruption bug
> fixed"?

SYNC_MMU actually fixes the bug.

> In any case this test sure as hell needs a big comment next to it
> explaining this context.

Yes.


>
> >  Why are mmu notifiers not implemented for PPC again?
>
> It's just not done yet; we're working on it.  (That is, mmu notifiers
> are certainly present on PPC, it's just they're not wired up to kvm,
> yet).
>

If ppc doesn't have this issue even without SYNC_MMU, we can make the 
check x86 specific.
David Gibson - Aug. 10, 2011, 5:10 a.m.
On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote:
> On 08/08/2011 09:03 AM, David Gibson wrote:
> >Second, if userspace qemu passing hugepages to kvm can cause (host)
> >kernel memory corruption, that is clearly a host kernel bug.  So am I
> >correct in thinking this is basically just a safety feature if qemu is
> >run on a buggy kernel.
> 
> Seems so, yes.  2.6.2[456] are exploitable.  We only found out after
> these were all released.
> 
> >Presumably this bug was corrected at some
> >point?  Is the presence of the SYNC_MMU feature just being used as a
> >proxy for "is this kernel recent enough to have the corruption bug
> >fixed"?
> 
> SYNC_MMU actually fixes the bug.

Ah, so SYNC_MMU fixed the bug on x86, and all the other archs without
SYNC_MMU were left with a serious memory corruption bug, under a
userspace bandaid.  Thanks for that.

As I understand the bug that causes the problem, it's because removing
all the hugepage VMAs from userspace will cause the inode (and
therefore address_space) for the hugepage file to be freed, but not
the pages (because another ref is held by kvm).  Then when kvm
releases the pages, the address_space will be touched after free from
free_huge_page().

This would seem to be a genuine bug in the hugepage code, which has
just been hidden by SYNC_MMU.  It should be quite easy to fix - the
mapping is only stored in the struct page to get to the hugetlbfs
superblock, so we could just store a direct superblock pointer
instead, and bump it's refcount when we put that in the page private
pointer.

But then I'm not sure how qemu would detect that it's on a kernel
where the bug is fixed and allow -mem-path to be used again.  Any
ideas?
Avi Kivity - Aug. 10, 2011, 9:01 a.m.
On 08/10/2011 08:10 AM, David Gibson wrote:
> On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote:
> >  On 08/08/2011 09:03 AM, David Gibson wrote:
> >  >Second, if userspace qemu passing hugepages to kvm can cause (host)
> >  >kernel memory corruption, that is clearly a host kernel bug.  So am I
> >  >correct in thinking this is basically just a safety feature if qemu is
> >  >run on a buggy kernel.
> >
> >  Seems so, yes.  2.6.2[456] are exploitable.  We only found out after
> >  these were all released.
> >
> >  >Presumably this bug was corrected at some
> >  >point?  Is the presence of the SYNC_MMU feature just being used as a
> >  >proxy for "is this kernel recent enough to have the corruption bug
> >  >fixed"?
> >
> >  SYNC_MMU actually fixes the bug.
>
> Ah, so SYNC_MMU fixed the bug on x86, and all the other archs without
> SYNC_MMU were left with a serious memory corruption bug, under a
> userspace bandaid.  Thanks for that.

Unfortunately it's all too easy to ignore non-x86.

It may be considered that not implementing SYNC_MMU is a bug in itself, 
as it allows userspace to pin arbitrary amounts of user memory.  At 
least on x86 we had shrinkers that kill off shadow page tables under 
memory pressure, unpinning memory, but I don't see it on ppc.

> As I understand the bug that causes the problem, it's because removing
> all the hugepage VMAs from userspace will cause the inode (and
> therefore address_space) for the hugepage file to be freed, but not
> the pages (because another ref is held by kvm).  Then when kvm
> releases the pages, the address_space will be touched after free from
> free_huge_page().
>
> This would seem to be a genuine bug in the hugepage code, which has
> just been hidden by SYNC_MMU.  It should be quite easy to fix - the
> mapping is only stored in the struct page to get to the hugetlbfs
> superblock, so we could just store a direct superblock pointer
> instead, and bump it's refcount when we put that in the page private
> pointer.
>
> But then I'm not sure how qemu would detect that it's on a kernel
> where the bug is fixed and allow -mem-path to be used again.  Any
> ideas?

If it's just a kernel bug, the fix belongs in the kernel, not in qemu.

We used to have KVM_CAPs to declare this sort of thing 
(KVM_CAP_HUGETLBFS_WORKS_EVEN_WITHOUT_SYNC_MMU) but I don't think it was 
a good idea.
David Gibson - Aug. 11, 2011, 6:09 a.m.
On Wed, Aug 10, 2011 at 12:01:04PM +0300, Avi Kivity wrote:
> On 08/10/2011 08:10 AM, David Gibson wrote:
> >On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote:
> >>  On 08/08/2011 09:03 AM, David Gibson wrote:
[snip]
> >This would seem to be a genuine bug in the hugepage code, which has
> >just been hidden by SYNC_MMU.  It should be quite easy to fix - the
> >mapping is only stored in the struct page to get to the hugetlbfs
> >superblock, so we could just store a direct superblock pointer
> >instead, and bump it's refcount when we put that in the page private
> >pointer.
> >
> >But then I'm not sure how qemu would detect that it's on a kernel
> >where the bug is fixed and allow -mem-path to be used again.  Any
> >ideas?
> 
> If it's just a kernel bug, the fix belongs in the kernel, not in qemu.

Obviously.

> We used to have KVM_CAPs to declare this sort of thing
> (KVM_CAP_HUGETLBFS_WORKS_EVEN_WITHOUT_SYNC_MMU) but I don't think it
> was a good idea.

I tend to agree - especially since there's nothing actually kvm
specific about this bug.  AFAICT a driver which did gup on hugepages
could trigger the bug equally well.

Patch

diff --git a/exec.c b/exec.c
index 476b507..041637c 100644
--- a/exec.c
+++ b/exec.c
@@ -2818,11 +2818,6 @@  static void *file_ram_alloc(RAMBlock *block,
         return NULL;
     }
 
-    if (kvm_enabled() && !kvm_has_sync_mmu()) {
-        fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n");
-        return NULL;
-    }
-
     if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) {
         return NULL;
     }