Message ID | 1312516970-26606-1-git-send-email-david@gibson.dropbear.id.au |
---|---|
State | New |
Headers | show |
On 2011-08-05 06:02, David Gibson wrote: > At present, an explicit test disallows use of -mem-path when kvm is enabled > but KVM_CAP_SYNC_MMU is not set. In particular, this prevents the user > from using hugetlbfs to back the guest memory. > > I can see no reason for this check, and when I asked about it previously, > the only theory offered was that this was a limitation of the very early > days of kvm which only happened to match the SYNC_MMU flag by accident. > > This patch, therefore, removes the check. This is of particular use to > us on POWER, where we haven't yet implement SYNC_MMU, but where backing > the guest with hugepages is possible, and in fact mandatory (for now). > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > exec.c | 5 ----- > 1 files changed, 0 insertions(+), 5 deletions(-) > > diff --git a/exec.c b/exec.c > index 476b507..041637c 100644 > --- a/exec.c > +++ b/exec.c > @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block, > return NULL; > } > > - if (kvm_enabled() && !kvm_has_sync_mmu()) { > - fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n"); > - return NULL; > - } > - > if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) { > return NULL; > } This is nothing trivial, see ce9a92411d in qemu-kvm or http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it should rather target uq/master. CCing Avi, Marcelo, and the kvm list. Jan
On Fri, Aug 05, 2011 at 08:16:42AM +0200, Jan Kiszka wrote: > On 2011-08-05 06:02, David Gibson wrote: > > At present, an explicit test disallows use of -mem-path when kvm is enabled > > but KVM_CAP_SYNC_MMU is not set. In particular, this prevents the user > > from using hugetlbfs to back the guest memory. > > > > I can see no reason for this check, and when I asked about it previously, > > the only theory offered was that this was a limitation of the very early > > days of kvm which only happened to match the SYNC_MMU flag by accident. > > > > This patch, therefore, removes the check. This is of particular use to > > us on POWER, where we haven't yet implement SYNC_MMU, but where backing > > the guest with hugepages is possible, and in fact mandatory (for now). > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > --- > > exec.c | 5 ----- > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > diff --git a/exec.c b/exec.c > > index 476b507..041637c 100644 > > --- a/exec.c > > +++ b/exec.c > > @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block, > > return NULL; > > } > > > > - if (kvm_enabled() && !kvm_has_sync_mmu()) { > > - fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n"); > > - return NULL; > > - } > > - > > if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) { > > return NULL; > > } > > This is nothing trivial, see ce9a92411d in qemu-kvm or > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it > should rather target uq/master. CCing Avi, Marcelo, and the kvm list. > > Jan Yes, the check cannot be removed because there is the possibility of corruption using hugepages without mmu notifiers (described in the archived message above). Why are mmu notifiers not implemented for PPC again?
On Fri, Aug 05, 2011 at 12:30:53PM -0300, Marcelo Tosatti wrote: > On Fri, Aug 05, 2011 at 08:16:42AM +0200, Jan Kiszka wrote: > > On 2011-08-05 06:02, David Gibson wrote: > > > At present, an explicit test disallows use of -mem-path when kvm is enabled > > > but KVM_CAP_SYNC_MMU is not set. In particular, this prevents the user > > > from using hugetlbfs to back the guest memory. > > > > > > I can see no reason for this check, and when I asked about it previously, > > > the only theory offered was that this was a limitation of the very early > > > days of kvm which only happened to match the SYNC_MMU flag by accident. > > > > > > This patch, therefore, removes the check. This is of particular use to > > > us on POWER, where we haven't yet implement SYNC_MMU, but where backing > > > the guest with hugepages is possible, and in fact mandatory (for now). > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > > --- > > > exec.c | 5 ----- > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > diff --git a/exec.c b/exec.c > > > index 476b507..041637c 100644 > > > --- a/exec.c > > > +++ b/exec.c > > > @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block, > > > return NULL; > > > } > > > > > > - if (kvm_enabled() && !kvm_has_sync_mmu()) { > > > - fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n"); > > > - return NULL; > > > - } > > > - > > > if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) { > > > return NULL; > > > } > > > > This is nothing trivial, see ce9a92411d in qemu-kvm or > > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/27380. And it > > should rather target uq/master. CCing Avi, Marcelo, and the kvm list. > > > > Jan Well, sending the patch flushed out the real reason for that check, at least, as I thought it might. > Yes, the check cannot be removed because there is the possibility of > corruption using hugepages without mmu notifiers (described in the > archived message above). Ok, so. If I understand the archived message correctly. First, this check *is* all about hugepages - which is not obvious from the test itself. Second, if userspace qemu passing hugepages to kvm can cause (host) kernel memory corruption, that is clearly a host kernel bug. So am I correct in thinking this is basically just a safety feature if qemu is run on a buggy kernel. Presumably this bug was corrected at some point? Is the presence of the SYNC_MMU feature just being used as a proxy for "is this kernel recent enough to have the corruption bug fixed"? In any case this test sure as hell needs a big comment next to it explaining this context. > Why are mmu notifiers not implemented for PPC again? It's just not done yet; we're working on it. (That is, mmu notifiers are certainly present on PPC, it's just they're not wired up to kvm, yet).
On 08/08/2011 09:03 AM, David Gibson wrote: > Second, if userspace qemu passing hugepages to kvm can cause (host) > kernel memory corruption, that is clearly a host kernel bug. So am I > correct in thinking this is basically just a safety feature if qemu is > run on a buggy kernel. Seems so, yes. 2.6.2[456] are exploitable. We only found out after these were all released. > Presumably this bug was corrected at some > point? Is the presence of the SYNC_MMU feature just being used as a > proxy for "is this kernel recent enough to have the corruption bug > fixed"? SYNC_MMU actually fixes the bug. > In any case this test sure as hell needs a big comment next to it > explaining this context. Yes. > > > Why are mmu notifiers not implemented for PPC again? > > It's just not done yet; we're working on it. (That is, mmu notifiers > are certainly present on PPC, it's just they're not wired up to kvm, > yet). > If ppc doesn't have this issue even without SYNC_MMU, we can make the check x86 specific.
On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote: > On 08/08/2011 09:03 AM, David Gibson wrote: > >Second, if userspace qemu passing hugepages to kvm can cause (host) > >kernel memory corruption, that is clearly a host kernel bug. So am I > >correct in thinking this is basically just a safety feature if qemu is > >run on a buggy kernel. > > Seems so, yes. 2.6.2[456] are exploitable. We only found out after > these were all released. > > >Presumably this bug was corrected at some > >point? Is the presence of the SYNC_MMU feature just being used as a > >proxy for "is this kernel recent enough to have the corruption bug > >fixed"? > > SYNC_MMU actually fixes the bug. Ah, so SYNC_MMU fixed the bug on x86, and all the other archs without SYNC_MMU were left with a serious memory corruption bug, under a userspace bandaid. Thanks for that. As I understand the bug that causes the problem, it's because removing all the hugepage VMAs from userspace will cause the inode (and therefore address_space) for the hugepage file to be freed, but not the pages (because another ref is held by kvm). Then when kvm releases the pages, the address_space will be touched after free from free_huge_page(). This would seem to be a genuine bug in the hugepage code, which has just been hidden by SYNC_MMU. It should be quite easy to fix - the mapping is only stored in the struct page to get to the hugetlbfs superblock, so we could just store a direct superblock pointer instead, and bump it's refcount when we put that in the page private pointer. But then I'm not sure how qemu would detect that it's on a kernel where the bug is fixed and allow -mem-path to be used again. Any ideas?
On 08/10/2011 08:10 AM, David Gibson wrote: > On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote: > > On 08/08/2011 09:03 AM, David Gibson wrote: > > >Second, if userspace qemu passing hugepages to kvm can cause (host) > > >kernel memory corruption, that is clearly a host kernel bug. So am I > > >correct in thinking this is basically just a safety feature if qemu is > > >run on a buggy kernel. > > > > Seems so, yes. 2.6.2[456] are exploitable. We only found out after > > these were all released. > > > > >Presumably this bug was corrected at some > > >point? Is the presence of the SYNC_MMU feature just being used as a > > >proxy for "is this kernel recent enough to have the corruption bug > > >fixed"? > > > > SYNC_MMU actually fixes the bug. > > Ah, so SYNC_MMU fixed the bug on x86, and all the other archs without > SYNC_MMU were left with a serious memory corruption bug, under a > userspace bandaid. Thanks for that. Unfortunately it's all too easy to ignore non-x86. It may be considered that not implementing SYNC_MMU is a bug in itself, as it allows userspace to pin arbitrary amounts of user memory. At least on x86 we had shrinkers that kill off shadow page tables under memory pressure, unpinning memory, but I don't see it on ppc. > As I understand the bug that causes the problem, it's because removing > all the hugepage VMAs from userspace will cause the inode (and > therefore address_space) for the hugepage file to be freed, but not > the pages (because another ref is held by kvm). Then when kvm > releases the pages, the address_space will be touched after free from > free_huge_page(). > > This would seem to be a genuine bug in the hugepage code, which has > just been hidden by SYNC_MMU. It should be quite easy to fix - the > mapping is only stored in the struct page to get to the hugetlbfs > superblock, so we could just store a direct superblock pointer > instead, and bump it's refcount when we put that in the page private > pointer. > > But then I'm not sure how qemu would detect that it's on a kernel > where the bug is fixed and allow -mem-path to be used again. Any > ideas? If it's just a kernel bug, the fix belongs in the kernel, not in qemu. We used to have KVM_CAPs to declare this sort of thing (KVM_CAP_HUGETLBFS_WORKS_EVEN_WITHOUT_SYNC_MMU) but I don't think it was a good idea.
On Wed, Aug 10, 2011 at 12:01:04PM +0300, Avi Kivity wrote: > On 08/10/2011 08:10 AM, David Gibson wrote: > >On Mon, Aug 08, 2011 at 11:24:09AM +0300, Avi Kivity wrote: > >> On 08/08/2011 09:03 AM, David Gibson wrote: [snip] > >This would seem to be a genuine bug in the hugepage code, which has > >just been hidden by SYNC_MMU. It should be quite easy to fix - the > >mapping is only stored in the struct page to get to the hugetlbfs > >superblock, so we could just store a direct superblock pointer > >instead, and bump it's refcount when we put that in the page private > >pointer. > > > >But then I'm not sure how qemu would detect that it's on a kernel > >where the bug is fixed and allow -mem-path to be used again. Any > >ideas? > > If it's just a kernel bug, the fix belongs in the kernel, not in qemu. Obviously. > We used to have KVM_CAPs to declare this sort of thing > (KVM_CAP_HUGETLBFS_WORKS_EVEN_WITHOUT_SYNC_MMU) but I don't think it > was a good idea. I tend to agree - especially since there's nothing actually kvm specific about this bug. AFAICT a driver which did gup on hugepages could trigger the bug equally well.
diff --git a/exec.c b/exec.c index 476b507..041637c 100644 --- a/exec.c +++ b/exec.c @@ -2818,11 +2818,6 @@ static void *file_ram_alloc(RAMBlock *block, return NULL; } - if (kvm_enabled() && !kvm_has_sync_mmu()) { - fprintf(stderr, "host lacks kvm mmu notifiers, -mem-path unsupported\n"); - return NULL; - } - if (asprintf(&filename, "%s/qemu_back_mem.XXXXXX", path) == -1) { return NULL; }
At present, an explicit test disallows use of -mem-path when kvm is enabled but KVM_CAP_SYNC_MMU is not set. In particular, this prevents the user from using hugetlbfs to back the guest memory. I can see no reason for this check, and when I asked about it previously, the only theory offered was that this was a limitation of the very early days of kvm which only happened to match the SYNC_MMU flag by accident. This patch, therefore, removes the check. This is of particular use to us on POWER, where we haven't yet implement SYNC_MMU, but where backing the guest with hugepages is possible, and in fact mandatory (for now). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- exec.c | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-)