Patchwork [QEMU] Transparent Hugepage Support #3

login
register
mail settings
Submitter Andrea Arcangeli
Date March 17, 2010, 2:59 p.m.
Message ID <20100317145950.GA5752@random.random>
Download mbox | patch
Permalink /patch/47947/
State New
Headers show

Comments

Andrea Arcangeli - March 17, 2010, 2:59 p.m.
From: Andrea Arcangeli <aarcange@redhat.com>

This will allow proper alignment so NPT/EPT can take advantage of
linux host backing the guest memory with hugepages. It also ensures
that when KVM isn't used the first 2M of guest physical memory are
backed by a large TLB. To complete it, it will also notify the kernel
that this memory is important to be backed by hugepages with madvise
(needed for both KVM and QEMU) so that hugepages can also be used in
embedded systems without any memory waste and in the future it will
allow khugepaged to prioritize the collapse of hugepages into the
madvise regions.

Ideally the max hugepage size provided by the transparent hugepage
support in the kernel should be exported by some sysfs file, but
there is no reason to expect x86_64 host to have hugepages larger than
2M or to expect those to be supported by the kernel transparent
hugepage support in the short and medium term, so we can defer the
invention of a fixed kernel API until this happens, by that time we'll
surely have a better clue of what's the best way to provide that
information to userland and it'll be a few liner change to adapt qemu
to use it so there's no hurry to do it right now. Plus the below will
keep to remain optimal and there is no risk of memory waste as virtual
memory is practically zero cost on 64bit archs.

NOTE: if the callers of qemu_ram_alloc changes significantly we may
later be required to pass a second parameter to qemu_ram_alloc that
will tell it what is the first guest physical address that corresponds
to the "sized" memory block being allocated. I'd defer this change for
later too as it may never be needed.

I verified this is more than enough to get the max benefit from the
kernel side feature.

cat /sys/kernel/debug/kvm/largepages 
301

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
Paul Brook - March 17, 2010, 3:05 p.m.
> +       if (size >= PREFERRED_RAM_ALIGN)
> +               new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size);
> 

Is this deliberately bigger-than rather than multiple-of?
Having the size not be a multiple of alignment seems somewhat strange, it's 
always going to be wrong at one end...

Paul
Andrea Arcangeli - March 17, 2010, 3:14 p.m.
On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote:
> > +       if (size >= PREFERRED_RAM_ALIGN)
> > +               new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size);
> > 
> 
> Is this deliberately bigger-than rather than multiple-of?
> Having the size not be a multiple of alignment seems somewhat strange, it's 
> always going to be wrong at one end...

Size not multiple I think is legitimate, the below-4G chunk isn't
required to end 2M aligned, all it matters is that the above-4G then
starts aligned. In short one thing to add in the future as parameter
to qemu_ram_alloc is the physical address that the host virtual
address corresponds to. The guest physical address that the host
retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for
NPT/EPT to work. I don't think it's a big concern right now.
Paul Brook - March 17, 2010, 3:21 p.m.
> On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote:
> > > +       if (size >= PREFERRED_RAM_ALIGN)
> > > +               new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN,
> > > size);
> >
> > Is this deliberately bigger-than rather than multiple-of?
> > Having the size not be a multiple of alignment seems somewhat strange,
> > it's always going to be wrong at one end...
> 
> Size not multiple I think is legitimate, the below-4G chunk isn't
> required to end 2M aligned, all it matters is that the above-4G then
> starts aligned. In short one thing to add in the future as parameter
> to qemu_ram_alloc is the physical address that the host virtual
> address corresponds to.

In general you don't know this at allocation time.

> The guest physical address that the host
> retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for
> NPT/EPT to work. I don't think it's a big concern right now.
 
If you allocating chinks that are multiples of the relevant page size, then I 
don't think you can expect anything particularly sensible to happen.

Paul
Andrea Arcangeli - March 17, 2010, 3:35 p.m.
On Wed, Mar 17, 2010 at 03:21:26PM +0000, Paul Brook wrote:
> > On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote:
> > > > +       if (size >= PREFERRED_RAM_ALIGN)
> > > > +               new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN,
> > > > size);
> > >
> > > Is this deliberately bigger-than rather than multiple-of?
> > > Having the size not be a multiple of alignment seems somewhat strange,
> > > it's always going to be wrong at one end...
> > 
> > Size not multiple I think is legitimate, the below-4G chunk isn't
> > required to end 2M aligned, all it matters is that the above-4G then
> > starts aligned. In short one thing to add in the future as parameter
> > to qemu_ram_alloc is the physical address that the host virtual
> > address corresponds to.
> 
> In general you don't know this at allocation time.

Caller knows it, it's not like the caller is outside of qemu, it's not
some library. We know this is enough with the caller that there is now.

Again there is absolutely no relation between the "size" and
this. Size can be anything and it's absolutely irrelevant.

All it matter is the _start_. Both the guest physical address _start_
and the host virtual address _start_. And they don't have to be
aligned to 2M, simply their alignment or misalignment have to match
and this is the simplest way to have them match.

> > The guest physical address that the host
> > retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for
> > NPT/EPT to work. I don't think it's a big concern right now.
>  
> If you allocating chinks that are multiples of the relevant page size, then I 
> don't think you can expect anything particularly sensible to happen.

If you want me to do a bigger more complex patch that passes down to
qemu_ram_alloc the actual guest physical address that the virtual
address returned by qram_mem_alloc will correspond to, I will do
it. That likely would be something like qemu_ram_alloc_align.

And if somebody volunteers to avoid me to do it, you're welcome.

I don't care how this happens but it must happen.
Paul Brook - March 17, 2010, 3:52 p.m.
> > > Size not multiple I think is legitimate, the below-4G chunk isn't
> > > required to end 2M aligned, all it matters is that the above-4G then
> > > starts aligned. In short one thing to add in the future as parameter
> > > to qemu_ram_alloc is the physical address that the host virtual
> > > address corresponds to.
> >
> > In general you don't know this at allocation time.
> 
> Caller knows it, it's not like the caller is outside of qemu, it's not
> some library. We know this is enough with the caller that there is now.

No we don't.  As discussed previously, there are machines where the physical 
location of RAM is configurable at runtime.  In fact it's common for the ram 
to be completely absent at reset.

Paul
Andrea Arcangeli - March 17, 2010, 3:55 p.m.
On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote:
> > > > Size not multiple I think is legitimate, the below-4G chunk isn't
> > > > required to end 2M aligned, all it matters is that the above-4G then
> > > > starts aligned. In short one thing to add in the future as parameter
> > > > to qemu_ram_alloc is the physical address that the host virtual
> > > > address corresponds to.
> > >
> > > In general you don't know this at allocation time.
> > 
> > Caller knows it, it's not like the caller is outside of qemu, it's not
> > some library. We know this is enough with the caller that there is now.
> 
> No we don't.  As discussed previously, there are machines where the physical 
> location of RAM is configurable at runtime.  In fact it's common for the ram 
> to be completely absent at reset.

This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm
not talking about other archs that may never support transparent
hugepages in the kernel because of other architectural constrains that
may prevent to map hugepages mixed with regular pages in the same vma.
Paul Brook - March 17, 2010, 4:07 p.m.
> On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote:
> > > > > Size not multiple I think is legitimate, the below-4G chunk isn't
> > > > > required to end 2M aligned, all it matters is that the above-4G
> > > > > then starts aligned. In short one thing to add in the future as
> > > > > parameter to qemu_ram_alloc is the physical address that the host
> > > > > virtual address corresponds to.
> > > >
> > > > In general you don't know this at allocation time.
> > >
> > > Caller knows it, it's not like the caller is outside of qemu, it's not
> > > some library. We know this is enough with the caller that there is now.
> >
> > No we don't.  As discussed previously, there are machines where the
> > physical location of RAM is configurable at runtime.  In fact it's common
> > for the ram to be completely absent at reset.
> 
> This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm
> not talking about other archs that may never support transparent
> hugepages in the kernel because of other architectural constrains that
> may prevent to map hugepages mixed with regular pages in the same vma.

__x86__64 only tells you about the host. I'm talking about the guest machine.

Paul
Andrea Arcangeli - March 17, 2010, 4:23 p.m.
On Wed, Mar 17, 2010 at 04:07:09PM +0000, Paul Brook wrote:
> > On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote:
> > > > > > Size not multiple I think is legitimate, the below-4G chunk isn't
> > > > > > required to end 2M aligned, all it matters is that the above-4G
> > > > > > then starts aligned. In short one thing to add in the future as
> > > > > > parameter to qemu_ram_alloc is the physical address that the host
> > > > > > virtual address corresponds to.
> > > > >
> > > > > In general you don't know this at allocation time.
> > > >
> > > > Caller knows it, it's not like the caller is outside of qemu, it's not
> > > > some library. We know this is enough with the caller that there is now.
> > >
> > > No we don't.  As discussed previously, there are machines where the
> > > physical location of RAM is configurable at runtime.  In fact it's common
> > > for the ram to be completely absent at reset.
> > 
> > This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm
> > not talking about other archs that may never support transparent
> > hugepages in the kernel because of other architectural constrains that
> > may prevent to map hugepages mixed with regular pages in the same vma.
> 
> __x86__64 only tells you about the host. I'm talking about the guest machine.

When it's qemu and not kvm (so when the guest might not be x86 arch) the
guest physical address becomes as irrelevant as the size and only the
host virtual address has to start 2M aligned on x86_64 host.

I think this already takes care of all practical issues, and there's
no need of further work until pc.c will start allocating chunks of ram
starting at guest physical addresses not 2M aligned. Maybe if we add
memory hotplug or something.

Patch

diff --git a/exec.c b/exec.c
index 14767b7..ab33f6b 100644
--- a/exec.c
+++ b/exec.c
@@ -2745,6 +2745,18 @@  static void *file_ram_alloc(ram_addr_t memory, const char *path)
 }
 #endif
 
+#if defined(__linux__) && defined(__x86_64__)
+/*
+ * Align on the max transparent hugepage size so that
+ * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to
+ * take advantage of hugepages with NPT/EPT or to
+ * ensure the first 2M of the guest physical ram will
+ * be mapped by the same hugetlb for QEMU (it is worth
+ * it even without NPT/EPT).
+ */
+#define PREFERRED_RAM_ALIGN (2*1024*1024)
+#endif
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
     RAMBlock *new_block;
@@ -2768,11 +2780,19 @@  ram_addr_t qemu_ram_alloc(ram_addr_t size)
                                 PROT_EXEC|PROT_READ|PROT_WRITE,
                                 MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 #else
-        new_block->host = qemu_vmalloc(size);
+#ifdef PREFERRED_RAM_ALIGN
+	if (size >= PREFERRED_RAM_ALIGN)
+		new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size);
+	else
+#endif 
+		new_block->host = qemu_vmalloc(size);
 #endif
 #ifdef MADV_MERGEABLE
         madvise(new_block->host, size, MADV_MERGEABLE);
 #endif
+#ifdef MADV_HUGEPAGE
+        madvise(new_block->host, size, MADV_HUGEPAGE);
+#endif
     }
     new_block->offset = last_ram_offset;
     new_block->length = size;