Patchwork [QEMU] Transparent Hugepage Support #2

login
register
mail settings
Submitter Andrea Arcangeli
Date March 16, 2010, 4:46 p.m.
Message ID <20100316164641.GB5717@random.random>
Download mbox | patch
Permalink /patch/47871/
State New
Headers show

Comments

Andrea Arcangeli - March 16, 2010, 4:46 p.m.
From: Andrea Arcangeli <aarcange@redhat.com>

This will allow proper alignment so NPT/EPT can take advantage of
linux host backing the guest memory with hugepages. It also ensures
that when KVM isn't used the first 2M of guest physical memory are
backed by a large TLB. To complete it, it will also notify the kernel
that this memory is important to be backed by hugepages with madvise
(needed for both KVM and QEMU) so that hugepages can also be used in
embedded systems without any memory waste and in the future it will
allow khugepaged to prioritize the collapse of hugepages into the
madvise regions.

Ideally the max hugepage size provided by the transparent hugepage
support in the kernel should be exported by some sysfs file, but
there is no reason to expect x86_64 host to have hugepages larger than
2M or to expect those to be supported by the kernel transparent
hugepage support in the short and medium term, so we can defer the
invention of a fixed kernel API until this happens, by that time we'll
surely have a better clue of what's the best way to provide that
information to userland and it'll be a few liner change to adapt qemu
to use it so there's no hurry to do it right now. Plus the below will
keep to remain optimal and there is no risk of memory waste as virtual
memory is practically zero cost on 64bit archs.

NOTE: if the callers of qemu_ram_alloc changes significantly we may
later be required to pass a second parameter to qemu_ram_alloc that
will tell it what is the first guest physical address that corresponds
to the "sized" memory block being allocated. I'd defer this change for
later too as it may never be needed.

I verified this is more than enough to get the max benefit from the
kernel side feature.

cat /sys/kernel/debug/kvm/largepages 
301

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
Jamie Lokier - March 16, 2010, 5:10 p.m.
Andrea Arcangeli wrote:
> +		 * take advantage of hugepages with NPT/EPP or to

Spelling: NPT/EPT?

-- Jamie
Paul Brook - March 16, 2010, 11:48 p.m.
> +#if defined(__linux__) && defined(__x86_64__)
> +#define MAX_TRANSPARENT_HUGEPAGE_SIZE (2*1024*1024)
> +       if (size >= MAX_TRANSPARENT_HUGEPAGE_SIZE)

I'd prefer something like:

#if defined(__linux__) && defined(__x86_64__)
/* [...Allow the host to use huge pages easily...].  */
#define PREFERRED_RAM_ALIGN (2*1024*1024)
#endif

qemu_ram_alloc(...)
{...
#ifdef PREFERRED_RAM_ALIGN
  if ((size & (PREFERRED_RAM_ALIGN - 1)) == 0) {
    new_block->host = qemu_memalign(...)
  }
#endif


i.e. separate the architecture specific knowledge from the generic alignment 
code.

Paul

Patch

diff --git a/exec.c b/exec.c
index b0b6056..9552366 100644
--- a/exec.c
+++ b/exec.c
@@ -2733,11 +2733,30 @@  ram_addr_t qemu_ram_alloc(ram_addr_t size)
                                 PROT_EXEC|PROT_READ|PROT_WRITE,
                                 MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 #else
-        new_block->host = qemu_vmalloc(size);
+#if defined(__linux__) && defined(__x86_64__)
+#define MAX_TRANSPARENT_HUGEPAGE_SIZE (2*1024*1024)
+	if (size >= MAX_TRANSPARENT_HUGEPAGE_SIZE)
+		/*
+		 * Align on the max transparent hugepage size so that
+		 * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to
+		 * take advantage of hugepages with NPT/EPP or to
+		 * ensure the first 2M of the guest physical ram will
+		 * be mapped by the same hugetlb for QEMU (it is worth
+		 * it even without NPT/EPT).
+		 */
+		new_block->host = qemu_memalign(MAX_TRANSPARENT_HUGEPAGE_SIZE,
+						size);
+	else
+#undef MAX_TRANSPARENT_HUGEPAGE_SIZE
+#endif 
+		new_block->host = qemu_vmalloc(size);
 #endif
 #ifdef MADV_MERGEABLE
         madvise(new_block->host, size, MADV_MERGEABLE);
 #endif
+#ifdef MADV_HUGEPAGE
+        madvise(new_block->host, size, MADV_HUGEPAGE);
+#endif
     }
     new_block->offset = last_ram_offset;
     new_block->length = size;