Patchwork [QEMU] transparent hugepage support

login
register
mail settings
Submitter Andrea Arcangeli
Date March 11, 2010, 3:14 p.m.
Message ID <20100311151427.GE5677@random.random>
Download mbox | patch
Permalink /patch/47342/
State New
Headers show

Comments

Andrea Arcangeli - March 11, 2010, 3:14 p.m.
From: Andrea Arcangeli <aarcange@redhat.com>

This will allow proper alignment so NPT/EPT can take advantage of linux host
backing the guest memory with hugepages (only relevant for KVM and not for QEMU
that has no NPT/EPT support). To complete it, it will also notify
the kernel that this memory is important to be backed by hugepages with
madvise (needed for both KVM and QEMU).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
Avi Kivity - March 11, 2010, 3:52 p.m.
On 03/11/2010 05:14 PM, Andrea Arcangeli wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> This will allow proper alignment so NPT/EPT can take advantage of linux host
> backing the guest memory with hugepages (only relevant for KVM and not for QEMU
> that has no NPT/EPT support). To complete it, it will also notify
> the kernel that this memory is important to be backed by hugepages with
> madvise (needed for both KVM and QEMU).
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>
> diff --git a/exec.c b/exec.c
> index 891e0ee..aedd133 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2628,11 +2628,25 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
>                                   PROT_EXEC|PROT_READ|PROT_WRITE,
>                                   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>   #else
> -        new_block->host = qemu_vmalloc(size);
> -#endif
> +#if defined(__linux__)&&  defined(TARGET_HPAGE_BITS)
> +	if (!kvm_enabled())
> +#endif
> +		new_block->host = qemu_vmalloc(size);
> +#if defined(__linux__)&&  defined(TARGET_HPAGE_BITS)
> +	else
> +		/*
> +		 * Align on HPAGE_SIZE so "(gfn ^ pfn)&
> +		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
> +		 * of hugepages with NPT/EPT.
> +		 */
> +		new_block->host = qemu_memalign(1<<  TARGET_HPAGE_BITS, size);
>    

That is a little wasteful.  How about a hint to mmap() requesting proper 
alignment (MAP_HPAGE_ALIGN)?

Failing that, modify qemu_memalign() to trim excess memory.

Come to think of it, posix_memalign() needs to do that (but doesn't).

> +#endif
>   #ifdef MADV_MERGEABLE
>           madvise(new_block->host, size, MADV_MERGEABLE);
>   #endif
> +#ifdef MADV_HUGEPAGE
> +        madvise(new_block->host, size, MADV_HUGEPAGE);
> +#endif
>       }
Andrea Arcangeli - March 11, 2010, 4:05 p.m.
On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote:
> That is a little wasteful.  How about a hint to mmap() requesting proper 
> alignment (MAP_HPAGE_ALIGN)?

So you suggest adding a new kernel feature to mmap? Not sure if it's
worth it, considering it'd also increase the number of vmas because it
will have to leave an hole. Wasting 2M-4k of virtual memory is likely
cheaper than having 1 more vma in the rbtree for every page fault. So
I think it's better to just malloc and adjust ourselfs on the next
offset which is done in userland by qemu_memalign I think.

What we could ask the kernel is the HPAGE_SIZE. Also thinking a bit
more about it, it now comes to mind what we really care about is the
HOST_HPAGE_SIZE. Said that I doubt for kvm it makes a lot of
difference and this only changes the kvm path. I'm open to suggestions
of where to get the HPAGE_SIZE from and how to call it...

> Failing that, modify qemu_memalign() to trim excess memory.
> 
> Come to think of it, posix_memalign() needs to do that (but doesn't).

It's hard to tell because of the amount of #ifdefs in .c files, but it
seems to be using posix_memalign.

If we don't touch these additional pages allocated and there's no
transparent hugepage support in the kernel, you won't waste any more
memory and less vmas will be generated this way than with a kernel
option to reduce the virtual memory waste. Basically the idea is to
waste virtual memory to avoid wasting cpu.

In short we should make sure it only wastes virtual memory...
Paul Brook - March 11, 2010, 4:28 p.m.
> > +		/*
> > +		 * Align on HPAGE_SIZE so "(gfn ^ pfn)&
> > +		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
> > +		 * of hugepages with NPT/EPT.
> > +		 */
> > +		new_block->host = qemu_memalign(1<<  TARGET_HPAGE_BITS, size);

This should not be target dependent. i.e. it should be the host page size.

> That is a little wasteful.  How about a hint to mmap() requesting proper
> alignment (MAP_HPAGE_ALIGN)?

I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart enough 
to automatically align large allocations anyway.

This is probably a useful optimization regardless of KVM.

Paul
Andrea Arcangeli - March 11, 2010, 4:46 p.m.
On Thu, Mar 11, 2010 at 04:28:04PM +0000, Paul Brook wrote:
> > > +		/*
> > > +		 * Align on HPAGE_SIZE so "(gfn ^ pfn)&
> > > +		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
> > > +		 * of hugepages with NPT/EPT.
> > > +		 */
> > > +		new_block->host = qemu_memalign(1<<  TARGET_HPAGE_BITS, size);
> 
> This should not be target dependent. i.e. it should be the host page size.

Yep I noticed. I'm not aware of an official way to get that
information out of the kernel (hugepagesize in /proc/meminfo is
dependent on hugetlbfs which in turn is not a dependency for
transparent hugepage support) but hey I can add it myself to
/sys/kernel/mm/transparent_hugepage/hugepage_size !

> > That is a little wasteful.  How about a hint to mmap() requesting proper
> > alignment (MAP_HPAGE_ALIGN)?
> 
> I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart enough 
> to automatically align large allocations anyway.

Kernel won't do that, and the main reason is to avoid creating more
vmas, it's more efficient to waste virtual space and have userland
allocate more than needed, than ask the kernel alignment and force it
to create more vmas because of holes generated out of it. virtual
memory costs nothing.

Also khugepaged can later zero out the pte_none regions to create a
full segment all backed by hugepages, however if we do that khugepaged
will eat into the free memory space. At the moment I kept khugepaged a
zero-memory-footprint thing. But I'm currently adding an option called
collapse_unmapped to allow khugepaged to collapse unmapped pages too
so if there are only 2/3 pages in the region before the memalign, they
also can be mapped by a large tlb to allow qemu run faster.

> This is probably a useful optimization regardless of KVM.

HPAGE alignment is only useful with KVM because it can only payoff
with EPT/NPT, transparent hugepage already works fine without that
(but ok it'd be a microoptimization for the first and last few pages
in the whole vma). This is why I made it conditional to
kvm_enabled(). I can remove the kvm_enabled() check if you worry about
the first and last pages in the huge anon vma.

OTOH the madvise(MADV_HUGEPAGE) is surely good idea for qemu too. KVM
normally runs on 64bit hosts, so it's no big deal if we waste 1M of
virtual memory here and there but I thought on qemu you preferred not
to have alignment and have the first few and last few pages in a vma
not backed by large tlb. Ideally we should also align on hpage size if
sizeof(long) = 8. Not sure what's the recommended way to code that
though and it'll make it a bit more complex for little good.
Paul Brook - March 11, 2010, 5:55 p.m.
> On Thu, Mar 11, 2010 at 04:28:04PM +0000, Paul Brook wrote:
> > > > +		/*
> > > > +		 * Align on HPAGE_SIZE so "(gfn ^ pfn)&
> > > > +		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
> > > > +		 * of hugepages with NPT/EPT.
> > > > +		 */
> > > > +		new_block->host = qemu_memalign(1<<  TARGET_HPAGE_BITS, size);
> >
> > This should not be target dependent. i.e. it should be the host page
> > size.
> 
> Yep I noticed. I'm not aware of an official way to get that
> information out of the kernel (hugepagesize in /proc/meminfo is
> dependent on hugetlbfs which in turn is not a dependency for
> transparent hugepage support) but hey I can add it myself to
> /sys/kernel/mm/transparent_hugepage/hugepage_size !

sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
 
> > > That is a little wasteful.  How about a hint to mmap() requesting
> > > proper alignment (MAP_HPAGE_ALIGN)?
> >
> > I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart
> > enough to automatically align large allocations anyway.
> 
> Kernel won't do that, and the main reason is to avoid creating more
> vmas, it's more efficient to waste virtual space and have userland
> allocate more than needed, than ask the kernel alignment and force it
> to create more vmas because of holes generated out of it. virtual
> memory costs nothing.

Huh. That seems unfortunate :-(

> Also khugepaged can later zero out the pte_none regions to create a
> full segment all backed by hugepages, however if we do that khugepaged
> will eat into the free memory space. At the moment I kept khugepaged a
> zero-memory-footprint thing. But I'm currently adding an option called
> collapse_unmapped to allow khugepaged to collapse unmapped pages too
> so if there are only 2/3 pages in the region before the memalign, they
> also can be mapped by a large tlb to allow qemu run faster.

I don't really understand what you're getting at here. Surely a naturally 
aligned block is always going to be easier to defragment than a misaligned 
block.

If the allocation size is not a multiple of the preferred alignment, then you 
probably loose either way, and we shouldn't be requesting increased alignment.

> > This is probably a useful optimization regardless of KVM.
> 
> HPAGE alignment is only useful with KVM because it can only payoff
> with EPT/NPT, transparent hugepage already works fine without that
> (but ok it'd be a microoptimization for the first and last few pages
> in the whole vma). This is why I made it conditional to
> kvm_enabled(). I can remove the kvm_enabled() check if you worry about
> the first and last pages in the huge anon vma.

I wouldn't be surprised if putting the start of guest ram on a large TLB entry 
was a win. Your guest kernel often lives there!

> OTOH the madvise(MADV_HUGEPAGE) is surely good idea for qemu too. KVM
> normally runs on 64bit hosts, so it's no big deal if we waste 1M of
> virtual memory here and there but I thought on qemu you preferred not
> to have alignment and have the first few and last few pages in a vma
> not backed by large tlb. Ideally we should also align on hpage size if
> sizeof(long) = 8. Not sure what's the recommended way to code that
> though and it'll make it a bit more complex for little good.

Assuming we're allocating in large chunks, I doubt an extra hugepage worth of 
VMA is a big issue.

Either way I'd argue that this isn't something qemu should have to care about, 
and is actually a bug in posix_memalign.

Paul
Andrea Arcangeli - March 11, 2010, 6:49 p.m.
On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote:
> sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.

There's not just one hugepage size and that thing doesn't exist yet
plus it'd require mangling over glibc too. If it existed I could use
it but I think this is better:

$ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 
2097152

Ok? If this file doesn't exist we won't align, so we also align on
qemu not only on kvm for the concern below on the first and last bytes.

> > Also khugepaged can later zero out the pte_none regions to create a
> > full segment all backed by hugepages, however if we do that khugepaged
> > will eat into the free memory space. At the moment I kept khugepaged a
> > zero-memory-footprint thing. But I'm currently adding an option called
> > collapse_unmapped to allow khugepaged to collapse unmapped pages too
> > so if there are only 2/3 pages in the region before the memalign, they
> > also can be mapped by a large tlb to allow qemu run faster.
> 
> I don't really understand what you're getting at here. Surely a naturally 
> aligned block is always going to be easier to defragment than a misaligned 
> block.

Basically was I was saying it's about touching subpage 0, 1 of an
hugepage, then posix_memalign extends the vma and nobody is ever going
to touch page 2-511 because those are the virtual addresses
wasted. khugepaged before couldn't allocate an hugepage for only page
0, 1 because the vma stopped there, but later after the vma is
extended it can. So previously I wasn't mapping this range with an
hugepage, but now I'm mapping it with an hugepage too. And a sysfs
control will select the max number of unmapped subpages for the
collapse to happen. For just 1 subpage mapped in the hugepage virtual
range, it won't make sense to use large tlb and waste 511 pages of
ram.

> If the allocation size is not a multiple of the preferred alignment, then you 
> probably loose either way, and we shouldn't be requesting increased alignment.

That's probably good idea. Also note, if we were to allocate
separately the 0-640k 1m-end, for NPT to work we'd need to start the
second block misaligned at a 1m address. So maybe I should move the
alignment out of qemu_ram_alloc and have it in the caller?

> I wouldn't be surprised if putting the start of guest ram on a large TLB entry 
> was a win. Your guest kernel often lives there!

Yep, that's easy to handle with the hpage_pmd_size ;).

> Assuming we're allocating in large chunks, I doubt an extra hugepage worth of 
> VMA is a big issue.
> 
> Either way I'd argue that this isn't something qemu should have to care about, 
> and is actually a bug in posix_memalign.

Hmm the last is a weird claim considering posix_memalign gets an explicit
alignment parameter and it surely can't choose what alignment to
use. We can argue about the kernel side having to align automatically
but again if it would do that, it'd generate unnecessary vma holes
which we don't want.

I think it's quite simple, just use my new sysfs control, if it exists
always use that alignment instead of the default. We've only to decide
if to align inside or outside of qemu_ram_alloc.
Paul Brook - March 12, 2010, 11:36 a.m.
> On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote:
> > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
> 
> There's not just one hugepage size 

We only have one madvise flag...

> and that thing doesn't exist yet
> plus it'd require mangling over glibc too. If it existed I could use
> it but I think this is better:
 
> $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> 2097152

Is "pmd" x86 specific?

> > If the allocation size is not a multiple of the preferred alignment, then
> > you probably loose either way, and we shouldn't be requesting increased
> > alignment.
> 
> That's probably good idea. Also note, if we were to allocate
> separately the 0-640k 1m-end, for NPT to work we'd need to start the
> second block misaligned at a 1m address. So maybe I should move the
> alignment out of qemu_ram_alloc and have it in the caller?

I think the only viable solution if you care about EPT/NPT is to not do that. 
With your current code the 1m-end region will be misaligned - your code 
allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. 
Aligning on a 1M boundary will only DTRT half the time.
 
> > I wouldn't be surprised if putting the start of guest ram on a large TLB
> > entry was a win. Your guest kernel often lives there!
> 
> Yep, that's easy to handle with the hpage_pmd_size ;).

But that's only going to happen if you align the allocation.

> > Assuming we're allocating in large chunks, I doubt an extra hugepage
> > worth of VMA is a big issue.
> >
> > Either way I'd argue that this isn't something qemu should have to care
> > about, and is actually a bug in posix_memalign.
> 
> Hmm the last is a weird claim considering posix_memalign gets an explicit
> alignment parameter and it surely can't choose what alignment to
> use. We can argue about the kernel side having to align automatically
> but again if it would do that, it'd generate unnecessary vma holes
> which we don't want.

It can't choose what align to use, but it can (should?) choose how to achieve 
that alignment.

Paul
Andrea Arcangeli - March 12, 2010, 2:52 p.m.
On Fri, Mar 12, 2010 at 11:36:33AM +0000, Paul Brook wrote:
> > On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote:
> > > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
> > 
> > There's not just one hugepage size 
> 
> We only have one madvise flag...

Transparent hugepage support means _really_ transparent, it's not up
to userland to know what hugepage size the kernel uses. There is no
way for userland to notice anything but that it runs faster.

The madvise flag is one and it only exists for 1 reason: embedded
systems that may want to turn off the transparency feature to avoid
the risk of using a little more memory during anonymous memory
copy-on-writes after fork or similar. But for things like kvm there is
absolutely zero memory waste in enabling hugepages so even embedded
definitely wants to enable transparent hugepage and run faster on
their underpowered CPU.

If it wasn't for embedded the madvise flag would need to be dropped as
it would be pointless. It's not about the page size at all.

> > and that thing doesn't exist yet
> > plus it'd require mangling over glibc too. If it existed I could use
> > it but I think this is better:
>  
> > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > 2097152
> 
> Is "pmd" x86 specific?

It's linux specific, this is common code, nothing x86 specific. In
fact on x86 it's not called pmd but Page Directory. I've actually no
idea what pmd stands for but it's definitely not x86 specific and it's
just about the linux common code common to all archs. The reason this
is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in
the kernel code. So this entirely match the kernel internals
_common_code_.

> > > If the allocation size is not a multiple of the preferred alignment, then
> > > you probably loose either way, and we shouldn't be requesting increased
> > > alignment.
> > 
> > That's probably good idea. Also note, if we were to allocate
> > separately the 0-640k 1m-end, for NPT to work we'd need to start the
> > second block misaligned at a 1m address. So maybe I should move the
> > alignment out of qemu_ram_alloc and have it in the caller?
> 
> I think the only viable solution if you care about EPT/NPT is to not do that. 
> With your current code the 1m-end region will be misaligned - your code 

Well with my current code on top of current qemu code, there is no
risk of misalignment because the 0-4G is allocated in a single
qemu_ram_alloc. I'm sure it works right because
/debugfs/kvm/largepages shows all ram in largepages and otherwise I
wouldn't get a reproducible 6% boost on kernel compiles in guest even
on a common $150 quad core workstation (without even thinking at the
boost on huge systems).

> allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. 
> Aligning on a 1M boundary will only DTRT half the time.

The 1m-end is an hypothetical worry that come to mind as I was
discussing the issue with you. Basically my point is that if the pc.c
code will change and it'll pretend to qemu_ram_alloc the 0-640k and
1M-4G range with two separate calls (this is _not_ what qemu does
right now), the alignment in qemu_ram_alloc that works right now,
would then stop working.

This is why I thought maybe it's more correct (and less
virtual-ram-wasteful) to move the alignment in the caller even if the
patch will grow in size and it'll be pc.c specific (which it wouldn't
need to if other archs will support transparent hugepage).

I think with what you're saying above you're basically agreeing with
me I should move the alignment in the caller. Correct me if I
misunderstood.

> But that's only going to happen if you align the allocation.

Yep, this is why I agree with you, it's better to always align even
when kvm_enabled() == 0.

> It can't choose what align to use, but it can (should?) choose how to achieve 
> that alignment.

Ok but I don't see a problem in how it achieves it, in fact I think
it's more efficient than a kernel assisted alignment that would then
force to split the vma generating a micro-slowdown.
Paul Brook - March 12, 2010, 4:04 p.m.
> > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > > 2097152
> >
> > Is "pmd" x86 specific?
> 
> It's linux specific, this is common code, nothing x86 specific. In
> fact on x86 it's not called pmd but Page Directory. I've actually no
> idea what pmd stands for but it's definitely not x86 specific and it's
> just about the linux common code common to all archs. The reason this
> is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in
> the kernel code. So this entirely match the kernel internals
> _common_code_.

Hmm, ok. I'm guessing linux doesn't support anything other than "huge" and 
"normal" page sizes now, so it's a question of whether we want it to expose 
current implementation details, or say "Align big in-memory things this much 
for optimal TLB behavior".

Paul
Paul Brook - March 12, 2010, 4:10 p.m.
> > allocates it on a 2M boundary. I suspect you actually want (base % 2M) ==
> > 1M. Aligning on a 1M boundary will only DTRT half the time.
> 
> The 1m-end is an hypothetical worry that come to mind as I was
> discussing the issue with you. Basically my point is that if the pc.c
> code will change and it'll pretend to qemu_ram_alloc the 0-640k and
> 1M-4G range with two separate calls (this is _not_ what qemu does
> right now), the alignment in qemu_ram_alloc that works right now,
> would then stop working.
> 
> This is why I thought maybe it's more correct (and less
> virtual-ram-wasteful) to move the alignment in the caller even if the
> patch will grow in size and it'll be pc.c specific (which it wouldn't
> need to if other archs will support transparent hugepage).
> 
> I think with what you're saying above you're basically agreeing with
> me I should move the alignment in the caller. Correct me if I
> misunderstood.

I don't think the target specific should know or care about this.
Anthony recently proposed a different API for allocation guest RAM that would 
potentially make some of this information available to common code.  However 
that has significant issues once you try and use it for anything other than 
the trivial PC machine.  In particular I don't believe is is reasonable to 
assume RAM is always mapped at a fixed guest address.

Paul
Andrea Arcangeli - March 12, 2010, 4:17 p.m.
On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote:
> > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > > > 2097152
> > >
> > > Is "pmd" x86 specific?
> > 
> > It's linux specific, this is common code, nothing x86 specific. In
> > fact on x86 it's not called pmd but Page Directory. I've actually no
> > idea what pmd stands for but it's definitely not x86 specific and it's
> > just about the linux common code common to all archs. The reason this
> > is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in
> > the kernel code. So this entirely match the kernel internals
> > _common_code_.
> 
> Hmm, ok. I'm guessing linux doesn't support anything other than "huge" and 
> "normal" page sizes now, so it's a question of whether we want it to expose 
> current implementation details, or say "Align big in-memory things this much 
> for optimal TLB behavior".

hugetlbfs already exposes the implementation detail. So if you want
that it's already available. The whole point of going the extra mile
with a transparent solution is to avoid userland to increase in
complexity and to keep it as unaware of hugepages as possible. The
madvise hint basically means "this won't risk to waste memory if you
use large tlb on this mapping" and also "this mapping is more
important than others to be backed by hugepages". It's up to the
kernel what to do next. For example right now khugepaged doesn't
prioritize scanning the madvise regions first, it basically doesn't
matter for hypervisor solutions in the cloud (all anon memory in the
system is only allocated by kvm...). But later we may prioritize it
and try to be smarter from the hint given by userland.
Paul Brook - March 12, 2010, 4:24 p.m.
> On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote:
> > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > > > > 2097152
>
> > Hmm, ok. I'm guessing linux doesn't support anything other than "huge"
> > and "normal" page sizes now, so it's a question of whether we want it to
> > expose current implementation details, or say "Align big in-memory things
> > this much for optimal TLB behavior".
> 
> hugetlbfs already exposes the implementation detail. So if you want
> that it's already available. The whole point of going the extra mile
> with a transparent solution is to avoid userland to increase in
> complexity and to keep it as unaware of hugepages as possible. The
> madvise hint basically means "this won't risk to waste memory if you
> use large tlb on this mapping" and also "this mapping is more
> important than others to be backed by hugepages". It's up to the
> kernel what to do next. For example right now khugepaged doesn't
> prioritize scanning the madvise regions first, it basically doesn't
> matter for hypervisor solutions in the cloud (all anon memory in the
> system is only allocated by kvm...). But later we may prioritize it
> and try to be smarter from the hint given by userland.

So shouldn't [the name of] the value the kernel provides for recommended 
alignment be equally implementation agnostic?

Paul
Andrea Arcangeli - March 12, 2010, 4:57 p.m.
On Fri, Mar 12, 2010 at 04:24:24PM +0000, Paul Brook wrote:
> > On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote:
> > > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > > > > > 2097152
> >
> > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge"
> > > and "normal" page sizes now, so it's a question of whether we want it to
> > > expose current implementation details, or say "Align big in-memory things
> > > this much for optimal TLB behavior".
> > 
> > hugetlbfs already exposes the implementation detail. So if you want
> > that it's already available. The whole point of going the extra mile
> > with a transparent solution is to avoid userland to increase in
> > complexity and to keep it as unaware of hugepages as possible. The
> > madvise hint basically means "this won't risk to waste memory if you
> > use large tlb on this mapping" and also "this mapping is more
> > important than others to be backed by hugepages". It's up to the
> > kernel what to do next. For example right now khugepaged doesn't
> > prioritize scanning the madvise regions first, it basically doesn't
> > matter for hypervisor solutions in the cloud (all anon memory in the
> > system is only allocated by kvm...). But later we may prioritize it
> > and try to be smarter from the hint given by userland.
> 
> So shouldn't [the name of] the value the kernel provides for recommended 
> alignment be equally implementation agnostic?

Is sys/kernel/mm/transparent_hugepage directory implementation
agnostic in the first place?

This is not black and white issue, the idea of transparency is to have
userland to know as little as possible but without actually losing any
feature (in fact getting _more_!) than hugetlbfs that requires
userland to setup the whole thing, lose paging, lose ksm (yeah it also
loses ksm right now but we'll fix that with transparent hugepage
support later) etc...

If we want to fully take advantage of the feature (i.e. NPT and qemu
first 2M of guest physical ram where usually kernel resides) userspace
has to know the alignment size the kernel recommends. And so this
information can't be implementation agnostic. In short we do
everything as possible to avoid changing userland, and this results in
a few liner change in fact, but this few liner change is required. be
it an hint to ask kernel to align or use posix_madvise (which is more
efficient as virtual memory is cheaper than vmas IMHO).

Only thing I'm undecided about is if this should be called
hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds
64k pages too and the kernel decides to use them too if it fails to
allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k,
and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max
possible hpage_size provided by transparent hugepage. So with this new
reasoning I think hpage_size or max_hpage_size would be better sysfs
name for this. What do you think? hpage_size or max_hpage_size?
Paul Brook - March 12, 2010, 5:10 p.m.
> > So shouldn't [the name of] the value the kernel provides for recommended
> > alignment be equally implementation agnostic?
> 
> Is sys/kernel/mm/transparent_hugepage directory implementation
> agnostic in the first place?

It's about as agnostic as MADV_HUGEPAGE :-)

> If we want to fully take advantage of the feature (i.e. NPT and qemu
> first 2M of guest physical ram where usually kernel resides) userspace
> has to know the alignment size the kernel recommends.

This is KVM specific, so my gut reaction is you should be asking KVM.

> Only thing I'm undecided about is if this should be called
> hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds
> 64k pages too and the kernel decides to use them too if it fails to
> allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k,
> and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max
> possible hpage_size provided by transparent hugepage. So with this new
> reasoning I think hpage_size or max_hpage_size would be better sysfs
> name for this. What do you think? 

Agreed.

> hpage_size or max_hpage_size?

No particular preference. Or you could have .../page_sizes list all available 
sizes, and have qemu take the first one (or last depending on sort order).

Paul
Andrea Arcangeli - March 12, 2010, 5:41 p.m.
On Fri, Mar 12, 2010 at 05:10:54PM +0000, Paul Brook wrote:
> > > So shouldn't [the name of] the value the kernel provides for recommended
> > > alignment be equally implementation agnostic?
> > 
> > Is sys/kernel/mm/transparent_hugepage directory implementation
> > agnostic in the first place?
> 
> It's about as agnostic as MADV_HUGEPAGE :-)

Exactly! ;) Again it's no black and white... we expose a minimum so
the kernel later can do better for qemu.

> > If we want to fully take advantage of the feature (i.e. NPT and qemu
> > first 2M of guest physical ram where usually kernel resides) userspace
> > has to know the alignment size the kernel recommends.
> 
> This is KVM specific, so my gut reaction is you should be asking KVM.

Hey I just said this is also needed for qemu to be backed by hugepages
on the first 2M of guest ram. Not KVM specific anymore if we want to
optimize that bit too. Performance difference for small system with
one qemu only, will be in the 2% range, for KVM it's from 6% to 15%
depending on the workload. But it's not kvm specific.

> No particular preference. Or you could have .../page_sizes list all available 
> sizes, and have qemu take the first one (or last depending on sort order).

That would also work. Considering that the current transparent
hugepage support won't support any more than 1 page, I think it's ok
to call it hpage_size, the fact that amd/intel will add a 64k page
size is purely hypothetical and I guess by the time transparent
hugepages are 1G in size, the basic page will be 2M and 4k will be
obsolete. But hey if you prefer page_sizes or max_page_size let me
know... The semantics will be able to stand for the long run, I'll
write hpage_size shall export the alignment that userland should use
with posix_memalign to be sure the whole allocated space can be backed
by hugepages.
Paul Brook - March 12, 2010, 6:17 p.m.
> > No particular preference. Or you could have .../page_sizes list all
> > available sizes, and have qemu take the first one (or last depending on
> > sort order).
> 
> That would also work. Considering that the current transparent
> hugepage support won't support any more than 1 page, I think it's ok
> to call it hpage_size, the fact that amd/intel will add a 64k page
> size is purely hypothetical

It's only hypothetical on x86. Many other architectures already support this 
(at least ARM, MIPS, IA64, SPARC).

Paul
Andrea Arcangeli - March 12, 2010, 6:36 p.m.
On Fri, Mar 12, 2010 at 06:17:05PM +0000, Paul Brook wrote:
> > > No particular preference. Or you could have .../page_sizes list all
> > > available sizes, and have qemu take the first one (or last depending on
> > > sort order).
> > 
> > That would also work. Considering that the current transparent
> > hugepage support won't support any more than 1 page, I think it's ok
> > to call it hpage_size, the fact that amd/intel will add a 64k page
> > size is purely hypothetical
> 
> It's only hypothetical on x86. Many other architectures already support this 
> (at least ARM, MIPS, IA64, SPARC).

Hmm the kernel won't support mixed page size right now but ok this is
irrelevant because the API should stand. I think we'll be lucky enough
if 1 HPAGE size will be supported...

ia64 won't be able to take advantage of it either because it can't mix
different page sizes in the same vma, which is a requirement for
transparency and fallback to regular page size.

My point is that there is no need to show the smaller page sizes to
userland, only the max one is relevant and this isn't going to change
and I'm uncomfortable to add plural stuff to a patch that doesn't
contemplate mixes page sizes and for the time being multiple hpage size
isn't even on the horizon... I hate APIs...

What if we defer this whole issue and we just align on 2M if
/sys/kernel/mm/transparent_hugepage exists without checking /sys? ;)
If there's a way to define the host hpage size that will be enough, 2M
is there since PAE was introduced and it's not going to change
overnight so there's plenty of time later to add a hpage_sizes..
Paul Brook - March 12, 2010, 6:41 p.m.
>My point is that there is no need to show the smaller page sizes to
>userland, only the max one is relevant and this isn't going to change
>and I'm uncomfortable to add plural stuff to a patch that doesn't
>contemplate mixes page sizes and for the time being multiple hpage size
>isn't even on the horizon... I hate APIs...

I don't care enough to continue arguing about the utility of multiple page 
sizes :-)

> What if we defer this whole issue and we just align on 2M if
> /sys/kernel/mm/transparent_hugepage exists without checking /sys? ;)

Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good 
number of those knocking about.

Paul
Andrea Arcangeli - March 12, 2010, 6:51 p.m.
On Fri, Mar 12, 2010 at 06:41:56PM +0000, Paul Brook wrote:
> Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good 
> number of those knocking about.

Yep, but 32bit x86 host won't support transparent hugepage (certain
bits overflows in the kernel implementation, there's no more space in
the page struct). It's not worth it IMHO.
Jamie Lokier - March 12, 2010, 10:40 p.m.
Andrea Arcangeli wrote:
> On Fri, Mar 12, 2010 at 06:41:56PM +0000, Paul Brook wrote:
> > Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good 
> > number of those knocking about.
> 
> Yep, but 32bit x86 host won't support transparent hugepage (certain
> bits overflows in the kernel implementation, there's no more space in
> the page struct). It's not worth it IMHO.

Perhaps the performance figures would be even more significant for
32-bit x86 than 64-bit, due to the smaller TLBs.

I guess that's not what you mean by worth it?

-- Jamie
Avi Kivity - March 13, 2010, 8:28 a.m.
On 03/11/2010 06:05 PM, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote:
>    
>> That is a little wasteful.  How about a hint to mmap() requesting proper
>> alignment (MAP_HPAGE_ALIGN)?
>>      
> So you suggest adding a new kernel feature to mmap? Not sure if it's
> worth it, considering it'd also increase the number of vmas because it
> will have to leave an hole. Wasting 2M-4k of virtual memory is likely
> cheaper than having 1 more vma in the rbtree for every page fault. So
> I think it's better to just malloc and adjust ourselfs on the next
> offset which is done in userland by qemu_memalign I think.
>
>    

Won't we get a new vma anyway due to the madvise() call later?

But I agree it isn't worth it.
Andrea Arcangeli - March 13, 2010, 5:47 p.m.
On Sat, Mar 13, 2010 at 10:28:32AM +0200, Avi Kivity wrote:
> On 03/11/2010 06:05 PM, Andrea Arcangeli wrote:
> > On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote:
> >    
> >> That is a little wasteful.  How about a hint to mmap() requesting proper
> >> alignment (MAP_HPAGE_ALIGN)?
> >>      
> > So you suggest adding a new kernel feature to mmap? Not sure if it's
> > worth it, considering it'd also increase the number of vmas because it
> > will have to leave an hole. Wasting 2M-4k of virtual memory is likely
> > cheaper than having 1 more vma in the rbtree for every page fault. So
> > I think it's better to just malloc and adjust ourselfs on the next
> > offset which is done in userland by qemu_memalign I think.
> >
> >    
> 
> Won't we get a new vma anyway due to the madvise() call later?

As long as MADV_HUGEPAGE is set in all, it will merge the vmas
together.

So if we do stuff like "alloc from 0 to 4G-4k" and then alloc "4G to
8G" this will avoid a vma split. (initially the mmap will create a
vma, but it'll be immediately removed from madvise with vma_merge)

> But I agree it isn't worth it.

2 vma or 1 vma isn't measurable of course, but yes the point is that
it's not worth it because doing it in userland is theoretically better
too for performance.

Patch

diff --git a/exec.c b/exec.c
index 891e0ee..aedd133 100644
--- a/exec.c
+++ b/exec.c
@@ -2628,11 +2628,25 @@  ram_addr_t qemu_ram_alloc(ram_addr_t size)
                                 PROT_EXEC|PROT_READ|PROT_WRITE,
                                 MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 #else
-        new_block->host = qemu_vmalloc(size);
-#endif
+#if defined(__linux__) && defined(TARGET_HPAGE_BITS)
+	if (!kvm_enabled())
+#endif
+		new_block->host = qemu_vmalloc(size);
+#if defined(__linux__) && defined(TARGET_HPAGE_BITS)
+	else
+		/*
+		 * Align on HPAGE_SIZE so "(gfn ^ pfn) &
+		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
+		 * of hugepages with NPT/EPT.
+		 */
+		new_block->host = qemu_memalign(1 << TARGET_HPAGE_BITS, size);
+#endif 
 #ifdef MADV_MERGEABLE
         madvise(new_block->host, size, MADV_MERGEABLE);
 #endif
+#ifdef MADV_HUGEPAGE
+        madvise(new_block->host, size, MADV_HUGEPAGE);
+#endif
     }
     new_block->offset = last_ram_offset;
     new_block->length = size;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index ef7d951..26044eb 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -873,6 +873,7 @@  uint64_t cpu_get_tsc(CPUX86State *env);
 #define X86_DUMP_CCOP 0x0002 /* dump qemu flag cache */
 
 #define TARGET_PAGE_BITS 12
+#define TARGET_HPAGE_BITS (TARGET_PAGE_BITS+9)
 
 #define cpu_init cpu_x86_init
 #define cpu_exec cpu_x86_exec