Message ID | 20100311151427.GE5677@random.random |
---|---|
State | New |
Headers | show |
On 03/11/2010 05:14 PM, Andrea Arcangeli wrote: > From: Andrea Arcangeli<aarcange@redhat.com> > > This will allow proper alignment so NPT/EPT can take advantage of linux host > backing the guest memory with hugepages (only relevant for KVM and not for QEMU > that has no NPT/EPT support). To complete it, it will also notify > the kernel that this memory is important to be backed by hugepages with > madvise (needed for both KVM and QEMU). > > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> > --- > > diff --git a/exec.c b/exec.c > index 891e0ee..aedd133 100644 > --- a/exec.c > +++ b/exec.c > @@ -2628,11 +2628,25 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size) > PROT_EXEC|PROT_READ|PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > #else > - new_block->host = qemu_vmalloc(size); > -#endif > +#if defined(__linux__)&& defined(TARGET_HPAGE_BITS) > + if (!kvm_enabled()) > +#endif > + new_block->host = qemu_vmalloc(size); > +#if defined(__linux__)&& defined(TARGET_HPAGE_BITS) > + else > + /* > + * Align on HPAGE_SIZE so "(gfn ^ pfn)& > + * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage > + * of hugepages with NPT/EPT. > + */ > + new_block->host = qemu_memalign(1<< TARGET_HPAGE_BITS, size); > That is a little wasteful. How about a hint to mmap() requesting proper alignment (MAP_HPAGE_ALIGN)? Failing that, modify qemu_memalign() to trim excess memory. Come to think of it, posix_memalign() needs to do that (but doesn't). > +#endif > #ifdef MADV_MERGEABLE > madvise(new_block->host, size, MADV_MERGEABLE); > #endif > +#ifdef MADV_HUGEPAGE > + madvise(new_block->host, size, MADV_HUGEPAGE); > +#endif > }
On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote: > That is a little wasteful. How about a hint to mmap() requesting proper > alignment (MAP_HPAGE_ALIGN)? So you suggest adding a new kernel feature to mmap? Not sure if it's worth it, considering it'd also increase the number of vmas because it will have to leave an hole. Wasting 2M-4k of virtual memory is likely cheaper than having 1 more vma in the rbtree for every page fault. So I think it's better to just malloc and adjust ourselfs on the next offset which is done in userland by qemu_memalign I think. What we could ask the kernel is the HPAGE_SIZE. Also thinking a bit more about it, it now comes to mind what we really care about is the HOST_HPAGE_SIZE. Said that I doubt for kvm it makes a lot of difference and this only changes the kvm path. I'm open to suggestions of where to get the HPAGE_SIZE from and how to call it... > Failing that, modify qemu_memalign() to trim excess memory. > > Come to think of it, posix_memalign() needs to do that (but doesn't). It's hard to tell because of the amount of #ifdefs in .c files, but it seems to be using posix_memalign. If we don't touch these additional pages allocated and there's no transparent hugepage support in the kernel, you won't waste any more memory and less vmas will be generated this way than with a kernel option to reduce the virtual memory waste. Basically the idea is to waste virtual memory to avoid wasting cpu. In short we should make sure it only wastes virtual memory...
> > + /* > > + * Align on HPAGE_SIZE so "(gfn ^ pfn)& > > + * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage > > + * of hugepages with NPT/EPT. > > + */ > > + new_block->host = qemu_memalign(1<< TARGET_HPAGE_BITS, size); This should not be target dependent. i.e. it should be the host page size. > That is a little wasteful. How about a hint to mmap() requesting proper > alignment (MAP_HPAGE_ALIGN)? I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart enough to automatically align large allocations anyway. This is probably a useful optimization regardless of KVM. Paul
On Thu, Mar 11, 2010 at 04:28:04PM +0000, Paul Brook wrote: > > > + /* > > > + * Align on HPAGE_SIZE so "(gfn ^ pfn)& > > > + * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage > > > + * of hugepages with NPT/EPT. > > > + */ > > > + new_block->host = qemu_memalign(1<< TARGET_HPAGE_BITS, size); > > This should not be target dependent. i.e. it should be the host page size. Yep I noticed. I'm not aware of an official way to get that information out of the kernel (hugepagesize in /proc/meminfo is dependent on hugetlbfs which in turn is not a dependency for transparent hugepage support) but hey I can add it myself to /sys/kernel/mm/transparent_hugepage/hugepage_size ! > > That is a little wasteful. How about a hint to mmap() requesting proper > > alignment (MAP_HPAGE_ALIGN)? > > I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart enough > to automatically align large allocations anyway. Kernel won't do that, and the main reason is to avoid creating more vmas, it's more efficient to waste virtual space and have userland allocate more than needed, than ask the kernel alignment and force it to create more vmas because of holes generated out of it. virtual memory costs nothing. Also khugepaged can later zero out the pte_none regions to create a full segment all backed by hugepages, however if we do that khugepaged will eat into the free memory space. At the moment I kept khugepaged a zero-memory-footprint thing. But I'm currently adding an option called collapse_unmapped to allow khugepaged to collapse unmapped pages too so if there are only 2/3 pages in the region before the memalign, they also can be mapped by a large tlb to allow qemu run faster. > This is probably a useful optimization regardless of KVM. HPAGE alignment is only useful with KVM because it can only payoff with EPT/NPT, transparent hugepage already works fine without that (but ok it'd be a microoptimization for the first and last few pages in the whole vma). This is why I made it conditional to kvm_enabled(). I can remove the kvm_enabled() check if you worry about the first and last pages in the huge anon vma. OTOH the madvise(MADV_HUGEPAGE) is surely good idea for qemu too. KVM normally runs on 64bit hosts, so it's no big deal if we waste 1M of virtual memory here and there but I thought on qemu you preferred not to have alignment and have the first few and last few pages in a vma not backed by large tlb. Ideally we should also align on hpage size if sizeof(long) = 8. Not sure what's the recommended way to code that though and it'll make it a bit more complex for little good.
> On Thu, Mar 11, 2010 at 04:28:04PM +0000, Paul Brook wrote: > > > > + /* > > > > + * Align on HPAGE_SIZE so "(gfn ^ pfn)& > > > > + * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage > > > > + * of hugepages with NPT/EPT. > > > > + */ > > > > + new_block->host = qemu_memalign(1<< TARGET_HPAGE_BITS, size); > > > > This should not be target dependent. i.e. it should be the host page > > size. > > Yep I noticed. I'm not aware of an official way to get that > information out of the kernel (hugepagesize in /proc/meminfo is > dependent on hugetlbfs which in turn is not a dependency for > transparent hugepage support) but hey I can add it myself to > /sys/kernel/mm/transparent_hugepage/hugepage_size ! sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer. > > > That is a little wasteful. How about a hint to mmap() requesting > > > proper alignment (MAP_HPAGE_ALIGN)? > > > > I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart > > enough to automatically align large allocations anyway. > > Kernel won't do that, and the main reason is to avoid creating more > vmas, it's more efficient to waste virtual space and have userland > allocate more than needed, than ask the kernel alignment and force it > to create more vmas because of holes generated out of it. virtual > memory costs nothing. Huh. That seems unfortunate :-( > Also khugepaged can later zero out the pte_none regions to create a > full segment all backed by hugepages, however if we do that khugepaged > will eat into the free memory space. At the moment I kept khugepaged a > zero-memory-footprint thing. But I'm currently adding an option called > collapse_unmapped to allow khugepaged to collapse unmapped pages too > so if there are only 2/3 pages in the region before the memalign, they > also can be mapped by a large tlb to allow qemu run faster. I don't really understand what you're getting at here. Surely a naturally aligned block is always going to be easier to defragment than a misaligned block. If the allocation size is not a multiple of the preferred alignment, then you probably loose either way, and we shouldn't be requesting increased alignment. > > This is probably a useful optimization regardless of KVM. > > HPAGE alignment is only useful with KVM because it can only payoff > with EPT/NPT, transparent hugepage already works fine without that > (but ok it'd be a microoptimization for the first and last few pages > in the whole vma). This is why I made it conditional to > kvm_enabled(). I can remove the kvm_enabled() check if you worry about > the first and last pages in the huge anon vma. I wouldn't be surprised if putting the start of guest ram on a large TLB entry was a win. Your guest kernel often lives there! > OTOH the madvise(MADV_HUGEPAGE) is surely good idea for qemu too. KVM > normally runs on 64bit hosts, so it's no big deal if we waste 1M of > virtual memory here and there but I thought on qemu you preferred not > to have alignment and have the first few and last few pages in a vma > not backed by large tlb. Ideally we should also align on hpage size if > sizeof(long) = 8. Not sure what's the recommended way to code that > though and it'll make it a bit more complex for little good. Assuming we're allocating in large chunks, I doubt an extra hugepage worth of VMA is a big issue. Either way I'd argue that this isn't something qemu should have to care about, and is actually a bug in posix_memalign. Paul
On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote: > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer. There's not just one hugepage size and that thing doesn't exist yet plus it'd require mangling over glibc too. If it existed I could use it but I think this is better: $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 2097152 Ok? If this file doesn't exist we won't align, so we also align on qemu not only on kvm for the concern below on the first and last bytes. > > Also khugepaged can later zero out the pte_none regions to create a > > full segment all backed by hugepages, however if we do that khugepaged > > will eat into the free memory space. At the moment I kept khugepaged a > > zero-memory-footprint thing. But I'm currently adding an option called > > collapse_unmapped to allow khugepaged to collapse unmapped pages too > > so if there are only 2/3 pages in the region before the memalign, they > > also can be mapped by a large tlb to allow qemu run faster. > > I don't really understand what you're getting at here. Surely a naturally > aligned block is always going to be easier to defragment than a misaligned > block. Basically was I was saying it's about touching subpage 0, 1 of an hugepage, then posix_memalign extends the vma and nobody is ever going to touch page 2-511 because those are the virtual addresses wasted. khugepaged before couldn't allocate an hugepage for only page 0, 1 because the vma stopped there, but later after the vma is extended it can. So previously I wasn't mapping this range with an hugepage, but now I'm mapping it with an hugepage too. And a sysfs control will select the max number of unmapped subpages for the collapse to happen. For just 1 subpage mapped in the hugepage virtual range, it won't make sense to use large tlb and waste 511 pages of ram. > If the allocation size is not a multiple of the preferred alignment, then you > probably loose either way, and we shouldn't be requesting increased alignment. That's probably good idea. Also note, if we were to allocate separately the 0-640k 1m-end, for NPT to work we'd need to start the second block misaligned at a 1m address. So maybe I should move the alignment out of qemu_ram_alloc and have it in the caller? > I wouldn't be surprised if putting the start of guest ram on a large TLB entry > was a win. Your guest kernel often lives there! Yep, that's easy to handle with the hpage_pmd_size ;). > Assuming we're allocating in large chunks, I doubt an extra hugepage worth of > VMA is a big issue. > > Either way I'd argue that this isn't something qemu should have to care about, > and is actually a bug in posix_memalign. Hmm the last is a weird claim considering posix_memalign gets an explicit alignment parameter and it surely can't choose what alignment to use. We can argue about the kernel side having to align automatically but again if it would do that, it'd generate unnecessary vma holes which we don't want. I think it's quite simple, just use my new sysfs control, if it exists always use that alignment instead of the default. We've only to decide if to align inside or outside of qemu_ram_alloc.
> On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote: > > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer. > > There's not just one hugepage size We only have one madvise flag... > and that thing doesn't exist yet > plus it'd require mangling over glibc too. If it existed I could use > it but I think this is better: > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > 2097152 Is "pmd" x86 specific? > > If the allocation size is not a multiple of the preferred alignment, then > > you probably loose either way, and we shouldn't be requesting increased > > alignment. > > That's probably good idea. Also note, if we were to allocate > separately the 0-640k 1m-end, for NPT to work we'd need to start the > second block misaligned at a 1m address. So maybe I should move the > alignment out of qemu_ram_alloc and have it in the caller? I think the only viable solution if you care about EPT/NPT is to not do that. With your current code the 1m-end region will be misaligned - your code allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. Aligning on a 1M boundary will only DTRT half the time. > > I wouldn't be surprised if putting the start of guest ram on a large TLB > > entry was a win. Your guest kernel often lives there! > > Yep, that's easy to handle with the hpage_pmd_size ;). But that's only going to happen if you align the allocation. > > Assuming we're allocating in large chunks, I doubt an extra hugepage > > worth of VMA is a big issue. > > > > Either way I'd argue that this isn't something qemu should have to care > > about, and is actually a bug in posix_memalign. > > Hmm the last is a weird claim considering posix_memalign gets an explicit > alignment parameter and it surely can't choose what alignment to > use. We can argue about the kernel side having to align automatically > but again if it would do that, it'd generate unnecessary vma holes > which we don't want. It can't choose what align to use, but it can (should?) choose how to achieve that alignment. Paul
On Fri, Mar 12, 2010 at 11:36:33AM +0000, Paul Brook wrote: > > On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote: > > > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer. > > > > There's not just one hugepage size > > We only have one madvise flag... Transparent hugepage support means _really_ transparent, it's not up to userland to know what hugepage size the kernel uses. There is no way for userland to notice anything but that it runs faster. The madvise flag is one and it only exists for 1 reason: embedded systems that may want to turn off the transparency feature to avoid the risk of using a little more memory during anonymous memory copy-on-writes after fork or similar. But for things like kvm there is absolutely zero memory waste in enabling hugepages so even embedded definitely wants to enable transparent hugepage and run faster on their underpowered CPU. If it wasn't for embedded the madvise flag would need to be dropped as it would be pointless. It's not about the page size at all. > > and that thing doesn't exist yet > > plus it'd require mangling over glibc too. If it existed I could use > > it but I think this is better: > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > 2097152 > > Is "pmd" x86 specific? It's linux specific, this is common code, nothing x86 specific. In fact on x86 it's not called pmd but Page Directory. I've actually no idea what pmd stands for but it's definitely not x86 specific and it's just about the linux common code common to all archs. The reason this is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in the kernel code. So this entirely match the kernel internals _common_code_. > > > If the allocation size is not a multiple of the preferred alignment, then > > > you probably loose either way, and we shouldn't be requesting increased > > > alignment. > > > > That's probably good idea. Also note, if we were to allocate > > separately the 0-640k 1m-end, for NPT to work we'd need to start the > > second block misaligned at a 1m address. So maybe I should move the > > alignment out of qemu_ram_alloc and have it in the caller? > > I think the only viable solution if you care about EPT/NPT is to not do that. > With your current code the 1m-end region will be misaligned - your code Well with my current code on top of current qemu code, there is no risk of misalignment because the 0-4G is allocated in a single qemu_ram_alloc. I'm sure it works right because /debugfs/kvm/largepages shows all ram in largepages and otherwise I wouldn't get a reproducible 6% boost on kernel compiles in guest even on a common $150 quad core workstation (without even thinking at the boost on huge systems). > allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. > Aligning on a 1M boundary will only DTRT half the time. The 1m-end is an hypothetical worry that come to mind as I was discussing the issue with you. Basically my point is that if the pc.c code will change and it'll pretend to qemu_ram_alloc the 0-640k and 1M-4G range with two separate calls (this is _not_ what qemu does right now), the alignment in qemu_ram_alloc that works right now, would then stop working. This is why I thought maybe it's more correct (and less virtual-ram-wasteful) to move the alignment in the caller even if the patch will grow in size and it'll be pc.c specific (which it wouldn't need to if other archs will support transparent hugepage). I think with what you're saying above you're basically agreeing with me I should move the alignment in the caller. Correct me if I misunderstood. > But that's only going to happen if you align the allocation. Yep, this is why I agree with you, it's better to always align even when kvm_enabled() == 0. > It can't choose what align to use, but it can (should?) choose how to achieve > that alignment. Ok but I don't see a problem in how it achieves it, in fact I think it's more efficient than a kernel assisted alignment that would then force to split the vma generating a micro-slowdown.
> > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > > 2097152 > > > > Is "pmd" x86 specific? > > It's linux specific, this is common code, nothing x86 specific. In > fact on x86 it's not called pmd but Page Directory. I've actually no > idea what pmd stands for but it's definitely not x86 specific and it's > just about the linux common code common to all archs. The reason this > is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in > the kernel code. So this entirely match the kernel internals > _common_code_. Hmm, ok. I'm guessing linux doesn't support anything other than "huge" and "normal" page sizes now, so it's a question of whether we want it to expose current implementation details, or say "Align big in-memory things this much for optimal TLB behavior". Paul
> > allocates it on a 2M boundary. I suspect you actually want (base % 2M) == > > 1M. Aligning on a 1M boundary will only DTRT half the time. > > The 1m-end is an hypothetical worry that come to mind as I was > discussing the issue with you. Basically my point is that if the pc.c > code will change and it'll pretend to qemu_ram_alloc the 0-640k and > 1M-4G range with two separate calls (this is _not_ what qemu does > right now), the alignment in qemu_ram_alloc that works right now, > would then stop working. > > This is why I thought maybe it's more correct (and less > virtual-ram-wasteful) to move the alignment in the caller even if the > patch will grow in size and it'll be pc.c specific (which it wouldn't > need to if other archs will support transparent hugepage). > > I think with what you're saying above you're basically agreeing with > me I should move the alignment in the caller. Correct me if I > misunderstood. I don't think the target specific should know or care about this. Anthony recently proposed a different API for allocation guest RAM that would potentially make some of this information available to common code. However that has significant issues once you try and use it for anything other than the trivial PC machine. In particular I don't believe is is reasonable to assume RAM is always mapped at a fixed guest address. Paul
On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote: > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > > > 2097152 > > > > > > Is "pmd" x86 specific? > > > > It's linux specific, this is common code, nothing x86 specific. In > > fact on x86 it's not called pmd but Page Directory. I've actually no > > idea what pmd stands for but it's definitely not x86 specific and it's > > just about the linux common code common to all archs. The reason this > > is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in > > the kernel code. So this entirely match the kernel internals > > _common_code_. > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge" and > "normal" page sizes now, so it's a question of whether we want it to expose > current implementation details, or say "Align big in-memory things this much > for optimal TLB behavior". hugetlbfs already exposes the implementation detail. So if you want that it's already available. The whole point of going the extra mile with a transparent solution is to avoid userland to increase in complexity and to keep it as unaware of hugepages as possible. The madvise hint basically means "this won't risk to waste memory if you use large tlb on this mapping" and also "this mapping is more important than others to be backed by hugepages". It's up to the kernel what to do next. For example right now khugepaged doesn't prioritize scanning the madvise regions first, it basically doesn't matter for hypervisor solutions in the cloud (all anon memory in the system is only allocated by kvm...). But later we may prioritize it and try to be smarter from the hint given by userland.
> On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote: > > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > > > > 2097152 > > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge" > > and "normal" page sizes now, so it's a question of whether we want it to > > expose current implementation details, or say "Align big in-memory things > > this much for optimal TLB behavior". > > hugetlbfs already exposes the implementation detail. So if you want > that it's already available. The whole point of going the extra mile > with a transparent solution is to avoid userland to increase in > complexity and to keep it as unaware of hugepages as possible. The > madvise hint basically means "this won't risk to waste memory if you > use large tlb on this mapping" and also "this mapping is more > important than others to be backed by hugepages". It's up to the > kernel what to do next. For example right now khugepaged doesn't > prioritize scanning the madvise regions first, it basically doesn't > matter for hypervisor solutions in the cloud (all anon memory in the > system is only allocated by kvm...). But later we may prioritize it > and try to be smarter from the hint given by userland. So shouldn't [the name of] the value the kernel provides for recommended alignment be equally implementation agnostic? Paul
On Fri, Mar 12, 2010 at 04:24:24PM +0000, Paul Brook wrote: > > On Fri, Mar 12, 2010 at 04:04:03PM +0000, Paul Brook wrote: > > > > > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > > > > > > 2097152 > > > > > Hmm, ok. I'm guessing linux doesn't support anything other than "huge" > > > and "normal" page sizes now, so it's a question of whether we want it to > > > expose current implementation details, or say "Align big in-memory things > > > this much for optimal TLB behavior". > > > > hugetlbfs already exposes the implementation detail. So if you want > > that it's already available. The whole point of going the extra mile > > with a transparent solution is to avoid userland to increase in > > complexity and to keep it as unaware of hugepages as possible. The > > madvise hint basically means "this won't risk to waste memory if you > > use large tlb on this mapping" and also "this mapping is more > > important than others to be backed by hugepages". It's up to the > > kernel what to do next. For example right now khugepaged doesn't > > prioritize scanning the madvise regions first, it basically doesn't > > matter for hypervisor solutions in the cloud (all anon memory in the > > system is only allocated by kvm...). But later we may prioritize it > > and try to be smarter from the hint given by userland. > > So shouldn't [the name of] the value the kernel provides for recommended > alignment be equally implementation agnostic? Is sys/kernel/mm/transparent_hugepage directory implementation agnostic in the first place? This is not black and white issue, the idea of transparency is to have userland to know as little as possible but without actually losing any feature (in fact getting _more_!) than hugetlbfs that requires userland to setup the whole thing, lose paging, lose ksm (yeah it also loses ksm right now but we'll fix that with transparent hugepage support later) etc... If we want to fully take advantage of the feature (i.e. NPT and qemu first 2M of guest physical ram where usually kernel resides) userspace has to know the alignment size the kernel recommends. And so this information can't be implementation agnostic. In short we do everything as possible to avoid changing userland, and this results in a few liner change in fact, but this few liner change is required. be it an hint to ask kernel to align or use posix_madvise (which is more efficient as virtual memory is cheaper than vmas IMHO). Only thing I'm undecided about is if this should be called hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds 64k pages too and the kernel decides to use them too if it fails to allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k, and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max possible hpage_size provided by transparent hugepage. So with this new reasoning I think hpage_size or max_hpage_size would be better sysfs name for this. What do you think? hpage_size or max_hpage_size?
> > So shouldn't [the name of] the value the kernel provides for recommended > > alignment be equally implementation agnostic? > > Is sys/kernel/mm/transparent_hugepage directory implementation > agnostic in the first place? It's about as agnostic as MADV_HUGEPAGE :-) > If we want to fully take advantage of the feature (i.e. NPT and qemu > first 2M of guest physical ram where usually kernel resides) userspace > has to know the alignment size the kernel recommends. This is KVM specific, so my gut reaction is you should be asking KVM. > Only thing I'm undecided about is if this should be called > hpage_pmd_size or just hpage_size. Suppose amd/intel next year adds > 64k pages too and the kernel decides to use them too if it fails to > allocate a 2M page. So we escalate the fallback from 2M -> 64k -> 4k, > and HPAGE_PMD_SIZE becomes 64k. Still qemu has to align on the max > possible hpage_size provided by transparent hugepage. So with this new > reasoning I think hpage_size or max_hpage_size would be better sysfs > name for this. What do you think? Agreed. > hpage_size or max_hpage_size? No particular preference. Or you could have .../page_sizes list all available sizes, and have qemu take the first one (or last depending on sort order). Paul
On Fri, Mar 12, 2010 at 05:10:54PM +0000, Paul Brook wrote: > > > So shouldn't [the name of] the value the kernel provides for recommended > > > alignment be equally implementation agnostic? > > > > Is sys/kernel/mm/transparent_hugepage directory implementation > > agnostic in the first place? > > It's about as agnostic as MADV_HUGEPAGE :-) Exactly! ;) Again it's no black and white... we expose a minimum so the kernel later can do better for qemu. > > If we want to fully take advantage of the feature (i.e. NPT and qemu > > first 2M of guest physical ram where usually kernel resides) userspace > > has to know the alignment size the kernel recommends. > > This is KVM specific, so my gut reaction is you should be asking KVM. Hey I just said this is also needed for qemu to be backed by hugepages on the first 2M of guest ram. Not KVM specific anymore if we want to optimize that bit too. Performance difference for small system with one qemu only, will be in the 2% range, for KVM it's from 6% to 15% depending on the workload. But it's not kvm specific. > No particular preference. Or you could have .../page_sizes list all available > sizes, and have qemu take the first one (or last depending on sort order). That would also work. Considering that the current transparent hugepage support won't support any more than 1 page, I think it's ok to call it hpage_size, the fact that amd/intel will add a 64k page size is purely hypothetical and I guess by the time transparent hugepages are 1G in size, the basic page will be 2M and 4k will be obsolete. But hey if you prefer page_sizes or max_page_size let me know... The semantics will be able to stand for the long run, I'll write hpage_size shall export the alignment that userland should use with posix_memalign to be sure the whole allocated space can be backed by hugepages.
> > No particular preference. Or you could have .../page_sizes list all > > available sizes, and have qemu take the first one (or last depending on > > sort order). > > That would also work. Considering that the current transparent > hugepage support won't support any more than 1 page, I think it's ok > to call it hpage_size, the fact that amd/intel will add a 64k page > size is purely hypothetical It's only hypothetical on x86. Many other architectures already support this (at least ARM, MIPS, IA64, SPARC). Paul
On Fri, Mar 12, 2010 at 06:17:05PM +0000, Paul Brook wrote: > > > No particular preference. Or you could have .../page_sizes list all > > > available sizes, and have qemu take the first one (or last depending on > > > sort order). > > > > That would also work. Considering that the current transparent > > hugepage support won't support any more than 1 page, I think it's ok > > to call it hpage_size, the fact that amd/intel will add a 64k page > > size is purely hypothetical > > It's only hypothetical on x86. Many other architectures already support this > (at least ARM, MIPS, IA64, SPARC). Hmm the kernel won't support mixed page size right now but ok this is irrelevant because the API should stand. I think we'll be lucky enough if 1 HPAGE size will be supported... ia64 won't be able to take advantage of it either because it can't mix different page sizes in the same vma, which is a requirement for transparency and fallback to regular page size. My point is that there is no need to show the smaller page sizes to userland, only the max one is relevant and this isn't going to change and I'm uncomfortable to add plural stuff to a patch that doesn't contemplate mixes page sizes and for the time being multiple hpage size isn't even on the horizon... I hate APIs... What if we defer this whole issue and we just align on 2M if /sys/kernel/mm/transparent_hugepage exists without checking /sys? ;) If there's a way to define the host hpage size that will be enough, 2M is there since PAE was introduced and it's not going to change overnight so there's plenty of time later to add a hpage_sizes..
>My point is that there is no need to show the smaller page sizes to >userland, only the max one is relevant and this isn't going to change >and I'm uncomfortable to add plural stuff to a patch that doesn't >contemplate mixes page sizes and for the time being multiple hpage size >isn't even on the horizon... I hate APIs... I don't care enough to continue arguing about the utility of multiple page sizes :-) > What if we defer this whole issue and we just align on 2M if > /sys/kernel/mm/transparent_hugepage exists without checking /sys? ;) Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good number of those knocking about. Paul
On Fri, Mar 12, 2010 at 06:41:56PM +0000, Paul Brook wrote: > Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good > number of those knocking about. Yep, but 32bit x86 host won't support transparent hugepage (certain bits overflows in the kernel implementation, there's no more space in the page struct). It's not worth it IMHO.
Andrea Arcangeli wrote: > On Fri, Mar 12, 2010 at 06:41:56PM +0000, Paul Brook wrote: > > Doesn't non-PAE (i.e. most 32-bit x86) use 4M huge pages? There's still a good > > number of those knocking about. > > Yep, but 32bit x86 host won't support transparent hugepage (certain > bits overflows in the kernel implementation, there's no more space in > the page struct). It's not worth it IMHO. Perhaps the performance figures would be even more significant for 32-bit x86 than 64-bit, due to the smaller TLBs. I guess that's not what you mean by worth it? -- Jamie
On 03/11/2010 06:05 PM, Andrea Arcangeli wrote: > On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote: > >> That is a little wasteful. How about a hint to mmap() requesting proper >> alignment (MAP_HPAGE_ALIGN)? >> > So you suggest adding a new kernel feature to mmap? Not sure if it's > worth it, considering it'd also increase the number of vmas because it > will have to leave an hole. Wasting 2M-4k of virtual memory is likely > cheaper than having 1 more vma in the rbtree for every page fault. So > I think it's better to just malloc and adjust ourselfs on the next > offset which is done in userland by qemu_memalign I think. > > Won't we get a new vma anyway due to the madvise() call later? But I agree it isn't worth it.
On Sat, Mar 13, 2010 at 10:28:32AM +0200, Avi Kivity wrote: > On 03/11/2010 06:05 PM, Andrea Arcangeli wrote: > > On Thu, Mar 11, 2010 at 05:52:16PM +0200, Avi Kivity wrote: > > > >> That is a little wasteful. How about a hint to mmap() requesting proper > >> alignment (MAP_HPAGE_ALIGN)? > >> > > So you suggest adding a new kernel feature to mmap? Not sure if it's > > worth it, considering it'd also increase the number of vmas because it > > will have to leave an hole. Wasting 2M-4k of virtual memory is likely > > cheaper than having 1 more vma in the rbtree for every page fault. So > > I think it's better to just malloc and adjust ourselfs on the next > > offset which is done in userland by qemu_memalign I think. > > > > > > Won't we get a new vma anyway due to the madvise() call later? As long as MADV_HUGEPAGE is set in all, it will merge the vmas together. So if we do stuff like "alloc from 0 to 4G-4k" and then alloc "4G to 8G" this will avoid a vma split. (initially the mmap will create a vma, but it'll be immediately removed from madvise with vma_merge) > But I agree it isn't worth it. 2 vma or 1 vma isn't measurable of course, but yes the point is that it's not worth it because doing it in userland is theoretically better too for performance.
diff --git a/exec.c b/exec.c index 891e0ee..aedd133 100644 --- a/exec.c +++ b/exec.c @@ -2628,11 +2628,25 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size) PROT_EXEC|PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); #else - new_block->host = qemu_vmalloc(size); -#endif +#if defined(__linux__) && defined(TARGET_HPAGE_BITS) + if (!kvm_enabled()) +#endif + new_block->host = qemu_vmalloc(size); +#if defined(__linux__) && defined(TARGET_HPAGE_BITS) + else + /* + * Align on HPAGE_SIZE so "(gfn ^ pfn) & + * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage + * of hugepages with NPT/EPT. + */ + new_block->host = qemu_memalign(1 << TARGET_HPAGE_BITS, size); +#endif #ifdef MADV_MERGEABLE madvise(new_block->host, size, MADV_MERGEABLE); #endif +#ifdef MADV_HUGEPAGE + madvise(new_block->host, size, MADV_HUGEPAGE); +#endif } new_block->offset = last_ram_offset; new_block->length = size; diff --git a/target-i386/cpu.h b/target-i386/cpu.h index ef7d951..26044eb 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -873,6 +873,7 @@ uint64_t cpu_get_tsc(CPUX86State *env); #define X86_DUMP_CCOP 0x0002 /* dump qemu flag cache */ #define TARGET_PAGE_BITS 12 +#define TARGET_HPAGE_BITS (TARGET_PAGE_BITS+9) #define cpu_init cpu_x86_init #define cpu_exec cpu_x86_exec