Patchwork KVM: PPC: Book3S HV: Make the guest MMU hash table size configurable

login
register
mail settings
Submitter Paul Mackerras
Date April 27, 2012, 3:55 a.m.
Message ID <20120427035545.GB1216@drongo>
Download mbox | patch
Permalink /patch/155375/
State New
Headers show

Comments

Paul Mackerras - April 27, 2012, 3:55 a.m.
At present, on powerpc with Book 3S HV KVM, the kernel allocates a
fixed-size MMU hashed page table (HPT) to store the hardware PTEs for
the guest.  The hash table is currently always 16MB in size, but this
is larger than necessary for small guests (i.e. those with less than
about 1GB of RAM) and too small for large guests.  Furthermore, there
is no way for userspace to clear it out when resetting the guest.

This adds a new ioctl to enable qemu to control the size of the guest
hash table, and to clear it out when resetting the guest.  The
KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a
pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.

There must be no vcpus running at the time of this ioctl.  To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.

If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out.  We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.

If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point.  We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.

When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool.  If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator.  Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB).

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt        |   34 +++++++++
 arch/powerpc/include/asm/kvm_book3s_64.h |    7 +-
 arch/powerpc/include/asm/kvm_host.h      |    4 +
 arch/powerpc/include/asm/kvm_ppc.h       |    3 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  117 +++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_hv.c             |   40 +++++++---
 arch/powerpc/kvm/book3s_hv_builtin.c     |    4 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   15 ++--
 arch/powerpc/kvm/powerpc.c               |   18 +++++
 include/linux/kvm.h                      |    3 +
 10 files changed, 191 insertions(+), 54 deletions(-)
Avi Kivity - April 29, 2012, 1:37 p.m.
On 04/27/2012 06:55 AM, Paul Mackerras wrote:
> At present, on powerpc with Book 3S HV KVM, the kernel allocates a
> fixed-size MMU hashed page table (HPT) to store the hardware PTEs for
> the guest.  The hash table is currently always 16MB in size, but this
> is larger than necessary for small guests (i.e. those with less than
> about 1GB of RAM) and too small for large guests.  Furthermore, there
> is no way for userspace to clear it out when resetting the guest.
>
> This adds a new ioctl to enable qemu to control the size of the guest
> hash table, and to clear it out when resetting the guest.  The
> KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a
> pointer to a u32 containing the desired order of the HPT (log base 2
> of the size in bytes), which is updated on successful return to the
> actual order of the HPT which was allocated.
>
> There must be no vcpus running at the time of this ioctl.  To enforce
> this, we now keep a count of the number of vcpus running in
> kvm->arch.vcpus_running.
>
> If the ioctl is called when a HPT has already been allocated, we don't
> reallocate the HPT but just clear it out.  We first clear the
> kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
> the kvm->lock mutex, it will prevent any vcpus from starting to run until
> we're done, and (b) it means that the first vcpu to run after we're done
> will re-establish the VRMA if necessary.
>
> If userspace doesn't call this ioctl before running the first vcpu, the
> kernel will allocate a default-sized HPT at that point.  We do it then
> rather than when creating the VM, as the code did previously, so that
> userspace has a chance to do the ioctl if it wants.
>
> When allocating the HPT, we can allocate either from the kernel page
> allocator, or from the preallocated pool.  If userspace is asking for
> a different size from the preallocated HPTs, we first try to allocate
> using the kernel page allocator.  Then we try to allocate from the
> preallocated pool, and then if that fails, we try allocating decreasing
> sizes from the kernel page allocator, down to the minimum size allowed
> (256kB).
>
>

How difficult is it to have the kernel resize the HPT on demand?  Guest
size is meaningless in the presence of memory hotplug, and having
unprivileged userspace pin down large amounts of kernel memory us
undesirable.

On x86 we grow and shrink the mmu resources in response to guest demand
and host memory pressure.  We can do this because the data structures
are not authoritative (don't know it that's the case for ppc) and
because they can be grown incrementally (pretty sure that isn't the case
on ppc).  Still, if we can do this at KVM_SET_USER_MEMORY_REGION time
instead of a separate ioctl, I think it's better.
Paul Mackerras - April 30, 2012, 4:40 a.m.
On Sun, Apr 29, 2012 at 04:37:33PM +0300, Avi Kivity wrote:

> How difficult is it to have the kernel resize the HPT on demand?

Quite difficult, unfortunately.  The guest kernel knows the size of
the HPT, and the paravirt interface for updating it relies on the
guest knowing it, since it is used in the hash function (the computed
hash is taken modulo the HPT size).

And even if it were possible to notify the guest that the size was
changing, since it is a hash table, changing the size requires
traversing the table to move hash entries to their new locations.
When reducing the size one only has to traverse the part that is going
away, but even that will be at least half of the table since the size
is always a power of 2.

>  Guest
> size is meaningless in the presence of memory hotplug, and having
> unprivileged userspace pin down large amounts of kernel memory us
> undesirable.

I agree.  The HPT is certainly not ideal.  However, it's what we have
to deal with on POWER hardware.

One idea I had is to reserve some contiguous physical memory at boot
time, say a couple of percent of system memory, and use that as a pool
to allocate HPTs from.  That would limit the impact on the rest of the
system and also make it more likely that we can find the necessary
amount of physically contiguous memory.

> On x86 we grow and shrink the mmu resources in response to guest demand
> and host memory pressure.  We can do this because the data structures
> are not authoritative (don't know it that's the case for ppc) and
> because they can be grown incrementally (pretty sure that isn't the case
> on ppc).  Still, if we can do this at KVM_SET_USER_MEMORY_REGION time
> instead of a separate ioctl, I think it's better.

It's not practical to grow the HPT after the guest has started
booting.  It is possible to have two HPTs: one that the guest sees,
which can be in pageable memory, and another shadow HPT that the
hardware uses, which has to be in physically contiguous memory.  In
this model the size of the shadow HPT can be changed at will, at the
expense of having to reestablish the entries in it, though that can be
done on demand.  I have avoided that approach until now because it
uses more memory and is slower than just having a single HPT.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity - April 30, 2012, 8:31 a.m.
On 04/30/2012 07:40 AM, Paul Mackerras wrote:
> On Sun, Apr 29, 2012 at 04:37:33PM +0300, Avi Kivity wrote:
>
> > How difficult is it to have the kernel resize the HPT on demand?
>
> Quite difficult, unfortunately.  The guest kernel knows the size of
> the HPT, and the paravirt interface for updating it relies on the
> guest knowing it, since it is used in the hash function (the computed
> hash is taken modulo the HPT size).
>
> And even if it were possible to notify the guest that the size was
> changing, since it is a hash table, changing the size requires
> traversing the table to move hash entries to their new locations.
> When reducing the size one only has to traverse the part that is going
> away, but even that will be at least half of the table since the size
> is always a power of 2.

I'm no x86 fan but I'm glad we have nothing like that over there.

>
> >  Guest
> > size is meaningless in the presence of memory hotplug, and having
> > unprivileged userspace pin down large amounts of kernel memory us
> > undesirable.
>
> I agree.  The HPT is certainly not ideal.  However, it's what we have
> to deal with on POWER hardware.
>
> One idea I had is to reserve some contiguous physical memory at boot
> time, say a couple of percent of system memory, and use that as a pool
> to allocate HPTs from.  That would limit the impact on the rest of the
> system and also make it more likely that we can find the necessary
> amount of physically contiguous memory.

Doesn't that limit the number of guests that can run?

> > On x86 we grow and shrink the mmu resources in response to guest demand
> > and host memory pressure.  We can do this because the data structures
> > are not authoritative (don't know it that's the case for ppc) and
> > because they can be grown incrementally (pretty sure that isn't the case
> > on ppc).  Still, if we can do this at KVM_SET_USER_MEMORY_REGION time
> > instead of a separate ioctl, I think it's better.
>
> It's not practical to grow the HPT after the guest has started
> booting.  It is possible to have two HPTs: one that the guest sees,
> which can be in pageable memory, and another shadow HPT that the
> hardware uses, which has to be in physically contiguous memory.  In
> this model the size of the shadow HPT can be changed at will, at the
> expense of having to reestablish the entries in it, though that can be
> done on demand.  I have avoided that approach until now because it
> uses more memory and is slower than just having a single HPT.

This is similar to x86 in the pre npt/ept days, it's indeed slow.  I
guess we'll be stuck with the pv hash until you get nested lookups (at
least a nested hash lookup is just 3 accesses instead of 24).

How are limits managed?  Won't a user creating a thousand guests with a
16MB hash each bring a server to its knees?
Paul Mackerras - April 30, 2012, 11:54 a.m.
On Mon, Apr 30, 2012 at 11:31:42AM +0300, Avi Kivity wrote:
> On 04/30/2012 07:40 AM, Paul Mackerras wrote:
> > On Sun, Apr 29, 2012 at 04:37:33PM +0300, Avi Kivity wrote:
> >
> > > How difficult is it to have the kernel resize the HPT on demand?
> >
> > Quite difficult, unfortunately.  The guest kernel knows the size of
> > the HPT, and the paravirt interface for updating it relies on the
> > guest knowing it, since it is used in the hash function (the computed
> > hash is taken modulo the HPT size).
> >
> > And even if it were possible to notify the guest that the size was
> > changing, since it is a hash table, changing the size requires
> > traversing the table to move hash entries to their new locations.
> > When reducing the size one only has to traverse the part that is going
> > away, but even that will be at least half of the table since the size
> > is always a power of 2.
> 
> I'm no x86 fan but I'm glad we have nothing like that over there.

:)

> >
> > >  Guest
> > > size is meaningless in the presence of memory hotplug, and having
> > > unprivileged userspace pin down large amounts of kernel memory us
> > > undesirable.
> >
> > I agree.  The HPT is certainly not ideal.  However, it's what we have
> > to deal with on POWER hardware.
> >
> > One idea I had is to reserve some contiguous physical memory at boot
> > time, say a couple of percent of system memory, and use that as a pool
> > to allocate HPTs from.  That would limit the impact on the rest of the
> > system and also make it more likely that we can find the necessary
> > amount of physically contiguous memory.
> 
> Doesn't that limit the number of guests that can run?

It does, but so does the amount of physical memory in the host.  I
believe that with 2% to 3% of the host memory reserved for HPTs, we'll
run out of memory for the guests before we run out of HPTs (even with
KSM).

> > > On x86 we grow and shrink the mmu resources in response to guest demand
> > > and host memory pressure.  We can do this because the data structures
> > > are not authoritative (don't know it that's the case for ppc) and
> > > because they can be grown incrementally (pretty sure that isn't the case
> > > on ppc).  Still, if we can do this at KVM_SET_USER_MEMORY_REGION time
> > > instead of a separate ioctl, I think it's better.
> >
> > It's not practical to grow the HPT after the guest has started
> > booting.  It is possible to have two HPTs: one that the guest sees,
> > which can be in pageable memory, and another shadow HPT that the
> > hardware uses, which has to be in physically contiguous memory.  In
> > this model the size of the shadow HPT can be changed at will, at the
> > expense of having to reestablish the entries in it, though that can be
> > done on demand.  I have avoided that approach until now because it
> > uses more memory and is slower than just having a single HPT.
> 
> This is similar to x86 in the pre npt/ept days, it's indeed slow.  I
> guess we'll be stuck with the pv hash until you get nested lookups (at
> least a nested hash lookup is just 3 accesses instead of 24).

How do you get 24?  Naively I would have thought that with a 4-level
guest page table and a 4-level host page table you would get 16
accesses.  I have seen a research paper that shows that those accesses
can be cached really well, whereas accesses in a hash generally don't
cache well at all.

> How are limits managed?  Won't a user creating a thousand guests with a
> 16MB hash each bring a server to its knees?

Well, that depends on how much memory the server has.  In my
experience the limit seems to be about 300 to 400 guests on a POWER7
with 128GB of RAM; that's with each guest getting 0.5GB of RAM (about
the minimum needed to boot Fedora or RHEL successfully) and using KSM.
Beyond that it gets really short of memory and starts thrashing.  It
seems to be the guest memory that consumes the memory rather than the
HPTs, which are much smaller.  And for a 0.5GB guest, a 1MB HPT is
ample, so 1000 guests then only use up 1GB.  Part of the point of my
patch is to allow userspace to make the HPT be 1MB rather than 16MB
for small guests like these.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity - April 30, 2012, 1:34 p.m.
On 04/30/2012 02:54 PM, Paul Mackerras wrote:
> > >
> > > It's not practical to grow the HPT after the guest has started
> > > booting.  It is possible to have two HPTs: one that the guest sees,
> > > which can be in pageable memory, and another shadow HPT that the
> > > hardware uses, which has to be in physically contiguous memory.  In
> > > this model the size of the shadow HPT can be changed at will, at the
> > > expense of having to reestablish the entries in it, though that can be
> > > done on demand.  I have avoided that approach until now because it
> > > uses more memory and is slower than just having a single HPT.
> > 
> > This is similar to x86 in the pre npt/ept days, it's indeed slow.  I
> > guess we'll be stuck with the pv hash until you get nested lookups (at
> > least a nested hash lookup is just 3 accesses instead of 24).
>
> How do you get 24?  Naively I would have thought that with a 4-level
> guest page table and a 4-level host page table you would get 16
> accesses.  

Each of the four guest ptes requires 4 host ptes accesses for
translation, plus one access to fetch the pte itself.  Finally the data
access itself needs 4 host ptes.  4*5+4 = 24 -- so we need 25 memory
accesses in a guest to fetch a word with an empty TLB, compared to 5 on
bare metal.

> I have seen a research paper that shows that those accesses
> can be cached really well, whereas accesses in a hash generally don't
> cache well at all.

Yes, generally the first three levels (on both guest and host) cache
well, plus there are intermediate TLB entries for them.  The last level
misses on large guests (since it occupies 0.2% of memory even with a
single mm_struct), which is why we (=Andrea) implemented transparent
huge pages that remove it completely.  Another thing that can't be done
on ppc IIUC.  Maybe you should talk to your hardware people.

> > How are limits managed?  Won't a user creating a thousand guests with a
> > 16MB hash each bring a server to its knees?
>
> Well, that depends on how much memory the server has.  In my
> experience the limit seems to be about 300 to 400 guests on a POWER7
> with 128GB of RAM; that's with each guest getting 0.5GB of RAM (about
> the minimum needed to boot Fedora or RHEL successfully) and using KSM.
> Beyond that it gets really short of memory and starts thrashing.  It
> seems to be the guest memory that consumes the memory rather than the
> HPTs, which are much smaller.  And for a 0.5GB guest, a 1MB HPT is
> ample, so 1000 guests then only use up 1GB.  Part of the point of my
> patch is to allow userspace to make the HPT be 1MB rather than 16MB
> for small guests like these.
>

Okay, sounds reasonable for the constraints you have.

I guess you can still make the sizing automatic by deferring hash table
allocation until the first KVM_RUN (when you know the size of guest
memory) but that leads to awkward locking and doesn't mesh well with
memory hotplug.
Paul Mackerras - May 1, 2012, 9:49 p.m.
On Mon, Apr 30, 2012 at 04:34:33PM +0300, Avi Kivity wrote:

> Okay, sounds reasonable for the constraints you have.
> 
> I guess you can still make the sizing automatic by deferring hash table
> allocation until the first KVM_RUN (when you know the size of guest
> memory) but that leads to awkward locking and doesn't mesh well with
> memory hotplug.

The trouble with waiting until the first KVM_RUN is that we have to
tell the guest the size of the hash table via the device tree, which
userspace has placed in guest memory.  The kernel doesn't have any
direct knowledge of the location or contents of the device tree.
That's why I need an ioctl to enable the kernel and userspace to come
to agreement about the size, so that userspace can then put the size
in the device tree.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf - May 2, 2012, 12:52 p.m.
On 04/27/2012 05:55 AM, Paul Mackerras wrote:
> At present, on powerpc with Book 3S HV KVM, the kernel allocates a
> fixed-size MMU hashed page table (HPT) to store the hardware PTEs for
> the guest.  The hash table is currently always 16MB in size, but this
> is larger than necessary for small guests (i.e. those with less than
> about 1GB of RAM) and too small for large guests.  Furthermore, there
> is no way for userspace to clear it out when resetting the guest.
>
> This adds a new ioctl to enable qemu to control the size of the guest
> hash table, and to clear it out when resetting the guest.  The
> KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a
> pointer to a u32 containing the desired order of the HPT (log base 2
> of the size in bytes), which is updated on successful return to the
> actual order of the HPT which was allocated.
>
> There must be no vcpus running at the time of this ioctl.  To enforce
> this, we now keep a count of the number of vcpus running in
> kvm->arch.vcpus_running.
>
> If the ioctl is called when a HPT has already been allocated, we don't
> reallocate the HPT but just clear it out.  We first clear the
> kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
> the kvm->lock mutex, it will prevent any vcpus from starting to run until
> we're done, and (b) it means that the first vcpu to run after we're done
> will re-establish the VRMA if necessary.
>
> If userspace doesn't call this ioctl before running the first vcpu, the
> kernel will allocate a default-sized HPT at that point.  We do it then
> rather than when creating the VM, as the code did previously, so that
> userspace has a chance to do the ioctl if it wants.
>
> When allocating the HPT, we can allocate either from the kernel page
> allocator, or from the preallocated pool.  If userspace is asking for
> a different size from the preallocated HPTs, we first try to allocate
> using the kernel page allocator.  Then we try to allocate from the
> preallocated pool, and then if that fails, we try allocating decreasing
> sizes from the kernel page allocator, down to the minimum size allowed
> (256kB).
>
> Signed-off-by: Paul Mackerras<paulus@samba.org>
> ---
>   Documentation/virtual/kvm/api.txt        |   34 +++++++++
>   arch/powerpc/include/asm/kvm_book3s_64.h |    7 +-
>   arch/powerpc/include/asm/kvm_host.h      |    4 +
>   arch/powerpc/include/asm/kvm_ppc.h       |    3 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c      |  117 +++++++++++++++++++++++-------
>   arch/powerpc/kvm/book3s_hv.c             |   40 +++++++---
>   arch/powerpc/kvm/book3s_hv_builtin.c     |    4 +-
>   arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   15 ++--
>   arch/powerpc/kvm/powerpc.c               |   18 +++++
>   include/linux/kvm.h                      |    3 +
>   10 files changed, 191 insertions(+), 54 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 81ff39f..3629d70 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1689,6 +1689,40 @@ where the guest will clear the flag: when the soft lockup watchdog timer resets
>   itself or when a soft lockup is detected.  This ioctl can be called any time
>   after pausing the vcpu, but before it is resumed.
>
> +4.71 KVM_PPC_ALLOCATE_HTAB
> +
> +Capability: KVM_CAP_PPC_ALLOC_HTAB
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to u32 containing hash table order (in/out)
> +Returns: 0 on success, -1 on error
> +
> +This requests the host kernel to allocate an MMU hash table for a
> +guest using the PAPR paravirtualization interface.  This only does
> +anything if the kernel is configured to use the Book 3S HV style of
> +virtualization.  Otherwise the capability doesn't exist and the ioctl
> +returns an ENOTTY error.  The rest of this description assumes Book 3S
> +HV.
> +
> +There must be no vcpus running when this ioctl is called; if there
> +are, it will do nothing and return an EBUSY error.
> +
> +The parameter is a pointer to a 32-bit unsigned integer variable
> +containing the order (log base 2) of the desired size of the hash
> +table, which must be between 18 and 46.  On successful return from the
> +ioctl, it will have been updated with the order of the hash table that
> +was allocated.
> +
> +If no hash table has been allocated when any vcpu is asked to run
> +(with the KVM_RUN ioctl), the host kernel will allocate a
> +default-sized hash table (16 MB).
> +
> +If this ioctl is called when a hash table has already been allocated,
> +the kernel will clear out the existing hash table (zero all HPTEs) and
> +return the hash table order in the parameter.  (If the guest is using
> +the virtualized real-mode area (VRMA) facility, the kernel will
> +re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> +
>   5. The kvm_run structure
>
>   Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index b0c08b1..0dd1d86 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -36,11 +36,8 @@ static inline void svcpu_put(struct kvmppc_book3s_shadow_vcpu *svcpu)
>   #define SPAPR_TCE_SHIFT		12
>
>   #ifdef CONFIG_KVM_BOOK3S_64_HV
> -/* For now use fixed-size 16MB page table */
> -#define HPT_ORDER	24
> -#define HPT_NPTEG	(1ul<<  (HPT_ORDER - 7))	/* 128B per pteg */
> -#define HPT_NPTE	(HPT_NPTEG<<  3)		/* 8 PTEs per PTEG */
> -#define HPT_HASH_MASK	(HPT_NPTEG - 1)
> +#define KVM_DEFAULT_HPT_ORDER	24	/* 16MB HPT by default */
> +extern int kvm_hpt_order;		/* order of preallocated HPTs */
>   #endif
>
>   #define VRMA_VSID	0x1ffffffUL	/* 1TB VSID reserved for VRMA */
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 42a527e..78f2e27 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -237,6 +237,10 @@ struct kvm_arch {
>   	unsigned long vrma_slb_v;
>   	int rma_setup_done;
>   	int using_mmu_notifiers;
> +	u32 hpt_order;
> +	atomic_t vcpus_running;
> +	unsigned long hpt_npte;
> +	unsigned long hpt_mask;
>   	struct list_head spapr_tce_tables;
>   	spinlock_t slot_phys_lock;
>   	unsigned long *slot_phys[KVM_MEM_SLOTS_NUM];
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 7f0a3da..fee7150 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -117,7 +117,8 @@ extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
>   extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu);
>   extern void kvmppc_map_magic(struct kvm_vcpu *vcpu);
>
> -extern long kvmppc_alloc_hpt(struct kvm *kvm);
> +extern long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp);
> +extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp);
>   extern void kvmppc_free_hpt(struct kvm *kvm);
>   extern long kvmppc_prepare_vrma(struct kvm *kvm,
>   				struct kvm_userspace_memory_region *mem);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 8e6401f..ac669a1 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -37,56 +37,115 @@
>   /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
>   #define MAX_LPID_970	63
>
> -long kvmppc_alloc_hpt(struct kvm *kvm)
> +/* Power architecture requires HPT is at least 256kB */
> +#define PPC_MIN_HPT_ORDER	18
> +
> +long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
>   {
>   	unsigned long hpt;
> -	long lpid;
>   	struct revmap_entry *rev;
>   	struct kvmppc_linear_info *li;
> +	long order = kvm_hpt_order;
>
> -	/* Allocate guest's hashed page table */
> -	li = kvm_alloc_hpt();
> -	if (li) {
> -		/* using preallocated memory */
> -		hpt = (ulong)li->base_virt;
> -		kvm->arch.hpt_li = li;
> -	} else {
> -		/* using dynamic memory */
> +	if (htab_orderp) {
> +		order = *htab_orderp;
> +		if (order<  PPC_MIN_HPT_ORDER)
> +			order = PPC_MIN_HPT_ORDER;
> +	}
> +
> +	/*
> +	 * If the user wants a different size from default,
> +	 * try first to allocate it from the kernel page allocator.
> +	 */
> +	hpt = 0;
> +	if (order != kvm_hpt_order) {
>   		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
> -				       __GFP_NOWARN, HPT_ORDER - PAGE_SHIFT);
> +				       __GFP_NOWARN, order - PAGE_SHIFT);

I like the idea quite a lot, but shouldn't we set a max memory threshold 
through a kernel module option that a user space program could pin this way?

> +		if (!hpt)
> +			--order;
>   	}
>
> +	/* Next try to allocate from the preallocated pool */
>   	if (!hpt) {
> -		pr_err("kvm_alloc_hpt: Couldn't alloc HPT\n");
> -		return -ENOMEM;
> +		li = kvm_alloc_hpt();
> +		if (li) {
> +			hpt = (ulong)li->base_virt;
> +			kvm->arch.hpt_li = li;
> +			order = kvm_hpt_order;
> +		}
>   	}
> +
> +	/* Lastly try successively smaller sizes from the page allocator */
> +	while (!hpt&&  order>  PPC_MIN_HPT_ORDER) {
> +		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
> +				       __GFP_NOWARN, order - PAGE_SHIFT);
> +		if (!hpt)
> +			--order;
> +	}
> +
> +	if (!hpt)
> +		return -ENOMEM;
> +
>   	kvm->arch.hpt_virt = hpt;
> +	kvm->arch.hpt_order = order;
> +	/* HPTEs are 2**4 bytes long */
> +	kvm->arch.hpt_npte = 1ul<<  (order - 4);
> +	/* 128 (2**7) bytes in each HPTEG */
> +	kvm->arch.hpt_mask = (1ul<<  (order - 7)) - 1;
>
>   	/* Allocate reverse map array */
> -	rev = vmalloc(sizeof(struct revmap_entry) * HPT_NPTE);
> +	rev = vmalloc(sizeof(struct revmap_entry) * kvm->arch.hpt_npte);
>   	if (!rev) {
>   		pr_err("kvmppc_alloc_hpt: Couldn't alloc reverse map array\n");
>   		goto out_freehpt;
>   	}
>   	kvm->arch.revmap = rev;
> +	kvm->arch.sdr1 = __pa(hpt) | (order - 18);
>
> -	lpid = kvmppc_alloc_lpid();
> -	if (lpid<  0)
> -		goto out_freeboth;
> -
> -	kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
> -	kvm->arch.lpid = lpid;
> +	pr_info("KVM guest htab at %lx (order %ld), LPID %x\n",
> +		hpt, order, kvm->arch.lpid);
>
> -	pr_info("KVM guest htab at %lx, LPID %lx\n", hpt, lpid);
> +	if (htab_orderp)
> +		*htab_orderp = order;
>   	return 0;
>
> - out_freeboth:
> -	vfree(rev);
>    out_freehpt:
> -	free_pages(hpt, HPT_ORDER - PAGE_SHIFT);
> +	if (kvm->arch.hpt_li)
> +		kvm_release_hpt(kvm->arch.hpt_li);
> +	else
> +		free_pages(hpt, order - PAGE_SHIFT);
>   	return -ENOMEM;
>   }
>
> +long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
> +{
> +	long err = -EBUSY;
> +	long order;
> +
> +	mutex_lock(&kvm->lock);
> +	if (kvm->arch.rma_setup_done) {
> +		kvm->arch.rma_setup_done = 0;
> +		/* order rma_setup_done vs. vcpus_running */
> +		smp_mb();
> +		if (atomic_read(&kvm->arch.vcpus_running)) {

Is this safe? What if we're running one vcpu thread and one hpt reset 
thread. The vcpu thread starts, is just before the vcpu_run function, 
the reset thread checks if anything is running, nothing is running, the 
vcpu thread goes on to do its thing and boom things break.

But then again, what's the problem with having vcpus running while we're 
clearing the HPT? Stale TLB entries?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras - May 2, 2012, 11:49 p.m.
On Wed, May 02, 2012 at 02:52:14PM +0200, Alexander Graf wrote:

> >+	/*
> >+	 * If the user wants a different size from default,
> >+	 * try first to allocate it from the kernel page allocator.
> >+	 */
> >+	hpt = 0;
> >+	if (order != kvm_hpt_order) {
> >  		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
> >-				       __GFP_NOWARN, HPT_ORDER - PAGE_SHIFT);
> >+				       __GFP_NOWARN, order - PAGE_SHIFT);
> 
> I like the idea quite a lot, but shouldn't we set a max memory
> threshold through a kernel module option that a user space program
> could pin this way?

The page allocator will hand out at most 16MB with the default setting
of CONFIG_FORCE_MAX_ZONEORDER.  Any larger request will fail, and
userspace can only do one allocation per VM.

Of course, userspace can start hundreds of VMs and use up memory that
way - but that is exactly the same situation as we have already, where
we allocate 16MB per VM at VM creation time, so this patch makes
things no worse in that regard.

> >+long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
> >+{
> >+	long err = -EBUSY;
> >+	long order;
> >+
> >+	mutex_lock(&kvm->lock);
> >+	if (kvm->arch.rma_setup_done) {
> >+		kvm->arch.rma_setup_done = 0;
> >+		/* order rma_setup_done vs. vcpus_running */
> >+		smp_mb();
> >+		if (atomic_read(&kvm->arch.vcpus_running)) {
> 
> Is this safe? What if we're running one vcpu thread and one hpt
> reset thread. The vcpu thread starts, is just before the vcpu_run
> function, the reset thread checks if anything is running, nothing is
> running, the vcpu thread goes on to do its thing and boom things
> break.

No - in the situation you describe, the vcpu thread will see that
kvm->arch.rma_setup_done is clear and call kvmppc_hv_setup_htab_rma(),
where the first thing it does is mutex_lock(&kvm->lock), and it sits
there until the reset thread has finished and unlocked the
mutex.  The vcpu thread has to see rma_setup_done clear since the
reset thread clears it and then has a barrier, before testing
vcpus_running.

If on the other hand the vcpu thread gets there first and sees that
kvm->arch.rma_setup_done is still set, then the reset thread must see
that kvm->arch.vcpus_running is non-zero, since the vcpu thread first
increments vcpus_running and then tests rma_setup_done, with a barrier
in between.

> But then again, what's the problem with having vcpus running while
> we're clearing the HPT? Stale TLB entries?

If a vcpu is running, it could possibly call kvmppc_h_enter() to set a
HPTE at the exact same time that we are clearing that HPTE, and we
could end up with inconsistent results.  I didn't want to have to add
extra locking in kvmppc_h_enter() to make sure that couldn't happen,
since kvmppc_h_enter() is quite a hot path.  Instead I added a little
bit of overhead on the vcpu entry/exit, since that happens much more
rarely.  In any case, if we're resetting the VM, there shouldn't be
any vcpus running.

Your comment has however reminded me that I need to arrange for a full
TLB flush to occur the first time we run each vcpu after the reset.
I'll do a new version that includes that.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 81ff39f..3629d70 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1689,6 +1689,40 @@  where the guest will clear the flag: when the soft lockup watchdog timer resets
 itself or when a soft lockup is detected.  This ioctl can be called any time
 after pausing the vcpu, but before it is resumed.
 
+4.71 KVM_PPC_ALLOCATE_HTAB
+
+Capability: KVM_CAP_PPC_ALLOC_HTAB
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to u32 containing hash table order (in/out)
+Returns: 0 on success, -1 on error
+
+This requests the host kernel to allocate an MMU hash table for a
+guest using the PAPR paravirtualization interface.  This only does
+anything if the kernel is configured to use the Book 3S HV style of
+virtualization.  Otherwise the capability doesn't exist and the ioctl
+returns an ENOTTY error.  The rest of this description assumes Book 3S
+HV.
+
+There must be no vcpus running when this ioctl is called; if there
+are, it will do nothing and return an EBUSY error.
+
+The parameter is a pointer to a 32-bit unsigned integer variable
+containing the order (log base 2) of the desired size of the hash
+table, which must be between 18 and 46.  On successful return from the
+ioctl, it will have been updated with the order of the hash table that
+was allocated.
+
+If no hash table has been allocated when any vcpu is asked to run
+(with the KVM_RUN ioctl), the host kernel will allocate a
+default-sized hash table (16 MB).
+
+If this ioctl is called when a hash table has already been allocated,
+the kernel will clear out the existing hash table (zero all HPTEs) and
+return the hash table order in the parameter.  (If the guest is using
+the virtualized real-mode area (VRMA) facility, the kernel will
+re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index b0c08b1..0dd1d86 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -36,11 +36,8 @@  static inline void svcpu_put(struct kvmppc_book3s_shadow_vcpu *svcpu)
 #define SPAPR_TCE_SHIFT		12
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
-/* For now use fixed-size 16MB page table */
-#define HPT_ORDER	24
-#define HPT_NPTEG	(1ul << (HPT_ORDER - 7))	/* 128B per pteg */
-#define HPT_NPTE	(HPT_NPTEG << 3)		/* 8 PTEs per PTEG */
-#define HPT_HASH_MASK	(HPT_NPTEG - 1)
+#define KVM_DEFAULT_HPT_ORDER	24	/* 16MB HPT by default */
+extern int kvm_hpt_order;		/* order of preallocated HPTs */
 #endif
 
 #define VRMA_VSID	0x1ffffffUL	/* 1TB VSID reserved for VRMA */
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 42a527e..78f2e27 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -237,6 +237,10 @@  struct kvm_arch {
 	unsigned long vrma_slb_v;
 	int rma_setup_done;
 	int using_mmu_notifiers;
+	u32 hpt_order;
+	atomic_t vcpus_running;
+	unsigned long hpt_npte;
+	unsigned long hpt_mask;
 	struct list_head spapr_tce_tables;
 	spinlock_t slot_phys_lock;
 	unsigned long *slot_phys[KVM_MEM_SLOTS_NUM];
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 7f0a3da..fee7150 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -117,7 +117,8 @@  extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
 extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu);
 extern void kvmppc_map_magic(struct kvm_vcpu *vcpu);
 
-extern long kvmppc_alloc_hpt(struct kvm *kvm);
+extern long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp);
+extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp);
 extern void kvmppc_free_hpt(struct kvm *kvm);
 extern long kvmppc_prepare_vrma(struct kvm *kvm,
 				struct kvm_userspace_memory_region *mem);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8e6401f..ac669a1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -37,56 +37,115 @@ 
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970	63
 
-long kvmppc_alloc_hpt(struct kvm *kvm)
+/* Power architecture requires HPT is at least 256kB */
+#define PPC_MIN_HPT_ORDER	18
+
+long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
 	unsigned long hpt;
-	long lpid;
 	struct revmap_entry *rev;
 	struct kvmppc_linear_info *li;
+	long order = kvm_hpt_order;
 
-	/* Allocate guest's hashed page table */
-	li = kvm_alloc_hpt();
-	if (li) {
-		/* using preallocated memory */
-		hpt = (ulong)li->base_virt;
-		kvm->arch.hpt_li = li;
-	} else {
-		/* using dynamic memory */
+	if (htab_orderp) {
+		order = *htab_orderp;
+		if (order < PPC_MIN_HPT_ORDER)
+			order = PPC_MIN_HPT_ORDER;
+	}
+
+	/*
+	 * If the user wants a different size from default,
+	 * try first to allocate it from the kernel page allocator.
+	 */
+	hpt = 0;
+	if (order != kvm_hpt_order) {
 		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
-				       __GFP_NOWARN, HPT_ORDER - PAGE_SHIFT);
+				       __GFP_NOWARN, order - PAGE_SHIFT);
+		if (!hpt)
+			--order;
 	}
 
+	/* Next try to allocate from the preallocated pool */
 	if (!hpt) {
-		pr_err("kvm_alloc_hpt: Couldn't alloc HPT\n");
-		return -ENOMEM;
+		li = kvm_alloc_hpt();
+		if (li) {
+			hpt = (ulong)li->base_virt;
+			kvm->arch.hpt_li = li;
+			order = kvm_hpt_order;
+		}
 	}
+
+	/* Lastly try successively smaller sizes from the page allocator */
+	while (!hpt && order > PPC_MIN_HPT_ORDER) {
+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
+				       __GFP_NOWARN, order - PAGE_SHIFT);
+		if (!hpt)
+			--order;
+	}
+
+	if (!hpt)
+		return -ENOMEM;
+
 	kvm->arch.hpt_virt = hpt;
+	kvm->arch.hpt_order = order;
+	/* HPTEs are 2**4 bytes long */
+	kvm->arch.hpt_npte = 1ul << (order - 4);
+	/* 128 (2**7) bytes in each HPTEG */
+	kvm->arch.hpt_mask = (1ul << (order - 7)) - 1;
 
 	/* Allocate reverse map array */
-	rev = vmalloc(sizeof(struct revmap_entry) * HPT_NPTE);
+	rev = vmalloc(sizeof(struct revmap_entry) * kvm->arch.hpt_npte);
 	if (!rev) {
 		pr_err("kvmppc_alloc_hpt: Couldn't alloc reverse map array\n");
 		goto out_freehpt;
 	}
 	kvm->arch.revmap = rev;
+	kvm->arch.sdr1 = __pa(hpt) | (order - 18);
 
-	lpid = kvmppc_alloc_lpid();
-	if (lpid < 0)
-		goto out_freeboth;
-
-	kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
-	kvm->arch.lpid = lpid;
+	pr_info("KVM guest htab at %lx (order %ld), LPID %x\n",
+		hpt, order, kvm->arch.lpid);
 
-	pr_info("KVM guest htab at %lx, LPID %lx\n", hpt, lpid);
+	if (htab_orderp)
+		*htab_orderp = order;
 	return 0;
 
- out_freeboth:
-	vfree(rev);
  out_freehpt:
-	free_pages(hpt, HPT_ORDER - PAGE_SHIFT);
+	if (kvm->arch.hpt_li)
+		kvm_release_hpt(kvm->arch.hpt_li);
+	else
+		free_pages(hpt, order - PAGE_SHIFT);
 	return -ENOMEM;
 }
 
+long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
+{
+	long err = -EBUSY;
+	long order;
+
+	mutex_lock(&kvm->lock);
+	if (kvm->arch.rma_setup_done) {
+		kvm->arch.rma_setup_done = 0;
+		/* order rma_setup_done vs. vcpus_running */
+		smp_mb();
+		if (atomic_read(&kvm->arch.vcpus_running)) {
+			kvm->arch.rma_setup_done = 1;
+			goto out;
+		}
+	}
+	if (kvm->arch.hpt_virt) {
+		order = kvm->arch.hpt_order;
+		memset((void *)kvm->arch.hpt_virt, 0, 1ul << order);
+		*htab_orderp = order;
+		err = 0;
+	} else {
+		err = kvmppc_alloc_hpt(kvm, htab_orderp);
+		order = *htab_orderp;
+	}
+ out:
+	mutex_unlock(&kvm->lock);
+	return err;
+}
+
 void kvmppc_free_hpt(struct kvm *kvm)
 {
 	kvmppc_free_lpid(kvm->arch.lpid);
@@ -94,7 +153,8 @@  void kvmppc_free_hpt(struct kvm *kvm)
 	if (kvm->arch.hpt_li)
 		kvm_release_hpt(kvm->arch.hpt_li);
 	else
-		free_pages(kvm->arch.hpt_virt, HPT_ORDER - PAGE_SHIFT);
+		free_pages(kvm->arch.hpt_virt,
+			   kvm->arch.hpt_order - PAGE_SHIFT);
 }
 
 /* Bits in first HPTE dword for pagesize 4k, 64k or 16M */
@@ -119,6 +179,7 @@  void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 	unsigned long psize;
 	unsigned long hp0, hp1;
 	long ret;
+	struct kvm *kvm = vcpu->kvm;
 
 	psize = 1ul << porder;
 	npages = memslot->npages >> (porder - PAGE_SHIFT);
@@ -127,8 +188,8 @@  void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 	if (npages > 1ul << (40 - porder))
 		npages = 1ul << (40 - porder);
 	/* Can't use more than 1 HPTE per HPTEG */
-	if (npages > HPT_NPTEG)
-		npages = HPT_NPTEG;
+	if (npages > kvm->arch.hpt_mask + 1)
+		npages = kvm->arch.hpt_mask + 1;
 
 	hp0 = HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)) |
 		HPTE_V_BOLTED | hpte0_pgsize_encoding(psize);
@@ -138,7 +199,7 @@  void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 	for (i = 0; i < npages; ++i) {
 		addr = i << porder;
 		/* can't use hpt_hash since va > 64 bits */
-		hash = (i ^ (VRMA_VSID ^ (VRMA_VSID << 25))) & HPT_HASH_MASK;
+		hash = (i ^ (VRMA_VSID ^ (VRMA_VSID << 25))) & kvm->arch.hpt_mask;
 		/*
 		 * We assume that the hash table is empty and no
 		 * vcpus are using it at this stage.  Since we create
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9079357..f8d477b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -56,7 +56,7 @@ 
 /* #define EXIT_DEBUG_INT */
 
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
-static int kvmppc_hv_setup_rma(struct kvm_vcpu *vcpu);
+static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
@@ -1068,11 +1068,15 @@  int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
 		return -EINTR;
 	}
 
-	/* On the first time here, set up VRMA or RMA */
+	atomic_inc(&vcpu->kvm->arch.vcpus_running);
+	/* Order vcpus_running vs. rma_setup_done, see kvmppc_alloc_reset_hpt */
+	smp_mb();
+
+	/* On the first time here, set up HTAB and VRMA or RMA */
 	if (!vcpu->kvm->arch.rma_setup_done) {
-		r = kvmppc_hv_setup_rma(vcpu);
+		r = kvmppc_hv_setup_htab_rma(vcpu);
 		if (r)
-			return r;
+			goto out;
 	}
 
 	flush_fp_to_thread(current);
@@ -1090,6 +1094,9 @@  int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
 			kvmppc_core_prepare_to_enter(vcpu);
 		}
 	} while (r == RESUME_GUEST);
+
+ out:
+	atomic_dec(&vcpu->kvm->arch.vcpus_running);
 	return r;
 }
 
@@ -1384,7 +1391,7 @@  void kvmppc_core_commit_memory_region(struct kvm *kvm,
 {
 }
 
-static int kvmppc_hv_setup_rma(struct kvm_vcpu *vcpu)
+static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu)
 {
 	int err = 0;
 	struct kvm *kvm = vcpu->kvm;
@@ -1403,6 +1410,15 @@  static int kvmppc_hv_setup_rma(struct kvm_vcpu *vcpu)
 	if (kvm->arch.rma_setup_done)
 		goto out;	/* another vcpu beat us to it */
 
+	/* Allocate hashed page table (if not done already) and reset it */
+	if (!kvm->arch.hpt_virt) {
+		err = kvmppc_alloc_hpt(kvm, NULL);
+		if (err) {
+			pr_err("KVM: Couldn't alloc HPT\n");
+			goto out;
+		}
+	}
+
 	/* Look up the memslot for guest physical address 0 */
 	memslot = gfn_to_memslot(kvm, 0);
 
@@ -1514,13 +1530,14 @@  static int kvmppc_hv_setup_rma(struct kvm_vcpu *vcpu)
 
 int kvmppc_core_init_vm(struct kvm *kvm)
 {
-	long r;
-	unsigned long lpcr;
+	unsigned long lpcr, lpid;
 
-	/* Allocate hashed page table */
-	r = kvmppc_alloc_hpt(kvm);
-	if (r)
-		return r;
+	/* Allocate the guest's logical partition ID */
+
+	lpid = kvmppc_alloc_lpid();
+	if (lpid < 0)
+		return -ENOMEM;
+	kvm->arch.lpid = lpid;
 
 	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
 
@@ -1530,7 +1547,6 @@  int kvmppc_core_init_vm(struct kvm *kvm)
 
 	if (cpu_has_feature(CPU_FTR_ARCH_201)) {
 		/* PPC970; HID4 is effectively the LPCR */
-		unsigned long lpid = kvm->arch.lpid;
 		kvm->arch.host_lpid = 0;
 		kvm->arch.host_lpcr = lpcr = mfspr(SPRN_HID4);
 		lpcr &= ~((3 << HID4_LPID1_SH) | (0xful << HID4_LPID5_SH));
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index e1b60f5..99777eb 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -25,6 +25,8 @@  static void __init kvm_linear_init_one(ulong size, int count, int type);
 static struct kvmppc_linear_info *kvm_alloc_linear(int type);
 static void kvm_release_linear(struct kvmppc_linear_info *ri);
 
+int kvm_hpt_order = KVM_DEFAULT_HPT_ORDER;
+
 /*************** RMA *************/
 
 /*
@@ -209,7 +211,7 @@  static void kvm_release_linear(struct kvmppc_linear_info *ri)
 void __init kvm_linear_init(void)
 {
 	/* HPT */
-	kvm_linear_init_one(1 << HPT_ORDER, kvm_hpt_count, KVM_LINEAR_HPT);
+	kvm_linear_init_one(1 << kvm_hpt_order, kvm_hpt_count, KVM_LINEAR_HPT);
 
 	/* RMA */
 	/* Only do this on PPC970 in HV mode */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index def880a..4f4f24e 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -237,7 +237,7 @@  long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 
 	/* Find and lock the HPTEG slot to use */
  do_insert:
-	if (pte_index >= HPT_NPTE)
+	if (pte_index >= kvm->arch.hpt_npte)
 		return H_PARAMETER;
 	if (likely((flags & H_EXACT) == 0)) {
 		pte_index &= ~7UL;
@@ -352,7 +352,7 @@  long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
 	unsigned long v, r, rb;
 	struct revmap_entry *rev;
 
-	if (pte_index >= HPT_NPTE)
+	if (pte_index >= kvm->arch.hpt_npte)
 		return H_PARAMETER;
 	hpte = (unsigned long *)(kvm->arch.hpt_virt + (pte_index << 4));
 	while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
@@ -419,7 +419,8 @@  long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 				i = 4;
 				break;
 			}
-			if (req != 1 || flags == 3 || pte_index >= HPT_NPTE) {
+			if (req != 1 || flags == 3 ||
+			    pte_index >= kvm->arch.hpt_npte) {
 				/* parameter error */
 				args[j] = ((0xa0 | flags) << 56) + pte_index;
 				ret = H_PARAMETER;
@@ -520,7 +521,7 @@  long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
 	struct revmap_entry *rev;
 	unsigned long v, r, rb, mask, bits;
 
-	if (pte_index >= HPT_NPTE)
+	if (pte_index >= kvm->arch.hpt_npte)
 		return H_PARAMETER;
 
 	hpte = (unsigned long *)(kvm->arch.hpt_virt + (pte_index << 4));
@@ -582,7 +583,7 @@  long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
 	int i, n = 1;
 	struct revmap_entry *rev = NULL;
 
-	if (pte_index >= HPT_NPTE)
+	if (pte_index >= kvm->arch.hpt_npte)
 		return H_PARAMETER;
 	if (flags & H_READ_4) {
 		pte_index &= ~3;
@@ -677,7 +678,7 @@  long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v,
 		somask = (1UL << 28) - 1;
 		vsid = (slb_v & ~SLB_VSID_B) >> SLB_VSID_SHIFT;
 	}
-	hash = (vsid ^ ((eaddr & somask) >> pshift)) & HPT_HASH_MASK;
+	hash = (vsid ^ ((eaddr & somask) >> pshift)) & kvm->arch.hpt_mask;
 	avpn = slb_v & ~(somask >> 16);	/* also includes B */
 	avpn |= (eaddr & somask) >> 16;
 
@@ -722,7 +723,7 @@  long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v,
 		if (val & HPTE_V_SECONDARY)
 			break;
 		val |= HPTE_V_SECONDARY;
-		hash = hash ^ HPT_HASH_MASK;
+		hash = hash ^ kvm->arch.hpt_mask;
 	}
 	return -1;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 58ad860..c9933fe 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -246,6 +246,7 @@  int kvm_dev_ioctl_check_extension(long ext)
 #endif
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_CAP_SPAPR_TCE:
+	case KVM_CAP_PPC_ALLOC_HTAB:
 		r = 1;
 		break;
 	case KVM_CAP_PPC_SMT:
@@ -794,6 +795,23 @@  long kvm_arch_vm_ioctl(struct file *filp,
 			r = -EFAULT;
 		break;
 	}
+
+	case KVM_PPC_ALLOCATE_HTAB: {
+		struct kvm *kvm = filp->private_data;
+		u32 htab_order;
+
+		r = -EFAULT;
+		if (get_user(htab_order, (u32 __user *)argp))
+			break;
+		r = kvmppc_alloc_reset_hpt(kvm, &htab_order);
+		if (r)
+			break;
+		r = -EFAULT;
+		if (put_user(htab_order, (u32 __user *)argp))
+			break;
+		r = 0;
+		break;
+	}
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 
 	default:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7a9dd4b..f57cbcb 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -590,6 +590,7 @@  struct kvm_ppc_pvinfo {
 #define KVM_CAP_SYNC_REGS 74
 #define KVM_CAP_PCI_2_3 75
 #define KVM_CAP_KVMCLOCK_CTRL 76
+#define KVM_CAP_PPC_ALLOC_HTAB 77
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -789,6 +790,8 @@  struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PCI_2_3 */
 #define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
 				       struct kvm_assigned_pci_dev)
+/* Available with KVM_CAP_PPC_ALLOC_HTAB */
+#define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa5, __u32)
 
 /*
  * ioctls for vcpu fds