diff mbox

[RFC,v0] spapr: Disable memory hotplug when HTAB size is insufficient

Message ID 1440387111-23689-1-git-send-email-bharata@linux.vnet.ibm.com
State New
Headers show

Commit Message

Bharata B Rao Aug. 24, 2015, 3:31 a.m. UTC
The hash table size allocated to guest depends on the maxmem size.
If the host isn't able to allocate the required hash table size but
instead allocates less than the optimal requested size, then it will
not be possible to grow the RAM until maxmem via memory hotplug.
Attempts to hotplug memory till maxmem could fail and this failure
isn't being currently handled gracefully by the guest kernel thereby
causing guest kernel oops.

This should eventually get fixed when we move to completely in-kernel
memory hotplug instead of the current method where userspace tool drmgr
drives the hotplug. Until the in-kernel memory hotplug is available
for PowerKVM, disable memory hotplug when requested hash table size
isn't allocated.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
Applies against spapr-next branch of David Gibson's tree.

 hw/ppc/spapr.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Anshuman Khandual Aug. 24, 2015, 4:34 a.m. UTC | #1
On 08/24/2015 09:01 AM, Bharata B Rao wrote:
> The hash table size allocated to guest depends on the maxmem size.
> If the host isn't able to allocate the required hash table size but
> instead allocates less than the optimal requested size, then it will
> not be possible to grow the RAM until maxmem via memory hotplug.
> Attempts to hotplug memory till maxmem could fail and this failure
> isn't being currently handled gracefully by the guest kernel thereby
> causing guest kernel oops.
> 
> This should eventually get fixed when we move to completely in-kernel
> memory hotplug instead of the current method where userspace tool drmgr
> drives the hotplug. Until the in-kernel memory hotplug is available
> for PowerKVM, disable memory hotplug when requested hash table size
> isn't allocated.

Even when the in-kernel memory hotplug will be available on PKVM,
it still makes sense to disable memory hotplug when the hash table
size received from host is not sufficient for the permissible/
requested maximum memory size of the guest. Whats the point of
enabling memory hotplug when we know it cannot fulfill all the
memory hotplug request.

IIUC, the hash table size received from the host some times can
be greater than what is required for the current memory size and
less than max hot pluggable memory on the guest. With this patch
in that case we will disable memory hotplug but then why use hash
table size which is bigger than required for the current memory
size. We will not be doing *any* memory hotplug at all afterwards,
so lets shrink the hash page size to what is just required for
the current memory requested by the guest and save some RAM
on the system.
Bharata B Rao Sept. 2, 2015, 3:28 a.m. UTC | #2
On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
> The hash table size allocated to guest depends on the maxmem size.
> If the host isn't able to allocate the required hash table size but
> instead allocates less than the optimal requested size, then it will
> not be possible to grow the RAM until maxmem via memory hotplug.
> Attempts to hotplug memory till maxmem could fail and this failure
> isn't being currently handled gracefully by the guest kernel thereby
> causing guest kernel oops.
> 
> This should eventually get fixed when we move to completely in-kernel
> memory hotplug instead of the current method where userspace tool drmgr
> drives the hotplug. Until the in-kernel memory hotplug is available
> for PowerKVM, disable memory hotplug when requested hash table size
> isn't allocated.

David - Do you have any views on how to go about this ? Due to the way
we do hotplug currently using drmgr, it appears that it is very difficult
to have a graceful recovery within the guest kernel when memory hotplug
request can't be fulfilled due to insufficient HTAB size. (Anshuman can
elaborate on this with the exact description on why it is so hard to
recover).

Do you think disabling memory hotplug upfront is a reasonable workaround
for this problem ?

Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
be exporting something for the userspace (capability ?) to check and
determine the presense of in-kernel memory hotplug feature so that we
can depend on graceful recovery instead of upfront disablement of
memory hotplug from QEMU ?

Regards,
Bharata.
David Gibson Sept. 3, 2015, 2:34 a.m. UTC | #3
On Wed, Sep 02, 2015 at 08:58:54AM +0530, Bharata B Rao wrote:
> On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
> > The hash table size allocated to guest depends on the maxmem size.
> > If the host isn't able to allocate the required hash table size but
> > instead allocates less than the optimal requested size, then it will
> > not be possible to grow the RAM until maxmem via memory hotplug.
> > Attempts to hotplug memory till maxmem could fail and this failure
> > isn't being currently handled gracefully by the guest kernel thereby
> > causing guest kernel oops.
> > 
> > This should eventually get fixed when we move to completely in-kernel
> > memory hotplug instead of the current method where userspace tool drmgr
> > drives the hotplug. Until the in-kernel memory hotplug is available
> > for PowerKVM, disable memory hotplug when requested hash table size
> > isn't allocated.
> 
> David - Do you have any views on how to go about this ? Due to the way
> we do hotplug currently using drmgr, it appears that it is very difficult
> to have a graceful recovery within the guest kernel when memory hotplug
> request can't be fulfilled due to insufficient HTAB size. (Anshuman can
> elaborate on this with the exact description on why it is so hard to
> recover).
> 
> Do you think disabling memory hotplug upfront is a reasonable workaround
> for this problem ?
> 
> Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
> be exporting something for the userspace (capability ?) to check and
> determine the presense of in-kernel memory hotplug feature so that we
> can depend on graceful recovery instead of upfront disablement of
> memory hotplug from QEMU ?

So, I kind of dislike magically disabling requested options - it can
make debugging problems really confusing.

In theory, what I'd prefer is to just not start the guest if we don't
get a big enough hash table to cover maxram.  Unfortunately we don't
discover this until reset time at which point it is not
straightforward to bail out cleanly :/
Nathan Fontenot Sept. 3, 2015, 6:50 p.m. UTC | #4
On 09/01/2015 10:28 PM, Bharata B Rao wrote:
> On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
>> The hash table size allocated to guest depends on the maxmem size.
>> If the host isn't able to allocate the required hash table size but
>> instead allocates less than the optimal requested size, then it will
>> not be possible to grow the RAM until maxmem via memory hotplug.
>> Attempts to hotplug memory till maxmem could fail and this failure
>> isn't being currently handled gracefully by the guest kernel thereby
>> causing guest kernel oops.
>>
>> This should eventually get fixed when we move to completely in-kernel
>> memory hotplug instead of the current method where userspace tool drmgr
>> drives the hotplug. Until the in-kernel memory hotplug is available
>> for PowerKVM, disable memory hotplug when requested hash table size
>> isn't allocated.
> 
> David - Do you have any views on how to go about this ? Due to the way
> we do hotplug currently using drmgr, it appears that it is very difficult
> to have a graceful recovery within the guest kernel when memory hotplug
> request can't be fulfilled due to insufficient HTAB size. (Anshuman can
> elaborate on this with the exact description on why it is so hard to
> recover).
> 
> Do you think disabling memory hotplug upfront is a reasonable workaround
> for this problem ?
> 
> Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
> be exporting something for the userspace (capability ?) to check and
> determine the presense of in-kernel memory hotplug feature so that we
> can depend on graceful recovery instead of upfront disablement of
> memory hotplug from QEMU ?
> 

I did not have any plans currently to export something indicating we are
using the in-kernel memory hotplug code.

Perhaps this is something we should consider adding the to the PAPR update
proposal that is being worked? Something to indicate we can gracefully handle
adding memory beyond HTAB size.

-Nathan
Michael Roth Sept. 4, 2015, 3:33 p.m. UTC | #5
Quoting Nathan Fontenot (2015-09-03 13:50:59)
> On 09/01/2015 10:28 PM, Bharata B Rao wrote:
> > On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
> >> The hash table size allocated to guest depends on the maxmem size.
> >> If the host isn't able to allocate the required hash table size but
> >> instead allocates less than the optimal requested size, then it will
> >> not be possible to grow the RAM until maxmem via memory hotplug.
> >> Attempts to hotplug memory till maxmem could fail and this failure
> >> isn't being currently handled gracefully by the guest kernel thereby
> >> causing guest kernel oops.
> >>
> >> This should eventually get fixed when we move to completely in-kernel
> >> memory hotplug instead of the current method where userspace tool drmgr
> >> drives the hotplug. Until the in-kernel memory hotplug is available
> >> for PowerKVM, disable memory hotplug when requested hash table size
> >> isn't allocated.
> > 
> > David - Do you have any views on how to go about this ? Due to the way
> > we do hotplug currently using drmgr, it appears that it is very difficult
> > to have a graceful recovery within the guest kernel when memory hotplug
> > request can't be fulfilled due to insufficient HTAB size. (Anshuman can
> > elaborate on this with the exact description on why it is so hard to
> > recover).
> > 
> > Do you think disabling memory hotplug upfront is a reasonable workaround
> > for this problem ?
> > 
> > Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
> > be exporting something for the userspace (capability ?) to check and
> > determine the presense of in-kernel memory hotplug feature so that we
> > can depend on graceful recovery instead of upfront disablement of
> > memory hotplug from QEMU ?
> > 
> 
> I did not have any plans currently to export something indicating we are
> using the in-kernel memory hotplug code.
> 
> Perhaps this is something we should consider adding the to the PAPR update
> proposal that is being worked? Something to indicate we can gracefully handle
> adding memory beyond HTAB size.

That might make sense, but I'm curious what constitutes graceful
recovery in this context. What can we do with in-kernel hotplug that's not
possible with userspace tools? If it's graceful failure, is there really
nothing that can be done by QEMU as the DRC level to get the same
result?

> 
> -Nathan
>
Nathan Fontenot Sept. 4, 2015, 3:49 p.m. UTC | #6
On 09/04/2015 10:33 AM, Michael Roth wrote:
> Quoting Nathan Fontenot (2015-09-03 13:50:59)
>> On 09/01/2015 10:28 PM, Bharata B Rao wrote:
>>> On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
>>>> The hash table size allocated to guest depends on the maxmem size.
>>>> If the host isn't able to allocate the required hash table size but
>>>> instead allocates less than the optimal requested size, then it will
>>>> not be possible to grow the RAM until maxmem via memory hotplug.
>>>> Attempts to hotplug memory till maxmem could fail and this failure
>>>> isn't being currently handled gracefully by the guest kernel thereby
>>>> causing guest kernel oops.
>>>>
>>>> This should eventually get fixed when we move to completely in-kernel
>>>> memory hotplug instead of the current method where userspace tool drmgr
>>>> drives the hotplug. Until the in-kernel memory hotplug is available
>>>> for PowerKVM, disable memory hotplug when requested hash table size
>>>> isn't allocated.
>>>
>>> David - Do you have any views on how to go about this ? Due to the way
>>> we do hotplug currently using drmgr, it appears that it is very difficult
>>> to have a graceful recovery within the guest kernel when memory hotplug
>>> request can't be fulfilled due to insufficient HTAB size. (Anshuman can
>>> elaborate on this with the exact description on why it is so hard to
>>> recover).
>>>
>>> Do you think disabling memory hotplug upfront is a reasonable workaround
>>> for this problem ?
>>>
>>> Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
>>> be exporting something for the userspace (capability ?) to check and
>>> determine the presense of in-kernel memory hotplug feature so that we
>>> can depend on graceful recovery instead of upfront disablement of
>>> memory hotplug from QEMU ?
>>>
>>
>> I did not have any plans currently to export something indicating we are
>> using the in-kernel memory hotplug code.
>>
>> Perhaps this is something we should consider adding the to the PAPR update
>> proposal that is being worked? Something to indicate we can gracefully handle
>> adding memory beyond HTAB size.
> 
> That might make sense, but I'm curious what constitutes graceful
> recovery in this context. What can we do with in-kernel hotplug that's not
> possible with userspace tools? If it's graceful failure, is there really
> nothing that can be done by QEMU as the DRC level to get the same
> result?

I don't have an answer for how to recover gracefully or if it will be possible.
If/when we can determine how to do that my thought was to use the PAPR updates
we are working on to indicate to QEMU that the guest is able to handle this
situation.

-Nathan
Michael Roth Sept. 4, 2015, 4:12 p.m. UTC | #7
Quoting Nathan Fontenot (2015-09-04 10:49:18)
> On 09/04/2015 10:33 AM, Michael Roth wrote:
> > Quoting Nathan Fontenot (2015-09-03 13:50:59)
> >> On 09/01/2015 10:28 PM, Bharata B Rao wrote:
> >>> On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
> >>>> The hash table size allocated to guest depends on the maxmem size.
> >>>> If the host isn't able to allocate the required hash table size but
> >>>> instead allocates less than the optimal requested size, then it will
> >>>> not be possible to grow the RAM until maxmem via memory hotplug.
> >>>> Attempts to hotplug memory till maxmem could fail and this failure
> >>>> isn't being currently handled gracefully by the guest kernel thereby
> >>>> causing guest kernel oops.
> >>>>
> >>>> This should eventually get fixed when we move to completely in-kernel
> >>>> memory hotplug instead of the current method where userspace tool drmgr
> >>>> drives the hotplug. Until the in-kernel memory hotplug is available
> >>>> for PowerKVM, disable memory hotplug when requested hash table size
> >>>> isn't allocated.
> >>>
> >>> David - Do you have any views on how to go about this ? Due to the way
> >>> we do hotplug currently using drmgr, it appears that it is very difficult
> >>> to have a graceful recovery within the guest kernel when memory hotplug
> >>> request can't be fulfilled due to insufficient HTAB size. (Anshuman can
> >>> elaborate on this with the exact description on why it is so hard to
> >>> recover).
> >>>
> >>> Do you think disabling memory hotplug upfront is a reasonable workaround
> >>> for this problem ?
> >>>
> >>> Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
> >>> be exporting something for the userspace (capability ?) to check and
> >>> determine the presense of in-kernel memory hotplug feature so that we
> >>> can depend on graceful recovery instead of upfront disablement of
> >>> memory hotplug from QEMU ?
> >>>
> >>
> >> I did not have any plans currently to export something indicating we are
> >> using the in-kernel memory hotplug code.
> >>
> >> Perhaps this is something we should consider adding the to the PAPR update
> >> proposal that is being worked? Something to indicate we can gracefully handle
> >> adding memory beyond HTAB size.
> > 
> > That might make sense, but I'm curious what constitutes graceful
> > recovery in this context. What can we do with in-kernel hotplug that's not
> > possible with userspace tools? If it's graceful failure, is there really
> > nothing that can be done by QEMU as the DRC level to get the same
> > result?
> 
> I don't have an answer for how to recover gracefully or if it will be possible.

Sorry, I meant it as a general question. Bharata mentioned Anshuman might have
some further details?

> If/when we can determine how to do that my thought was to use the PAPR updates
> we are working on to indicate to QEMU that the guest is able to handle this
> situation.

Agreed, if it's something that only makes sense for updated guest
kernels an architecture flag would be good. But if it's possible to do
something compatible with existing guests that would be ideal. Not sure
that's been ruled out yet.

> 
> -Nathan
>
Anshuman Khandual Sept. 9, 2015, 9:06 a.m. UTC | #8
On 09/04/2015 09:42 PM, Michael Roth wrote:
> Quoting Nathan Fontenot (2015-09-04 10:49:18)
>> On 09/04/2015 10:33 AM, Michael Roth wrote:
>>> Quoting Nathan Fontenot (2015-09-03 13:50:59)
>>>> On 09/01/2015 10:28 PM, Bharata B Rao wrote:
>>>>> On Mon, Aug 24, 2015 at 09:01:51AM +0530, Bharata B Rao wrote:
>>>>>> The hash table size allocated to guest depends on the maxmem size.
>>>>>> If the host isn't able to allocate the required hash table size but
>>>>>> instead allocates less than the optimal requested size, then it will
>>>>>> not be possible to grow the RAM until maxmem via memory hotplug.
>>>>>> Attempts to hotplug memory till maxmem could fail and this failure
>>>>>> isn't being currently handled gracefully by the guest kernel thereby
>>>>>> causing guest kernel oops.
>>>>>>
>>>>>> This should eventually get fixed when we move to completely in-kernel
>>>>>> memory hotplug instead of the current method where userspace tool drmgr
>>>>>> drives the hotplug. Until the in-kernel memory hotplug is available
>>>>>> for PowerKVM, disable memory hotplug when requested hash table size
>>>>>> isn't allocated.
>>>>>
>>>>> David - Do you have any views on how to go about this ? Due to the way
>>>>> we do hotplug currently using drmgr, it appears that it is very difficult
>>>>> to have a graceful recovery within the guest kernel when memory hotplug
>>>>> request can't be fulfilled due to insufficient HTAB size. (Anshuman can
>>>>> elaborate on this with the exact description on why it is so hard to
>>>>> recover).
>>>>>
>>>>> Do you think disabling memory hotplug upfront is a reasonable workaround
>>>>> for this problem ?
>>>>>
>>>>> Nathan - When you enable in-kernel memory hotplug for PowerKVM, will you
>>>>> be exporting something for the userspace (capability ?) to check and
>>>>> determine the presense of in-kernel memory hotplug feature so that we
>>>>> can depend on graceful recovery instead of upfront disablement of
>>>>> memory hotplug from QEMU ?
>>>>>
>>>>
>>>> I did not have any plans currently to export something indicating we are
>>>> using the in-kernel memory hotplug code.
>>>>
>>>> Perhaps this is something we should consider adding the to the PAPR update
>>>> proposal that is being worked? Something to indicate we can gracefully handle
>>>> adding memory beyond HTAB size.
>>>
>>> That might make sense, but I'm curious what constitutes graceful
>>> recovery in this context. What can we do with in-kernel hotplug that's not
>>> possible with userspace tools? If it's graceful failure, is there really
>>> nothing that can be done by QEMU as the DRC level to get the same
>>> result?
>>
>> I don't have an answer for how to recover gracefully or if it will be possible.
> 
> Sorry, I meant it as a general question. Bharata mentioned Anshuman might have
> some further details?

Graceful recovery in the kernel seems to be difficult (though I cannot
say whether it is impossible) because of the way we have implemented
the memory hotplug function with the help of the userspace tool called
'drmgr'. It has two distinct steps in which it achieve memory hotplug
after receiving platform notification.

(1) Update the /proc/ofdt
(2) Write into /sys/devices/system/memory/probe

Both of these above steps try to add the new memory block into the kernel
(generic and arch specific representations). Now if the step (2) fails we
restore /proc/ofdt value to the original state present before we started
the hotplug operation. In short, this does not rollback all the changes
we had done in step (2) and step (1) gracefully. One of the reasons being
the fact that it happens in two distinct steps from user space.

Had it been attempted through a single step, kernel would have right away
reverted any changes before exiting back into the userspace. New in-kernel
memory hotplug method follows this principle now.
diff mbox

Patch

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c3268c5..4a07a7d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -92,6 +92,9 @@ 
 
 #define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
 
+/* TODO: Move this to sPAPRMachineState ? */
+static bool spapr_memory_hotplug_disabled;
+
 static XICSState *try_create_xics(const char *type, int nr_servers,
                                   int nr_irqs, Error **errp)
 {
@@ -983,6 +986,14 @@  static void spapr_reset_htab(sPAPRMachineState *spapr)
 
     if (shift > 0) {
         /* Kernel handles htab, we don't need to allocate one */
+        if (shift != spapr->htab_shift) {
+            /*
+             * Disable memory hotplug since we didn't get the requested
+             * hash table size.
+             */
+            spapr_memory_hotplug_disabled = true;
+        }
+
         spapr->htab_shift = shift;
         kvmppc_kern_htab = true;
 
@@ -2149,6 +2160,11 @@  static void spapr_machine_device_plug(HotplugHandler *hotplug_dev,
             return;
         }
 
+        if (spapr_memory_hotplug_disabled) {
+            error_setg(errp, "Insufficient HTAB size to support memory hotplug");
+            return;
+        }
+
         spapr_memory_plug(hotplug_dev, dev, node, errp);
     }
 }