diff mbox

[RFC,v0,for,2.7] spapr: Work around the memory hotplug failure with DDW

Message ID 1461066701-23212-1-git-send-email-bharata@linux.vnet.ibm.com
State New
Headers show

Commit Message

Bharata B Rao April 19, 2016, 11:51 a.m. UTC
Memory hotplug can fail for some combinations of RAM and maxmem when
DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
on maximum addressable memory by guest and this value is currently
being calculated wrongly by the guest kernel routine memory_hotplug_max().
While there is an attempt to fix the guest kernel(*), this patch works
around the problem within QEMU itself.

memory_hotplug_max() routine in the guest kernel arrives at max
addressable memory by multiplying lmb-size with the lmb-count obtained
from ibm,dyanmic-memory property. There are two assumptions here:

- All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
  where only hot-pluggable LMBs are present in this property.
- The memory area comprising of RAM and hotplug region is contiguous: This
  needn't be true always for PowerKVM as there can be gap between
  boot time RAM and hotplug region.

This work around involves having all the LMBs (RMA, rest of the boot time
LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
guest kernel's calculation of max addressable memory comes out correct
resulting in correct DDW value which prevents memory hotplug failures.
memory@0 is created for RMA, but RMA LMBs are also represented as
"reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
revert of e8f986fc57a664a74b9f685b466506366a15201b.

In addition to this, the alignment of hotplug memory region is reduced from
current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
gaps between boot time RAM and hotplug region.

This change has a side effect on how the memory nodes in DT are
represented before and after this change. As an example consider
a guest with the following memory related command line options:

-m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G

Before this change, the guest would have

Scenario 1
----------
memory@0 for RMA
memory@80000000 for rest of the boot time memory
ibm,dynamic-reconfiguration-memory for hot-pluggable memory.

After this commit, the guest will have

Scenario 2
----------
memory@0 for RMA
ibm,dynamic-reconfiguration-memory for the entire memory including
RMA, boot time memory as well as hot-pluggable memory.

If an existing guest having DT nodes as in Scenario 1 above is migrated
to a QEMU which has this change, at the target, it continues to have the
DT nodes as in Scenario 1. However after 1st reboot, the DT representation
changes over to Scenario 2.

I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
if this change works well with DRC migration.

(*) https://patchwork.ozlabs.org/patch/606912/

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 hw/ppc/spapr.c         | 59 +++++++++++++++++++++++++++++++++++---------------
 include/hw/ppc/spapr.h |  1 +
 2 files changed, 43 insertions(+), 17 deletions(-)

Comments

Alexey Kardashevskiy April 26, 2016, 3:14 a.m. UTC | #1
On 04/19/2016 09:51 PM, Bharata B Rao wrote:
> Memory hotplug can fail for some combinations of RAM and maxmem when
> DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> on maximum addressable memory by guest and this value is currently
> being calculated wrongly by the guest kernel routine memory_hotplug_max().
> While there is an attempt to fix the guest kernel(*), this patch works
> around the problem within QEMU itself.
>
> memory_hotplug_max() routine in the guest kernel arrives at max
> addressable memory by multiplying lmb-size with the lmb-count obtained
> from ibm,dyanmic-memory property. There are two assumptions here:
>
> - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
>    where only hot-pluggable LMBs are present in this property.
> - The memory area comprising of RAM and hotplug region is contiguous: This
>    needn't be true always for PowerKVM as there can be gap between
>    boot time RAM and hotplug region.
>
> This work around involves having all the LMBs (RMA, rest of the boot time
> LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> guest kernel's calculation of max addressable memory comes out correct
> resulting in correct DDW value which prevents memory hotplug failures.
> memory@0 is created for RMA, but RMA LMBs are also represented as
> "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> revert of e8f986fc57a664a74b9f685b466506366a15201b.
>
> In addition to this, the alignment of hotplug memory region is reduced from
> current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> gaps between boot time RAM and hotplug region.
>
> This change has a side effect on how the memory nodes in DT are
> represented before and after this change. As an example consider
> a guest with the following memory related command line options:
>
> -m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G
>
> Before this change, the guest would have
>
> Scenario 1
> ----------
> memory@0 for RMA
> memory@80000000 for rest of the boot time memory
> ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
>
> After this commit, the guest will have
>
> Scenario 2
> ----------
> memory@0 for RMA
> ibm,dynamic-reconfiguration-memory for the entire memory including
> RMA, boot time memory as well as hot-pluggable memory.
>
> If an existing guest having DT nodes as in Scenario 1 above is migrated
> to a QEMU which has this change, at the target, it continues to have the
> DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> changes over to Scenario 2.


Does this patch also enable hot-unplug or RMA and boot time LMBs?

Also, this would look nicer on top of "[RFC for-2.7 00/11] A new 
infrastructure for guest device trees" where all memory DT nodes are in one 
place so it is easier to follow what exactly you...



> I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
> if this change works well with DRC migration.
>
> (*) https://patchwork.ozlabs.org/patch/606912/
>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>   hw/ppc/spapr.c         | 59 +++++++++++++++++++++++++++++++++++---------------
>   include/hw/ppc/spapr.h |  1 +
>   2 files changed, 43 insertions(+), 17 deletions(-)
>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 79a70a9..6d8de2e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState *spapr, void *fdt)
>           }
>           if (!mem_start) {
>               /* ppc_spapr_init() checks for rma_size <= node0_size already */
> -            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
>               mem_start += spapr->rma_size;
>               node_size -= spapr->rma_size;
>           }
> @@ -759,18 +758,13 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>       int ret, i, offset;
>       uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
>       uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
> -    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> +    uint32_t nr_rma_lmbs = spapr->rma_size / lmb_size;
> +    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
>       uint32_t *int_buf, *cur_index, buf_len;
>       int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
>
>       /*
> -     * Don't create the node if there are no DR LMBs.
> -     */
> -    if (!nr_lmbs) {
> -        return 0;
> -    }
> -
> -    /*
>        * Allocate enough buffer size to fit in ibm,dynamic-memory
>        * or ibm,associativity-lookup-arrays
>        */
> @@ -802,9 +796,15 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>       for (i = 0; i < nr_lmbs; i++) {
>           sPAPRDRConnector *drc;
>           sPAPRDRConnectorClass *drck;
> -        uint64_t addr = i * lmb_size + spapr->hotplug_memory.base;;
> +        uint64_t addr;
>           uint32_t *dynamic_memory = cur_index;
>
> +        if (i < nr_assigned_lmbs) {
> +            addr = i * lmb_size;
> +        } else {
> +            addr = (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;
> +        }
>           drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB,
>                                          addr/lmb_size);
>           g_assert(drc);
> @@ -817,7 +817,11 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>           dynamic_memory[4] = cpu_to_be32(numa_get_node(addr, NULL));
>           if (addr < machine->ram_size ||
>                       memory_region_present(get_system_memory(), addr)) {
> -            dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> +            if (i < nr_rma_lmbs) {
> +                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED);
> +            } else {
> +                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> +            }
>           } else {
>               dynamic_memory[5] = cpu_to_be32(0);
>           }
> @@ -879,6 +883,8 @@ int spapr_h_cas_compose_response(sPAPRMachineState *spapr,
>       /* Generate ibm,dynamic-reconfiguration-memory node if required */
>       if (memory_update && smc->dr_lmb_enabled) {
>           _FDT((spapr_populate_drconf_memory(spapr, fdt)));
> +    } else {
> +        _FDT((spapr_populate_memory(spapr, fdt)));
>       }
>
>       /* Pack resulting tree */
> @@ -916,10 +922,23 @@ static void spapr_finalize_fdt(sPAPRMachineState *spapr,
>       /* open out the base tree into a temp buffer for the final tweaks */
>       _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE)));
>
> -    ret = spapr_populate_memory(spapr, fdt);
> -    if (ret < 0) {
> -        fprintf(stderr, "couldn't setup memory nodes in fdt\n");
> -        exit(1);
> +    /*
> +     * Add memory@0 node to represent RMA. Rest of the memory is either
> +     * represented by memory nodes or ibm,dynamic-reconfiguration-memory
> +     * node later during ibm,client-architecture-support call.
> +     *
> +     * If NUMA is configured, ensure that memory@0 ends up in the
> +     * first memory-less node.
> +     */
> +    if (nb_numa_nodes) {
> +        for (i = 0; i < nb_numa_nodes; ++i) {
> +            if (numa_info[i].node_mem) {
> +                spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> +                break;
> +            }
> +        }
> +    } else {
> +        spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>       }
>
>       ret = spapr_populate_vdevice(spapr->vio_bus, fdt);
> @@ -1659,14 +1678,20 @@ static void spapr_create_lmb_dr_connectors(sPAPRMachineState *spapr)
>   {
>       MachineState *machine = MACHINE(spapr);
>       uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> -    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> +    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
>       int i;
>
>       for (i = 0; i < nr_lmbs; i++) {
>           sPAPRDRConnector *drc;
>           uint64_t addr;
>
> -        addr = i * lmb_size + spapr->hotplug_memory.base;
> +        if (i < nr_assigned_lmbs) {
> +            addr = i * lmb_size;
> +        } else {
> +            addr = (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;
> +        }
>           drc = spapr_dr_connector_new(OBJECT(spapr), SPAPR_DR_CONNECTOR_TYPE_LMB,
>                                        addr/lmb_size);
>           qemu_register_reset(spapr_drc_reset, drc);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..9f2050d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -627,5 +627,6 @@ int spapr_rng_populate_dt(void *fdt);
>    * property under ibm,dynamic-reconfiguration-memory node.
>    */
>   #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
> +#define SPAPR_LMB_FLAGS_RESERVED 0x00000080
>
>   #endif /* !defined (__HW_SPAPR_H__) */
>
Bharata B Rao April 26, 2016, 5:30 a.m. UTC | #2
On Tue, Apr 26, 2016 at 01:14:39PM +1000, Alexey Kardashevskiy wrote:
> On 04/19/2016 09:51 PM, Bharata B Rao wrote:
> >Memory hotplug can fail for some combinations of RAM and maxmem when
> >DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> >on maximum addressable memory by guest and this value is currently
> >being calculated wrongly by the guest kernel routine memory_hotplug_max().
> >While there is an attempt to fix the guest kernel(*), this patch works
> >around the problem within QEMU itself.
> >
> >memory_hotplug_max() routine in the guest kernel arrives at max
> >addressable memory by multiplying lmb-size with the lmb-count obtained
> >from ibm,dyanmic-memory property. There are two assumptions here:
> >
> >- All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
> >   where only hot-pluggable LMBs are present in this property.
> >- The memory area comprising of RAM and hotplug region is contiguous: This
> >   needn't be true always for PowerKVM as there can be gap between
> >   boot time RAM and hotplug region.
> >
> >This work around involves having all the LMBs (RMA, rest of the boot time
> >LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> >guest kernel's calculation of max addressable memory comes out correct
> >resulting in correct DDW value which prevents memory hotplug failures.
> >memory@0 is created for RMA, but RMA LMBs are also represented as
> >"reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> >revert of e8f986fc57a664a74b9f685b466506366a15201b.
> >
> >In addition to this, the alignment of hotplug memory region is reduced from
> >current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> >gaps between boot time RAM and hotplug region.
> >
> >This change has a side effect on how the memory nodes in DT are
> >represented before and after this change. As an example consider
> >a guest with the following memory related command line options:
> >
> >-m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G
> >
> >Before this change, the guest would have
> >
> >Scenario 1
> >----------
> >memory@0 for RMA
> >memory@80000000 for rest of the boot time memory
> >ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
> >
> >After this commit, the guest will have
> >
> >Scenario 2
> >----------
> >memory@0 for RMA
> >ibm,dynamic-reconfiguration-memory for the entire memory including
> >RMA, boot time memory as well as hot-pluggable memory.
> >
> >If an existing guest having DT nodes as in Scenario 1 above is migrated
> >to a QEMU which has this change, at the target, it continues to have the
> >DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> >changes over to Scenario 2.
> 
> 
> Does this patch also enable hot-unplug or RMA and boot time LMBs?

No, this patch will not enable hot-unplug for RMA or boot time LMBs.

> 
> Also, this would look nicer on top of "[RFC for-2.7 00/11] A new
> infrastructure for guest device trees" where all memory DT nodes are in one
> place so it is easier to follow what exactly you...

Let me take a look at that patchset.

Regards,
Bharata.
Michael Roth May 3, 2016, 11:30 p.m. UTC | #3
Quoting Bharata B Rao (2016-04-19 06:51:41)
> Memory hotplug can fail for some combinations of RAM and maxmem when
> DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> on maximum addressable memory by guest and this value is currently
> being calculated wrongly by the guest kernel routine memory_hotplug_max().
> While there is an attempt to fix the guest kernel(*), this patch works
> around the problem within QEMU itself.
> 
> memory_hotplug_max() routine in the guest kernel arrives at max
> addressable memory by multiplying lmb-size with the lmb-count obtained
> from ibm,dyanmic-memory property. There are two assumptions here:
> 
> - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
>   where only hot-pluggable LMBs are present in this property.
> - The memory area comprising of RAM and hotplug region is contiguous: This
>   needn't be true always for PowerKVM as there can be gap between
>   boot time RAM and hotplug region.
> 
> This work around involves having all the LMBs (RMA, rest of the boot time
> LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> guest kernel's calculation of max addressable memory comes out correct
> resulting in correct DDW value which prevents memory hotplug failures.
> memory@0 is created for RMA, but RMA LMBs are also represented as
> "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> revert of e8f986fc57a664a74b9f685b466506366a15201b.
> 
> In addition to this, the alignment of hotplug memory region is reduced from
> current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> gaps between boot time RAM and hotplug region.

I don't see the actual change to SPAPR_HOTPLUG_MEM_ALIGN here? Is it aligned
by some other means?

> 
> This change has a side effect on how the memory nodes in DT are
> represented before and after this change. As an example consider
> a guest with the following memory related command line options:
> 
> -m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G
> 
> Before this change, the guest would have
> 
> Scenario 1
> ----------
> memory@0 for RMA
> memory@80000000 for rest of the boot time memory
> ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
> 
> After this commit, the guest will have
> 
> Scenario 2
> ----------
> memory@0 for RMA
> ibm,dynamic-reconfiguration-memory for the entire memory including
> RMA, boot time memory as well as hot-pluggable memory.
> 
> If an existing guest having DT nodes as in Scenario 1 above is migrated
> to a QEMU which has this change, at the target, it continues to have the
> DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> changes over to Scenario 2.
> 
> I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
> if this change works well with DRC migration.
> 
> (*) https://patchwork.ozlabs.org/patch/606912/
> 
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>  hw/ppc/spapr.c         | 59 +++++++++++++++++++++++++++++++++++---------------
>  include/hw/ppc/spapr.h |  1 +
>  2 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 79a70a9..6d8de2e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState *spapr, void *fdt)
>          }
>          if (!mem_start) {
>              /* ppc_spapr_init() checks for rma_size <= node0_size already */
> -            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
>              mem_start += spapr->rma_size;
>              node_size -= spapr->rma_size;
>          }
> @@ -759,18 +758,13 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>      int ret, i, offset;
>      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
>      uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
> -    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> +    uint32_t nr_rma_lmbs = spapr->rma_size / lmb_size;
> +    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
>      uint32_t *int_buf, *cur_index, buf_len;
>      int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
> 
>      /*
> -     * Don't create the node if there are no DR LMBs.
> -     */
> -    if (!nr_lmbs) {
> -        return 0;
> -    }
> -
> -    /*
>       * Allocate enough buffer size to fit in ibm,dynamic-memory
>       * or ibm,associativity-lookup-arrays
>       */
> @@ -802,9 +796,15 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>      for (i = 0; i < nr_lmbs; i++) {
>          sPAPRDRConnector *drc;
>          sPAPRDRConnectorClass *drck;
> -        uint64_t addr = i * lmb_size + spapr->hotplug_memory.base;;
> +        uint64_t addr;
>          uint32_t *dynamic_memory = cur_index;
> 
> +        if (i < nr_assigned_lmbs) {
> +            addr = i * lmb_size;
> +        } else {
> +            addr = (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;

If the fix relies on there being no gap between hotplug_memory.base and
machine->ram_size, could we instead assert that (nr_assigned_lmbs *
lmb_size == spapr->hotplug_memory.base) and then use the same addr
calculation for all lmbs (here and elsewhere)?

Otherwise the patch looks good, and it seems like a reasonable
workaround to have on the QEMU side even if we still pursue the kernel
fix.

> +        }
>          drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB,
>                                         addr/lmb_size);
>          g_assert(drc);
> @@ -817,7 +817,11 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
>          dynamic_memory[4] = cpu_to_be32(numa_get_node(addr, NULL));
>          if (addr < machine->ram_size ||
>                      memory_region_present(get_system_memory(), addr)) {
> -            dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> +            if (i < nr_rma_lmbs) {
> +                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED);
> +            } else {
> +                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> +            }
>          } else {
>              dynamic_memory[5] = cpu_to_be32(0);
>          }
> @@ -879,6 +883,8 @@ int spapr_h_cas_compose_response(sPAPRMachineState *spapr,
>      /* Generate ibm,dynamic-reconfiguration-memory node if required */
>      if (memory_update && smc->dr_lmb_enabled) {
>          _FDT((spapr_populate_drconf_memory(spapr, fdt)));
> +    } else {
> +        _FDT((spapr_populate_memory(spapr, fdt)));
>      }
> 
>      /* Pack resulting tree */
> @@ -916,10 +922,23 @@ static void spapr_finalize_fdt(sPAPRMachineState *spapr,
>      /* open out the base tree into a temp buffer for the final tweaks */
>      _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE)));
> 
> -    ret = spapr_populate_memory(spapr, fdt);
> -    if (ret < 0) {
> -        fprintf(stderr, "couldn't setup memory nodes in fdt\n");
> -        exit(1);
> +    /*
> +     * Add memory@0 node to represent RMA. Rest of the memory is either
> +     * represented by memory nodes or ibm,dynamic-reconfiguration-memory
> +     * node later during ibm,client-architecture-support call.
> +     *
> +     * If NUMA is configured, ensure that memory@0 ends up in the
> +     * first memory-less node.
> +     */
> +    if (nb_numa_nodes) {
> +        for (i = 0; i < nb_numa_nodes; ++i) {
> +            if (numa_info[i].node_mem) {
> +                spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> +                break;
> +            }
> +        }
> +    } else {
> +        spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>      }
> 
>      ret = spapr_populate_vdevice(spapr->vio_bus, fdt);
> @@ -1659,14 +1678,20 @@ static void spapr_create_lmb_dr_connectors(sPAPRMachineState *spapr)
>  {
>      MachineState *machine = MACHINE(spapr);
>      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> -    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> +    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
>      int i;
> 
>      for (i = 0; i < nr_lmbs; i++) {
>          sPAPRDRConnector *drc;
>          uint64_t addr;
> 
> -        addr = i * lmb_size + spapr->hotplug_memory.base;
> +        if (i < nr_assigned_lmbs) {
> +            addr = i * lmb_size;
> +        } else {
> +            addr = (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;
> +        }
>          drc = spapr_dr_connector_new(OBJECT(spapr), SPAPR_DR_CONNECTOR_TYPE_LMB,
>                                       addr/lmb_size);
>          qemu_register_reset(spapr_drc_reset, drc);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..9f2050d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -627,5 +627,6 @@ int spapr_rng_populate_dt(void *fdt);
>   * property under ibm,dynamic-reconfiguration-memory node.
>   */
>  #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
> +#define SPAPR_LMB_FLAGS_RESERVED 0x00000080
> 
>  #endif /* !defined (__HW_SPAPR_H__) */
> -- 
> 2.1.0
>
Bharata B Rao May 4, 2016, 2:38 a.m. UTC | #4
On Tue, May 03, 2016 at 06:30:51PM -0500, Michael Roth wrote:
> Quoting Bharata B Rao (2016-04-19 06:51:41)
> > Memory hotplug can fail for some combinations of RAM and maxmem when
> > DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> > on maximum addressable memory by guest and this value is currently
> > being calculated wrongly by the guest kernel routine memory_hotplug_max().
> > While there is an attempt to fix the guest kernel(*), this patch works
> > around the problem within QEMU itself.
> > 
> > memory_hotplug_max() routine in the guest kernel arrives at max
> > addressable memory by multiplying lmb-size with the lmb-count obtained
> > from ibm,dyanmic-memory property. There are two assumptions here:
> > 
> > - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
> >   where only hot-pluggable LMBs are present in this property.
> > - The memory area comprising of RAM and hotplug region is contiguous: This
> >   needn't be true always for PowerKVM as there can be gap between
> >   boot time RAM and hotplug region.
> > 
> > This work around involves having all the LMBs (RMA, rest of the boot time
> > LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> > guest kernel's calculation of max addressable memory comes out correct
> > resulting in correct DDW value which prevents memory hotplug failures.
> > memory@0 is created for RMA, but RMA LMBs are also represented as
> > "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> > revert of e8f986fc57a664a74b9f685b466506366a15201b.
> > 
> > In addition to this, the alignment of hotplug memory region is reduced from
> > current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> > gaps between boot time RAM and hotplug region.
> 
> I don't see the actual change to SPAPR_HOTPLUG_MEM_ALIGN here? Is it aligned
> by some other means?

Looks like the alignment change went missing in the patch that I sent :(

> 
> > 
> > This change has a side effect on how the memory nodes in DT are
> > represented before and after this change. As an example consider
> > a guest with the following memory related command line options:
> > 
> > -m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G
> > 
> > Before this change, the guest would have
> > 
> > Scenario 1
> > ----------
> > memory@0 for RMA
> > memory@80000000 for rest of the boot time memory
> > ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
> > 
> > After this commit, the guest will have
> > 
> > Scenario 2
> > ----------
> > memory@0 for RMA
> > ibm,dynamic-reconfiguration-memory for the entire memory including
> > RMA, boot time memory as well as hot-pluggable memory.
> > 
> > If an existing guest having DT nodes as in Scenario 1 above is migrated
> > to a QEMU which has this change, at the target, it continues to have the
> > DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> > changes over to Scenario 2.
> > 
> > I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
> > if this change works well with DRC migration.
> > 
> > (*) https://patchwork.ozlabs.org/patch/606912/
> > 
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> > Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> > ---
> >  hw/ppc/spapr.c         | 59 +++++++++++++++++++++++++++++++++++---------------
> >  include/hw/ppc/spapr.h |  1 +
> >  2 files changed, 43 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 79a70a9..6d8de2e 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState *spapr, void *fdt)
> >          }
> >          if (!mem_start) {
> >              /* ppc_spapr_init() checks for rma_size <= node0_size already */
> > -            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> >              mem_start += spapr->rma_size;
> >              node_size -= spapr->rma_size;
> >          }
> > @@ -759,18 +758,13 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
> >      int ret, i, offset;
> >      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> >      uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
> > -    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> > +    uint32_t nr_rma_lmbs = spapr->rma_size / lmb_size;
> > +    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> > +    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
> >      uint32_t *int_buf, *cur_index, buf_len;
> >      int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
> > 
> >      /*
> > -     * Don't create the node if there are no DR LMBs.
> > -     */
> > -    if (!nr_lmbs) {
> > -        return 0;
> > -    }
> > -
> > -    /*
> >       * Allocate enough buffer size to fit in ibm,dynamic-memory
> >       * or ibm,associativity-lookup-arrays
> >       */
> > @@ -802,9 +796,15 @@ static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
> >      for (i = 0; i < nr_lmbs; i++) {
> >          sPAPRDRConnector *drc;
> >          sPAPRDRConnectorClass *drck;
> > -        uint64_t addr = i * lmb_size + spapr->hotplug_memory.base;;
> > +        uint64_t addr;
> >          uint32_t *dynamic_memory = cur_index;
> > 
> > +        if (i < nr_assigned_lmbs) {
> > +            addr = i * lmb_size;
> > +        } else {
> > +            addr = (i - nr_assigned_lmbs) * lmb_size +
> > +                    spapr->hotplug_memory.base;
> 
> If the fix relies on there being no gap between hotplug_memory.base and
> machine->ram_size, could we instead assert that (nr_assigned_lmbs *
> lmb_size == spapr->hotplug_memory.base) and then use the same addr
> calculation for all lmbs (here and elsewhere)?

Yes, of course.

> 
> Otherwise the patch looks good, and it seems like a reasonable
> workaround to have on the QEMU side even if we still pursue the kernel
> fix.

I am beginning to think that we should probably not work around this
guest kernel bug in this memory DT generation code but instead work around
this in the QEMU DDW code. Wherever DDW code in QEMU creates the DDW
based on the window size obtained from the guest via RTAS call, it could
just ignore that guest provided wrong value and use the right value (based
on maxmem) which QEMU already knows about.

That way we will be spared from moving between different DT representations
for memory between different QEMU versions.

Regards,
Bharata.
diff mbox

Patch

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 79a70a9..6d8de2e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -566,7 +566,6 @@  static int spapr_populate_memory(sPAPRMachineState *spapr, void *fdt)
         }
         if (!mem_start) {
             /* ppc_spapr_init() checks for rma_size <= node0_size already */
-            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
             mem_start += spapr->rma_size;
             node_size -= spapr->rma_size;
         }
@@ -759,18 +758,13 @@  static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
     int ret, i, offset;
     uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
     uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
-    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
+    uint32_t nr_rma_lmbs = spapr->rma_size / lmb_size;
+    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
+    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
     uint32_t *int_buf, *cur_index, buf_len;
     int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
 
     /*
-     * Don't create the node if there are no DR LMBs.
-     */
-    if (!nr_lmbs) {
-        return 0;
-    }
-
-    /*
      * Allocate enough buffer size to fit in ibm,dynamic-memory
      * or ibm,associativity-lookup-arrays
      */
@@ -802,9 +796,15 @@  static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
     for (i = 0; i < nr_lmbs; i++) {
         sPAPRDRConnector *drc;
         sPAPRDRConnectorClass *drck;
-        uint64_t addr = i * lmb_size + spapr->hotplug_memory.base;;
+        uint64_t addr;
         uint32_t *dynamic_memory = cur_index;
 
+        if (i < nr_assigned_lmbs) {
+            addr = i * lmb_size;
+        } else {
+            addr = (i - nr_assigned_lmbs) * lmb_size +
+                    spapr->hotplug_memory.base;
+        }
         drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB,
                                        addr/lmb_size);
         g_assert(drc);
@@ -817,7 +817,11 @@  static int spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
         dynamic_memory[4] = cpu_to_be32(numa_get_node(addr, NULL));
         if (addr < machine->ram_size ||
                     memory_region_present(get_system_memory(), addr)) {
-            dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
+            if (i < nr_rma_lmbs) {
+                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED);
+            } else {
+                dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
+            }
         } else {
             dynamic_memory[5] = cpu_to_be32(0);
         }
@@ -879,6 +883,8 @@  int spapr_h_cas_compose_response(sPAPRMachineState *spapr,
     /* Generate ibm,dynamic-reconfiguration-memory node if required */
     if (memory_update && smc->dr_lmb_enabled) {
         _FDT((spapr_populate_drconf_memory(spapr, fdt)));
+    } else {
+        _FDT((spapr_populate_memory(spapr, fdt)));
     }
 
     /* Pack resulting tree */
@@ -916,10 +922,23 @@  static void spapr_finalize_fdt(sPAPRMachineState *spapr,
     /* open out the base tree into a temp buffer for the final tweaks */
     _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE)));
 
-    ret = spapr_populate_memory(spapr, fdt);
-    if (ret < 0) {
-        fprintf(stderr, "couldn't setup memory nodes in fdt\n");
-        exit(1);
+    /*
+     * Add memory@0 node to represent RMA. Rest of the memory is either
+     * represented by memory nodes or ibm,dynamic-reconfiguration-memory
+     * node later during ibm,client-architecture-support call.
+     *
+     * If NUMA is configured, ensure that memory@0 ends up in the
+     * first memory-less node.
+     */
+    if (nb_numa_nodes) {
+        for (i = 0; i < nb_numa_nodes; ++i) {
+            if (numa_info[i].node_mem) {
+                spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
+                break;
+            }
+        }
+    } else {
+        spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
     }
 
     ret = spapr_populate_vdevice(spapr->vio_bus, fdt);
@@ -1659,14 +1678,20 @@  static void spapr_create_lmb_dr_connectors(sPAPRMachineState *spapr)
 {
     MachineState *machine = MACHINE(spapr);
     uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
-    uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
+    uint32_t nr_lmbs = machine->maxram_size / lmb_size;
+    uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
     int i;
 
     for (i = 0; i < nr_lmbs; i++) {
         sPAPRDRConnector *drc;
         uint64_t addr;
 
-        addr = i * lmb_size + spapr->hotplug_memory.base;
+        if (i < nr_assigned_lmbs) {
+            addr = i * lmb_size;
+        } else {
+            addr = (i - nr_assigned_lmbs) * lmb_size +
+                    spapr->hotplug_memory.base;
+        }
         drc = spapr_dr_connector_new(OBJECT(spapr), SPAPR_DR_CONNECTOR_TYPE_LMB,
                                      addr/lmb_size);
         qemu_register_reset(spapr_drc_reset, drc);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 098d85d..9f2050d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -627,5 +627,6 @@  int spapr_rng_populate_dt(void *fdt);
  * property under ibm,dynamic-reconfiguration-memory node.
  */
 #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
+#define SPAPR_LMB_FLAGS_RESERVED 0x00000080
 
 #endif /* !defined (__HW_SPAPR_H__) */