diff mbox

[RFC] hw/arm/virt: Add support for NUMA on ARM64

Message ID 1417524986-9196-1-git-send-email-zhaoshenglong@huawei.com
State New
Headers show

Commit Message

Shannon Zhao Dec. 2, 2014, 12:56 p.m. UTC
Add support for NUMA on ARM64. Tested successfully running a guest
Linux kernel with the following patch applied:

- arm64:numa: adding numa support for arm64 platforms.
http://www.spinics.net/lists/arm-kernel/msg365316.html

Example qemu command line:
qemu-system-aarch64 \
    -enable-kvm -smp 4\
    -kernel Image \
    -m 512 -machine virt,kernel_irqchip=on \
    -initrd guestfs.cpio.gz \
    -cpu host -nographic \
    -numa node,mem=256M,cpus=0-1,nodeid=0 \
    -numa node,mem=256M,cpus=2-3,nodeid=1 \
    -append "console=ttyAMA0 root=/dev/ram"

Todo:
1)The NUMA nodes information in DT is not finalized yet, so this
patch might need to be further modified to follow any changes in it.

2)Consider IO-NUMA as well

Please refer to the following url for NUMA DT node details:

- Documentation: arm64/arm: dt bindings for numa.
http://www.spinics.net/lists/arm-kernel/msg380200.html

Example: 2 Node system each having 2 CPUs and a Memory

	numa-map {
		#address-cells = <2>;
		#size-cells = <1>;
		#node-count = <2>;
		mem-map =  <0x0 0x40000000 0>,
		           <0x0 0x50000000 1>;

		cpu-map = <0 1 0>,
			  <2 3 1>;

		node-matrix = <0 0 10>,
			      <0 1 20>,
			      <1 0 20>,
			      <1 1 10>;
	};

- mem-map:	This property defines the association between a range of
		memory and the proximity domain/numa node to which it belongs.

- cpu-map:	This property defines the association of range of processors
		(range of cpu ids) and the proximity domain to which
		the processor belongs.

- node-matrix:	This table provides a matrix that describes the relative
		distance (memory latency) between all System Localities.
		The value of each Entry[i j distance] in node-matrix table,
		where i represents a row of a matrix and j represents a
		column of a matrix, indicates the relative distances
		from Proximity Domain/Numa node i to every other
		node j in the system (including itself).

Signed-off-by: Shannon Zhao <zhaoshenglong@huawei.com>
---
 hw/arm/boot.c |   25 ------------
 hw/arm/virt.c |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 113 insertions(+), 32 deletions(-)

Comments

Peter Maydell Dec. 2, 2014, 4 p.m. UTC | #1
On 2 December 2014 at 12:56, Shannon Zhao <zhaoshenglong@huawei.com> wrote:
> Add support for NUMA on ARM64. Tested successfully running a guest
> Linux kernel with the following patch applied:
>
> - arm64:numa: adding numa support for arm64 platforms.
> http://www.spinics.net/lists/arm-kernel/msg365316.html
>
> Example qemu command line:
> qemu-system-aarch64 \
>     -enable-kvm -smp 4\
>     -kernel Image \
>     -m 512 -machine virt,kernel_irqchip=on \
>     -initrd guestfs.cpio.gz \
>     -cpu host -nographic \
>     -numa node,mem=256M,cpus=0-1,nodeid=0 \
>     -numa node,mem=256M,cpus=2-3,nodeid=1 \
>     -append "console=ttyAMA0 root=/dev/ram"

Thanks for sending this patch. Unfortunately I know very
little about NUMA so it's a bit difficult for me to
evaluate (would anybody else care to look at it?).
Nevertheless, it's on my list of patches to review;
ping me if I don't get to it within a week or two.

-- PMM
Shannon Zhao Dec. 3, 2014, 2:46 a.m. UTC | #2
On 2014/12/3 0:00, Peter Maydell wrote:
> On 2 December 2014 at 12:56, Shannon Zhao <zhaoshenglong@huawei.com> wrote:
>> Add support for NUMA on ARM64. Tested successfully running a guest
>> Linux kernel with the following patch applied:
>>
>> - arm64:numa: adding numa support for arm64 platforms.
>> http://www.spinics.net/lists/arm-kernel/msg365316.html
>>
>> Example qemu command line:
>> qemu-system-aarch64 \
>>     -enable-kvm -smp 4\
>>     -kernel Image \
>>     -m 512 -machine virt,kernel_irqchip=on \
>>     -initrd guestfs.cpio.gz \
>>     -cpu host -nographic \
>>     -numa node,mem=256M,cpus=0-1,nodeid=0 \
>>     -numa node,mem=256M,cpus=2-3,nodeid=1 \
>>     -append "console=ttyAMA0 root=/dev/ram"
> 
> Thanks for sending this patch. Unfortunately I know very
> little about NUMA so it's a bit difficult for me to
> evaluate (would anybody else care to look at it?).
> Nevertheless, it's on my list of patches to review;
> ping me if I don't get to it within a week or two.
> 

Ok, thanks for your reply :-)

Something about NUMA can get from wiki:

http://en.wikipedia.org/wiki/Non-uniform_memory_access

Thanks,
Shannon
Peter Maydell Dec. 8, 2014, 1:49 p.m. UTC | #3
On 2 December 2014 at 12:56, Shannon Zhao <zhaoshenglong@huawei.com> wrote:
> Add support for NUMA on ARM64. Tested successfully running a guest
> Linux kernel with the following patch applied:

I'm still hoping for review from somebody who better understands
how QEMU and NUMA should interact, but in the meantime some comments
at a code level:

>  hw/arm/boot.c |   25 ------------
>  hw/arm/virt.c |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 113 insertions(+), 32 deletions(-)
>
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index 0014c34..c20fee4 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -335,7 +335,6 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>  {
>      void *fdt = NULL;
>      int size, rc;
> -    uint32_t acells, scells;
>
>      if (binfo->dtb_filename) {
>          char *filename;
> @@ -369,30 +368,6 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>          return 0;
>      }
>
> -    acells = qemu_fdt_getprop_cell(fdt, "/", "#address-cells");
> -    scells = qemu_fdt_getprop_cell(fdt, "/", "#size-cells");
> -    if (acells == 0 || scells == 0) {
> -        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
> -        goto fail;
> -    }
> -
> -    if (scells < 2 && binfo->ram_size >= (1ULL << 32)) {
> -        /* This is user error so deserves a friendlier error message
> -         * than the failure of setprop_sized_cells would provide
> -         */
> -        fprintf(stderr, "qemu: dtb file not compatible with "
> -                "RAM size > 4GB\n");
> -        goto fail;
> -    }
> -
> -    rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
> -                                      acells, binfo->loader_start,
> -                                      scells, binfo->ram_size);
> -    if (rc < 0) {
> -        fprintf(stderr, "couldn't set /memory/reg\n");
> -        goto fail;
> -    }
> -

This patchset seems to be moving the initialization of a lot of
the dtb from this generic code into the virt board. That doesn't
seem right to me -- why would NUMA support be specific to the
virt board? I would expect support for this to be in the generic
code (possibly controlled with a board option for "I support NUMA").
As it is your patch will break support for every other
board that uses device trees, because they rely on this code
which you've deleted here.

>      if (binfo->kernel_cmdline && *binfo->kernel_cmdline) {
>          rc = qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
>                                       binfo->kernel_cmdline);
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 78f618d..9d18a91 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -170,8 +170,6 @@ static void create_fdt(VirtBoardInfo *vbi)
>       * to fill in necessary properties later
>       */
>      qemu_fdt_add_subnode(fdt, "/chosen");
> -    qemu_fdt_add_subnode(fdt, "/memory");
> -    qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
>
>      /* Clock node, for the benefit of the UART. The kernel device tree
>       * binding documentation claims the PL011 node clock properties are
> @@ -235,6 +233,116 @@ static void fdt_add_psci_node(const VirtBoardInfo *vbi)
>      qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
>  }
>
> +static int virt_memory_init(MachineState *machine,
> +                            MemoryRegion *system_memory,
> +                            const VirtBoardInfo *vbi)
> +{
> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> +    CPUState *cpu;
> +    int min_cpu = 0, max_cpu = 0;
> +    int i, j, count, len;
> +    uint32_t acells, scells;
> +
> +    acells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#address-cells");
> +    scells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#size-cells");
> +    if (acells == 0 || scells == 0) {
> +        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
> +        goto fail;
> +    }
> +
> +    if (scells < 2 && machine->ram_size >= (1ULL << 32)) {
> +        /* This is user error so deserves a friendlier error message
> +         * than the failure of setprop_sized_cells would provide
> +         */
> +        fprintf(stderr, "qemu: dtb file not compatible with "
> +                "RAM size > 4GB\n");
> +        goto fail;
> +    }
> +
> +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
> +                                         machine->ram_size);
> +    memory_region_add_subregion(system_memory, vbi->memmap[VIRT_MEM].base, ram);
> +
> +    hwaddr mem_base = vbi->memmap[VIRT_MEM].base;
> +
> +    if (!nb_numa_nodes) {
> +        qemu_fdt_add_subnode(vbi->fdt, "/memory");
> +        qemu_fdt_setprop_string(vbi->fdt, "/memory", "device_type", "memory");
> +        qemu_fdt_setprop_sized_cells(vbi->fdt, "/memory", "reg",
> +                                      acells, mem_base,
> +                                      scells, machine->ram_size);
> +        return 0;
> +    }
> +
> +    qemu_fdt_add_subnode(vbi->fdt, "/numa-map");
> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#address-cells", 0x2);
> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#size-cells", 0x1);
> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#node-count", 0x2);
> +
> +    uint64_t *mem_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
> +    uint64_t *cpu_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
> +    uint64_t *node_matrix = g_malloc0(nb_numa_nodes * nb_numa_nodes
> +                                      * sizeof(uint64_t) * 6);

g_new0() is usually better than g_malloc0(count * sizeof(thing));

> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        uint64_t buffer[6] = {1, 0x00000000, 1, mem_base, 1, i};
> +        char *nodename;
> +        nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> +        qemu_fdt_add_subnode(vbi->fdt, nodename);
> +        qemu_fdt_setprop_string(vbi->fdt, nodename, "device_type", "memory");
> +        qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg",
> +                                      acells, mem_base,
> +                                      scells, numa_info[i].node_mem-1);
> +        memcpy(mem_map + 6 * i, buffer, 6 * sizeof(*buffer));

Why do we create a local buffer array and then immediately do nothing
with it but memcpy() it somewhere else? I suspect this code would
be clearer if you defined a type (possibly a struct) where you're
currently using raw uint64_t array[6]. Then you could make your
mem_map, cpu_map and node_matrix variables have more sensible types
than raw uint64_t*, and you could just set the right element in them
using C array syntax rather than these memcpy calls.

> +        mem_base += numa_info[i].node_mem;
> +        g_free(nodename);
> +    }
> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "mem-map",
> +                        (nb_numa_nodes * 6) / 2, mem_map);

Lots of unnecessary hardcoded 6s here (and above and below).

> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        CPU_FOREACH(cpu) {
> +            if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) {
> +                if (cpu->cpu_index < min_cpu) {
> +                    min_cpu = cpu->cpu_index;
> +                }
> +                if (cpu->cpu_index > max_cpu) {
> +                    max_cpu = cpu->cpu_index;
> +                }
> +            }
> +        }
> +
> +        uint64_t buffer[6] = {1, min_cpu, 1, max_cpu, 1, i};
> +        memcpy(cpu_map + 6 * i, buffer, 6 * sizeof(*buffer));
> +        min_cpu = max_cpu + 1;
> +    }
> +
> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "cpu-map",
> +                        (nb_numa_nodes * 6) / 2, cpu_map);
> +    count = 0;
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        for (j = 0; j < nb_numa_nodes; j++) {
> +            len = 20;
> +            if (i == j) {
> +                len = 10;
> +            }
> +            uint64_t buffer[6] = {1, i, 1, j, 1, len};
> +            memcpy(node_matrix + 6 * count, buffer, 6 * sizeof(*buffer));
> +            count++;
> +        }
> +    }
> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map",
> +        "node-matrix", (nb_numa_nodes * nb_numa_nodes * 6) / 2, node_matrix);
> +
> +    g_free(mem_map);
> +    g_free(cpu_map);
> +    g_free(node_matrix);
> +
> +    return 0;
> +fail:
> +    return -1;
> +}
> +
>  static void fdt_add_timer_nodes(const VirtBoardInfo *vbi)
>  {
>      /* Note that on A15 h/w these interrupts are level-triggered,
> @@ -532,7 +640,6 @@ static void machvirt_init(MachineState *machine)
>      qemu_irq pic[NUM_IRQS];
>      MemoryRegion *sysmem = get_system_memory();
>      int n;
> -    MemoryRegion *ram = g_new(MemoryRegion, 1);
>      const char *cpu_model = machine->cpu_model;
>      VirtBoardInfo *vbi;
>
> @@ -585,10 +692,9 @@ static void machvirt_init(MachineState *machine)
>      fdt_add_cpu_nodes(vbi);
>      fdt_add_psci_node(vbi);
>
> -    memory_region_init_ram(ram, NULL, "mach-virt.ram", machine->ram_size,
> -                           &error_abort);
> -    vmstate_register_ram_global(ram);
> -    memory_region_add_subregion(sysmem, vbi->memmap[VIRT_MEM].base, ram);
> +    if (virt_memory_init(machine, sysmem, vbi) < 0) {
> +        exit(1);
> +    }
>
>      create_flash(vbi);
>
> --
> 1.7.1

thanks
-- PMM
Shannon Zhao Dec. 15, 2014, 1:06 a.m. UTC | #4
On 2014/12/8 21:49, Peter Maydell wrote:
> On 2 December 2014 at 12:56, Shannon Zhao <zhaoshenglong@huawei.com> wrote:
>> Add support for NUMA on ARM64. Tested successfully running a guest
>> Linux kernel with the following patch applied:
> 
> I'm still hoping for review from somebody who better understands
> how QEMU and NUMA should interact, but in the meantime some comments
> at a code level:
> 
>>  hw/arm/boot.c |   25 ------------
>>  hw/arm/virt.c |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>  2 files changed, 113 insertions(+), 32 deletions(-)
>>
>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>> index 0014c34..c20fee4 100644
>> --- a/hw/arm/boot.c
>> +++ b/hw/arm/boot.c
>> @@ -335,7 +335,6 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>>  {
>>      void *fdt = NULL;
>>      int size, rc;
>> -    uint32_t acells, scells;
>>
>>      if (binfo->dtb_filename) {
>>          char *filename;
>> @@ -369,30 +368,6 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>>          return 0;
>>      }
>>
>> -    acells = qemu_fdt_getprop_cell(fdt, "/", "#address-cells");
>> -    scells = qemu_fdt_getprop_cell(fdt, "/", "#size-cells");
>> -    if (acells == 0 || scells == 0) {
>> -        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
>> -        goto fail;
>> -    }
>> -
>> -    if (scells < 2 && binfo->ram_size >= (1ULL << 32)) {
>> -        /* This is user error so deserves a friendlier error message
>> -         * than the failure of setprop_sized_cells would provide
>> -         */
>> -        fprintf(stderr, "qemu: dtb file not compatible with "
>> -                "RAM size > 4GB\n");
>> -        goto fail;
>> -    }
>> -
>> -    rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
>> -                                      acells, binfo->loader_start,
>> -                                      scells, binfo->ram_size);
>> -    if (rc < 0) {
>> -        fprintf(stderr, "couldn't set /memory/reg\n");
>> -        goto fail;
>> -    }
>> -
> 
> This patchset seems to be moving the initialization of a lot of
> the dtb from this generic code into the virt board. That doesn't
> seem right to me -- why would NUMA support be specific to the
> virt board? I would expect support for this to be in the generic
> code (possibly controlled with a board option for "I support NUMA").
> As it is your patch will break support for every other
> board that uses device trees, because they rely on this code
> which you've deleted here.
> 

Good suggestion. Will fix this :-)

>>      if (binfo->kernel_cmdline && *binfo->kernel_cmdline) {
>>          rc = qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
>>                                       binfo->kernel_cmdline);
>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>> index 78f618d..9d18a91 100644
>> --- a/hw/arm/virt.c
>> +++ b/hw/arm/virt.c
>> @@ -170,8 +170,6 @@ static void create_fdt(VirtBoardInfo *vbi)
>>       * to fill in necessary properties later
>>       */
>>      qemu_fdt_add_subnode(fdt, "/chosen");
>> -    qemu_fdt_add_subnode(fdt, "/memory");
>> -    qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
>>
>>      /* Clock node, for the benefit of the UART. The kernel device tree
>>       * binding documentation claims the PL011 node clock properties are
>> @@ -235,6 +233,116 @@ static void fdt_add_psci_node(const VirtBoardInfo *vbi)
>>      qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
>>  }
>>
>> +static int virt_memory_init(MachineState *machine,
>> +                            MemoryRegion *system_memory,
>> +                            const VirtBoardInfo *vbi)
>> +{
>> +    MemoryRegion *ram = g_new(MemoryRegion, 1);
>> +    CPUState *cpu;
>> +    int min_cpu = 0, max_cpu = 0;
>> +    int i, j, count, len;
>> +    uint32_t acells, scells;
>> +
>> +    acells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#address-cells");
>> +    scells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#size-cells");
>> +    if (acells == 0 || scells == 0) {
>> +        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
>> +        goto fail;
>> +    }
>> +
>> +    if (scells < 2 && machine->ram_size >= (1ULL << 32)) {
>> +        /* This is user error so deserves a friendlier error message
>> +         * than the failure of setprop_sized_cells would provide
>> +         */
>> +        fprintf(stderr, "qemu: dtb file not compatible with "
>> +                "RAM size > 4GB\n");
>> +        goto fail;
>> +    }
>> +
>> +    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
>> +                                         machine->ram_size);
>> +    memory_region_add_subregion(system_memory, vbi->memmap[VIRT_MEM].base, ram);
>> +
>> +    hwaddr mem_base = vbi->memmap[VIRT_MEM].base;
>> +
>> +    if (!nb_numa_nodes) {
>> +        qemu_fdt_add_subnode(vbi->fdt, "/memory");
>> +        qemu_fdt_setprop_string(vbi->fdt, "/memory", "device_type", "memory");
>> +        qemu_fdt_setprop_sized_cells(vbi->fdt, "/memory", "reg",
>> +                                      acells, mem_base,
>> +                                      scells, machine->ram_size);
>> +        return 0;
>> +    }
>> +
>> +    qemu_fdt_add_subnode(vbi->fdt, "/numa-map");
>> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#address-cells", 0x2);
>> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#size-cells", 0x1);
>> +    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#node-count", 0x2);
>> +
>> +    uint64_t *mem_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
>> +    uint64_t *cpu_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
>> +    uint64_t *node_matrix = g_malloc0(nb_numa_nodes * nb_numa_nodes
>> +                                      * sizeof(uint64_t) * 6);
> 
> g_new0() is usually better than g_malloc0(count * sizeof(thing));
> 

Ok

>> +
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        uint64_t buffer[6] = {1, 0x00000000, 1, mem_base, 1, i};
>> +        char *nodename;
>> +        nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
>> +        qemu_fdt_add_subnode(vbi->fdt, nodename);
>> +        qemu_fdt_setprop_string(vbi->fdt, nodename, "device_type", "memory");
>> +        qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg",
>> +                                      acells, mem_base,
>> +                                      scells, numa_info[i].node_mem-1);
>> +        memcpy(mem_map + 6 * i, buffer, 6 * sizeof(*buffer));
> 
> Why do we create a local buffer array and then immediately do nothing
> with it but memcpy() it somewhere else? I suspect this code would
> be clearer if you defined a type (possibly a struct) where you're
> currently using raw uint64_t array[6]. Then you could make your
> mem_map, cpu_map and node_matrix variables have more sensible types
> than raw uint64_t*, and you could just set the right element in them
> using C array syntax rather than these memcpy calls.
> 

OK, thanks for your suggestion.

>> +        mem_base += numa_info[i].node_mem;
>> +        g_free(nodename);
>> +    }
>> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "mem-map",
>> +                        (nb_numa_nodes * 6) / 2, mem_map);
> 
> Lots of unnecessary hardcoded 6s here (and above and below).
> 

Will replace them.

>> +
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        CPU_FOREACH(cpu) {
>> +            if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) {
>> +                if (cpu->cpu_index < min_cpu) {
>> +                    min_cpu = cpu->cpu_index;
>> +                }
>> +                if (cpu->cpu_index > max_cpu) {
>> +                    max_cpu = cpu->cpu_index;
>> +                }
>> +            }
>> +        }
>> +
>> +        uint64_t buffer[6] = {1, min_cpu, 1, max_cpu, 1, i};
>> +        memcpy(cpu_map + 6 * i, buffer, 6 * sizeof(*buffer));
>> +        min_cpu = max_cpu + 1;
>> +    }
>> +
>> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "cpu-map",
>> +                        (nb_numa_nodes * 6) / 2, cpu_map);
>> +    count = 0;
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        for (j = 0; j < nb_numa_nodes; j++) {
>> +            len = 20;
>> +            if (i == j) {
>> +                len = 10;
>> +            }
>> +            uint64_t buffer[6] = {1, i, 1, j, 1, len};
>> +            memcpy(node_matrix + 6 * count, buffer, 6 * sizeof(*buffer));
>> +            count++;
>> +        }
>> +    }
>> +    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map",
>> +        "node-matrix", (nb_numa_nodes * nb_numa_nodes * 6) / 2, node_matrix);
>> +
>> +    g_free(mem_map);
>> +    g_free(cpu_map);
>> +    g_free(node_matrix);
>> +
>> +    return 0;
>> +fail:
>> +    return -1;
>> +}
>> +
>>  static void fdt_add_timer_nodes(const VirtBoardInfo *vbi)
>>  {
>>      /* Note that on A15 h/w these interrupts are level-triggered,
>> @@ -532,7 +640,6 @@ static void machvirt_init(MachineState *machine)
>>      qemu_irq pic[NUM_IRQS];
>>      MemoryRegion *sysmem = get_system_memory();
>>      int n;
>> -    MemoryRegion *ram = g_new(MemoryRegion, 1);
>>      const char *cpu_model = machine->cpu_model;
>>      VirtBoardInfo *vbi;
>>
>> @@ -585,10 +692,9 @@ static void machvirt_init(MachineState *machine)
>>      fdt_add_cpu_nodes(vbi);
>>      fdt_add_psci_node(vbi);
>>
>> -    memory_region_init_ram(ram, NULL, "mach-virt.ram", machine->ram_size,
>> -                           &error_abort);
>> -    vmstate_register_ram_global(ram);
>> -    memory_region_add_subregion(sysmem, vbi->memmap[VIRT_MEM].base, ram);
>> +    if (virt_memory_init(machine, sysmem, vbi) < 0) {
>> +        exit(1);
>> +    }
>>
>>      create_flash(vbi);
>>
>> --
>> 1.7.1
> 
> thanks
> -- PMM
> 
> .
> 

Thanks for your comments. Anyone else has more comments about this?

Thanks,
Shannon
diff mbox

Patch

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 0014c34..c20fee4 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -335,7 +335,6 @@  static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
 {
     void *fdt = NULL;
     int size, rc;
-    uint32_t acells, scells;
 
     if (binfo->dtb_filename) {
         char *filename;
@@ -369,30 +368,6 @@  static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
         return 0;
     }
 
-    acells = qemu_fdt_getprop_cell(fdt, "/", "#address-cells");
-    scells = qemu_fdt_getprop_cell(fdt, "/", "#size-cells");
-    if (acells == 0 || scells == 0) {
-        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
-        goto fail;
-    }
-
-    if (scells < 2 && binfo->ram_size >= (1ULL << 32)) {
-        /* This is user error so deserves a friendlier error message
-         * than the failure of setprop_sized_cells would provide
-         */
-        fprintf(stderr, "qemu: dtb file not compatible with "
-                "RAM size > 4GB\n");
-        goto fail;
-    }
-
-    rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg",
-                                      acells, binfo->loader_start,
-                                      scells, binfo->ram_size);
-    if (rc < 0) {
-        fprintf(stderr, "couldn't set /memory/reg\n");
-        goto fail;
-    }
-
     if (binfo->kernel_cmdline && *binfo->kernel_cmdline) {
         rc = qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
                                      binfo->kernel_cmdline);
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 78f618d..9d18a91 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -170,8 +170,6 @@  static void create_fdt(VirtBoardInfo *vbi)
      * to fill in necessary properties later
      */
     qemu_fdt_add_subnode(fdt, "/chosen");
-    qemu_fdt_add_subnode(fdt, "/memory");
-    qemu_fdt_setprop_string(fdt, "/memory", "device_type", "memory");
 
     /* Clock node, for the benefit of the UART. The kernel device tree
      * binding documentation claims the PL011 node clock properties are
@@ -235,6 +233,116 @@  static void fdt_add_psci_node(const VirtBoardInfo *vbi)
     qemu_fdt_setprop_cell(fdt, "/psci", "migrate", migrate_fn);
 }
 
+static int virt_memory_init(MachineState *machine,
+                            MemoryRegion *system_memory,
+                            const VirtBoardInfo *vbi)
+{
+    MemoryRegion *ram = g_new(MemoryRegion, 1);
+    CPUState *cpu;
+    int min_cpu = 0, max_cpu = 0;
+    int i, j, count, len;
+    uint32_t acells, scells;
+
+    acells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#address-cells");
+    scells = qemu_fdt_getprop_cell(vbi->fdt, "/", "#size-cells");
+    if (acells == 0 || scells == 0) {
+        fprintf(stderr, "dtb file invalid (#address-cells or #size-cells 0)\n");
+        goto fail;
+    }
+
+    if (scells < 2 && machine->ram_size >= (1ULL << 32)) {
+        /* This is user error so deserves a friendlier error message
+         * than the failure of setprop_sized_cells would provide
+         */
+        fprintf(stderr, "qemu: dtb file not compatible with "
+                "RAM size > 4GB\n");
+        goto fail;
+    }
+
+    memory_region_allocate_system_memory(ram, NULL, "mach-virt.ram",
+                                         machine->ram_size);
+    memory_region_add_subregion(system_memory, vbi->memmap[VIRT_MEM].base, ram);
+
+    hwaddr mem_base = vbi->memmap[VIRT_MEM].base;
+
+    if (!nb_numa_nodes) {
+        qemu_fdt_add_subnode(vbi->fdt, "/memory");
+        qemu_fdt_setprop_string(vbi->fdt, "/memory", "device_type", "memory");
+        qemu_fdt_setprop_sized_cells(vbi->fdt, "/memory", "reg",
+                                      acells, mem_base,
+                                      scells, machine->ram_size);
+        return 0;
+    }
+
+    qemu_fdt_add_subnode(vbi->fdt, "/numa-map");
+    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#address-cells", 0x2);
+    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#size-cells", 0x1);
+    qemu_fdt_setprop_cell(vbi->fdt, "/numa-map", "#node-count", 0x2);
+
+    uint64_t *mem_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
+    uint64_t *cpu_map = g_malloc0(nb_numa_nodes * sizeof(uint64_t) * 6);
+    uint64_t *node_matrix = g_malloc0(nb_numa_nodes * nb_numa_nodes
+                                      * sizeof(uint64_t) * 6);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        uint64_t buffer[6] = {1, 0x00000000, 1, mem_base, 1, i};
+        char *nodename;
+        nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
+        qemu_fdt_add_subnode(vbi->fdt, nodename);
+        qemu_fdt_setprop_string(vbi->fdt, nodename, "device_type", "memory");
+        qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg",
+                                      acells, mem_base,
+                                      scells, numa_info[i].node_mem-1);
+        memcpy(mem_map + 6 * i, buffer, 6 * sizeof(*buffer));
+        mem_base += numa_info[i].node_mem;
+        g_free(nodename);
+    }
+    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "mem-map",
+                        (nb_numa_nodes * 6) / 2, mem_map);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        CPU_FOREACH(cpu) {
+            if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) {
+                if (cpu->cpu_index < min_cpu) {
+                    min_cpu = cpu->cpu_index;
+                }
+                if (cpu->cpu_index > max_cpu) {
+                    max_cpu = cpu->cpu_index;
+                }
+            }
+        }
+
+        uint64_t buffer[6] = {1, min_cpu, 1, max_cpu, 1, i};
+        memcpy(cpu_map + 6 * i, buffer, 6 * sizeof(*buffer));
+        min_cpu = max_cpu + 1;
+    }
+
+    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map", "cpu-map",
+                        (nb_numa_nodes * 6) / 2, cpu_map);
+    count = 0;
+    for (i = 0; i < nb_numa_nodes; i++) {
+        for (j = 0; j < nb_numa_nodes; j++) {
+            len = 20;
+            if (i == j) {
+                len = 10;
+            }
+            uint64_t buffer[6] = {1, i, 1, j, 1, len};
+            memcpy(node_matrix + 6 * count, buffer, 6 * sizeof(*buffer));
+            count++;
+        }
+    }
+    qemu_fdt_setprop_sized_cells_from_array(vbi->fdt, "/numa-map",
+        "node-matrix", (nb_numa_nodes * nb_numa_nodes * 6) / 2, node_matrix);
+
+    g_free(mem_map);
+    g_free(cpu_map);
+    g_free(node_matrix);
+
+    return 0;
+fail:
+    return -1;
+}
+
 static void fdt_add_timer_nodes(const VirtBoardInfo *vbi)
 {
     /* Note that on A15 h/w these interrupts are level-triggered,
@@ -532,7 +640,6 @@  static void machvirt_init(MachineState *machine)
     qemu_irq pic[NUM_IRQS];
     MemoryRegion *sysmem = get_system_memory();
     int n;
-    MemoryRegion *ram = g_new(MemoryRegion, 1);
     const char *cpu_model = machine->cpu_model;
     VirtBoardInfo *vbi;
 
@@ -585,10 +692,9 @@  static void machvirt_init(MachineState *machine)
     fdt_add_cpu_nodes(vbi);
     fdt_add_psci_node(vbi);
 
-    memory_region_init_ram(ram, NULL, "mach-virt.ram", machine->ram_size,
-                           &error_abort);
-    vmstate_register_ram_global(ram);
-    memory_region_add_subregion(sysmem, vbi->memmap[VIRT_MEM].base, ram);
+    if (virt_memory_init(machine, sysmem, vbi) < 0) {
+        exit(1);
+    }
 
     create_flash(vbi);