Message ID | 20140624174038.GK4323@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On Tue, 24 Jun 2014 10:40:38 -0700 Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > Sparse node numbering occurs on powerpc in practice under PowerVM. In > order to emulate the same NUMA topology under qemu, the assumption that > NUMA nodes are linearly ordered has to be removed. qemu actually already > supports (inasmuch as it doesn't throw an error) sparse node numbering > by the end-user, but the information is effectively ignored and the > nodes are populated linearly starting at 0. This means a user's node ID > requests are not honored in the Linux kernel, and that the topology may > not be as requested (in terms of the CPU/memory placement). > > Add a present field to NodeInfo which indicates if a given nodeid was > present on the command-line or not. Current code relies on a memory > value being passed for a node to indicate presence, which is > insufficient in the presence of memoryless nodes or sparse numbering. > Adjust the iteration of various NUMA loops to use the maximum known NUMA > ID rather than the number of NUMA nodes. > > numa.c::set_numa_nodes() has become a bit more convoluted for > round-robin'ing the CPUs over known nodes when not specified by the > user. > > Note that architecture-specific code still possibly needs changes > (forthcoming) for both sparse node numbering and memoryless nodes. This > only puts in the generic infrastructure to support that. > > Examples: > > (1) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=3 -numa > node,nodeid=2 -smp 16 > > Before: > > node 0 cpus: 0 2 4 6 8 10 12 14 > node 0 size: 2048 MB > node 1 cpus: 1 3 5 7 9 11 13 15 > node 1 size: 2048 MB > > After: > > node 2 cpus: 0 2 4 6 8 10 12 14 | > node 2 size: 2048 MB | > node 3 cpus: 1 3 5 7 9 11 13 15 | > node 3 size: 2048 MB > > (2) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=0 -numa > node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -smp 16 > > node 0 cpus: 0 4 8 12 | > node 0 size: 1024 MB | > node 1 cpus: 1 5 9 13 | > node 1 size: 1024 MB | > node 2 cpus: 2 6 10 14 | > node 2 size: 1024 MB | > node 3 cpus: 3 7 11 15 | > node 3 size: 1024 MB > > (qemu) info numa > 4 nodes > node 0 cpus: 0 4 8 12 > node 0 size: 1024 MB > node 1 cpus: 1 5 9 13 > node 1 size: 1024 MB > node 2 cpus: 2 6 10 14 > node 2 size: 1024 MB > node 3 cpus: 3 7 11 15 > node 3 size: 1024 MB > > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > Tested-by: Hu Tao <hutao@cn.fujitsu.com> > > --- > Based off mst's for_upstream tag, which has the NodeInfo changes. > > v1 -> v2: > Modify set_numa_nodes loop for round-robin'ing CPUs. > > v2 -> v3: > Updated changelog to indicate problem being solved. > Updated memory_region_allocate_system_memory based upon feedback from > Hu. > Updated set_numa_nodes loop again to be simpler based upon feedback > from Hu. > Fixed bug with a mix of nodes with nodeid specified and without, where > the same nodeid would be used by the explicit specification and the > auto-numbering code. > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > index 277230d..b90bf66 100644 > --- a/include/sysemu/sysemu.h > +++ b/include/sysemu/sysemu.h > @@ -145,11 +145,13 @@ extern int mem_prealloc; > */ > #define MAX_CPUMASK_BITS 255 > > -extern int nb_numa_nodes; > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > +extern int max_numa_node; /* Highest specified NUMA node ID */ > typedef struct node_info { > uint64_t node_mem; > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > struct HostMemoryBackend *node_memdev; > + bool present; How about dropping 'present' and replacing array with a list of only present nodes? That way it will be one more step closer to converting numa infrastructure to a set of QOM objects. > } NodeInfo; > extern NodeInfo numa_info[MAX_NODES]; > void set_numa_nodes(void); > diff --git a/monitor.c b/monitor.c > index c7f8797..4721996 100644 > --- a/monitor.c > +++ b/monitor.c > @@ -2003,7 +2003,10 @@ static void do_info_numa(Monitor *mon, const QDict *qdict) > CPUState *cpu; > > monitor_printf(mon, "%d nodes\n", nb_numa_nodes); > - for (i = 0; i < nb_numa_nodes; i++) { > + for (i = 0; i < max_numa_node; i++) { > + if (!numa_info[i].present) { > + continue; > + } > monitor_printf(mon, "node %d cpus:", i); > CPU_FOREACH(cpu) { > if (cpu->numa_node == i) { > diff --git a/numa.c b/numa.c > index e471afe..d6b86ab 100644 > --- a/numa.c > +++ b/numa.c > @@ -53,7 +53,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) > if (node->has_nodeid) { > nodenr = node->nodeid; > } else { > - nodenr = nb_numa_nodes; > + nodenr = 0; > + while (numa_info[nodenr].present) { > + nodenr++; > + } > } > > if (nodenr >= MAX_NODES) { > @@ -106,6 +109,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) > numa_info[nodenr].node_mem = object_property_get_int(o, "size", NULL); > numa_info[nodenr].node_memdev = MEMORY_BACKEND(o); > } > + numa_info[nodenr].present = true; > + if (nodenr >= max_numa_node) { > + max_numa_node = nodenr + 1; > + } > } > > int numa_init_func(QemuOpts *opts, void *opaque) > @@ -155,7 +162,7 @@ void set_numa_nodes(void) > { > if (nb_numa_nodes > 0) { > uint64_t numa_total; > - int i; > + int i, j = -1; > > if (nb_numa_nodes > MAX_NODES) { > nb_numa_nodes = MAX_NODES; > @@ -164,27 +171,29 @@ void set_numa_nodes(void) > /* If no memory size if given for any node, assume the default case > * and distribute the available memory equally across all nodes > */ > - for (i = 0; i < nb_numa_nodes; i++) { > - if (numa_info[i].node_mem != 0) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && numa_info[i].node_mem != 0) { > break; > } > } > - if (i == nb_numa_nodes) { > + if (i == max_numa_node) { > uint64_t usedmem = 0; > > /* On Linux, the each node's border has to be 8MB aligned, > * the final node gets the rest. > */ > - for (i = 0; i < nb_numa_nodes - 1; i++) { > - numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > - ~((1 << 23UL) - 1); > - usedmem += numa_info[i].node_mem; > + for (i = 0; i < max_numa_node - 1; i++) { > + if (numa_info[i].present) { > + numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > + ~((1 << 23UL) - 1); > + usedmem += numa_info[i].node_mem; > + } > } > numa_info[i].node_mem = ram_size - usedmem; > } > > numa_total = 0; > - for (i = 0; i < nb_numa_nodes; i++) { > + for (i = 0; i < max_numa_node; i++) { > numa_total += numa_info[i].node_mem; > } > if (numa_total != ram_size) { > @@ -194,8 +203,9 @@ void set_numa_nodes(void) > exit(1); > } > > - for (i = 0; i < nb_numa_nodes; i++) { > - if (!bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && > + !bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { > break; > } > } > @@ -203,9 +213,12 @@ void set_numa_nodes(void) > * must cope with this anyway, because there are BIOSes out there in > * real machines which also use this scheme. > */ > - if (i == nb_numa_nodes) { > + if (i == max_numa_node) { > for (i = 0; i < max_cpus; i++) { > - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu); > + do { > + j = (j + 1) % max_numa_node; > + } while (!numa_info[j].present); > + set_bit(i, numa_info[j].node_cpu); > } > } > } > @@ -217,8 +230,9 @@ void set_numa_modes(void) > int i; > > CPU_FOREACH(cpu) { > - for (i = 0; i < nb_numa_nodes; i++) { > - if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && > + test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { > cpu->numa_node = i; > } > } > @@ -266,10 +280,13 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner, > } > > memory_region_init(mr, owner, name, ram_size); > - for (i = 0; i < MAX_NODES; i++) { > + for (i = 0; i < max_numa_node; i++) { > Error *local_err = NULL; > uint64_t size = numa_info[i].node_mem; > HostMemoryBackend *backend = numa_info[i].node_memdev; > + if (!numa_info[i].present) { > + continue; > + } > if (!backend) { > continue; > } > diff --git a/vl.c b/vl.c > index 54b4627..e1a6ab8 100644 > --- a/vl.c > +++ b/vl.c > @@ -195,6 +195,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order = > QTAILQ_HEAD_INITIALIZER(fw_boot_order); > > int nb_numa_nodes; > +int max_numa_node; > NodeInfo numa_info[MAX_NODES]; > > uint8_t qemu_uuid[16]; > @@ -2967,10 +2968,12 @@ int main(int argc, char **argv, char **envp) > > for (i = 0; i < MAX_NODES; i++) { > numa_info[i].node_mem = 0; > + numa_info[i].present = false; > bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS); > } > > nb_numa_nodes = 0; > + max_numa_node = -1; > nb_nics = 0; > > bdrv_init_with_whitelist(); > >
On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > On Tue, 24 Jun 2014 10:40:38 -0700 > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: <snip> > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > index 277230d..b90bf66 100644 > > --- a/include/sysemu/sysemu.h > > +++ b/include/sysemu/sysemu.h > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > */ > > #define MAX_CPUMASK_BITS 255 > > > > -extern int nb_numa_nodes; > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > typedef struct node_info { > > uint64_t node_mem; > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > struct HostMemoryBackend *node_memdev; > > + bool present; > How about dropping 'present' and replacing array with a list > of only present nodes? If that would be preferred, I can move to that. I assume a simple linked-list is fine. Does qemu provide any infrastructure for defining lists? I'll look through the source but any pointers would be helpful. Generally speaking, sparse NUMA nodes aren't that common and when they exist, the gaps aren't large. But it does seem to make sense if we have sparse IDs at all, we might as well move to a list. In any case, moving to the list means we'd have a nodeid as part of the structure instead. > That way it will be one more step closer to converting numa > infrastructure to a set of QOM objects. Sounds like a good idea to me. I'll respin the patch soon. Thanks, Nish
On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > On Tue, 24 Jun 2014 10:40:38 -0700 > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > <snip> > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > index 277230d..b90bf66 100644 > > > --- a/include/sysemu/sysemu.h > > > +++ b/include/sysemu/sysemu.h > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > */ > > > #define MAX_CPUMASK_BITS 255 > > > > > > -extern int nb_numa_nodes; > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > typedef struct node_info { > > > uint64_t node_mem; > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > struct HostMemoryBackend *node_memdev; > > > + bool present; > > How about dropping 'present' and replacing array with a list > > of only present nodes? > > If that would be preferred, I can move to that. I assume a simple > linked-list is fine. Does qemu provide any infrastructure for defining > lists? I'll look through the source but any pointers would be helpful. > > Generally speaking, sparse NUMA nodes aren't that common and when they > exist, the gaps aren't large. But it does seem to make sense if we have > sparse IDs at all, we might as well move to a list. > > In any case, moving to the list means we'd have a nodeid as part of the > structure instead. > > > That way it will be one more step closer to converting numa > > infrastructure to a set of QOM objects. > > Sounds like a good idea to me. I'll respin the patch soon. Having a list makes sense, the only difference is that keeping a sparse array sorted is much easier than making a sorted list (because the ACPI tables are nodeid-ordered). That's why I suggested keeping the array initially. Adding a "present" field to the array is a trivial and easy-to-review change. Changing NodeInfo to use linked lists is a more complex change that I wouldn't want to include after soft freeze. In other words: * Having a list is better than a sparse array; but: * Having a small sparse array with the "present" field is better than broken sparse nodeid support (IMO). <rant> Improving the code is always welcome (and adding a list would be an improvement), but we don't need to implement all refactoring changes we can dream of, before including a new feature. I like to evaluate how the code looks like compared to what we have today, not with what we want it to look like in one year. Could we (the QEMU community) please learn to sometimes accept simple features or fixes (which don't make the code worse) without requiring the whole code to be rewritten first? </rant>
On 25.06.2014 [13:52:56 -0300], Eduardo Habkost wrote: > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > <snip> > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > index 277230d..b90bf66 100644 > > > > --- a/include/sysemu/sysemu.h > > > > +++ b/include/sysemu/sysemu.h > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > */ > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > -extern int nb_numa_nodes; > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > typedef struct node_info { > > > > uint64_t node_mem; > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > struct HostMemoryBackend *node_memdev; > > > > + bool present; > > > How about dropping 'present' and replacing array with a list > > > of only present nodes? > > > > If that would be preferred, I can move to that. I assume a simple > > linked-list is fine. Does qemu provide any infrastructure for defining > > lists? I'll look through the source but any pointers would be helpful. > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > exist, the gaps aren't large. But it does seem to make sense if we have > > sparse IDs at all, we might as well move to a list. > > > > In any case, moving to the list means we'd have a nodeid as part of the > > structure instead. > > > > > That way it will be one more step closer to converting numa > > > infrastructure to a set of QOM objects. > > > > Sounds like a good idea to me. I'll respin the patch soon. > > Having a list makes sense, the only difference is that keeping a sparse > array sorted is much easier than making a sorted list (because the ACPI > tables are nodeid-ordered). That's why I suggested keeping the array > initially. And for non-ACPI platforms, it does feel like keeping the list sorted is ideal, as it simplifies various loops, etc. > Adding a "present" field to the array is a trivial and easy-to-review > change. Changing NodeInfo to use linked lists is a more complex change > that I wouldn't want to include after soft freeze. > > In other words: > * Having a list is better than a sparse array; but: > * Having a small sparse array with the "present" field is better > than broken sparse nodeid support (IMO). Perhaps as a compromise I can work on the list conversion as a follow-on patch? Thanks, Nish
On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote: > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > <snip> > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > index 277230d..b90bf66 100644 > > > > --- a/include/sysemu/sysemu.h > > > > +++ b/include/sysemu/sysemu.h > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > */ > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > -extern int nb_numa_nodes; > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > typedef struct node_info { > > > > uint64_t node_mem; > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > struct HostMemoryBackend *node_memdev; > > > > + bool present; > > > How about dropping 'present' and replacing array with a list > > > of only present nodes? > > > > If that would be preferred, I can move to that. I assume a simple > > linked-list is fine. Does qemu provide any infrastructure for defining > > lists? I'll look through the source but any pointers would be helpful. > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > exist, the gaps aren't large. But it does seem to make sense if we have > > sparse IDs at all, we might as well move to a list. > > > > In any case, moving to the list means we'd have a nodeid as part of the > > structure instead. > > > > > That way it will be one more step closer to converting numa > > > infrastructure to a set of QOM objects. > > > > Sounds like a good idea to me. I'll respin the patch soon. > > Having a list makes sense, the only difference is that keeping a sparse > array sorted is much easier than making a sorted list (because the ACPI > tables are nodeid-ordered). That's why I suggested keeping the array > initially. > > Adding a "present" field to the array is a trivial and easy-to-review > change. Changing NodeInfo to use linked lists is a more complex change > that I wouldn't want to include after soft freeze. > > In other words: > * Having a list is better than a sparse array; but: > * Having a small sparse array with the "present" field is better > than broken sparse nodeid support (IMO). I agree here. This patchset is still RFC but if that's the only issue I might apply it. > <rant> > Improving the code is always welcome (and adding a list would be an > improvement), but we don't need to implement all refactoring changes we > can dream of, before including a new feature. I like to evaluate how the > code looks like compared to what we have today, not with what we want it > to look like in one year. > > Could we (the QEMU community) please learn to sometimes accept simple > features or fixes (which don't make the code worse) without requiring > the whole code to be rewritten first? > </rant> Whether this would be a win for the community depends on the specifics of each case. Contributors are often inclined not to look beyond their immediate needs and their specific small project. Who gets to clean it up then? > -- > Eduardo
On Wed, Jun 25, 2014 at 10:04:18AM -0700, Nishanth Aravamudan wrote: > On 25.06.2014 [13:52:56 -0300], Eduardo Habkost wrote: > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > <snip> > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > > index 277230d..b90bf66 100644 > > > > > --- a/include/sysemu/sysemu.h > > > > > +++ b/include/sysemu/sysemu.h > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > > */ > > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > > > -extern int nb_numa_nodes; > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > > typedef struct node_info { > > > > > uint64_t node_mem; > > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > > struct HostMemoryBackend *node_memdev; > > > > > + bool present; > > > > How about dropping 'present' and replacing array with a list > > > > of only present nodes? > > > > > > If that would be preferred, I can move to that. I assume a simple > > > linked-list is fine. Does qemu provide any infrastructure for defining > > > lists? I'll look through the source but any pointers would be helpful. > > > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > > exist, the gaps aren't large. But it does seem to make sense if we have > > > sparse IDs at all, we might as well move to a list. > > > > > > In any case, moving to the list means we'd have a nodeid as part of the > > > structure instead. > > > > > > > That way it will be one more step closer to converting numa > > > > infrastructure to a set of QOM objects. > > > > > > Sounds like a good idea to me. I'll respin the patch soon. > > > > Having a list makes sense, the only difference is that keeping a sparse > > array sorted is much easier than making a sorted list (because the ACPI > > tables are nodeid-ordered). That's why I suggested keeping the array > > initially. > > And for non-ACPI platforms, it does feel like keeping the list sorted is > ideal, as it simplifies various loops, etc. > > > Adding a "present" field to the array is a trivial and easy-to-review > > change. Changing NodeInfo to use linked lists is a more complex change > > that I wouldn't want to include after soft freeze. > > > > In other words: > > * Having a list is better than a sparse array; but: > > * Having a small sparse array with the "present" field is better > > than broken sparse nodeid support (IMO). > > Perhaps as a compromise I can work on the list conversion as a follow-on > patch? > > Thanks, > Nish Yes, that's fine I think, we can do this after 2.1.
On Wed, 25 Jun 2014 10:04:18 -0700 Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > On 25.06.2014 [13:52:56 -0300], Eduardo Habkost wrote: > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > <snip> > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > > index 277230d..b90bf66 100644 > > > > > --- a/include/sysemu/sysemu.h > > > > > +++ b/include/sysemu/sysemu.h > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > > */ > > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > > > -extern int nb_numa_nodes; > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > > typedef struct node_info { > > > > > uint64_t node_mem; > > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > > struct HostMemoryBackend *node_memdev; > > > > > + bool present; > > > > How about dropping 'present' and replacing array with a list > > > > of only present nodes? > > > > > > If that would be preferred, I can move to that. I assume a simple > > > linked-list is fine. Does qemu provide any infrastructure for defining > > > lists? I'll look through the source but any pointers would be helpful. > > > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > > exist, the gaps aren't large. But it does seem to make sense if we have > > > sparse IDs at all, we might as well move to a list. > > > > > > In any case, moving to the list means we'd have a nodeid as part of the > > > structure instead. > > > > > > > That way it will be one more step closer to converting numa > > > > infrastructure to a set of QOM objects. > > > > > > Sounds like a good idea to me. I'll respin the patch soon. > > > > Having a list makes sense, the only difference is that keeping a sparse > > array sorted is much easier than making a sorted list (because the ACPI > > tables are nodeid-ordered). That's why I suggested keeping the array > > initially. > > And for non-ACPI platforms, it does feel like keeping the list sorted is > ideal, as it simplifies various loops, etc. > > > Adding a "present" field to the array is a trivial and easy-to-review > > change. Changing NodeInfo to use linked lists is a more complex change > > that I wouldn't want to include after soft freeze. > > > > In other words: > > * Having a list is better than a sparse array; but: > > * Having a small sparse array with the "present" field is better > > than broken sparse nodeid support (IMO). > > Perhaps as a compromise I can work on the list conversion as a follow-on > patch? Sure, it was just a suggestion that might make our lives easier in the future. > > Thanks, > Nish >
On Wed, Jun 25, 2014 at 09:23:17PM +0300, Michael S. Tsirkin wrote: > On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote: > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > <snip> > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > > index 277230d..b90bf66 100644 > > > > > --- a/include/sysemu/sysemu.h > > > > > +++ b/include/sysemu/sysemu.h > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > > */ > > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > > > -extern int nb_numa_nodes; > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > > typedef struct node_info { > > > > > uint64_t node_mem; > > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > > struct HostMemoryBackend *node_memdev; > > > > > + bool present; > > > > How about dropping 'present' and replacing array with a list > > > > of only present nodes? > > > > > > If that would be preferred, I can move to that. I assume a simple > > > linked-list is fine. Does qemu provide any infrastructure for defining > > > lists? I'll look through the source but any pointers would be helpful. > > > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > > exist, the gaps aren't large. But it does seem to make sense if we have > > > sparse IDs at all, we might as well move to a list. > > > > > > In any case, moving to the list means we'd have a nodeid as part of the > > > structure instead. > > > > > > > That way it will be one more step closer to converting numa > > > > infrastructure to a set of QOM objects. > > > > > > Sounds like a good idea to me. I'll respin the patch soon. > > > > Having a list makes sense, the only difference is that keeping a sparse > > array sorted is much easier than making a sorted list (because the ACPI > > tables are nodeid-ordered). That's why I suggested keeping the array > > initially. > > > > Adding a "present" field to the array is a trivial and easy-to-review > > change. Changing NodeInfo to use linked lists is a more complex change > > that I wouldn't want to include after soft freeze. > > > > In other words: > > * Having a list is better than a sparse array; but: > > * Having a small sparse array with the "present" field is better > > than broken sparse nodeid support (IMO). > > I agree here. This patchset is still RFC but if that's > the only issue I might apply it. I don't think it is ready for 2.1 since it lacks arch-specific changes. The patch itself only makes qemu be able to bring up guest with sparse numa nodes, without arch-specific changes, guest sees only one node(at least on x86). Hu
On 26.06.2014 [17:09:25 +0800], Hu Tao wrote: > On Wed, Jun 25, 2014 at 09:23:17PM +0300, Michael S. Tsirkin wrote: > > On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote: > > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > > <snip> > > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > > > index 277230d..b90bf66 100644 > > > > > > --- a/include/sysemu/sysemu.h > > > > > > +++ b/include/sysemu/sysemu.h > > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > > > */ > > > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > > > > > -extern int nb_numa_nodes; > > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > > > typedef struct node_info { > > > > > > uint64_t node_mem; > > > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > > > struct HostMemoryBackend *node_memdev; > > > > > > + bool present; > > > > > How about dropping 'present' and replacing array with a list > > > > > of only present nodes? > > > > > > > > If that would be preferred, I can move to that. I assume a simple > > > > linked-list is fine. Does qemu provide any infrastructure for defining > > > > lists? I'll look through the source but any pointers would be helpful. > > > > > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > > > exist, the gaps aren't large. But it does seem to make sense if we have > > > > sparse IDs at all, we might as well move to a list. > > > > > > > > In any case, moving to the list means we'd have a nodeid as part of the > > > > structure instead. > > > > > > > > > That way it will be one more step closer to converting numa > > > > > infrastructure to a set of QOM objects. > > > > > > > > Sounds like a good idea to me. I'll respin the patch soon. > > > > > > Having a list makes sense, the only difference is that keeping a sparse > > > array sorted is much easier than making a sorted list (because the ACPI > > > tables are nodeid-ordered). That's why I suggested keeping the array > > > initially. > > > > > > Adding a "present" field to the array is a trivial and easy-to-review > > > change. Changing NodeInfo to use linked lists is a more complex change > > > that I wouldn't want to include after soft freeze. > > > > > > In other words: > > > * Having a list is better than a sparse array; but: > > > * Having a small sparse array with the "present" field is better > > > than broken sparse nodeid support (IMO). > > > > I agree here. This patchset is still RFC but if that's > > the only issue I might apply it. > > I don't think it is ready for 2.1 since it lacks arch-specific changes. > The patch itself only makes qemu be able to bring up guest with sparse > numa nodes, without arch-specific changes, guest sees only one node(at > least on x86). Well, it does no harm or foul to have it, on it's own. In that, nothing is worse or better than it was. Although I would argue the qemu core is better than it was :) I have sent the powerpc enabling patch, but even with that, we need Alexey's series of 6 patches (which do fix a regression) to fully support the sparse numbering. Maybe that means I've convinced myself it's best to wait until after 2.1. I honestly am not sure how to "fix" the x86 code to support the sparse numbering -- the APIC code seems to rely on it being continuous. From some googling, it seems like the most common reason for non-continuous numbering on x86 is faulty hardware. I will keep fiddling if no one else steps up, but I'm not an x86 NUMA expert :) Thanks, Nish
On Thu, Jun 26, 2014 at 10:58:07AM -0700, Nishanth Aravamudan wrote: > On 26.06.2014 [17:09:25 +0800], Hu Tao wrote: > > On Wed, Jun 25, 2014 at 09:23:17PM +0300, Michael S. Tsirkin wrote: > > > On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote: > > > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote: > > > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote: > > > > > > On Tue, 24 Jun 2014 10:40:38 -0700 > > > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > > > <snip> > > > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > > > > > > index 277230d..b90bf66 100644 > > > > > > > --- a/include/sysemu/sysemu.h > > > > > > > +++ b/include/sysemu/sysemu.h > > > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > > > > > > */ > > > > > > > #define MAX_CPUMASK_BITS 255 > > > > > > > > > > > > > > -extern int nb_numa_nodes; > > > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > > > > > > typedef struct node_info { > > > > > > > uint64_t node_mem; > > > > > > > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > > > > > > > struct HostMemoryBackend *node_memdev; > > > > > > > + bool present; > > > > > > How about dropping 'present' and replacing array with a list > > > > > > of only present nodes? > > > > > > > > > > If that would be preferred, I can move to that. I assume a simple > > > > > linked-list is fine. Does qemu provide any infrastructure for defining > > > > > lists? I'll look through the source but any pointers would be helpful. > > > > > > > > > > Generally speaking, sparse NUMA nodes aren't that common and when they > > > > > exist, the gaps aren't large. But it does seem to make sense if we have > > > > > sparse IDs at all, we might as well move to a list. > > > > > > > > > > In any case, moving to the list means we'd have a nodeid as part of the > > > > > structure instead. > > > > > > > > > > > That way it will be one more step closer to converting numa > > > > > > infrastructure to a set of QOM objects. > > > > > > > > > > Sounds like a good idea to me. I'll respin the patch soon. > > > > > > > > Having a list makes sense, the only difference is that keeping a sparse > > > > array sorted is much easier than making a sorted list (because the ACPI > > > > tables are nodeid-ordered). That's why I suggested keeping the array > > > > initially. > > > > > > > > Adding a "present" field to the array is a trivial and easy-to-review > > > > change. Changing NodeInfo to use linked lists is a more complex change > > > > that I wouldn't want to include after soft freeze. > > > > > > > > In other words: > > > > * Having a list is better than a sparse array; but: > > > > * Having a small sparse array with the "present" field is better > > > > than broken sparse nodeid support (IMO). > > > > > > I agree here. This patchset is still RFC but if that's > > > the only issue I might apply it. > > > > I don't think it is ready for 2.1 since it lacks arch-specific changes. > > The patch itself only makes qemu be able to bring up guest with sparse > > numa nodes, without arch-specific changes, guest sees only one node(at > > least on x86). > > Well, it does no harm or foul to have it, on it's own. In that, nothing > is worse or better than it was. Although I would argue the qemu core is > better than it was :) I have sent the powerpc enabling patch, but even > with that, we need Alexey's series of 6 patches (which do fix a > regression) to fully support the sparse numbering. Maybe that means I've > convinced myself it's best to wait until after 2.1. > > I honestly am not sure how to "fix" the x86 code to support the sparse > numbering -- the APIC code seems to rely on it being continuous. From > some googling, it seems like the most common reason for non-continuous > numbering on x86 is faulty hardware. I will keep fiddling if no one else > steps up, but I'm not an x86 NUMA expert :) There are two ACPI table building methods: the fw_cfg-based (where SeaBIOS builds the SRAT table), and the acpi-build.c code. The fw_cfg-based method assumes contigous nodeids: it builds a table with nb_numa_nodes entries, carrying only the memory size of each node. The acpi-build.c code could easily use the "present" flag, and doesn't seem to require contiguous nodeids. The ACPI specs don't seem to have any requirement about contiguous proximity domain IDs, but I am not sure about how guest OSes would react. acpi-build.c has some confusing code to make sure the table has at least nb_numa_nodes+2 entries (adding disabled entries at the end of the table if necessary), but I don't understand why. Maybe it is just for compatibility with SeaBIOS versions that calculated the SRAT table size based on nb_numa_nodes beforehand. But we need to this carefully, as changing the behavior again later would require compatibility code, and would make things worse. I don't even know if guest OSes would properly handle non-contiguous node IDs. In either case, QEMU is currently broken when nodeids are not contiguous or nodeids are repeated in the command-line. So I suggest we do this on QEMU 2.1: * Add the "present" field (required to allow us to detect command-lines with missing nodeids); * Reject the command-line if a nodeid is specified twice; * Make the PC initialization code reject the configuration if some nodeid is not present. This is a bug fix, not a regression, as non-contiguous nodeids were never supported. This way, we are not introducing new behavior to keep compatibility in the future, but we are fixing the bug where QEMU doesn't complain about non-contiguous nodeids. And after 2.1, we can make PC support non-contiguous nodeids if we agree it is useful.
On Tue, Jun 24, 2014 at 10:40:38AM -0700, Nishanth Aravamudan wrote: > Sparse node numbering occurs on powerpc in practice under PowerVM. In > order to emulate the same NUMA topology under qemu, the assumption that > NUMA nodes are linearly ordered has to be removed. qemu actually already > supports (inasmuch as it doesn't throw an error) sparse node numbering > by the end-user, but the information is effectively ignored and the > nodes are populated linearly starting at 0. This means a user's node ID > requests are not honored in the Linux kernel, and that the topology may > not be as requested (in terms of the CPU/memory placement). > > Add a present field to NodeInfo which indicates if a given nodeid was > present on the command-line or not. Current code relies on a memory > value being passed for a node to indicate presence, which is > insufficient in the presence of memoryless nodes or sparse numbering. > Adjust the iteration of various NUMA loops to use the maximum known NUMA > ID rather than the number of NUMA nodes. > > numa.c::set_numa_nodes() has become a bit more convoluted for > round-robin'ing the CPUs over known nodes when not specified by the > user. > > Note that architecture-specific code still possibly needs changes > (forthcoming) for both sparse node numbering and memoryless nodes. This > only puts in the generic infrastructure to support that. > > Examples: > > (1) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=3 -numa > node,nodeid=2 -smp 16 > > Before: > > node 0 cpus: 0 2 4 6 8 10 12 14 > node 0 size: 2048 MB > node 1 cpus: 1 3 5 7 9 11 13 15 > node 1 size: 2048 MB > > After: > > node 2 cpus: 0 2 4 6 8 10 12 14 | > node 2 size: 2048 MB | > node 3 cpus: 1 3 5 7 9 11 13 15 | > node 3 size: 2048 MB > > (2) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=0 -numa > node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -smp 16 > > node 0 cpus: 0 4 8 12 | > node 0 size: 1024 MB | > node 1 cpus: 1 5 9 13 | > node 1 size: 1024 MB | > node 2 cpus: 2 6 10 14 | > node 2 size: 1024 MB | > node 3 cpus: 3 7 11 15 | > node 3 size: 1024 MB > > (qemu) info numa > 4 nodes > node 0 cpus: 0 4 8 12 > node 0 size: 1024 MB > node 1 cpus: 1 5 9 13 > node 1 size: 1024 MB > node 2 cpus: 2 6 10 14 > node 2 size: 1024 MB > node 3 cpus: 3 7 11 15 > node 3 size: 1024 MB > > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > Tested-by: Hu Tao <hutao@cn.fujitsu.com> > > --- > Based off mst's for_upstream tag, which has the NodeInfo changes. > > v1 -> v2: > Modify set_numa_nodes loop for round-robin'ing CPUs. > > v2 -> v3: > Updated changelog to indicate problem being solved. > Updated memory_region_allocate_system_memory based upon feedback from > Hu. > Updated set_numa_nodes loop again to be simpler based upon feedback > from Hu. > Fixed bug with a mix of nodes with nodeid specified and without, where > the same nodeid would be used by the explicit specification and the > auto-numbering code. > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > index 277230d..b90bf66 100644 > --- a/include/sysemu/sysemu.h > +++ b/include/sysemu/sysemu.h > @@ -145,11 +145,13 @@ extern int mem_prealloc; > */ > #define MAX_CPUMASK_BITS 255 > > -extern int nb_numa_nodes; > +extern int nb_numa_nodes; /* Number of NUMA nodes */ I would rename it to num_numa_nodes, so we can easily ensure all code using nb_numa_nodes will be converted appropriately. > +extern int max_numa_node; /* Highest specified NUMA node ID */ I would rename it max_numa_nodeid, to make it clear it is the maximum ID, not the maximum number of nodes. The comment is inaccurate: it is the highest specified node ID, plus one. I would specify it as: /* Highest NUMA node ID, plus one, so for all present NUMA nodes, * nodeid < max_numa_nodeid */ > typedef struct node_info { > uint64_t node_mem; > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > struct HostMemoryBackend *node_memdev; > + bool present; > } NodeInfo; > extern NodeInfo numa_info[MAX_NODES]; > void set_numa_nodes(void); > diff --git a/monitor.c b/monitor.c > index c7f8797..4721996 100644 > --- a/monitor.c > +++ b/monitor.c > @@ -2003,7 +2003,10 @@ static void do_info_numa(Monitor *mon, const QDict *qdict) > CPUState *cpu; > > monitor_printf(mon, "%d nodes\n", nb_numa_nodes); > - for (i = 0; i < nb_numa_nodes; i++) { > + for (i = 0; i < max_numa_node; i++) { > + if (!numa_info[i].present) { > + continue; > + } > monitor_printf(mon, "node %d cpus:", i); > CPU_FOREACH(cpu) { > if (cpu->numa_node == i) { > diff --git a/numa.c b/numa.c > index e471afe..d6b86ab 100644 > --- a/numa.c > +++ b/numa.c > @@ -53,7 +53,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) > if (node->has_nodeid) { > nodenr = node->nodeid; > } else { > - nodenr = nb_numa_nodes; > + nodenr = 0; > + while (numa_info[nodenr].present) { > + nodenr++; > + } > } > > if (nodenr >= MAX_NODES) { > @@ -106,6 +109,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) > numa_info[nodenr].node_mem = object_property_get_int(o, "size", NULL); > numa_info[nodenr].node_memdev = MEMORY_BACKEND(o); > } > + numa_info[nodenr].present = true; What about rejecting nodes which are already present? A node shouldn't appear twice on the command-line. By repeating nodeids, an user can make nb_numa_nodes > max_numa_node, and make existing code access memory beyond memory_info[]. > + if (nodenr >= max_numa_node) { > + max_numa_node = nodenr + 1; > + } > } > > int numa_init_func(QemuOpts *opts, void *opaque) > @@ -155,7 +162,7 @@ void set_numa_nodes(void) > { > if (nb_numa_nodes > 0) { > uint64_t numa_total; > - int i; > + int i, j = -1; Can you please initialize j closer to the loop where it is used? > > if (nb_numa_nodes > MAX_NODES) { > nb_numa_nodes = MAX_NODES; > @@ -164,27 +171,29 @@ void set_numa_nodes(void) > /* If no memory size if given for any node, assume the default case > * and distribute the available memory equally across all nodes > */ > - for (i = 0; i < nb_numa_nodes; i++) { > - if (numa_info[i].node_mem != 0) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && numa_info[i].node_mem != 0) { > break; > } > } > - if (i == nb_numa_nodes) { > + if (i == max_numa_node) { > uint64_t usedmem = 0; > > /* On Linux, the each node's border has to be 8MB aligned, > * the final node gets the rest. > */ > - for (i = 0; i < nb_numa_nodes - 1; i++) { > - numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > - ~((1 << 23UL) - 1); > - usedmem += numa_info[i].node_mem; > + for (i = 0; i < max_numa_node - 1; i++) { > + if (numa_info[i].present) { > + numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > + ~((1 << 23UL) - 1); > + usedmem += numa_info[i].node_mem; > + } > } This part is tricky: the following line works only because numa_info[max_numa_node-1] is always present: > numa_info[i].node_mem = ram_size - usedmem; So, what about adding assert(numa_info[i].present) here? > } > > numa_total = 0; > - for (i = 0; i < nb_numa_nodes; i++) { > + for (i = 0; i < max_numa_node; i++) { > numa_total += numa_info[i].node_mem; > } > if (numa_total != ram_size) { > @@ -194,8 +203,9 @@ void set_numa_nodes(void) > exit(1); > } > > - for (i = 0; i < nb_numa_nodes; i++) { > - if (!bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && > + !bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { > break; > } > } > @@ -203,9 +213,12 @@ void set_numa_nodes(void) > * must cope with this anyway, because there are BIOSes out there in > * real machines which also use this scheme. > */ > - if (i == nb_numa_nodes) { > + if (i == max_numa_node) { > for (i = 0; i < max_cpus; i++) { > - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu); > + do { > + j = (j + 1) % max_numa_node; > + } while (!numa_info[j].present); If you change it from "do { } while" to "while { }", you don't need to initialize j to -1. > + set_bit(i, numa_info[j].node_cpu); > } > } > } > @@ -217,8 +230,9 @@ void set_numa_modes(void) > int i; > > CPU_FOREACH(cpu) { > - for (i = 0; i < nb_numa_nodes; i++) { > - if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { > + for (i = 0; i < max_numa_node; i++) { > + if (numa_info[i].present && > + test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { > cpu->numa_node = i; > } > } > @@ -266,10 +280,13 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner, > } > > memory_region_init(mr, owner, name, ram_size); > - for (i = 0; i < MAX_NODES; i++) { > + for (i = 0; i < max_numa_node; i++) { > Error *local_err = NULL; > uint64_t size = numa_info[i].node_mem; > HostMemoryBackend *backend = numa_info[i].node_memdev; > + if (!numa_info[i].present) { > + continue; > + } > if (!backend) { > continue; > } > diff --git a/vl.c b/vl.c > index 54b4627..e1a6ab8 100644 > --- a/vl.c > +++ b/vl.c > @@ -195,6 +195,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order = > QTAILQ_HEAD_INITIALIZER(fw_boot_order); > > int nb_numa_nodes; > +int max_numa_node; > NodeInfo numa_info[MAX_NODES]; > > uint8_t qemu_uuid[16]; > @@ -2967,10 +2968,12 @@ int main(int argc, char **argv, char **envp) > > for (i = 0; i < MAX_NODES; i++) { > numa_info[i].node_mem = 0; > + numa_info[i].present = false; > bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS); > } > > nb_numa_nodes = 0; > + max_numa_node = -1; As we just need nodeids to be < max_numa_node, max_numa_node can be safely initialized to 0 instead of -1. > nb_nics = 0; > > bdrv_init_with_whitelist(); > Considering the size of this patch (and the rest of the machine-specific code which still needs to be changed), I have a suggestion: We can first make QEMU reject non-contiguous nodeid on all architectures, by introducing the "present" field, rejecting duplicate nodeids, and rejecting the configuration if max_numa_node != nb_numa_nodes at the beginning of set_numa_nodes() (preferably with an assert(node_info[i].present) loop, too). This way, we won't need to touch much of the code to keep existing behavior, and the patch will fix bugs but be small and very easy to review. After we do that, we can review and consider a patch that adds support to non-contiguous nodeids, which would be a larger and more intrusive patch. Maybe it will be 2.1 material, maybe it won't, I don't know.
On 26.06.2014 [16:37:05 -0300], Eduardo Habkost wrote: > On Tue, Jun 24, 2014 at 10:40:38AM -0700, Nishanth Aravamudan wrote: <snip> > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > index 277230d..b90bf66 100644 > > --- a/include/sysemu/sysemu.h > > +++ b/include/sysemu/sysemu.h > > @@ -145,11 +145,13 @@ extern int mem_prealloc; > > */ > > #define MAX_CPUMASK_BITS 255 > > > > -extern int nb_numa_nodes; > > +extern int nb_numa_nodes; /* Number of NUMA nodes */ > > I would rename it to num_numa_nodes, so we can easily ensure all code > using nb_numa_nodes will be converted appropriately. > > > +extern int max_numa_node; /* Highest specified NUMA node ID */ > > I would rename it max_numa_nodeid, to make it clear it is the maximum > ID, not the maximum number of nodes. Thanks, I'm rebasing onto your series now. <snip> > > int numa_init_func(QemuOpts *opts, void *opaque) > > @@ -155,7 +162,7 @@ void set_numa_nodes(void) > > { > > if (nb_numa_nodes > 0) { > > uint64_t numa_total; > > - int i; > > + int i, j = -1; > > Can you please initialize j closer to the loop where it is used? Yep. <snip> > > /* On Linux, the each node's border has to be 8MB aligned, > > * the final node gets the rest. > > */ > > - for (i = 0; i < nb_numa_nodes - 1; i++) { > > - numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > > - ~((1 << 23UL) - 1); > > - usedmem += numa_info[i].node_mem; > > + for (i = 0; i < max_numa_node - 1; i++) { > > + if (numa_info[i].present) { > > + numa_info[i].node_mem = (ram_size / nb_numa_nodes) & > > + ~((1 << 23UL) - 1); > > + usedmem += numa_info[i].node_mem; > > + } > > } > > This part is tricky: the following line works only because > numa_info[max_numa_node-1] is always present: > > > numa_info[i].node_mem = ram_size - usedmem; > > So, what about adding assert(numa_info[i].present) here? Yep. <snip> > > @@ -203,9 +213,12 @@ void set_numa_nodes(void) > > * must cope with this anyway, because there are BIOSes out there in > > * real machines which also use this scheme. > > */ > > - if (i == nb_numa_nodes) { > > + if (i == max_numa_node) { > > for (i = 0; i < max_cpus; i++) { > > - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu); > > + do { > > + j = (j + 1) % max_numa_node; > > + } while (!numa_info[j].present); > > If you change it from "do { } while" to "while { }", you don't need to > initialize j to -1. I don't think that's quite as simple as you make it out to be. If you use a while() loop, we won't always increment j, which means once we've found a present node, we'll always use that node? j here basically represents the *last* used nodeid, which we don't want to use again when we re-enter the for-loop, we want to use the next present nodeid. It seems like the do {} while() does this fine? I could use a while() if I added another increment outside the loop, as follows: if (i == max_numa_nodeid) { for (i = 0, j = 0; i < max_cpus; i++) { while (!numa_info[j].present) { j = (j + 1) % (max_numa_nodeid); } set_bit(i, numa_info[j].node_cpu); j = (j + 1) % (max_numa_nodeid); } } If you think that is cleaner, I'll use that version. Thanks, Nish
On Mon, Jun 30, 2014 at 11:26:10AM -0700, Nishanth Aravamudan wrote: [...] > > > - if (i == nb_numa_nodes) { > > > + if (i == max_numa_node) { > > > for (i = 0; i < max_cpus; i++) { > > > - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu); > > > + do { > > > + j = (j + 1) % max_numa_node; > > > + } while (!numa_info[j].present); > > > > If you change it from "do { } while" to "while { }", you don't need to > > initialize j to -1. > > I don't think that's quite as simple as you make it out to be. If you > use a while() loop, we won't always increment j, which means once we've > found a present node, we'll always use that node? j here basically > represents the *last* used nodeid, which we don't want to use again when > we re-enter the for-loop, we want to use the next present nodeid. It > seems like the do {} while() does this fine? I could use a while() if I > added another increment outside the loop, as follows: > > if (i == max_numa_nodeid) { > for (i = 0, j = 0; i < max_cpus; i++) { > while (!numa_info[j].present) { > j = (j + 1) % (max_numa_nodeid); > } > set_bit(i, numa_info[j].node_cpu); > j = (j + 1) % (max_numa_nodeid); > } > } > > If you think that is cleaner, I'll use that version. You are right, and your "do { } while" version looks cleaner than duplicating the "j = (j + 1) % (max_numa_nodeid)" line.
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index 277230d..b90bf66 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -145,11 +145,13 @@ extern int mem_prealloc; */ #define MAX_CPUMASK_BITS 255 -extern int nb_numa_nodes; +extern int nb_numa_nodes; /* Number of NUMA nodes */ +extern int max_numa_node; /* Highest specified NUMA node ID */ typedef struct node_info { uint64_t node_mem; DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); struct HostMemoryBackend *node_memdev; + bool present; } NodeInfo; extern NodeInfo numa_info[MAX_NODES]; void set_numa_nodes(void); diff --git a/monitor.c b/monitor.c index c7f8797..4721996 100644 --- a/monitor.c +++ b/monitor.c @@ -2003,7 +2003,10 @@ static void do_info_numa(Monitor *mon, const QDict *qdict) CPUState *cpu; monitor_printf(mon, "%d nodes\n", nb_numa_nodes); - for (i = 0; i < nb_numa_nodes; i++) { + for (i = 0; i < max_numa_node; i++) { + if (!numa_info[i].present) { + continue; + } monitor_printf(mon, "node %d cpus:", i); CPU_FOREACH(cpu) { if (cpu->numa_node == i) { diff --git a/numa.c b/numa.c index e471afe..d6b86ab 100644 --- a/numa.c +++ b/numa.c @@ -53,7 +53,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) if (node->has_nodeid) { nodenr = node->nodeid; } else { - nodenr = nb_numa_nodes; + nodenr = 0; + while (numa_info[nodenr].present) { + nodenr++; + } } if (nodenr >= MAX_NODES) { @@ -106,6 +109,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp) numa_info[nodenr].node_mem = object_property_get_int(o, "size", NULL); numa_info[nodenr].node_memdev = MEMORY_BACKEND(o); } + numa_info[nodenr].present = true; + if (nodenr >= max_numa_node) { + max_numa_node = nodenr + 1; + } } int numa_init_func(QemuOpts *opts, void *opaque) @@ -155,7 +162,7 @@ void set_numa_nodes(void) { if (nb_numa_nodes > 0) { uint64_t numa_total; - int i; + int i, j = -1; if (nb_numa_nodes > MAX_NODES) { nb_numa_nodes = MAX_NODES; @@ -164,27 +171,29 @@ void set_numa_nodes(void) /* If no memory size if given for any node, assume the default case * and distribute the available memory equally across all nodes */ - for (i = 0; i < nb_numa_nodes; i++) { - if (numa_info[i].node_mem != 0) { + for (i = 0; i < max_numa_node; i++) { + if (numa_info[i].present && numa_info[i].node_mem != 0) { break; } } - if (i == nb_numa_nodes) { + if (i == max_numa_node) { uint64_t usedmem = 0; /* On Linux, the each node's border has to be 8MB aligned, * the final node gets the rest. */ - for (i = 0; i < nb_numa_nodes - 1; i++) { - numa_info[i].node_mem = (ram_size / nb_numa_nodes) & - ~((1 << 23UL) - 1); - usedmem += numa_info[i].node_mem; + for (i = 0; i < max_numa_node - 1; i++) { + if (numa_info[i].present) { + numa_info[i].node_mem = (ram_size / nb_numa_nodes) & + ~((1 << 23UL) - 1); + usedmem += numa_info[i].node_mem; + } } numa_info[i].node_mem = ram_size - usedmem; } numa_total = 0; - for (i = 0; i < nb_numa_nodes; i++) { + for (i = 0; i < max_numa_node; i++) { numa_total += numa_info[i].node_mem; } if (numa_total != ram_size) { @@ -194,8 +203,9 @@ void set_numa_nodes(void) exit(1); } - for (i = 0; i < nb_numa_nodes; i++) { - if (!bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { + for (i = 0; i < max_numa_node; i++) { + if (numa_info[i].present && + !bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) { break; } } @@ -203,9 +213,12 @@ void set_numa_nodes(void) * must cope with this anyway, because there are BIOSes out there in * real machines which also use this scheme. */ - if (i == nb_numa_nodes) { + if (i == max_numa_node) { for (i = 0; i < max_cpus; i++) { - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu); + do { + j = (j + 1) % max_numa_node; + } while (!numa_info[j].present); + set_bit(i, numa_info[j].node_cpu); } } } @@ -217,8 +230,9 @@ void set_numa_modes(void) int i; CPU_FOREACH(cpu) { - for (i = 0; i < nb_numa_nodes; i++) { - if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { + for (i = 0; i < max_numa_node; i++) { + if (numa_info[i].present && + test_bit(cpu->cpu_index, numa_info[i].node_cpu)) { cpu->numa_node = i; } } @@ -266,10 +280,13 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner, } memory_region_init(mr, owner, name, ram_size); - for (i = 0; i < MAX_NODES; i++) { + for (i = 0; i < max_numa_node; i++) { Error *local_err = NULL; uint64_t size = numa_info[i].node_mem; HostMemoryBackend *backend = numa_info[i].node_memdev; + if (!numa_info[i].present) { + continue; + } if (!backend) { continue; } diff --git a/vl.c b/vl.c index 54b4627..e1a6ab8 100644 --- a/vl.c +++ b/vl.c @@ -195,6 +195,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order); int nb_numa_nodes; +int max_numa_node; NodeInfo numa_info[MAX_NODES]; uint8_t qemu_uuid[16]; @@ -2967,10 +2968,12 @@ int main(int argc, char **argv, char **envp) for (i = 0; i < MAX_NODES; i++) { numa_info[i].node_mem = 0; + numa_info[i].present = false; bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS); } nb_numa_nodes = 0; + max_numa_node = -1; nb_nics = 0; bdrv_init_with_whitelist();