Message ID | 1337754751-9018-2-git-send-email-kernelfans@gmail.com |
---|---|
State | New |
Headers | show |
On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote: > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> > > The guest's scheduler can not see the numa info on the host and > this will result to the following scene: > Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance, > the tasks' pull and push between these vcpus will cost more. But > unfortunately, currently, the guest is just blind to this. > > This patch want to export the host numa info to the guest, and help > guest to rebuild its sched domain based on host's info. Hell no, we're not going to export sched domains, if kvm/qemu wants this its all in sysfs. The whole sched_domain stuff is a big enough pain as it is, exporting this and making it a sodding API is the worst thing ever. Whatever brainfart made you think this is needed anyway? sysfs contains the host topology, qemu can already create whatever guest topology you want (see the -smp and -numa arguments), so what gives?
On Wed, May 23, 2012 at 3:54 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote: >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com> >> >> The guest's scheduler can not see the numa info on the host and >> this will result to the following scene: >> Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance, >> the tasks' pull and push between these vcpus will cost more. But >> unfortunately, currently, the guest is just blind to this. >> >> This patch want to export the host numa info to the guest, and help >> guest to rebuild its sched domain based on host's info. > > Hell no, we're not going to export sched domains, if kvm/qemu wants this > its all in sysfs. > > The whole sched_domain stuff is a big enough pain as it is, exporting > this and making it a sodding API is the worst thing ever. > > Whatever brainfart made you think this is needed anyway? sysfs contains > the host topology, qemu can already create whatever guest topology you > want (see the -smp and -numa arguments), so what gives? I think -numa option will be used to emulate the special virtual machine to customer, and do not necessary map to host topology. And even we map them exactly with -numa option, the movement of vcpu threads among host nodes will break the topology initialized by -numa option. So give the guest a opportunity to adjust its topology? Thanks and regards, pingfan
On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote: > the movement of vcpu > threads among host nodes will break the topology initialized by -numa > option. You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps are assumed static, you cannot make that a dynamic map and pray things keep working. Also, have you any idea how expensive it is to rebuild the topology vs migrating the vcpu?
On Wed, May 23, 2012 at 4:23 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote: >> the movement of vcpu >> threads among host nodes will break the topology initialized by -numa >> option. > > You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps > are assumed static, you cannot make that a dynamic map and pray things > keep working. > > Also, have you any idea how expensive it is to rebuild the topology vs > migrating the vcpu? > No, do not rebuild the topology too frequently. Supposing vcpus in node-A/B/C, now node-B is unplugged, so we need to migrate some of vcpus from node-B to node-A, or to node-C. >
On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote: > so we need to migrate some of vcpus from node-B to node-A, or to > node-C. This is absolutely broken, you cannot do that. A guest task might want to be node affine, it looks at the topology sets a cpu affinity mask and expects to stay on that node. But then you come along, and flip one of those cpus to another node. The guest task will now run on another node and get remote memory accesses. Similarly for the guest kernel, it assumes cpu:node maps are static, it will use this for all kinds of things, including the allocation of per-cpu memory to be node affine to that cpu. If you go migrate cpus across nodes everything comes down. Please go do something else, I'll do this.
On Wed, May 23, 2012 at 4:48 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote: >> so we need to migrate some of vcpus from node-B to node-A, or to >> node-C. > > This is absolutely broken, you cannot do that. > > A guest task might want to be node affine, it looks at the topology sets > a cpu affinity mask and expects to stay on that node. > > But then you come along, and flip one of those cpus to another node. The > guest task will now run on another node and get remote memory accesses. > Oh, I had thought using -smp to handle such situation. The memory accesses cost problem can be partly handled by kvm, while opening a gap for guest's scheduler to see the host numa info. > Similarly for the guest kernel, it assumes cpu:node maps are static, it > will use this for all kinds of things, including the allocation of > per-cpu memory to be node affine to that cpu. > > If you go migrate cpus across nodes everything comes down. > > > Please go do something else, I'll do this. OK, thanks. pingfan
On Wed, 2012-05-23 at 17:58 +0800, Liu ping fan wrote: > > Please go do something else, I'll do this. > OK so that was to say never, as in dynamic cpu:node relations aren't going to happen. but tip/sched/numa contain the bits needed to make vnuma work.
On 05/23/2012 01:48 AM, Peter Zijlstra wrote: > On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote: >> > so we need to migrate some of vcpus from node-B to node-A, or to >> > node-C. > This is absolutely broken, you cannot do that. > > A guest task might want to be node affine, it looks at the topology sets > a cpu affinity mask and expects to stay on that node. > > But then you come along, and flip one of those cpus to another node. The > guest task will now run on another node and get remote memory accesses. Insane, sure. But, if the node has physically gone away, what do we do? I think we've got to either kill the guest, or let it run somewhere suboptimal. Sounds like you're advocating killing it. ;)
On Wed, 2012-05-23 at 08:23 -0700, Dave Hansen wrote: > On 05/23/2012 01:48 AM, Peter Zijlstra wrote: > > On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote: > >> > so we need to migrate some of vcpus from node-B to node-A, or to > >> > node-C. > > This is absolutely broken, you cannot do that. > > > > A guest task might want to be node affine, it looks at the topology sets > > a cpu affinity mask and expects to stay on that node. > > > > But then you come along, and flip one of those cpus to another node. The > > guest task will now run on another node and get remote memory accesses. > > Insane, sure. But, if the node has physically gone away, what do we do? > I think we've got to either kill the guest, or let it run somewhere > suboptimal. Sounds like you're advocating killing it. ;) You all seem terribly confused. If you want a guest that 100% mirrors the host topology you need hard-binding of all vcpu threads and clearly you're in trouble if you unplug a host cpu while there's still a vcpu expecting to run there. That's an administrator error and you get to keep the pieces, I don't care. In case you want simple virt-numa where a number of vcpus constitute a vnode and have their memory all on the same node the vcpus are ran on, what does it matter if you unplug something in the host? Just migrate everything -- including memory. But what Liu was proposing is completely insane and broken. You cannot simply remap cpu:node relations. Wanting to do that shows a profound lack of understanding. Our kernel assumes that a cpu remains on the same node. All userspace that does anything with NUMA assumes the same. You cannot change this.
diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 14f7070..1246091 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -778,7 +778,7 @@ static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains); * to a separate workqueue thread, which ends up processing the * above do_rebuild_sched_domains() function. */ -static void async_rebuild_sched_domains(void) +void async_rebuild_sched_domains(void) { queue_work(cpuset_wq, &rebuild_sched_domains_work); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e5212ae..3f72c1a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6343,6 +6343,60 @@ static struct sched_domain_topology_level default_topology[] = { { NULL, }, }; +#ifdef CONFIG_VIRT_SCHED_DOMAIN +/* fill in by host */ +DEFINE_PER_CPU(int, virt_numa_node); +/* todo, exchange info about HOST_NUMNODES from host */ +#define HOST_NUMNODES 128 +/* keep map, node->cpumask; todo, make it dynamic allocated */ +static struct cpumask virt_node_to_cpumask_map[HOST_NUMNODES]; + +static inline int virt_cpu_to_node(int cpu) +{ + return per_cpu(virt_numa_node, cpu); +} + +const struct cpumask *virt_cpumask_of_node(int vnode) +{ + struct cpumask *msk = &virt_node_to_cpumask_map[vnode]; + return msk; +} + +static const struct cpumask *virt_cpu_cpu_mask(int cpu) +{ + return virt_cpumask_of_node(virt_cpu_to_node(cpu)); +} + +static struct sched_domain_topology_level virt_topology[] = { + { sd_init_CPU, virt_cpu_cpu_mask, }, +#ifdef CONFIG_NUMA + { sd_init_ALLNODES, cpu_allnodes_mask, }, +#endif + { NULL, }, +}; + +static int update_virt_numa_node(void) +{ + int i, cpu, apicid, vnode; + for (i = 0; i < HOST_NUMNODES; i++) + cpumask_clear(&virt_node_to_cpumask_map[i]); + for_each_possible_cpu(cpu) { + apicid = cpu_physical_id(cpu); + vnode = __vapicid_to_vnode[apicid]; + per_cpu(virt_numa_node, cpu) = vnode; + cpumask_set_cpu(cpu, &virt_node_to_cpumask_map[vnode]); + } + return 0; +} + +int rebuild_virt_sd(void) +{ + update_virt_numa_node(); + async_rebuild_sched_domains(); + return 0; +} +#endif + static struct sched_domain_topology_level *sched_domain_topology = default_topology; static int __sdt_alloc(const struct cpumask *cpu_map) @@ -6689,9 +6743,11 @@ match1: /* Build new domains */ for (i = 0; i < ndoms_new; i++) { for (j = 0; j < ndoms_cur && !new_topology; j++) { +#ifndef CONFIG_VIRT_SCHED_DOMAIN if (cpumask_equal(doms_new[i], doms_cur[j]) && dattrs_equal(dattr_new, i, dattr_cur, j)) goto match2; +#endif } /* no match - add a new doms_new */ build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL); @@ -6837,6 +6893,15 @@ void __init sched_init_smp(void) { cpumask_var_t non_isolated_cpus; +#ifdef CONFIG_VIRT_SCHED_DOMAIN + int i; + for (i = 0; i < MAX_LOCAL_APIC; i++) { + /* pretend all on the same node */ + __vapicid_to_vnode[i] = 0; + } + update_virt_numa_node(); + sched_domain_topology = virt_topology; +#endif alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL); alloc_cpumask_var(&fallback_doms, GFP_KERNEL); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fb3acba..232482d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -8,6 +8,9 @@ extern __read_mostly int scheduler_running; +#ifdef CONFIG_VIRT_SCHED_DOMAIN +extern s16 __vapicid_to_vnode[]; +#endif /* * Convert user-nice values [ -20 ... 0 ... 19 ] * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ], @@ -198,6 +201,8 @@ struct cfs_bandwidth { }; #endif /* CONFIG_CGROUP_SCHED */ +extern void async_rebuild_sched_domains(void); + /* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load;