Patchwork [1/2] sched: add virt sched domain for the guest

login
register
mail settings
Submitter Liu Ping Fan
Date May 23, 2012, 6:32 a.m.
Message ID <1337754751-9018-2-git-send-email-kernelfans@gmail.com>
Download mbox | patch
Permalink /patch/160867/
State New
Headers show

Comments

Liu Ping Fan - May 23, 2012, 6:32 a.m.
From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

The guest's scheduler can not see the numa info on the host and
this will result to the following scene:
  Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
the tasks' pull and push between these vcpus will cost more. But
unfortunately, currently, the guest is just blind to this.

This patch want to export the host numa info to the guest, and help
guest to rebuild its sched domain based on host's info.

--todo:
  vcpu's hotplug will be considered.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 kernel/cpuset.c      |    2 +-
 kernel/sched/core.c  |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    5 ++++
 3 files changed, 71 insertions(+), 1 deletions(-)
Peter Zijlstra - May 23, 2012, 7:54 a.m.
On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> The guest's scheduler can not see the numa info on the host and
> this will result to the following scene:
>   Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
> the tasks' pull and push between these vcpus will cost more. But
> unfortunately, currently, the guest is just blind to this.
> 
> This patch want to export the host numa info to the guest, and help
> guest to rebuild its sched domain based on host's info.

Hell no, we're not going to export sched domains, if kvm/qemu wants this
its all in sysfs.

The whole sched_domain stuff is a big enough pain as it is, exporting
this and making it a sodding API is the worst thing ever. 

Whatever brainfart made you think this is needed anyway? sysfs contains
the host topology, qemu can already create whatever guest topology you
want (see the -smp and -numa arguments), so what gives?
Liu Ping Fan - May 23, 2012, 8:10 a.m.
On Wed, May 23, 2012 at 3:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-05-23 at 14:32 +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> The guest's scheduler can not see the numa info on the host and
>> this will result to the following scene:
>>   Supposing vcpu-a on nodeA, vcpu-b on nodeB, when load balance,
>> the tasks' pull and push between these vcpus will cost more. But
>> unfortunately, currently, the guest is just blind to this.
>>
>> This patch want to export the host numa info to the guest, and help
>> guest to rebuild its sched domain based on host's info.
>
> Hell no, we're not going to export sched domains, if kvm/qemu wants this
> its all in sysfs.
>
> The whole sched_domain stuff is a big enough pain as it is, exporting
> this and making it a sodding API is the worst thing ever.
>
> Whatever brainfart made you think this is needed anyway? sysfs contains
> the host topology, qemu can already create whatever guest topology you
> want (see the -smp and -numa arguments), so what gives?

I think -numa option will be used to emulate the special virtual
machine to customer, and do not necessary map to host topology.
And even we map them exactly with -numa option, the movement of vcpu
threads among host nodes will break the topology initialized by -numa
option.
So give the guest a opportunity to adjust its topology?

Thanks and regards,
pingfan
Peter Zijlstra - May 23, 2012, 8:23 a.m.
On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote:
> the movement of vcpu
> threads among host nodes will break the topology initialized by -numa
> option. 

You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps
are assumed static, you cannot make that a dynamic map and pray things
keep working.

Also, have you any idea how expensive it is to rebuild the topology vs
migrating the vcpu?
Liu Ping Fan - May 23, 2012, 8:34 a.m.
On Wed, May 23, 2012 at 4:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-05-23 at 16:10 +0800, Liu ping fan wrote:
>> the movement of vcpu
>> threads among host nodes will break the topology initialized by -numa
>> option.
>
> You want to remap vcpu to nodes? Are you bloody insane? cpu:node maps
> are assumed static, you cannot make that a dynamic map and pray things
> keep working.
>
> Also, have you any idea how expensive it is to rebuild the topology vs
> migrating the vcpu?
>
No, do not rebuild the topology too frequently. Supposing vcpus in
node-A/B/C, now node-B is unplugged,
so we need to migrate some of vcpus from node-B to node-A, or to node-C.
>
Peter Zijlstra - May 23, 2012, 8:48 a.m.
On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
> so we need to migrate some of vcpus from node-B to node-A, or to
> node-C.

This is absolutely broken, you cannot do that.

A guest task might want to be node affine, it looks at the topology sets
a cpu affinity mask and expects to stay on that node.

But then you come along, and flip one of those cpus to another node. The
guest task will now run on another node and get remote memory accesses.

Similarly for the guest kernel, it assumes cpu:node maps are static, it
will use this for all kinds of things, including the allocation of
per-cpu memory to be node affine to that cpu.

If you go migrate cpus across nodes everything comes down.


Please go do something else, I'll do this.
Liu Ping Fan - May 23, 2012, 9:58 a.m.
On Wed, May 23, 2012 at 4:48 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
>> so we need to migrate some of vcpus from node-B to node-A, or to
>> node-C.
>
> This is absolutely broken, you cannot do that.
>
> A guest task might want to be node affine, it looks at the topology sets
> a cpu affinity mask and expects to stay on that node.
>
> But then you come along, and flip one of those cpus to another node. The
> guest task will now run on another node and get remote memory accesses.
>
Oh, I had thought using -smp to handle such situation. The memory
accesses cost problem can be partly handled by kvm,
while opening a gap for guest's scheduler to see the host numa info.

> Similarly for the guest kernel, it assumes cpu:node maps are static, it
> will use this for all kinds of things, including the allocation of
> per-cpu memory to be node affine to that cpu.
>
> If you go migrate cpus across nodes everything comes down.
>
>
> Please go do something else, I'll do this.

OK, thanks.
pingfan
Peter Zijlstra - May 23, 2012, 10:14 a.m.
On Wed, 2012-05-23 at 17:58 +0800, Liu ping fan wrote:
> > Please go do something else, I'll do this.
> 
OK so that was to say never, as in dynamic cpu:node relations aren't
going to happen. but tip/sched/numa contain the bits needed to make
vnuma work.
Dave Hansen - May 23, 2012, 3:23 p.m.
On 05/23/2012 01:48 AM, Peter Zijlstra wrote:
> On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
>> > so we need to migrate some of vcpus from node-B to node-A, or to
>> > node-C.
> This is absolutely broken, you cannot do that.
> 
> A guest task might want to be node affine, it looks at the topology sets
> a cpu affinity mask and expects to stay on that node.
> 
> But then you come along, and flip one of those cpus to another node. The
> guest task will now run on another node and get remote memory accesses.

Insane, sure.  But, if the node has physically gone away, what do we do?
 I think we've got to either kill the guest, or let it run somewhere
suboptimal.  Sounds like you're advocating killing it. ;)
Peter Zijlstra - May 23, 2012, 3:52 p.m.
On Wed, 2012-05-23 at 08:23 -0700, Dave Hansen wrote:
> On 05/23/2012 01:48 AM, Peter Zijlstra wrote:
> > On Wed, 2012-05-23 at 16:34 +0800, Liu ping fan wrote:
> >> > so we need to migrate some of vcpus from node-B to node-A, or to
> >> > node-C.
> > This is absolutely broken, you cannot do that.
> > 
> > A guest task might want to be node affine, it looks at the topology sets
> > a cpu affinity mask and expects to stay on that node.
> > 
> > But then you come along, and flip one of those cpus to another node. The
> > guest task will now run on another node and get remote memory accesses.
> 
> Insane, sure.  But, if the node has physically gone away, what do we do?
>  I think we've got to either kill the guest, or let it run somewhere
> suboptimal.  Sounds like you're advocating killing it. ;)

You all seem terribly confused. If you want a guest that 100% mirrors
the host topology you need hard-binding of all vcpu threads and clearly
you're in trouble if you unplug a host cpu while there's still a vcpu
expecting to run there.

That's an administrator error and you get to keep the pieces, I don't
care.

In case you want simple virt-numa where a number of vcpus constitute a
vnode and have their memory all on the same node the vcpus are ran on,
what does it matter if you unplug something in the host? Just migrate
everything -- including memory.

But what Liu was proposing is completely insane and broken. You cannot
simply remap cpu:node relations. Wanting to do that shows a profound
lack of understanding.

Our kernel assumes that a cpu remains on the same node. All userspace
that does anything with NUMA assumes the same. You cannot change this.

Patch

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..1246091 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -778,7 +778,7 @@  static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
  * to a separate workqueue thread, which ends up processing the
  * above do_rebuild_sched_domains() function.
  */
-static void async_rebuild_sched_domains(void)
+void async_rebuild_sched_domains(void)
 {
 	queue_work(cpuset_wq, &rebuild_sched_domains_work);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5212ae..3f72c1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6343,6 +6343,60 @@  static struct sched_domain_topology_level default_topology[] = {
 	{ NULL, },
 };
 
+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+/* fill in by host */
+DEFINE_PER_CPU(int, virt_numa_node);
+/* todo, exchange info about HOST_NUMNODES from host */
+#define  HOST_NUMNODES  128
+/* keep map, node->cpumask; todo, make it dynamic allocated */
+static struct cpumask virt_node_to_cpumask_map[HOST_NUMNODES];
+
+static inline int virt_cpu_to_node(int cpu)
+{
+	return per_cpu(virt_numa_node, cpu);
+}
+
+const struct cpumask *virt_cpumask_of_node(int vnode)
+{
+	struct cpumask *msk = &virt_node_to_cpumask_map[vnode];
+	return msk;
+}
+
+static const struct cpumask *virt_cpu_cpu_mask(int cpu)
+{
+	return virt_cpumask_of_node(virt_cpu_to_node(cpu));
+}
+
+static struct sched_domain_topology_level virt_topology[] = {
+	{ sd_init_CPU, virt_cpu_cpu_mask, },
+#ifdef CONFIG_NUMA
+	{ sd_init_ALLNODES, cpu_allnodes_mask, },
+#endif
+	{ NULL, },
+};
+
+static int update_virt_numa_node(void)
+{
+	int i, cpu, apicid, vnode;
+	for (i = 0; i < HOST_NUMNODES; i++)
+		cpumask_clear(&virt_node_to_cpumask_map[i]);
+	for_each_possible_cpu(cpu) {
+		apicid = cpu_physical_id(cpu);
+		vnode = __vapicid_to_vnode[apicid];
+		per_cpu(virt_numa_node, cpu) = vnode;
+		cpumask_set_cpu(cpu, &virt_node_to_cpumask_map[vnode]);
+	}
+	return 0;
+}
+
+int rebuild_virt_sd(void)
+{
+	update_virt_numa_node();
+	async_rebuild_sched_domains();
+	return 0;
+}
+#endif
+
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -6689,9 +6743,11 @@  match1:
 	/* Build new domains */
 	for (i = 0; i < ndoms_new; i++) {
 		for (j = 0; j < ndoms_cur && !new_topology; j++) {
+#ifndef CONFIG_VIRT_SCHED_DOMAIN
 			if (cpumask_equal(doms_new[i], doms_cur[j])
 			    && dattrs_equal(dattr_new, i, dattr_cur, j))
 				goto match2;
+#endif
 		}
 		/* no match - add a new doms_new */
 		build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL);
@@ -6837,6 +6893,15 @@  void __init sched_init_smp(void)
 {
 	cpumask_var_t non_isolated_cpus;
 
+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+	int i;
+	for (i = 0; i < MAX_LOCAL_APIC; i++) {
+		/* pretend all on the same node */
+		__vapicid_to_vnode[i] = 0;
+	}
+	update_virt_numa_node();
+	sched_domain_topology = virt_topology;
+#endif
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
 	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb3acba..232482d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,9 @@ 
 
 extern __read_mostly int scheduler_running;
 
+#ifdef CONFIG_VIRT_SCHED_DOMAIN
+extern s16 __vapicid_to_vnode[];
+#endif
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -198,6 +201,8 @@  struct cfs_bandwidth { };
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+extern void async_rebuild_sched_domains(void);
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;