Patchwork [v2,7/11] Use stop machine to update cpu maps

login
register
mail settings
Submitter Nathan Fontenot
Date March 25, 2013, 6:58 p.m.
Message ID <51509E3C.2030008@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/230972/
State Changes Requested, archived
Delegated to: Michael Ellerman
Headers show

Comments

Nathan Fontenot - March 25, 2013, 6:58 p.m.
From: Jesse Larrew <jlarrew@linux.vnet.ibm.com>

The new PRRN firmware feature allows CPU and memory resources to be
transparently reassigned across NUMA boundaries. When this happens, the
kernel must update the node maps to reflect the new affinity
information.

Although the NUMA maps can be protected by locking primitives during the
update itself, this is insufficient to prevent concurrent accesses to these
structures. Since cpumask_of_node() hands out a pointer to these
structures, they can still be modified outside of the lock. Furthermore,
tracking down each usage of these pointers and adding locks would be quite
invasive and difficult to maintain.

Situations like these are best handled using stop_machine(). Since the NUMA
affinity updates are exceptionally rare events, this approach has the
benefit of not adding any overhead while accessing the NUMA maps during
normal operation.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c |   51 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)
Paul Mackerras - April 4, 2013, 4:46 a.m.
On Mon, Mar 25, 2013 at 01:58:04PM -0500, Nathan Fontenot wrote:
> From: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
> 
> The new PRRN firmware feature allows CPU and memory resources to be
> transparently reassigned across NUMA boundaries. When this happens, the
> kernel must update the node maps to reflect the new affinity
> information.
> 
> Although the NUMA maps can be protected by locking primitives during the
> update itself, this is insufficient to prevent concurrent accesses to these
> structures. Since cpumask_of_node() hands out a pointer to these
> structures, they can still be modified outside of the lock. Furthermore,
> tracking down each usage of these pointers and adding locks would be quite
> invasive and difficult to maintain.
> 
> Situations like these are best handled using stop_machine(). Since the NUMA
> affinity updates are exceptionally rare events, this approach has the
> benefit of not adding any overhead while accessing the NUMA maps during
> normal operation.

I notice you do one stop_machine() call for every cpu whose affinity
has changed.  Couldn't we update the affinity for them all in one
stop_machine call?  Given that stopping the whole machine can be quite
slow, wouldn't it be better to do one call rather than potentially
many?

Paul.
Nathan Fontenot - April 5, 2013, 6:22 p.m.
On 04/03/2013 11:46 PM, Paul Mackerras wrote:
> On Mon, Mar 25, 2013 at 01:58:04PM -0500, Nathan Fontenot wrote:
>> From: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
>>
>> The new PRRN firmware feature allows CPU and memory resources to be
>> transparently reassigned across NUMA boundaries. When this happens, the
>> kernel must update the node maps to reflect the new affinity
>> information.
>>
>> Although the NUMA maps can be protected by locking primitives during the
>> update itself, this is insufficient to prevent concurrent accesses to these
>> structures. Since cpumask_of_node() hands out a pointer to these
>> structures, they can still be modified outside of the lock. Furthermore,
>> tracking down each usage of these pointers and adding locks would be quite
>> invasive and difficult to maintain.
>>
>> Situations like these are best handled using stop_machine(). Since the NUMA
>> affinity updates are exceptionally rare events, this approach has the
>> benefit of not adding any overhead while accessing the NUMA maps during
>> normal operation.
> 
> I notice you do one stop_machine() call for every cpu whose affinity
> has changed.  Couldn't we update the affinity for them all in one
> stop_machine call?  Given that stopping the whole machine can be quite
> slow, wouldn't it be better to do one call rather than potentially
> many?
> 

Agreed, having to call stop_machine() for each cpu that gets updated is
pretty brutal. The plus side is that PRRN events should a rare occurrence 
and not cause too much pain.

The current design ties into the of notification chain so that we can do
the affinity update when the affinity property in the device tree is updated.
Switching to doing one stop and updating all of the cpus would require a
design change....and....

I went back and looked at the code again and there is another issue with
way this is done. Tying into the of notification chain is great for
being informed of when a property changes but the code (from patch 6/11)

+	case OF_RECONFIG_ADD_PROPERTY:
+	case OF_RECONFIG_UPDATE_PROPERTY:
+		update = (struct of_prop_reconfig *)data;
+		if (!of_prop_cmp(update->dn->type, "cpu")) {
+			u32 core_id;
+			of_property_read_u32(update->dn, "reg", &core_id);
+			stage_topology_update(core_id);
+			rc = NOTIFY_OK;
+		}
+		break;

Does not check to see which property is being updated and just assumes
the affinity is being updated. This code as is will do an affinity update
every time any property of a cpu is updated or added.

Since this needs an update I will also look at possibly doing this so
that we call stop_machine only once.
Benjamin Herrenschmidt - April 23, 2013, 12:23 a.m.
On Fri, 2013-04-05 at 13:22 -0500, Nathan Fontenot wrote:

> Agreed, having to call stop_machine() for each cpu that gets updated is
> pretty brutal. The plus side is that PRRN events should a rare occurrence 
> and not cause too much pain.

So that doesn't happen on VPHN changes ?

> The current design ties into the of notification chain so that we can do
> the affinity update when the affinity property in the device tree is updated.
> Switching to doing one stop and updating all of the cpus would require a
> design change....and....
> 
> I went back and looked at the code again and there is another issue with
> way this is done. Tying into the of notification chain is great for
> being informed of when a property changes but the code (from patch 6/11)
> 
> +	case OF_RECONFIG_ADD_PROPERTY:
> +	case OF_RECONFIG_UPDATE_PROPERTY:
> +		update = (struct of_prop_reconfig *)data;
> +		if (!of_prop_cmp(update->dn->type, "cpu")) {
> +			u32 core_id;
> +			of_property_read_u32(update->dn, "reg", &core_id);
> +			stage_topology_update(core_id);
> +			rc = NOTIFY_OK;
> +		}
> +		break;
> 
> Does not check to see which property is being updated and just assumes
> the affinity is being updated. This code as is will do an affinity update
> every time any property of a cpu is updated or added.
> 
> Since this needs an update I will also look at possibly doing this so
> that we call stop_machine only once.

Any new patch set ?

Cheers,
Ben.

Patch

Index: powerpc/arch/powerpc/mm/numa.c
===================================================================
--- powerpc.orig/arch/powerpc/mm/numa.c	2013-03-20 12:26:36.000000000 -0500
+++ powerpc/arch/powerpc/mm/numa.c	2013-03-20 12:27:43.000000000 -0500
@@ -22,6 +22,7 @@ 
 #include <linux/pfn.h>
 #include <linux/cpuset.h>
 #include <linux/node.h>
+#include <linux/stop_machine.h>
 #include <asm/sparsemem.h>
 #include <asm/prom.h>
 #include <asm/smp.h>
@@ -1254,6 +1255,12 @@ 
 
 /* Virtual Processor Home Node (VPHN) support */
 #ifdef CONFIG_PPC_SPLPAR
+struct topology_update_data {
+	int cpu;
+	int old_nid;
+	int new_nid;
+};
+
 static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
@@ -1405,34 +1412,46 @@ 
 }
 
 /*
+ * Update the CPU maps and sysfs entries for a single CPU when its NUMA
+ * characteristics change. This function doesn't perform any locking and is
+ * only safe to call from stop_machine().
+ */
+static int update_cpu_topology(void *data)
+{
+	struct topology_update_data *update = data;
+
+	if (!update)
+		return -EINVAL;
+
+	unregister_cpu_under_node(update->cpu, update->old_nid);
+	unmap_cpu_from_node(update->cpu);
+	map_cpu_to_node(update->cpu, update->new_nid);
+	register_cpu_under_node(update->cpu, update->new_nid);
+
+	return 0;
+}
+
+/*
  * Update the node maps and sysfs entries for each cpu whose home node
  * has changed. Returns 1 when the topology has changed, and 0 otherwise.
  */
 int arch_update_cpu_topology(void)
 {
-	int cpu, nid, old_nid, changed = 0;
+	int cpu, changed = 0;
+	struct topology_update_data update;
 	unsigned int associativity[VPHN_ASSOC_BUFSIZE] = {0};
 	struct device *dev;
 
 	for_each_cpu(cpu, &cpu_associativity_changes_mask) {
+		update.cpu = cpu;
 		vphn_get_associativity(cpu, associativity);
-		nid = associativity_to_nid(associativity);
-
-		if (nid < 0 || !node_online(nid))
-			nid = first_online_node;
+		update.new_nid = associativity_to_nid(associativity);
 
-		old_nid = numa_cpu_lookup_table[cpu];
-
-		/* Disable hotplug while we update the cpu
-		 * masks and sysfs.
-		 */
-		get_online_cpus();
-		unregister_cpu_under_node(cpu, old_nid);
-		unmap_cpu_from_node(cpu);
-		map_cpu_to_node(cpu, nid);
-		register_cpu_under_node(cpu, nid);
-		put_online_cpus();
+		if (update.new_nid < 0 || !node_online(update.new_nid))
+			update.new_nid = first_online_node;
 
+		update.old_nid = numa_cpu_lookup_table[cpu];
+		stop_machine(update_cpu_topology, &update, cpu_online_mask);
 		dev = get_cpu_device(cpu);
 		if (dev)
 			kobject_uevent(&dev->kobj, KOBJ_CHANGE);