diff mbox

[1/2] powerpc/workqueue: update list of possible CPUs

Message ID 20170821134951.18848-1-lvivier@redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Laurent Vivier Aug. 21, 2017, 1:49 p.m. UTC
In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
is built.

Unfortunately, on powerpc, the Firmware is only able to provide the
node of a CPU if the CPU is present. So, in our case (possible CPU)
CPU ids are known, but as the CPU is not present, the node id is
unknown and all the unplugged CPUs are attached to node 0.

When we hotplugged CPU, there can be an inconsistency between
the node id known by the workqueue, and the real node id of
the CPU.

This patch adds a hotplug handler to add the hotplugged CPU to
update the node entry in wq_numa_possible_cpumask array.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 arch/powerpc/kernel/smp.c |  1 +
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 21 +++++++++++++++++++++
 3 files changed, 23 insertions(+)

Comments

Tejun Heo Aug. 21, 2017, 2:48 p.m. UTC | #1
On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
> is built.
> 
> Unfortunately, on powerpc, the Firmware is only able to provide the
> node of a CPU if the CPU is present. So, in our case (possible CPU)
> CPU ids are known, but as the CPU is not present, the node id is
> unknown and all the unplugged CPUs are attached to node 0.

This is something powerpc needs to fix.  Workqueue isn't the only one
making this assumption.  mm as a whole assumes that CPU <-> node
mapping is stable regardless of hotplug events.  Powerpc people know
about the issue and AFAIK are working on it.

Thanks.
Michael Ellerman Aug. 22, 2017, 1:41 a.m. UTC | #2
Tejun Heo <tj@kernel.org> writes:

> On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
>> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
>> is built.
>> 
>> Unfortunately, on powerpc, the Firmware is only able to provide the
>> node of a CPU if the CPU is present. So, in our case (possible CPU)
>> CPU ids are known, but as the CPU is not present, the node id is
>> unknown and all the unplugged CPUs are attached to node 0.
>
> This is something powerpc needs to fix.

There is no way for us to fix it.

At boot, for possible but not present CPUs, we have no way of knowing
the CPU <-> node mapping, firmware simply doesn't tell us.

> Workqueue isn't the only one making this assumption. mm as a whole
> assumes that CPU <-> node mapping is stable regardless of hotplug
> events.

At least in this case I don't think the mapping changes, it's just we
don't know the mapping at boot.

Currently we have to report possible but not present CPUs as belonging
to node 0, because otherwise we trip this helpful piece of code:

	for_each_possible_cpu(cpu) {
		node = cpu_to_node(cpu);
		if (WARN_ON(node == NUMA_NO_NODE)) {
			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
			/* happens iff arch is bonkers, let's just proceed */
			return;
		}

But if we remove that, we could then accurately report NUMA_NO_NODE at
boot, and then update the mapping when the CPU is hotplugged.

cheers
Tejun Heo Aug. 22, 2017, 4:54 p.m. UTC | #3
Hello, Michael.

On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
> > This is something powerpc needs to fix.
> 
> There is no way for us to fix it.

I don't think that's true.  The CPU id used in kernel doesn't have to
match the physical one and arch code should be able to pre-map CPU IDs
to nodes and use the matching one when hotplugging CPUs.  I'm not
saying that's the best way to solve the problem tho.  It could be that
the best way forward is making cpu <-> node mapping dynamic and
properly synchronized.  However, please note that that does mean we
mess up node affinity for things like per-cpu memory which are
allocated before the cpu comes up, so there's some inherent benefits
to keeping the mapping static even if that involves indirection.

> > Workqueue isn't the only one making this assumption. mm as a whole
> > assumes that CPU <-> node mapping is stable regardless of hotplug
> > events.
> 
> At least in this case I don't think the mapping changes, it's just we
> don't know the mapping at boot.
> 
> Currently we have to report possible but not present CPUs as belonging
> to node 0, because otherwise we trip this helpful piece of code:
> 
> 	for_each_possible_cpu(cpu) {
> 		node = cpu_to_node(cpu);
> 		if (WARN_ON(node == NUMA_NO_NODE)) {
> 			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> 			/* happens iff arch is bonkers, let's just proceed */
> 			return;
> 		}
> 
> But if we remove that, we could then accurately report NUMA_NO_NODE at
> boot, and then update the mapping when the CPU is hotplugged.

If you think that making this dynamic is the right way to go, I have
no objection but we should be doing this properly instead of patching
up what seems to be crashing right now.  What synchronization and
notification mechanisms do we need to make cpu <-> node mapping
dynamic?  Do we need any synchronization in memory allocation paths?
If not, why would it be safe?

Thanks.
Michael Ellerman Aug. 23, 2017, 11 a.m. UTC | #4
Hi Tejun,

Tejun Heo <tj@kernel.org> writes:
> Hello, Michael.
>
> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
>> > This is something powerpc needs to fix.
>> 
>> There is no way for us to fix it.
>
> I don't think that's true.  The CPU id used in kernel doesn't have to
> match the physical one and arch code should be able to pre-map CPU IDs
> to nodes and use the matching one when hotplugging CPUs.  I'm not
> saying that's the best way to solve the problem tho.

We already virtualise the CPU numbers, but not the node IDs. And it's
the node IDs that are really the problem.

So yeah I guess we might be able to make that work, but I'd have to
think about it a bit more.

> It could be that the best way forward is making cpu <-> node mapping
> dynamic and properly synchronized.

We don't need it to be dynamic (at least for this bug).

Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
that because some CPUs aren't present at boot we don't know what the
node mapping is. (Correct me if I'm wrong Laurent).

So all we need is:
 - the workqueue code to cope with CPUs that are possible but not online
   having NUMA_NO_NODE to begin with.
 - a way to update the workqueue cpumask when the CPU comes online.


Which seems reasonable to me?

cheers
Laurent Vivier Aug. 23, 2017, 11:17 a.m. UTC | #5
On 23/08/2017 13:00, Michael Ellerman wrote:
> Hi Tejun,
> 
> Tejun Heo <tj@kernel.org> writes:
>> Hello, Michael.
>>
>> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
>>>> This is something powerpc needs to fix.
>>>
>>> There is no way for us to fix it.
>>
>> I don't think that's true.  The CPU id used in kernel doesn't have to
>> match the physical one and arch code should be able to pre-map CPU IDs
>> to nodes and use the matching one when hotplugging CPUs.  I'm not
>> saying that's the best way to solve the problem tho.
> 
> We already virtualise the CPU numbers, but not the node IDs. And it's
> the node IDs that are really the problem.
> 
> So yeah I guess we might be able to make that work, but I'd have to
> think about it a bit more.
> 
>> It could be that the best way forward is making cpu <-> node mapping
>> dynamic and properly synchronized.
> 
> We don't need it to be dynamic (at least for this bug).
> 
> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
> that because some CPUs aren't present at boot we don't know what the
> node mapping is. (Correct me if I'm wrong Laurent).

You're correct.

Qemu is started with:

    -numa node,cpus=0-1 -numa node,cpus=2-3 \
    -smp 2,maxcpus=4,cores=4,threads=1,sockets=1

Which means we have 2 nodes with cpu ids 0 and 1 on node 0, and cpu ids
2 and 3 on node 1, but at boot only 2 CPUs are present.

The problem I try to fix with this series is when we hotplug a third
CPU, to node 1 with id 2.

Thanks,
Laurent
Tejun Heo Aug. 23, 2017, 1:26 p.m. UTC | #6
Hello, Michael.

On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
> > I don't think that's true.  The CPU id used in kernel doesn't have to
> > match the physical one and arch code should be able to pre-map CPU IDs
> > to nodes and use the matching one when hotplugging CPUs.  I'm not
> > saying that's the best way to solve the problem tho.
> 
> We already virtualise the CPU numbers, but not the node IDs. And it's
> the node IDs that are really the problem.

Yeah, it just needs to match up new cpus to the cpu ids assigned to
the right node.

> > It could be that the best way forward is making cpu <-> node mapping
> > dynamic and properly synchronized.
> 
> We don't need it to be dynamic (at least for this bug).

The node mapping for that cpu id changes *dynamically* while the
system is running and that can race with node-affinity sensitive
operations such as memory allocations.

> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
> that because some CPUs aren't present at boot we don't know what the
> node mapping is. (Correct me if I'm wrong Laurent).
> 
> So all we need is:
>  - the workqueue code to cope with CPUs that are possible but not online
>    having NUMA_NO_NODE to begin with.
>  - a way to update the workqueue cpumask when the CPU comes online.
> 
> Which seems reasonable to me?

Please take a step back and think through the problem again.  You
can't bandaid it this way.

Thanks.
Laurent Vivier Aug. 24, 2017, 12:10 p.m. UTC | #7
On 23/08/2017 15:26, Tejun Heo wrote:
> Hello, Michael.
> 
> On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
>>> I don't think that's true.  The CPU id used in kernel doesn't have to
>>> match the physical one and arch code should be able to pre-map CPU IDs
>>> to nodes and use the matching one when hotplugging CPUs.  I'm not
>>> saying that's the best way to solve the problem tho.
>>
>> We already virtualise the CPU numbers, but not the node IDs. And it's
>> the node IDs that are really the problem.
> 
> Yeah, it just needs to match up new cpus to the cpu ids assigned to
> the right node.

We are not able to assign the cpu ids to the right node before the CPU
is present, because firmware doesn't provide CPU mapping <-> node id
before that.

>>> It could be that the best way forward is making cpu <-> node mapping
>>> dynamic and properly synchronized.
>>
>> We don't need it to be dynamic (at least for this bug).
> 
> The node mapping for that cpu id changes *dynamically* while the
> system is running and that can race with node-affinity sensitive
> operations such as memory allocations.

Memory is mapped to the node through its own firmware entry, so I don't
think cpu id change can affect memory affinity, and before we know the
node id of the CPU, the CPU is not present and thus it can't use memory.

>> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
>> that because some CPUs aren't present at boot we don't know what the
>> node mapping is. (Correct me if I'm wrong Laurent).
>>
>> So all we need is:
>>  - the workqueue code to cope with CPUs that are possible but not online
>>    having NUMA_NO_NODE to begin with.
>>  - a way to update the workqueue cpumask when the CPU comes online.
>>
>> Which seems reasonable to me?
> 
> Please take a step back and think through the problem again.  You
> can't bandaid it this way.

Could you give some ideas, proposals?
As the firmware doesn't provide the information before the CPU is really
plugged, I really don't know how to manage this problem.

Thanks,
Laurent
Tejun Heo Aug. 24, 2017, 1:51 p.m. UTC | #8
Hello, Laurent.

On Thu, Aug 24, 2017 at 02:10:31PM +0200, Laurent Vivier wrote:
> > Yeah, it just needs to match up new cpus to the cpu ids assigned to
> > the right node.
> 
> We are not able to assign the cpu ids to the right node before the CPU
> is present, because firmware doesn't provide CPU mapping <-> node id
> before that.

What I meant was to assign the matching CPU ID when the CPU becomes
present - ie. have CPU IDs available for different nodes and allocate
them to the new CPU according to its node mapping when it actually
comes up.  Please note that I'm not saying this is the way to go, just
that it is a solvable problem from the arch code.

> > The node mapping for that cpu id changes *dynamically* while the
> > system is running and that can race with node-affinity sensitive
> > operations such as memory allocations.
> 
> Memory is mapped to the node through its own firmware entry, so I don't
> think cpu id change can affect memory affinity, and before we know the
> node id of the CPU, the CPU is not present and thus it can't use memory.

The latter part isn't true.  For example, percpu memory gets alloacted
for all possible CPUs according to their node affinity, so the memory
node association change which happens when the CPU comes up for the
first time can race against such allocations.  I don't know whether
that's actually problematic but we don't have *any* synchronization
around it.  If you think it's safe to have such races, please explain
why that is.

> > Please take a step back and think through the problem again.  You
> > can't bandaid it this way.
> 
> Could you give some ideas, proposals?
> As the firmware doesn't provide the information before the CPU is really
> plugged, I really don't know how to manage this problem.

There are two possible approaches, I think.

1. Make physical cpu -> logical cpu mapping indirect so that the
   kernel's cpu ID assignment is always on the right numa node.  This
   may mean that the kernel might have to keep more possible CPUs
   around than necessary but it does have the benefit that all memory
   allocations are affine to the right node.

2. Make cpu <-> node mapping properly dynamic.  Identify what sort of
   synchronization we'd need around the mapping changing dynamically.
   Note that we might not need much but it'll most likely need some.
   Build synchronization and notification infrastructure around it.

Thanks.
diff mbox

Patch

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8d3320562c70..abc533146514 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -964,6 +964,7 @@  void start_secondary(void *unused)
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+	wq_numa_add_possible_cpu(cpu);
 
 	smp_wmb();
 	notify_cpu_starting(cpu);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9dc0482..4ef030dae22c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -644,6 +644,7 @@  int workqueue_online_cpu(unsigned int cpu);
 int workqueue_offline_cpu(unsigned int cpu);
 #endif
 
+void wq_numa_add_possible_cpu(unsigned int cpu);
 int __init workqueue_init_early(void);
 int __init workqueue_init(void);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0c3a96..d1a99e25d5da 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5486,6 +5486,27 @@  static inline void wq_watchdog_init(void) { }
 
 #endif	/* CONFIG_WQ_WATCHDOG */
 
+void wq_numa_add_possible_cpu(unsigned int cpu)
+{
+	int node;
+
+	if (num_possible_nodes() <= 1)
+		return;
+
+	if (wq_disable_numa)
+		return;
+
+	if (!wq_numa_enabled)
+		return;
+
+	node = cpu_to_node(cpu);
+	if (node == NUMA_NO_NODE)
+		return;
+
+	cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+EXPORT_SYMBOL_GPL(wq_numa_add_possible_cpu);
+
 static void __init wq_numa_init(void)
 {
 	cpumask_var_t *tbl;