[v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
diff mbox series

Message ID 1568724534-146242-1-git-send-email-linyunsheng@huawei.com
State New
Headers show
Series
  • [v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
Related show

Commit Message

Yunsheng Lin Sept. 17, 2019, 12:48 p.m. UTC
When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
    little controversial, may need deeper investigation, and rebased
    on the latest linux-next.
V5: Drop unsigned "fix" change for x86/arm64, and change comment log
    according to Michal's comment.
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
    for NUMA_NO_NODE case, and change the commit log to better justify
    the change.
V2: make the node id checking change to other arches too.
---
 arch/arm64/include/asm/numa.h                    | 3 +++
 arch/arm64/mm/numa.c                             | 3 +++
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/s390/include/asm/topology.h                 | 3 +++
 arch/x86/include/asm/topology.h                  | 3 +++
 arch/x86/mm/numa.c                               | 3 +++
 6 files changed, 18 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Sept. 23, 2019, 3:15 p.m. UTC | #1
On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/

That is bloody unusable, don't do that. Use:

  https://lkml.kernel.org/r/$MSGID

if anything. Then I can find it in my local mbox without having to
resort to touching a mouse and shitty browser software.

(also patchwork is absolute crap for reading email threads)

Anyway, I found it -- I think, I refused to click the link. I replied
there.

> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>



> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4123100e..9859acb 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> +	if (node == NUMA_NO_NODE)
> +		return cpu_online_mask;

This mandates the caller holds cpus_read_lock() or something, I'm pretty
sure that if I put:

	lockdep_assert_cpus_held();

here, it comes apart real quick. Without holding the cpu hotplug lock,
the online mask is gibberish.

> +
>  	if ((unsigned)node >= nr_node_ids) {
>  		printk(KERN_WARNING
>  			"cpumask_of_node(%d): (unsigned)node >= nr_node_ids(%u)\n",

I still think this makes absolutely no sense what so ever.
Michal Hocko Sept. 23, 2019, 3:28 p.m. UTC | #2
On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> > When passing the return value of dev_to_node() to cpumask_of_node()
> > without checking if the device's node id is NUMA_NO_NODE, there is
> > global-out-of-bounds detected by KASAN.
> > 
> > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > which also means all cpus should be usable. So the cpumask_of_node()
> > should always return all cpus online when user passes the node id as
> > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > NUMA_NO_NODE.
> > 
> > But we cannot really copy the page allocator logic. Simply because the
> > page allocator doesn't enforce the near node affinity. It just picks it
> > up as a preferred node but then it is free to fallback to any other numa
> > node. This is not the case here and node_to_cpumask_map will only restrict
> > to the particular node's cpus which would have really non deterministic
> > behavior depending on where the code is executed. So in fact we really
> > want to return cpu_online_mask for NUMA_NO_NODE.
> > 
> > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1125789/
> 
> That is bloody unusable, don't do that. Use:
> 
>   https://lkml.kernel.org/r/$MSGID
> 
> if anything. Then I can find it in my local mbox without having to
> resort to touching a mouse and shitty browser software.
> 
> (also patchwork is absolute crap for reading email threads)
> 
> Anyway, I found it -- I think, I refused to click the link. I replied
> there.
> 
> > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> > Suggested-by: Michal Hocko <mhocko@kernel.org>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> 
> 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 4123100e..9859acb 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> >   */
> >  const struct cpumask *cpumask_of_node(int node)
> >  {
> > +	if (node == NUMA_NO_NODE)
> > +		return cpu_online_mask;
> 
> This mandates the caller holds cpus_read_lock() or something, I'm pretty
> sure that if I put:
> 
> 	lockdep_assert_cpus_held();

Is this documented somewhere? Also how does that differ from a normal
case when a proper node is used? The cpumask will always be dynamic in
the cpu hotplug presence, right?

> here, it comes apart real quick. Without holding the cpu hotplug lock,
> the online mask is gibberish.

Can the returned cpu mask go away?
Peter Zijlstra Sept. 23, 2019, 3:48 p.m. UTC | #3
On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:

> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 4123100e..9859acb 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > >   */
> > >  const struct cpumask *cpumask_of_node(int node)
> > >  {
> > > +	if (node == NUMA_NO_NODE)
> > > +		return cpu_online_mask;
> > 
> > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > sure that if I put:
> > 
> > 	lockdep_assert_cpus_held();
> 
> Is this documented somewhere?

No idea... common sense :-)

> Also how does that differ from a normal
> case when a proper node is used? The cpumask will always be dynamic in
> the cpu hotplug presence, right?

As per normal yes, and I'm fairly sure there's a ton of bugs. Any
'online' state is subject to change except when you're holding
sufficient locks to stop it.

Disabling preemption also stabilizes it, because cpu unplug relies on
stop-machine.

> > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > the online mask is gibberish.
> 
> Can the returned cpu mask go away?

No, the cpu_online_mask itself has static storage, the contents OTOH can
change at will. Very little practical difference :-)
Michal Hocko Sept. 23, 2019, 4:52 p.m. UTC | #4
On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> 
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 4123100e..9859acb 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > > >   */
> > > >  const struct cpumask *cpumask_of_node(int node)
> > > >  {
> > > > +	if (node == NUMA_NO_NODE)
> > > > +		return cpu_online_mask;
> > > 
> > > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > > sure that if I put:
> > > 
> > > 	lockdep_assert_cpus_held();
> > 
> > Is this documented somewhere?
> 
> No idea... common sense :-)

I thought that and cpuhotplug were forbiden to be used in the same
sentence :p

> > Also how does that differ from a normal
> > case when a proper node is used? The cpumask will always be dynamic in
> > the cpu hotplug presence, right?
> 
> As per normal yes, and I'm fairly sure there's a ton of bugs. Any
> 'online' state is subject to change except when you're holding
> sufficient locks to stop it.
> 
> Disabling preemption also stabilizes it, because cpu unplug relies on
> stop-machine.

OK, I guess it is fair to document that callers should be careful when
using this if they absolutely need any stability. But I strongly suspect
they simply do not care all that much. They mostly do care to have
something that gives them an idea which CPUs are close to the device and
that can tolerate some race.

In other words this is more of an optimization than a correctness issue.
 
> > > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > > the online mask is gibberish.
> > 
> > Can the returned cpu mask go away?
> 
> No, the cpu_online_mask itself has static storage, the contents OTOH can
> change at will. Very little practical difference :-)
 
OK, thanks for the confirmation. I was worried that I've overlooked
something.

To the NUMA_NO_NODE itself. Your earlier email noted:
: > +
: >  	if ((unsigned)node >= nr_node_ids) {
: >  		printk(KERN_WARNING
: >  			"cpumask_of_node(%d): (unsigned)node >= nr_node_ids(%u)\n",
: 
: I still think this makes absolutely no sense what so ever.

Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
check?

Because as to NUMA_NO_NODE I believe this makes sense because this is
the only way that a device is not bound to any numa node. I even the
ACPI standard is considering this optional. Yunsheng Lin has referred to
the specific part of the standard in one of the earlier discussions.
Trying to guess the node affinity is worse than providing all CPUs IMHO.
Peter Zijlstra Sept. 23, 2019, 8:34 p.m. UTC | #5
On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:

> To the NUMA_NO_NODE itself. Your earlier email noted:
> : > +
> : >  	if ((unsigned)node >= nr_node_ids) {
> : >  		printk(KERN_WARNING
> : >  			"cpumask_of_node(%d): (unsigned)node >= nr_node_ids(%u)\n",
> : 
> : I still think this makes absolutely no sense what so ever.
> 
> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
> check?

The NUMA_NO_NODE thing. It's is physical impossibility. And if the
device description doesn't give us a node, then the description is
incomplete and wrong and we should bloody well complain about it.

> Because as to NUMA_NO_NODE I believe this makes sense because this is
> the only way that a device is not bound to any numa node.

Which is a physical impossibility.

> I even the
> ACPI standard is considering this optional. Yunsheng Lin has referred to
> the specific part of the standard in one of the earlier discussions.
> Trying to guess the node affinity is worse than providing all CPUs IMHO.

I'm saying the ACPI standard is wrong. Explain to me how it is
physically possible to have a device without NUMA affinity in a NUMA
system?

 1) The fundamental interconnect is not uniform.
 2) The device needs to actually be somewhere.

From these it seems to follow that access to the device is subject to
NUMA.
Yunsheng Lin Sept. 24, 2019, 1:29 a.m. UTC | #6
On 2019/9/24 4:34, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
>> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> 
>> To the NUMA_NO_NODE itself. Your earlier email noted:
>> : > +
>> : >  	if ((unsigned)node >= nr_node_ids) {
>> : >  		printk(KERN_WARNING
>> : >  			"cpumask_of_node(%d): (unsigned)node >= nr_node_ids(%u)\n",
>> : 
>> : I still think this makes absolutely no sense what so ever.
>>
>> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
>> check?
> 
> The NUMA_NO_NODE thing. It's is physical impossibility. And if the
> device description doesn't give us a node, then the description is
> incomplete and wrong and we should bloody well complain about it.
> 
>> Because as to NUMA_NO_NODE I believe this makes sense because this is
>> the only way that a device is not bound to any numa node.
> 
> Which is a physical impossibility.
> 
>> I even the
>> ACPI standard is considering this optional. Yunsheng Lin has referred to
>> the specific part of the standard in one of the earlier discussions.
>> Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong. Explain to me how it is
> physically possible to have a device without NUMA affinity in a NUMA
> system?
> 
>  1) The fundamental interconnect is not uniform.
>  2) The device needs to actually be somewhere.
> 

From what I can see, NUMA_NO_NODE may make sense in the below case:

1) Theoretically, there would be a device that can access all the memory
uniformly and can be accessed by all cpus uniformly even in a NUMA system.
Suppose we have two nodes, and the device just sit in the middle of the
interconnect between the two nodes.

Even we define a third node solely for the device, we may need to look at
the node distance to decide the device can be accessed uniformly.

Or we can decide that the device can be accessed uniformly by setting
it's node to NUMA_NO_NODE.


2) For many virtual deivces, such as tun or loopback netdevice, they
are also accessed uniformly by all cpus.
Michal Hocko Sept. 24, 2019, 7:47 a.m. UTC | #7
On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
[...]
> > I even the
> > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > the specific part of the standard in one of the earlier discussions.
> > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong.

Even if you were right on this the reality is that a HW is likely to
follow that standard and we cannot rule out NUMA_NO_NODE being
specified. As of now we would access beyond the defined array and that
is clearly a bug.

Let's assume that this is really a bug for a moment. What are you going
to do about that? BUG_ON? I do not really see any solution besides to either
provide something sensible or BUG_ON. If you are worried about a
conditional then this should be pretty easy to solve by starting the
array at -1 index and associate it with the online cpu mask.
Peter Zijlstra Sept. 24, 2019, 9:17 a.m. UTC | #8
On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> [...]
> > > I even the
> > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > the specific part of the standard in one of the earlier discussions.
> > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > 
> > I'm saying the ACPI standard is wrong.
> 
> Even if you were right on this the reality is that a HW is likely to
> follow that standard and we cannot rule out NUMA_NO_NODE being
> specified. As of now we would access beyond the defined array and that
> is clearly a bug.

Right, because the device node is wrong, so we fix _that_!

> Let's assume that this is really a bug for a moment. What are you going
> to do about that? BUG_ON? I do not really see any solution besides to either
> provide something sensible or BUG_ON. If you are worried about a
> conditional then this should be pretty easy to solve by starting the
> array at -1 index and associate it with the online cpu mask.

The same thing I proposed earlier; force the device node to 0 (or any
other convenient random valid value) and issue a FW_BUG message to the
console.
Peter Zijlstra Sept. 24, 2019, 9:25 a.m. UTC | #9
On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> On 2019/9/24 4:34, Peter Zijlstra wrote:

> > I'm saying the ACPI standard is wrong. Explain to me how it is
> > physically possible to have a device without NUMA affinity in a NUMA
> > system?
> > 
> >  1) The fundamental interconnect is not uniform.
> >  2) The device needs to actually be somewhere.
> > 
> 
> From what I can see, NUMA_NO_NODE may make sense in the below case:
> 
> 1) Theoretically, there would be a device that can access all the memory
> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> Suppose we have two nodes, and the device just sit in the middle of the
> interconnect between the two nodes.
> 
> Even we define a third node solely for the device, we may need to look at
> the node distance to decide the device can be accessed uniformly.
> 
> Or we can decide that the device can be accessed uniformly by setting
> it's node to NUMA_NO_NODE.

This is indeed a theoretical case; it doesn't scale. The moment you're
adding multiple sockets or even board interconnects this all goes out
the window.

And in this case, forcing the device to either node is fine.

> 2) For many virtual deivces, such as tun or loopback netdevice, they
> are also accessed uniformly by all cpus.

Not true; the virtual device will sit in memory local to some node.

And as with physical devices, you probably want at least one (virtual)
queue per node.
Michal Hocko Sept. 24, 2019, 10:56 a.m. UTC | #10
On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > [...]
> > > > I even the
> > > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > > the specific part of the standard in one of the earlier discussions.
> > > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > > 
> > > I'm saying the ACPI standard is wrong.
> > 
> > Even if you were right on this the reality is that a HW is likely to
> > follow that standard and we cannot rule out NUMA_NO_NODE being
> > specified. As of now we would access beyond the defined array and that
> > is clearly a bug.
> 
> Right, because the device node is wrong, so we fix _that_!
> 
> > Let's assume that this is really a bug for a moment. What are you going
> > to do about that? BUG_ON? I do not really see any solution besides to either
> > provide something sensible or BUG_ON. If you are worried about a
> > conditional then this should be pretty easy to solve by starting the
> > array at -1 index and associate it with the online cpu mask.
> 
> The same thing I proposed earlier; force the device node to 0 (or any
> other convenient random valid value) and issue a FW_BUG message to the
> console.

Why would you "fix" anything and how do you know that node 0 is the
right choice? I have seen setups with node 0 without any memory and
similar unexpected things.

To be honest I really fail to see why to object to a simple semantic
that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
Yunsheng Lin Sept. 24, 2019, 11:07 a.m. UTC | #11
On 2019/9/24 17:25, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 4:34, Peter Zijlstra wrote:
> 
>>> I'm saying the ACPI standard is wrong. Explain to me how it is
>>> physically possible to have a device without NUMA affinity in a NUMA
>>> system?
>>>
>>>  1) The fundamental interconnect is not uniform.
>>>  2) The device needs to actually be somewhere.
>>>
>>
>> From what I can see, NUMA_NO_NODE may make sense in the below case:
>>
>> 1) Theoretically, there would be a device that can access all the memory
>> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
>> Suppose we have two nodes, and the device just sit in the middle of the
>> interconnect between the two nodes.
>>
>> Even we define a third node solely for the device, we may need to look at
>> the node distance to decide the device can be accessed uniformly.
>>
>> Or we can decide that the device can be accessed uniformly by setting
>> it's node to NUMA_NO_NODE.
> 
> This is indeed a theoretical case; it doesn't scale. The moment you're
> adding multiple sockets or even board interconnects this all goes out
> the window.
> 
> And in this case, forcing the device to either node is fine.

Not really.
For packet sending and receiving, the buffer memory may be allocated
dynamically. Node of tx buffer memory is mainly based on the cpu
that is sending sending, node of rx buffer memory is mainly based on
the cpu the interrupt handler of the device is running on, and the
device' interrupt affinity is mainly based on node id of the device.

We can bind the processes that are using the device to both nodes
in order to utilize memory on both nodes for packet sending.

But for packet receiving, the node1 may not be used becuase the node
id of device is forced to node 0, which is the default way to bind
the interrupt to the cpu of the same node.

If node_to_cpumask_map() returns all usable cpus when the device's node
id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

> 
>> 2) For many virtual deivces, such as tun or loopback netdevice, they
>> are also accessed uniformly by all cpus.
> 
> Not true; the virtual device will sit in memory local to some node.
> 
> And as with physical devices, you probably want at least one (virtual)
> queue per node.

There may be similar handling as above for virtual device too.

> 
> 
> .
>
Peter Zijlstra Sept. 24, 2019, 11:23 a.m. UTC | #12
On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I even the
> > > > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > > > the specific part of the standard in one of the earlier discussions.
> > > > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > > > 
> > > > I'm saying the ACPI standard is wrong.
> > > 
> > > Even if you were right on this the reality is that a HW is likely to
> > > follow that standard and we cannot rule out NUMA_NO_NODE being
> > > specified. As of now we would access beyond the defined array and that
> > > is clearly a bug.
> > 
> > Right, because the device node is wrong, so we fix _that_!
> > 
> > > Let's assume that this is really a bug for a moment. What are you going
> > > to do about that? BUG_ON? I do not really see any solution besides to either
> > > provide something sensible or BUG_ON. If you are worried about a
> > > conditional then this should be pretty easy to solve by starting the
> > > array at -1 index and associate it with the online cpu mask.
> > 
> > The same thing I proposed earlier; force the device node to 0 (or any
> > other convenient random valid value) and issue a FW_BUG message to the
> > console.
> 
> Why would you "fix" anything and how do you know that node 0 is the
> right choice? I have seen setups with node 0 without any memory and
> similar unexpected things.

We don't know 0 is right; but we know 'unkown' is wrong, so we FW_BUG
and pick _something_.

> To be honest I really fail to see why to object to a simple semantic
> that NUMA_NO_NODE imply all usable cpus. Could you explain that please?

Because it feels wrong. The device needs to be _somewhere_. It simply
cannot be node-less.
Peter Zijlstra Sept. 24, 2019, 11:28 a.m. UTC | #13
On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
> On 2019/9/24 17:25, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> >> On 2019/9/24 4:34, Peter Zijlstra wrote:
> > 
> >>> I'm saying the ACPI standard is wrong. Explain to me how it is
> >>> physically possible to have a device without NUMA affinity in a NUMA
> >>> system?
> >>>
> >>>  1) The fundamental interconnect is not uniform.
> >>>  2) The device needs to actually be somewhere.
> >>>
> >>
> >> From what I can see, NUMA_NO_NODE may make sense in the below case:
> >>
> >> 1) Theoretically, there would be a device that can access all the memory
> >> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> >> Suppose we have two nodes, and the device just sit in the middle of the
> >> interconnect between the two nodes.
> >>
> >> Even we define a third node solely for the device, we may need to look at
> >> the node distance to decide the device can be accessed uniformly.
> >>
> >> Or we can decide that the device can be accessed uniformly by setting
> >> it's node to NUMA_NO_NODE.
> > 
> > This is indeed a theoretical case; it doesn't scale. The moment you're
> > adding multiple sockets or even board interconnects this all goes out
> > the window.
> > 
> > And in this case, forcing the device to either node is fine.
> 
> Not really.
> For packet sending and receiving, the buffer memory may be allocated
> dynamically. Node of tx buffer memory is mainly based on the cpu
> that is sending sending, node of rx buffer memory is mainly based on
> the cpu the interrupt handler of the device is running on, and the
> device' interrupt affinity is mainly based on node id of the device.
> 
> We can bind the processes that are using the device to both nodes
> in order to utilize memory on both nodes for packet sending.
> 
> But for packet receiving, the node1 may not be used becuase the node
> id of device is forced to node 0, which is the default way to bind
> the interrupt to the cpu of the same node.
> 
> If node_to_cpumask_map() returns all usable cpus when the device's node
> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

s/binded/bound/

Sure; the data can be allocated wherever, but the control structures are
not dynamically allocated every time. They are persistent, and they will
be local to some node.

Anyway, are you saying this stupid corner case is actually relevant?
Because how does it scale out? What if you have 8 sockets, with each
socket having 2 nodes and 1 such magic device. Then returning all CPUs
is just plain wrong.

> >> 2) For many virtual deivces, such as tun or loopback netdevice, they
> >> are also accessed uniformly by all cpus.
> > 
> > Not true; the virtual device will sit in memory local to some node.
> > 
> > And as with physical devices, you probably want at least one (virtual)
> > queue per node.
> 
> There may be similar handling as above for virtual device too.

And it'd be similarly broken.
Yunsheng Lin Sept. 24, 2019, 11:44 a.m. UTC | #14
On 2019/9/24 19:28, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 17:25, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
>>>> On 2019/9/24 4:34, Peter Zijlstra wrote:
>>>
>>>>> I'm saying the ACPI standard is wrong. Explain to me how it is
>>>>> physically possible to have a device without NUMA affinity in a NUMA
>>>>> system?
>>>>>
>>>>>  1) The fundamental interconnect is not uniform.
>>>>>  2) The device needs to actually be somewhere.
>>>>>
>>>>
>>>> From what I can see, NUMA_NO_NODE may make sense in the below case:
>>>>
>>>> 1) Theoretically, there would be a device that can access all the memory
>>>> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
>>>> Suppose we have two nodes, and the device just sit in the middle of the
>>>> interconnect between the two nodes.
>>>>
>>>> Even we define a third node solely for the device, we may need to look at
>>>> the node distance to decide the device can be accessed uniformly.
>>>>
>>>> Or we can decide that the device can be accessed uniformly by setting
>>>> it's node to NUMA_NO_NODE.
>>>
>>> This is indeed a theoretical case; it doesn't scale. The moment you're
>>> adding multiple sockets or even board interconnects this all goes out
>>> the window.
>>>
>>> And in this case, forcing the device to either node is fine.
>>
>> Not really.
>> For packet sending and receiving, the buffer memory may be allocated
>> dynamically. Node of tx buffer memory is mainly based on the cpu
>> that is sending sending, node of rx buffer memory is mainly based on
>> the cpu the interrupt handler of the device is running on, and the
>> device' interrupt affinity is mainly based on node id of the device.
>>
>> We can bind the processes that are using the device to both nodes
>> in order to utilize memory on both nodes for packet sending.
>>
>> But for packet receiving, the node1 may not be used becuase the node
>> id of device is forced to node 0, which is the default way to bind
>> the interrupt to the cpu of the same node.
>>
>> If node_to_cpumask_map() returns all usable cpus when the device's node
>> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.
> 
> s/binded/bound/
> 
> Sure; the data can be allocated wherever, but the control structures are
> not dynamically allocated every time. They are persistent, and they will
> be local to some node.
> 
> Anyway, are you saying this stupid corner case is actually relevant?
> Because how does it scale out? What if you have 8 sockets, with each
> socket having 2 nodes and 1 such magic device. Then returning all CPUs
> is just plain wrong.

Yes, the hardware may not scale out, but what about the virtual device?

> 
>>>> 2) For many virtual deivces, such as tun or loopback netdevice, they
>>>> are also accessed uniformly by all cpus.
>>>
>>> Not true; the virtual device will sit in memory local to some node.
>>>
>>> And as with physical devices, you probably want at least one (virtual)
>>> queue per node.
>>
>> There may be similar handling as above for virtual device too.
> 
> And it'd be similarly broken.

From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
FW_BUG.

[1] https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d3ba@huawei.com/


> 
> .
>
Michal Hocko Sept. 24, 2019, 11:54 a.m. UTC | #15
On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
[...]
> > To be honest I really fail to see why to object to a simple semantic
> > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> 
> Because it feels wrong. The device needs to be _somewhere_. It simply
> cannot be node-less.

What if it doesn't have any numa preference for what ever reason? There
is no other way to express that than NUMA_NO_NODE.

Anyway, I am not going to argue more about this because it seems more of
a discussion about "HW shouldn't be doing that although the specification
allows that" which cannot really have any outcome except of "feels
correct/wrong".

If you really feel strongly about this then we should think of a proper
way to prevent this to happen because an out-of-bound access is
certainly not something we really want, right?
Peter Zijlstra Sept. 24, 2019, 11:58 a.m. UTC | #16
On Tue, Sep 24, 2019 at 07:44:28PM +0800, Yunsheng Lin wrote:
> From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
> FW_BUG.
> 
> [1] https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d3ba@huawei.com/

So aside of all the software devices which we can (and really should)
fix; these:

26.470076]  pci0000:00: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
26.815436]  pci0000:7b: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
27.004447]  pci0000:7a: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
27.236797]  pci0000:78: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
27.505833]  pci0000:7c: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
28.056452]  pci0000:74: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
28.470018]  pci0000:80: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
28.726411]  pci0000:bb: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
28.916656]  pci0000:ba: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
29.152839]  pci0000:b8: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
29.425808]  pci0000:bc: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
29.718593]  pci0000:b4: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.

look like actual problems. How can PCI devices not have a node assigned?
Yunsheng Lin Sept. 24, 2019, 12:09 p.m. UTC | #17
On 2019/9/24 19:58, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:44:28PM +0800, Yunsheng Lin wrote:
>> From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
>> FW_BUG.
>>
>> [1] https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d3ba@huawei.com/
> 
> So aside of all the software devices which we can (and really should)
> fix; these:
> 
> 26.470076]  pci0000:00: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 26.815436]  pci0000:7b: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 27.004447]  pci0000:7a: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 27.236797]  pci0000:78: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 27.505833]  pci0000:7c: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 28.056452]  pci0000:74: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 28.470018]  pci0000:80: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 28.726411]  pci0000:bb: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 28.916656]  pci0000:ba: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 29.152839]  pci0000:b8: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 29.425808]  pci0000:bc: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 29.718593]  pci0000:b4: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
> 
> look like actual problems. How can PCI devices not have a node assigned?

The above PCI devices do not have a node assigned because I downgraded
the bios to a older version that has not implemented the "Proximity Domain"
feature specified by APCI, which is optional feature, so the bios denied
that it is a bug of the bios.

> 
> .
>
Peter Zijlstra Sept. 24, 2019, 12:09 p.m. UTC | #18
On Tue, Sep 24, 2019 at 01:54:01PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> [...]
> > > To be honest I really fail to see why to object to a simple semantic
> > > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> > 
> > Because it feels wrong. The device needs to be _somewhere_. It simply
> > cannot be node-less.
> 
> What if it doesn't have any numa preference for what ever reason? There
> is no other way to express that than NUMA_NO_NODE.

Like I said; how does that physically work? The device needs to be
somewhere. It _must_ have a preference.

> Anyway, I am not going to argue more about this because it seems more of
> a discussion about "HW shouldn't be doing that although the specification
> allows that" which cannot really have any outcome except of "feels
> correct/wrong".

We can push back and say we don't respect the specification because it
is batshit insane ;-)

> If you really feel strongly about this then we should think of a proper
> way to prevent this to happen because an out-of-bound access is
> certainly not something we really want, right?

I just genuinely don't understand it. And I refuse to duct tape it.

And as shown in that email here:

  https://lkml.kernel.org/r/5a188e2b-6c07-a9db-fbaa-561e9362d3ba@huawei.com

there is a ton of broken...

15.061682] node node0: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
...
15.285602] node node3: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.

15.360241] cpu cpu0: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
...
24.768305] cpu cpu127: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.

39.623339] clockevents clockevent0: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
...
48.769530] clockevents clockevent127: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.

That's all broken for no reason.. those things actually _have_ a trivial
node affinity.

By silently accepting we let this stuff fester.

Now granted; there's a number of virtual devices that really don't have
a node affinity, but then, those are not hurt by forcing them onto a
random node, they really don't do anything. Like:

48.913502] event_source armv8_pmuv3_0: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
48.985462] event_source breakpoint: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
49.057120] event_source uprobe: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
49.128431] event_source kprobe: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
49.199742] event_source tracepoint: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.
49.271399] event_source software: has invalid NUMA node(-1), default node of 0 now selected. Readjust it by writing to sysfs numa_node or contact your vendor for updates.

That's just fake devices to get a sysfs entry.
Michal Hocko Sept. 24, 2019, 12:25 p.m. UTC | #19
On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 01:54:01PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> > > On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> > [...]
> > > > To be honest I really fail to see why to object to a simple semantic
> > > > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> > > 
> > > Because it feels wrong. The device needs to be _somewhere_. It simply
> > > cannot be node-less.
> > 
> > What if it doesn't have any numa preference for what ever reason? There
> > is no other way to express that than NUMA_NO_NODE.
> 
> Like I said; how does that physically work? The device needs to be
> somewhere. It _must_ have a preference.
> 
> > Anyway, I am not going to argue more about this because it seems more of
> > a discussion about "HW shouldn't be doing that although the specification
> > allows that" which cannot really have any outcome except of "feels
> > correct/wrong".
> 
> We can push back and say we don't respect the specification because it
> is batshit insane ;-)

Here is my fingers crossed.

[...]

> Now granted; there's a number of virtual devices that really don't have
> a node affinity, but then, those are not hurt by forcing them onto a
> random node, they really don't do anything. Like:

Do you really consider a random node a better fix than simply living
with a more robust NUMA_NO_NODE which tells the actual state? Page
allocator would effectivelly use the local node in that case. Any code
using the cpumask will know that any of the online cpus are usable.

Compare that to a wild guess that might be easily wrong and have subtle
side effects which are really hard to debug. You will only see a higher
utilization on a specific node. Good luck with a bug report like that.

Anyway, I really do  not feel strongly about that. If you really consider
it a bad idea then I can live with that. This just felt easier and
reasonably consistent to address. Implementing the guessing and fighting
vendors who really do not feel like providing a real affinity sounds
harder and more error prone.
Peter Zijlstra Sept. 24, 2019, 12:43 p.m. UTC | #20
On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:

> > We can push back and say we don't respect the specification because it
> > is batshit insane ;-)
> 
> Here is my fingers crossed.
> 
> [...]
> 
> > Now granted; there's a number of virtual devices that really don't have
> > a node affinity, but then, those are not hurt by forcing them onto a
> > random node, they really don't do anything. Like:
> 
> Do you really consider a random node a better fix than simply living
> with a more robust NUMA_NO_NODE which tells the actual state? Page
> allocator would effectivelly use the local node in that case. Any code
> using the cpumask will know that any of the online cpus are usable.

For the pmu devices? Yes, those 'devices' aren't actually used for
anything other than sysfs entries.

Nothing else uses the struct device.

> Compare that to a wild guess that might be easily wrong and have subtle
> side effects which are really hard to debug. You will only see a higher
> utilization on a specific node. Good luck with a bug report like that.

We'd have the FW_BUG in the dmesg, which should be a big fat clue.
Peter Zijlstra Sept. 24, 2019, 12:59 p.m. UTC | #21
On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> 
> > > We can push back and say we don't respect the specification because it
> > > is batshit insane ;-)
> > 
> > Here is my fingers crossed.
> > 
> > [...]
> > 
> > > Now granted; there's a number of virtual devices that really don't have
> > > a node affinity, but then, those are not hurt by forcing them onto a
> > > random node, they really don't do anything. Like:
> > 
> > Do you really consider a random node a better fix than simply living
> > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > allocator would effectivelly use the local node in that case. Any code
> > using the cpumask will know that any of the online cpus are usable.
> 
> For the pmu devices? Yes, those 'devices' aren't actually used for
> anything other than sysfs entries.
> 
> Nothing else uses the struct device.

The below would get rid of the PMU and workqueue warnings with no
side-effects (the device isn't used for anything except sysfs).

I'm stuck in the device code for BDIs, I can't find a sane place to set
the node before it gets added, due to it using device_create_vargs().

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4f08b17d6426..2a64dcc3d70f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9965,6 +9965,7 @@ static int pmu_dev_alloc(struct pmu *pmu)
 	if (!pmu->dev)
 		goto out;
 
+	set_dev_node(pmu->dev, 0);
 	pmu->dev->groups = pmu->attr_groups;
 	device_initialize(pmu->dev);
 	ret = dev_set_name(pmu->dev, "%s", pmu->name);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc2e09a8ea61..efafc4590bbe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5613,6 +5613,7 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
 	wq_dev->dev.bus = &wq_subsys;
 	wq_dev->dev.release = wq_device_release;
 	dev_set_name(&wq_dev->dev, "%s", wq->name);
+	set_dev_node(wq_dev, 0);
 
 	/*
 	 * unbound_attrs are created separately.  Suppress uevent until
Michal Hocko Sept. 24, 2019, 1:19 p.m. UTC | #22
On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> > 
> > > > We can push back and say we don't respect the specification because it
> > > > is batshit insane ;-)
> > > 
> > > Here is my fingers crossed.
> > > 
> > > [...]
> > > 
> > > > Now granted; there's a number of virtual devices that really don't have
> > > > a node affinity, but then, those are not hurt by forcing them onto a
> > > > random node, they really don't do anything. Like:
> > > 
> > > Do you really consider a random node a better fix than simply living
> > > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > > allocator would effectivelly use the local node in that case. Any code
> > > using the cpumask will know that any of the online cpus are usable.
> > 
> > For the pmu devices? Yes, those 'devices' aren't actually used for
> > anything other than sysfs entries.
> > 
> > Nothing else uses the struct device.
> 
> The below would get rid of the PMU and workqueue warnings with no
> side-effects (the device isn't used for anything except sysfs).

Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
Yunsheng Lin Sept. 25, 2019, 9:14 a.m. UTC | #23
On 2019/9/24 21:19, Michal Hocko wrote:
> On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
>> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
>>>> On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
>>>
>>>>> We can push back and say we don't respect the specification because it
>>>>> is batshit insane ;-)
>>>>
>>>> Here is my fingers crossed.
>>>>
>>>> [...]
>>>>
>>>>> Now granted; there's a number of virtual devices that really don't have
>>>>> a node affinity, but then, those are not hurt by forcing them onto a
>>>>> random node, they really don't do anything. Like:
>>>>
>>>> Do you really consider a random node a better fix than simply living
>>>> with a more robust NUMA_NO_NODE which tells the actual state? Page
>>>> allocator would effectivelly use the local node in that case. Any code
>>>> using the cpumask will know that any of the online cpus are usable.
>>>
>>> For the pmu devices? Yes, those 'devices' aren't actually used for
>>> anything other than sysfs entries.
>>>
>>> Nothing else uses the struct device.
>>
>> The below would get rid of the PMU and workqueue warnings with no
>> side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

Hi, Peter & Michal

From the discussion above, It seems making the node_to_cpumask_map()
NUMA_NO_NODE aware is the most feasible way to move forwad.

Any suggestion?

>
Peter Zijlstra Sept. 25, 2019, 10:40 a.m. UTC | #24
On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:

> > The below would get rid of the PMU and workqueue warnings with no
> > side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

It doesn't matter.... that 0 is _never_ used. These are fake devices,
and all we care about is getting rid of that error.

If it makes you feel better we can make it -2 and have dev_to_node()
WARN if it ever sees one.
Peter Zijlstra Sept. 25, 2019, 10:41 a.m. UTC | #25
On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> From the discussion above, It seems making the node_to_cpumask_map()
> NUMA_NO_NODE aware is the most feasible way to move forwad.

That's still wrong.
Michal Hocko Sept. 25, 2019, 1:25 p.m. UTC | #26
On Wed 25-09-19 12:40:40, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:
> 
> > > The below would get rid of the PMU and workqueue warnings with no
> > > side-effects (the device isn't used for anything except sysfs).
> > 
> > Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
> 
> It doesn't matter.... that 0 is _never_ used. These are fake devices,
> and all we care about is getting rid of that error.

That is a very subtle and hard to review assumption. Even if this holds
now a future change might easily break this AFAIU. It also assumes that
you catch all such special devices.

I am sorry but I still do not understand why you consider this whack a
mole better then simply live with the fact that NUMA_NO_NODE is a
reality and that using the full cpu mask is a reasonable answer to that.
Anyway, I feel we are loop here so I will leave out the final decision
to you.

> If it makes you feel better we can make it -2 and have dev_to_node()
> WARN if it ever sees one.

That would help
Peter Zijlstra Sept. 25, 2019, 4:31 p.m. UTC | #27
On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> I am sorry but I still do not understand why you consider this whack a
> mole better then simply live with the fact that NUMA_NO_NODE is a
> reality and that using the full cpu mask is a reasonable answer to that.

Because it doesn't make physical sense. A device _cannot_ be local to
all CPUs in a NUMA system.
Peter Zijlstra Sept. 25, 2019, 9:45 p.m. UTC | #28
On Wed, Sep 25, 2019 at 06:31:54PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> > I am sorry but I still do not understand why you consider this whack a
> > mole better then simply live with the fact that NUMA_NO_NODE is a
> > reality and that using the full cpu mask is a reasonable answer to that.
> 
> Because it doesn't make physical sense. A device _cannot_ be local to
> all CPUs in a NUMA system.

The below patch still gives a fair amount of noise on my fairly old and
cruft IVB-EP, but it gets rid of most of the simple stuff.

[    2.890739] [Firmware Bug]: device: 'platform': no node assigned on NUMA capable HW
[    2.901855] [Firmware Bug]: device: 'vtcon0': no node assigned on NUMA capable HW
[    2.911804] [Firmware Bug]: device: 'id': no node assigned on NUMA capable HW
[    3.800832] [Firmware Bug]: device: 'fbcon': no node assigned on NUMA capable HW
[    4.824808] [Firmware Bug]: device: 'LNXSYSTM:00': no node assigned on NUMA capable HW
[    5.112739] [Firmware Bug]: device: 'pci0000:00': no node assigned on NUMA capable HW
[    6.703425] [Firmware Bug]: device: 'pci0000:80': no node assigned on NUMA capable HW
[    7.049515] [Firmware Bug]: device: 'ACPI0004:00': no node assigned on NUMA capable HW
[    7.078823] [Firmware Bug]: device: 'ACPI0004:01': no node assigned on NUMA capable HW
[    7.149889] [Firmware Bug]: device: 'pci0000:7f': no node assigned on NUMA capable HW
[    7.158798] [Firmware Bug]: device: '0000:7f': no node assigned on NUMA capable HW
[    7.183796] [Firmware Bug]: device: '0000:7f:08.0': no node assigned on NUMA capable HW
[    7.199796] [Firmware Bug]: device: '0000:7f:09.0': no node assigned on NUMA capable HW
[    7.215792] [Firmware Bug]: device: '0000:7f:0a.0': no node assigned on NUMA capable HW
[    7.231791] [Firmware Bug]: device: '0000:7f:0a.1': no node assigned on NUMA capable HW
[    7.247793] [Firmware Bug]: device: '0000:7f:0a.2': no node assigned on NUMA capable HW
[    7.262794] [Firmware Bug]: device: '0000:7f:0a.3': no node assigned on NUMA capable HW
[    7.278789] [Firmware Bug]: device: '0000:7f:0b.0': no node assigned on NUMA capable HW
[    7.294787] [Firmware Bug]: device: '0000:7f:0b.3': no node assigned on NUMA capable HW
[    7.310794] [Firmware Bug]: device: '0000:7f:0c.0': no node assigned on NUMA capable HW
[    7.325796] [Firmware Bug]: device: '0000:7f:0c.1': no node assigned on NUMA capable HW
[    7.341790] [Firmware Bug]: device: '0000:7f:0c.2': no node assigned on NUMA capable HW
[    7.357789] [Firmware Bug]: device: '0000:7f:0c.3': no node assigned on NUMA capable HW
[    7.373789] [Firmware Bug]: device: '0000:7f:0c.4': no node assigned on NUMA capable HW
[    7.388789] [Firmware Bug]: device: '0000:7f:0d.0': no node assigned on NUMA capable HW
[    7.404791] [Firmware Bug]: device: '0000:7f:0d.1': no node assigned on NUMA capable HW
[    7.420789] [Firmware Bug]: device: '0000:7f:0d.2': no node assigned on NUMA capable HW
[    7.436790] [Firmware Bug]: device: '0000:7f:0d.3': no node assigned on NUMA capable HW
[    7.451789] [Firmware Bug]: device: '0000:7f:0d.4': no node assigned on NUMA capable HW
[    7.467799] [Firmware Bug]: device: '0000:7f:0e.0': no node assigned on NUMA capable HW
[    7.483797] [Firmware Bug]: device: '0000:7f:0e.1': no node assigned on NUMA capable HW
[    7.499830] [Firmware Bug]: device: '0000:7f:0f.0': no node assigned on NUMA capable HW
[    7.515825] [Firmware Bug]: device: '0000:7f:0f.1': no node assigned on NUMA capable HW
[    7.530823] [Firmware Bug]: device: '0000:7f:0f.2': no node assigned on NUMA capable HW
[    7.546824] [Firmware Bug]: device: '0000:7f:0f.3': no node assigned on NUMA capable HW
[    7.562823] [Firmware Bug]: device: '0000:7f:0f.4': no node assigned on NUMA capable HW
[    7.578822] [Firmware Bug]: device: '0000:7f:0f.5': no node assigned on NUMA capable HW
[    7.594830] [Firmware Bug]: device: '0000:7f:10.0': no node assigned on NUMA capable HW
[    7.609834] [Firmware Bug]: device: '0000:7f:10.1': no node assigned on NUMA capable HW
[    7.625825] [Firmware Bug]: device: '0000:7f:10.2': no node assigned on NUMA capable HW
[    7.641824] [Firmware Bug]: device: '0000:7f:10.3': no node assigned on NUMA capable HW
[    7.657825] [Firmware Bug]: device: '0000:7f:10.4': no node assigned on NUMA capable HW
[    7.673824] [Firmware Bug]: device: '0000:7f:10.5': no node assigned on NUMA capable HW
[    7.689792] [Firmware Bug]: device: '0000:7f:10.6': no node assigned on NUMA capable HW
[    7.704825] [Firmware Bug]: device: '0000:7f:10.7': no node assigned on NUMA capable HW
[    7.720791] [Firmware Bug]: device: '0000:7f:13.0': no node assigned on NUMA capable HW
[    7.736793] [Firmware Bug]: device: '0000:7f:13.1': no node assigned on NUMA capable HW
[    7.752791] [Firmware Bug]: device: '0000:7f:13.4': no node assigned on NUMA capable HW
[    7.767780] [Firmware Bug]: device: '0000:7f:13.5': no node assigned on NUMA capable HW
[    7.783793] [Firmware Bug]: device: '0000:7f:13.6': no node assigned on NUMA capable HW
[    7.799790] [Firmware Bug]: device: '0000:7f:16.0': no node assigned on NUMA capable HW
[    7.815789] [Firmware Bug]: device: '0000:7f:16.1': no node assigned on NUMA capable HW
[    7.831795] [Firmware Bug]: device: '0000:7f:16.2': no node assigned on NUMA capable HW
[    7.882888] [Firmware Bug]: device: 'pci0000:ff': no node assigned on NUMA capable HW
[    7.891785] [Firmware Bug]: device: '0000:ff': no node assigned on NUMA capable HW
[    7.916800] [Firmware Bug]: device: '0000:ff:08.0': no node assigned on NUMA capable HW
[    7.932798] [Firmware Bug]: device: '0000:ff:09.0': no node assigned on NUMA capable HW
[    7.948795] [Firmware Bug]: device: '0000:ff:0a.0': no node assigned on NUMA capable HW
[    7.964800] [Firmware Bug]: device: '0000:ff:0a.1': no node assigned on NUMA capable HW
[    7.979794] [Firmware Bug]: device: '0000:ff:0a.2': no node assigned on NUMA capable HW
[    7.995799] [Firmware Bug]: device: '0000:ff:0a.3': no node assigned on NUMA capable HW
[    8.011809] [Firmware Bug]: device: '0000:ff:0b.0': no node assigned on NUMA capable HW
[    8.027793] [Firmware Bug]: device: '0000:ff:0b.3': no node assigned on NUMA capable HW
[    8.042793] [Firmware Bug]: device: '0000:ff:0c.0': no node assigned on NUMA capable HW
[    8.058793] [Firmware Bug]: device: '0000:ff:0c.1': no node assigned on NUMA capable HW
[    8.074804] [Firmware Bug]: device: '0000:ff:0c.2': no node assigned on NUMA capable HW
[    8.090794] [Firmware Bug]: device: '0000:ff:0c.3': no node assigned on NUMA capable HW
[    8.105793] [Firmware Bug]: device: '0000:ff:0c.4': no node assigned on NUMA capable HW
[    8.121800] [Firmware Bug]: device: '0000:ff:0d.0': no node assigned on NUMA capable HW
[    8.137805] [Firmware Bug]: device: '0000:ff:0d.1': no node assigned on NUMA capable HW
[    8.153795] [Firmware Bug]: device: '0000:ff:0d.2': no node assigned on NUMA capable HW
[    8.169797] [Firmware Bug]: device: '0000:ff:0d.3': no node assigned on NUMA capable HW
[    8.184796] [Firmware Bug]: device: '0000:ff:0d.4': no node assigned on NUMA capable HW
[    8.200802] [Firmware Bug]: device: '0000:ff:0e.0': no node assigned on NUMA capable HW
[    8.216803] [Firmware Bug]: device: '0000:ff:0e.1': no node assigned on NUMA capable HW
[    8.232832] [Firmware Bug]: device: '0000:ff:0f.0': no node assigned on NUMA capable HW
[    8.248821] [Firmware Bug]: device: '0000:ff:0f.1': no node assigned on NUMA capable HW
[    8.263833] [Firmware Bug]: device: '0000:ff:0f.2': no node assigned on NUMA capable HW
[    8.279830] [Firmware Bug]: device: '0000:ff:0f.3': no node assigned on NUMA capable HW
[    8.295829] [Firmware Bug]: device: '0000:ff:0f.4': no node assigned on NUMA capable HW
[    8.311830] [Firmware Bug]: device: '0000:ff:0f.5': no node assigned on NUMA capable HW
[    8.327834] [Firmware Bug]: device: '0000:ff:10.0': no node assigned on NUMA capable HW
[    8.342830] [Firmware Bug]: device: '0000:ff:10.1': no node assigned on NUMA capable HW
[    8.358829] [Firmware Bug]: device: '0000:ff:10.2': no node assigned on NUMA capable HW
[    8.374830] [Firmware Bug]: device: '0000:ff:10.3': no node assigned on NUMA capable HW
[    8.390837] [Firmware Bug]: device: '0000:ff:10.4': no node assigned on NUMA capable HW
[    8.406831] [Firmware Bug]: device: '0000:ff:10.5': no node assigned on NUMA capable HW
[    8.421842] [Firmware Bug]: device: '0000:ff:10.6': no node assigned on NUMA capable HW
[    8.437830] [Firmware Bug]: device: '0000:ff:10.7': no node assigned on NUMA capable HW
[    8.453795] [Firmware Bug]: device: '0000:ff:13.0': no node assigned on NUMA capable HW
[    8.469794] [Firmware Bug]: device: '0000:ff:13.1': no node assigned on NUMA capable HW
[    8.485796] [Firmware Bug]: device: '0000:ff:13.4': no node assigned on NUMA capable HW
[    8.500797] [Firmware Bug]: device: '0000:ff:13.5': no node assigned on NUMA capable HW
[    8.516795] [Firmware Bug]: device: '0000:ff:13.6': no node assigned on NUMA capable HW
[    8.532794] [Firmware Bug]: device: '0000:ff:16.0': no node assigned on NUMA capable HW
[    8.548794] [Firmware Bug]: device: '0000:ff:16.1': no node assigned on NUMA capable HW
[    8.563777] [Firmware Bug]: device: '0000:ff:16.2': no node assigned on NUMA capable HW
[    8.668308] [Firmware Bug]: device: 'mc': no node assigned on NUMA capable HW
[    8.758913] [Firmware Bug]: device: 'lo': no node assigned on NUMA capable HW
[    8.767975] [Firmware Bug]: device: 'wmi_bus-PNP0C14:00': no node assigned on NUMA capable HW
[    9.065753] [Firmware Bug]: device: 'pnp0': no node assigned on NUMA capable HW
[   11.220722] [Firmware Bug]: device: 'cooling_device0': no node assigned on NUMA capable HW
[   11.240495] [Firmware Bug]: device: 'cooling_device1': no node assigned on NUMA capable HW
[   11.260546] [Firmware Bug]: device: 'cooling_device2': no node assigned on NUMA capable HW
[   11.280271] [Firmware Bug]: device: 'cooling_device3': no node assigned on NUMA capable HW
[   11.300384] [Firmware Bug]: device: 'cooling_device4': no node assigned on NUMA capable HW
[   11.320479] [Firmware Bug]: device: 'cooling_device5': no node assigned on NUMA capable HW
[   11.340206] [Firmware Bug]: device: 'cooling_device6': no node assigned on NUMA capable HW
[   11.360185] [Firmware Bug]: device: 'cooling_device7': no node assigned on NUMA capable HW
[   11.379938] [Firmware Bug]: device: 'cooling_device8': no node assigned on NUMA capable HW
[   11.399969] [Firmware Bug]: device: 'cooling_device9': no node assigned on NUMA capable HW
[   11.419789] [Firmware Bug]: device: 'cooling_device10': no node assigned on NUMA capable HW
[   11.440004] [Firmware Bug]: device: 'cooling_device11': no node assigned on NUMA capable HW
[   11.464025] [Firmware Bug]: device: 'cooling_device12': no node assigned on NUMA capable HW
[   11.484837] [Firmware Bug]: device: 'cooling_device13': no node assigned on NUMA capable HW
[   11.505052] [Firmware Bug]: device: 'cooling_device14': no node assigned on NUMA capable HW
[   11.524997] [Firmware Bug]: device: 'cooling_device15': no node assigned on NUMA capable HW
[   11.545174] [Firmware Bug]: device: 'cooling_device16': no node assigned on NUMA capable HW
[   11.565095] [Firmware Bug]: device: 'cooling_device17': no node assigned on NUMA capable HW
[   11.585328] [Firmware Bug]: device: 'cooling_device18': no node assigned on NUMA capable HW
[   11.605273] [Firmware Bug]: device: 'cooling_device19': no node assigned on NUMA capable HW
[   11.625487] [Firmware Bug]: device: 'cooling_device20': no node assigned on NUMA capable HW
[   11.645382] [Firmware Bug]: device: 'cooling_device21': no node assigned on NUMA capable HW
[   11.665511] [Firmware Bug]: device: 'cooling_device22': no node assigned on NUMA capable HW
[   11.685326] [Firmware Bug]: device: 'cooling_device23': no node assigned on NUMA capable HW
[   11.705528] [Firmware Bug]: device: 'cooling_device24': no node assigned on NUMA capable HW
[   11.725422] [Firmware Bug]: device: 'cooling_device25': no node assigned on NUMA capable HW
[   11.745623] [Firmware Bug]: device: 'cooling_device26': no node assigned on NUMA capable HW
[   11.765454] [Firmware Bug]: device: 'cooling_device27': no node assigned on NUMA capable HW
[   11.785643] [Firmware Bug]: device: 'cooling_device28': no node assigned on NUMA capable HW
[   11.805595] [Firmware Bug]: device: 'cooling_device29': no node assigned on NUMA capable HW
[   11.825786] [Firmware Bug]: device: 'cooling_device30': no node assigned on NUMA capable HW
[   11.845675] [Firmware Bug]: device: 'cooling_device31': no node assigned on NUMA capable HW
[   11.865875] [Firmware Bug]: device: 'cooling_device32': no node assigned on NUMA capable HW
[   11.885853] [Firmware Bug]: device: 'cooling_device33': no node assigned on NUMA capable HW
[   11.906039] [Firmware Bug]: device: 'cooling_device34': no node assigned on NUMA capable HW
[   11.925978] [Firmware Bug]: device: 'cooling_device35': no node assigned on NUMA capable HW
[   11.946160] [Firmware Bug]: device: 'cooling_device36': no node assigned on NUMA capable HW
[   11.966030] [Firmware Bug]: device: 'cooling_device37': no node assigned on NUMA capable HW
[   11.986297] [Firmware Bug]: device: 'cooling_device38': no node assigned on NUMA capable HW
[   12.006158] [Firmware Bug]: device: 'cooling_device39': no node assigned on NUMA capable HW
[   12.443956] [Firmware Bug]: device: 'ttm': no node assigned on NUMA capable HW
[   12.490234] [Firmware Bug]: device: 'vtcon1': no node assigned on NUMA capable HW
[   12.585094] [Firmware Bug]: device: 'vtcon0': no node assigned on NUMA capable HW
[   12.889374] [Firmware Bug]: device: 'loop0': no node assigned on NUMA capable HW
[   12.902623] [Firmware Bug]: device: 'loop1': no node assigned on NUMA capable HW
[   12.915193] [Firmware Bug]: device: 'loop2': no node assigned on NUMA capable HW
[   12.928095] [Firmware Bug]: device: 'loop3': no node assigned on NUMA capable HW
[   12.941188] [Firmware Bug]: device: 'loop4': no node assigned on NUMA capable HW
[   12.953799] [Firmware Bug]: device: 'loop5': no node assigned on NUMA capable HW
[   12.966731] [Firmware Bug]: device: 'loop6': no node assigned on NUMA capable HW
[   12.979575] [Firmware Bug]: device: 'loop7': no node assigned on NUMA capable HW
[   13.857948] [Firmware Bug]: device: 'mice': no node assigned on NUMA capable HW
[   13.997228] [Firmware Bug]: device: 'cooling_device40': no node assigned on NUMA capable HW
[   14.006845] [Firmware Bug]: device: 'thermal_zone0': no node assigned on NUMA capable HW
[   14.016649] [Firmware Bug]: device: 'thermal_zone1': no node assigned on NUMA capable HW
[   14.641660] [Firmware Bug]: device: 'timer': no node assigned on NUMA capable HW

The PCI bits indicate my BIOS is shit, I suspect 7f is socket-0 and ff
is socket-1, but I'm not sure how to tie that back up sanely, I _really_
don't know anything about PCI :/

The thermal crud is just that, crud that needs an overhaul. It creates a
'cooling_device' per CPU, but in the process looses all actual
association to the actual CPU number and so we cannot sanely set right
node information, even though thermal does use the devm_ APIs and thus
setting the node number is 'required'. Idem for the 'thermal_zone' ones,
those are the sockets but loose all relation to them in the process.

See how nice it is to be strict? You find bugs everywhere you look. Now
we get to fix them :-)

---
 arch/x86/kernel/cpu/mce/core.c |  1 +
 arch/x86/kernel/cpuid.c        |  1 +
 arch/x86/kernel/msr.c          |  1 +
 drivers/base/bus.c             |  4 +++-
 drivers/base/core.c            | 23 +++++++++++++++++++----
 drivers/base/cpu.c             |  4 +++-
 drivers/base/node.c            |  4 +++-
 drivers/char/mem.c             |  1 +
 drivers/char/misc.c            |  2 ++
 drivers/tty/tty_io.c           |  1 +
 drivers/tty/vt/vc_screen.c     |  3 +++
 drivers/tty/vt/vt.c            |  1 +
 include/linux/device.h         | 10 ++++++++--
 include/linux/numa.h           |  1 +
 kernel/events/core.c           |  1 +
 kernel/time/clockevents.c      |  2 +-
 kernel/time/clocksource.c      |  1 +
 kernel/workqueue.c             |  1 +
 mm/backing-dev.c               |  1 +
 19 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 743370ee4983..ce09413b7c39 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2098,6 +2098,7 @@ static void mce_enable_ce(void *all)
 static struct bus_type mce_subsys = {
 	.name		= "machinecheck",
 	.dev_name	= "machinecheck",
+	.no_devm	= true,
 };
 
 DEFINE_PER_CPU(struct device *, mce_device);
diff --git a/arch/x86/kernel/cpuid.c b/arch/x86/kernel/cpuid.c
index 3492aa36bf09..6d7d2a35e911 100644
--- a/arch/x86/kernel/cpuid.c
+++ b/arch/x86/kernel/cpuid.c
@@ -161,6 +161,7 @@ static int __init cpuid_init(void)
 		goto out_chrdev;
 	}
 	cpuid_class->devnode = cpuid_devnode;
+	cpuid_class->no_devm = true;
 
 	err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/cpuid:online",
 				cpuid_device_create, cpuid_device_destroy);
diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index 3db2252b958d..e41add671596 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -210,6 +210,7 @@ static int __init msr_init(void)
 		goto out_chrdev;
 	}
 	msr_class->devnode = msr_devnode;
+	msr_class->no_devm = true;
 
 	err  = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/msr:online",
 				 msr_device_create, msr_device_destroy);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index df3cac739813..96ee8c50ab8a 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -1142,7 +1142,9 @@ static int subsys_register(struct bus_type *subsys,
 	dev->groups = groups;
 	dev->release = system_root_device_release;
 
-	err = device_register(dev);
+	device_initialize(dev);
+	set_dev_node(dev, NUMA_INVALID_NODE);
+	err = device_add(dev);
 	if (err < 0)
 		goto err_dev_reg;
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 636058bbf48a..049f3c7dbdcb 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1670,7 +1670,7 @@ void device_initialize(struct device *dev)
 	spin_lock_init(&dev->devres_lock);
 	INIT_LIST_HEAD(&dev->devres_head);
 	device_pm_init(dev);
-	set_dev_node(dev, -1);
+	set_dev_node(dev, NUMA_NO_NODE);
 #ifdef CONFIG_GENERIC_MSI_IRQ
 	INIT_LIST_HEAD(&dev->msi_list);
 #endif
@@ -2056,9 +2056,24 @@ int device_add(struct device *dev)
 	if (kobj)
 		dev->kobj.parent = kobj;
 
-	/* use parent numa_node */
-	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
-		set_dev_node(dev, dev_to_node(parent));
+#ifdef CONFIG_NUMA
+	/* use parent node */
+	if (dev->numa_node == NUMA_NO_NODE && parent)
+		set_dev_node(dev, parent->numa_node);
+
+	if (dev->numa_node == NUMA_NO_NODE && dev->class && dev->class->no_devm)
+		set_dev_node(dev, NUMA_INVALID_NODE);
+
+	if (dev->numa_node == NUMA_NO_NODE && dev->bus && dev->bus->no_devm)
+		set_dev_node(dev, NUMA_INVALID_NODE);
+
+	/* verify 'real' devices have a node assigned on NUMA capable hardware */
+	if (nr_node_ids > 1 && dev->numa_node == NUMA_NO_NODE) {
+		pr_warn(FW_BUG "device: '%s': no node assigned on NUMA capable HW\n",
+				dev_name(dev));
+		set_dev_node(dev, 0); // assign 'random' valid node
+	}
+#endif
 
 	/* first, register with generic layer. */
 	/* we require the name to be set before, and pass NULL */
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index cc37511de866..6613aed4de17 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -381,7 +381,9 @@ int register_cpu(struct cpu *cpu, int num)
 	cpu->dev.groups = common_cpu_attr_groups;
 	if (cpu->hotpluggable)
 		cpu->dev.groups = hotplugable_cpu_attr_groups;
-	error = device_register(&cpu->dev);
+	device_initialize(&cpu->dev);
+	set_dev_node(&cpu->dev, cpu->node_id);
+	error = device_add(&cpu->dev);
 	if (error) {
 		put_device(&cpu->dev);
 		return error;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 75b7e6f6535b..153bf21cf7ce 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -616,7 +616,9 @@ static int register_node(struct node *node, int num)
 	node->dev.bus = &node_subsys;
 	node->dev.release = node_device_release;
 	node->dev.groups = node_dev_groups;
-	error = device_register(&node->dev);
+	device_initialize(&node->dev);
+	set_dev_node(&node->dev, num);
+	error = device_add(&node->dev);
 
 	if (error)
 		put_device(&node->dev);
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index b08dc50f9f26..848d704839f0 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -927,6 +927,7 @@ static int __init chr_dev_init(void)
 		return PTR_ERR(mem_class);
 
 	mem_class->devnode = mem_devnode;
+	mem_class->no_devm = true;
 	for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
 		if (!devlist[minor].name)
 			continue;
diff --git a/drivers/char/misc.c b/drivers/char/misc.c
index f6a147427029..14ced3adfaad 100644
--- a/drivers/char/misc.c
+++ b/drivers/char/misc.c
@@ -274,6 +274,8 @@ static int __init misc_init(void)
 	if (IS_ERR(misc_class))
 		goto fail_remove;
 
+	misc_class->no_devm = true;
+
 	err = -EIO;
 	if (register_chrdev(MISC_MAJOR, "misc", &misc_fops))
 		goto fail_printk;
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 566728fbaf3c..28e4d531595f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3419,6 +3419,7 @@ static int __init tty_class_init(void)
 	if (IS_ERR(tty_class))
 		return PTR_ERR(tty_class);
 	tty_class->devnode = tty_devnode;
+	tty_class->no_devm = true;
 	return 0;
 }
 
diff --git a/drivers/tty/vt/vc_screen.c b/drivers/tty/vt/vc_screen.c
index 1f042346e722..b8fa9e6c0b71 100644
--- a/drivers/tty/vt/vc_screen.c
+++ b/drivers/tty/vt/vc_screen.c
@@ -735,6 +735,9 @@ int __init vcs_init(void)
 		panic("unable to get major %d for vcs device", VCS_MAJOR);
 	vc_class = class_create(THIS_MODULE, "vc");
 
+	if (vc_class)
+		vc_class->no_devm = true;
+
 	device_create(vc_class, NULL, MKDEV(VCS_MAJOR, 0), NULL, "vcs");
 	device_create(vc_class, NULL, MKDEV(VCS_MAJOR, 64), NULL, "vcsu");
 	device_create(vc_class, NULL, MKDEV(VCS_MAJOR, 128), NULL, "vcsa");
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 34aa39d1aed9..15add0f1b0d0 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -4138,6 +4138,7 @@ static int __init vtconsole_class_init(void)
 			PTR_ERR(vtconsole_class));
 		vtconsole_class = NULL;
 	}
+	vtconsole_class->no_devm = true;
 
 	/* Add system drivers to sysfs */
 	for (i = 0; i < MAX_NR_CON_DRIVER; i++) {
diff --git a/include/linux/device.h b/include/linux/device.h
index 6717adee33f0..6414788e642e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -144,6 +144,7 @@ struct bus_type {
 	struct lock_class_key lock_key;
 
 	bool need_parent_lock;
+	bool no_devm;
 };
 
 extern int __must_check bus_register(struct bus_type *bus);
@@ -428,6 +429,8 @@ struct class {
 	const struct dev_pm_ops *pm;
 
 	struct subsys_private *p;
+
+	bool no_devm;
 };
 
 struct class_dev_iter {
@@ -1109,7 +1112,10 @@ int dev_set_name(struct device *dev, const char *name, ...);
 #ifdef CONFIG_NUMA
 static inline int dev_to_node(struct device *dev)
 {
-	return dev->numa_node;
+	int node = dev->numa_node;
+	if (WARN_ON(node == NUMA_INVALID_NODE))
+		node = 0; // 'random' valid node
+	return node;
 }
 static inline void set_dev_node(struct device *dev, int node)
 {
@@ -1118,7 +1124,7 @@ static inline void set_dev_node(struct device *dev, int node)
 #else
 static inline int dev_to_node(struct device *dev)
 {
-	return NUMA_NO_NODE;
+	return 0;
 }
 static inline void set_dev_node(struct device *dev, int node)
 {
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 110b0e5d0fb0..7d8597f1d73a 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -12,5 +12,6 @@
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
 #define	NUMA_NO_NODE	(-1)
+#define NUMA_INVALID_NODE (-2)
 
 #endif /* _LINUX_NUMA_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index caae4b7743b4..3bc081bfbb3f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9865,6 +9865,7 @@ static int pmu_bus_running;
 static struct bus_type pmu_bus = {
 	.name		= "event_source",
 	.dev_groups	= pmu_dev_groups,
+	.no_devm	= true,
 };
 
 static void pmu_dev_release(struct device *dev)
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index f5490222e134..27ce96638b96 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -663,6 +663,7 @@ void tick_cleanup_dead_cpu(int cpu)
 static struct bus_type clockevents_subsys = {
 	.name		= "clockevents",
 	.dev_name       = "clockevent",
+	.no_devm	= true,
 };
 
 static DEFINE_PER_CPU(struct device, tick_percpu_dev);
@@ -732,7 +733,6 @@ static struct tick_device *tick_get_tick_dev(struct device *dev)
 static __init int tick_broadcast_init_sysfs(void)
 {
 	int err = device_register(&tick_bc_dev);
-
 	if (!err)
 		err = device_create_file(&tick_bc_dev, &dev_attr_current_device);
 	return err;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index fff5f64981c6..a6809106750d 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -1171,6 +1171,7 @@ ATTRIBUTE_GROUPS(clocksource);
 static struct bus_type clocksource_subsys = {
 	.name = "clocksource",
 	.dev_name = "clocksource",
+	.no_devm = true,
 };
 
 static struct device device_clocksource = {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 601d61150b65..bd1e48239dfd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5513,6 +5513,7 @@ static struct device_attribute wq_sysfs_unbound_attrs[] = {
 static struct bus_type wq_subsys = {
 	.name				= "workqueue",
 	.dev_groups			= wq_sysfs_groups,
+	.no_devm			= true,
 };
 
 static ssize_t wq_unbound_cpumask_show(struct device *dev,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e8e89158adec..2f4473c74c5c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -224,6 +224,7 @@ static __init int bdi_class_init(void)
 		return PTR_ERR(bdi_class);
 
 	bdi_class->dev_groups = bdi_dev_groups;
+	bdi_class->no_devm = true;
 	bdi_debug_init();
 
 	return 0;
Peter Zijlstra Sept. 26, 2019, 9:05 a.m. UTC | #29
On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> [    7.149889] [Firmware Bug]: device: 'pci0000:7f': no node assigned on NUMA capable HW
> [    7.882888] [Firmware Bug]: device: 'pci0000:ff': no node assigned on NUMA capable HW

Going by the limited number of intel numa boxes I have, it looks like:

  socket = (~busid) >> (8-n)

where 'n' is the number of bits required to encode the largest socket
id, ie 1 for 2-socket and 2 for 4 socket.

For 8 socket systems we start using pci domains, and things get more
'interesting' :/
Peter Zijlstra Sept. 26, 2019, 12:10 p.m. UTC | #30
On Thu, Sep 26, 2019 at 11:05:59AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> > [    7.149889] [Firmware Bug]: device: 'pci0000:7f': no node assigned on NUMA capable HW
> > [    7.882888] [Firmware Bug]: device: 'pci0000:ff': no node assigned on NUMA capable HW
> 
> Going by the limited number of intel numa boxes I have, it looks like:
> 
>   socket = (~busid) >> (8-n)

Bah, I got my notes mixed up, it should be: busid >> (8-n)

> where 'n' is the number of bits required to encode the largest socket
> id, ie 1 for 2-socket and 2 for 4 socket.
> 
> For 8 socket systems we start using pci domains, and things get more
> 'interesting' :/
Yunsheng Lin Oct. 8, 2019, 8:38 a.m. UTC | #31
On 2019/9/25 18:41, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>> From the discussion above, It seems making the node_to_cpumask_map()
>> NUMA_NO_NODE aware is the most feasible way to move forwad.
> 
> That's still wrong.

Hi, Peter

It seems this has trapped in the dead circle.

From my understanding, NUMA_NO_NODE which means not node numa preference
is the state to describe the node of virtual device or the physical device
that has equal distance to all cpu.

We can be stricter if the device does have a nearer node, but we can not
deny that a device does not have a node numa preference or node affinity,
which also means the control or data buffer can be allocated at the node where
the process is running.

As you has proposed, making it -2 and have dev_to_node() warn if the device does
have a nearer node and not set by the fw is a way to be stricter.

But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
we does need a state to describe the device that have equal distance to all node,
even if it is not physically scalable.

Any better suggestion to move this forward?

> 
> .
>
Robin Murphy Oct. 9, 2019, 12:25 p.m. UTC | #32
On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> On 2019/9/25 18:41, Peter Zijlstra wrote:
>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>>>  From the discussion above, It seems making the node_to_cpumask_map()
>>> NUMA_NO_NODE aware is the most feasible way to move forwad.
>>
>> That's still wrong.
> 
> Hi, Peter
> 
> It seems this has trapped in the dead circle.
> 
>  From my understanding, NUMA_NO_NODE which means not node numa preference
> is the state to describe the node of virtual device or the physical device
> that has equal distance to all cpu.
> 
> We can be stricter if the device does have a nearer node, but we can not
> deny that a device does not have a node numa preference or node affinity,
> which also means the control or data buffer can be allocated at the node where
> the process is running.
> 
> As you has proposed, making it -2 and have dev_to_node() warn if the device does
> have a nearer node and not set by the fw is a way to be stricter.
> 
> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
> we does need a state to describe the device that have equal distance to all node,
> even if it is not physically scalable.
> 
> Any better suggestion to move this forward?

FWIW (since this is in my inbox), it sounds like the fundamental issue 
is that NUMA_NO_NODE is conflated for at least two different purposes, 
so trying to sort that out would be a good first step. AFAICS we have 
genuine "don't care" cases like alloc_pages_node(), where if the 
producer says it doesn't matter then the consumer is free to make its 
own judgement on what to do, and fundamentally different "we expect this 
thing to have an affinity but it doesn't, so we can't say what's 
appropriate" cases which could really do with some separate indicator 
like "NUMA_INVALID_NODE".

The tricky part is then bestowed on the producers to decide whether they 
can downgrade "invalid" to "don't care". You can technically build 'a 
device' whose internal logic is distributed between nodes and thus 
appears to have equal affinity - interrupt controllers, for example, may 
have per-CPU or per-node interfaces that end up looking like that - so 
although it's unlikely it's not outright nonsensical. Similarly a 
'device' that's actually emulated behind a firmware call interface may 
well effectively have no real affinity.

Robin.
Yunsheng Lin Oct. 10, 2019, 6:07 a.m. UTC | #33
On 2019/10/9 20:25, Robin Murphy wrote:
> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
>> On 2019/9/25 18:41, Peter Zijlstra wrote:
>>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>>>>  From the discussion above, It seems making the node_to_cpumask_map()
>>>> NUMA_NO_NODE aware is the most feasible way to move forwad.
>>>
>>> That's still wrong.
>>
>> Hi, Peter
>>
>> It seems this has trapped in the dead circle.
>>
>>  From my understanding, NUMA_NO_NODE which means not node numa preference
>> is the state to describe the node of virtual device or the physical device
>> that has equal distance to all cpu.
>>
>> We can be stricter if the device does have a nearer node, but we can not
>> deny that a device does not have a node numa preference or node affinity,
>> which also means the control or data buffer can be allocated at the node where
>> the process is running.
>>
>> As you has proposed, making it -2 and have dev_to_node() warn if the device does
>> have a nearer node and not set by the fw is a way to be stricter.
>>
>> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
>> we does need a state to describe the device that have equal distance to all node,
>> even if it is not physically scalable.
>>
>> Any better suggestion to move this forward?
> 
> FWIW (since this is in my inbox), it sounds like the fundamental issue is that NUMA_NO_NODE is conflated for at least two different purposes, so trying to sort that out would be a good first step. AFAICS we have genuine "don't care" cases like alloc_pages_node(), where if the producer says it doesn't matter then the consumer is free to make its own judgement on what to do, and fundamentally different "we expect this thing to have an affinity but it doesn't, so we can't say what's appropriate" cases which could really do with some separate indicator like "NUMA_INVALID_NODE".
> 
> The tricky part is then bestowed on the producers to decide whether they can downgrade "invalid" to "don't care". You can technically build 'a device' whose internal logic is distributed between nodes and thus appears to have equal affinity - interrupt controllers, for example, may have per-CPU or per-node interfaces that end up looking like that - so although it's unlikely it's not outright nonsensical. Similarly a 'device' that's actually emulated behind a firmware call interface may well effectively have no real affinity.

We may set node of the physical device to NUMA_INVALID_NODE when fw does not
provide one.

But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
with nid being NUMA_INVALID_NODE?

If we change the node to default one(like node 0) when node of device is
NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
is the right one to choose?

From the privous disccusion, the below seems not get to consensus yet:
1) Do we need a state like NUMA_NO_NODE to describe that the device does not
   have any numa preference?

2) What do we do if the fw does not provide a node for the device? Should
   we guess and pick one for it and how do we do the guessing? Or leave it
   as it is and handle it as NUMA_NO_NODE?

The point of adding another state like NUMA_INVALID_NODE seems to catch
the case and give a warning above it when the device does have a nearer
node and the fw does not provide one, and alloc_pages_node() still need to
handle it as NUMA_NO_NODE?

If the above is true, then maybe we can move forward with the above goal.

Thanks very much for the suggestion.

> 



> Robin.
> 
> .
>
Michal Hocko Oct. 10, 2019, 7:32 a.m. UTC | #34
On Thu 10-10-19 14:07:21, Yunsheng Lin wrote:
> On 2019/10/9 20:25, Robin Murphy wrote:
> > On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> >> On 2019/9/25 18:41, Peter Zijlstra wrote:
> >>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> >>>>  From the discussion above, It seems making the node_to_cpumask_map()
> >>>> NUMA_NO_NODE aware is the most feasible way to move forwad.
> >>>
> >>> That's still wrong.
> >>
> >> Hi, Peter
> >>
> >> It seems this has trapped in the dead circle.
> >>
> >>  From my understanding, NUMA_NO_NODE which means not node numa preference
> >> is the state to describe the node of virtual device or the physical device
> >> that has equal distance to all cpu.
> >>
> >> We can be stricter if the device does have a nearer node, but we can not
> >> deny that a device does not have a node numa preference or node affinity,
> >> which also means the control or data buffer can be allocated at the node where
> >> the process is running.
> >>
> >> As you has proposed, making it -2 and have dev_to_node() warn if the device does
> >> have a nearer node and not set by the fw is a way to be stricter.
> >>
> >> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
> >> we does need a state to describe the device that have equal distance to all node,
> >> even if it is not physically scalable.
> >>
> >> Any better suggestion to move this forward?
> > 
> > FWIW (since this is in my inbox), it sounds like the fundamental issue is that NUMA_NO_NODE is conflated for at least two different purposes, so trying to sort that out would be a good first step. AFAICS we have genuine "don't care" cases like alloc_pages_node(), where if the producer says it doesn't matter then the consumer is free to make its own judgement on what to do, and fundamentally different "we expect this thing to have an affinity but it doesn't, so we can't say what's appropriate" cases which could really do with some separate indicator like "NUMA_INVALID_NODE".
> > 
> > The tricky part is then bestowed on the producers to decide whether they can downgrade "invalid" to "don't care". You can technically build 'a device' whose internal logic is distributed between nodes and thus appears to have equal affinity - interrupt controllers, for example, may have per-CPU or per-node interfaces that end up looking like that - so although it's unlikely it's not outright nonsensical. Similarly a 'device' that's actually emulated behind a firmware call interface may well effectively have no real affinity.
> 
> We may set node of the physical device to NUMA_INVALID_NODE when fw does not
> provide one.
> 
> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
> with nid being NUMA_INVALID_NODE?

There is nothing sensible the allocator can do. The only point of
NUMA_INVALID_NODE would be to catch potential misconfiguration and
report them to users so they can complain to their HW/FS suppliers.

Pushing it to other susbystem doesn't make much sense IMHO because there
is nothing really actionable. Refusing an allocation altogether sounds
like a bad plan to me.
 
> If we change the node to default one(like node 0) when node of device is
> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
> is the right one to choose?

Exactly. We cannot really assume any node in that situation.
 
> >From the privous disccusion, the below seems not get to consensus yet:
> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not
>    have any numa preference?

This is a traditional meaning MM subsystem is using.

> 2) What do we do if the fw does not provide a node for the device? Should
>    we guess and pick one for it and how do we do the guessing? Or leave it
>    as it is and handle it as NUMA_NO_NODE?

As already pointed several times, picking any node is rather error
prone. You can never assume topology. We used to assume that there
always be node 0 but that is not really the case (see 3e8589963773
("memcg: make it work on sparse non-0-node systems")). Nodes might also
come and go so this might just lead to all sorts of subtle problems.

On the other hand using NUMA_NO_NODE as no preference could only lead to
slightly sub optimal performance.

I do agree with Peter that reporting a lack of affinity might be useful
but we shouldn't really try to clever and make up the affinity nilly
willy.
Peter Zijlstra Oct. 10, 2019, 8:56 a.m. UTC | #35
On Wed, Oct 09, 2019 at 01:25:14PM +0100, Robin Murphy wrote:
> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> > On 2019/9/25 18:41, Peter Zijlstra wrote:
> > > On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> > > >  From the discussion above, It seems making the node_to_cpumask_map()
> > > > NUMA_NO_NODE aware is the most feasible way to move forwad.
> > > 
> > > That's still wrong.
> > 
> > Hi, Peter
> > 
> > It seems this has trapped in the dead circle.
> > 
> >  From my understanding, NUMA_NO_NODE which means not node numa preference
> > is the state to describe the node of virtual device or the physical device
> > that has equal distance to all cpu.

So I _really_ don't believe in the equidistant physical device. Physics
just doesn't allow that. Or rather, you can, but then it'll be so slow
it doesn't matter.

The only possible option is equidistant to a _small_ number of nodes,
and if that is a reality, then we should look at that. So far however
it's purely been a hypothetical device.

> > We can be stricter if the device does have a nearer node, but we can not
> > deny that a device does not have a node numa preference or node affinity,
> > which also means the control or data buffer can be allocated at the node where
> > the process is running.
> > 
> > As you has proposed, making it -2 and have dev_to_node() warn if the device does
> > have a nearer node and not set by the fw is a way to be stricter.

Because it is 100% guaranteed (we have proof) that BIOS is shit and
doesn't set node affinity for devices that really should have it.

So we're trading a hypothetical shared device vs not reporting actual
BIOS bugs. That's no contest.

Worse, we have virtual devices that have clear node affinity without it
set.

So we're growing shit, allowing bugs, and what do we get in return? Warm
fuzzies is not it.

> > Any better suggestion to move this forward?
> 
> FWIW (since this is in my inbox), it sounds like the fundamental issue is
> that NUMA_NO_NODE is conflated for at least two different purposes, so
> trying to sort that out would be a good first step. AFAICS we have genuine
> "don't care" cases like alloc_pages_node(), where if the producer says it
> doesn't matter then the consumer is free to make its own judgement on what
> to do, and fundamentally different "we expect this thing to have an affinity
> but it doesn't, so we can't say what's appropriate" cases which could really
> do with some separate indicator like "NUMA_INVALID_NODE".

It can possible be a 3 state:

 - UNKNON; overridden by parent/bus/etc..
   ERROR when still UNKNOWN on register.

 - INVALID; ERROR on devm usage.
   for virtual devices / pure sysfs nodes

 - NO_NODE; may only be set on virtual devices (we can check against PCI
   bus etc..) when there really is no better option.

But I only want to see the NO_NODE crap at the end, after all other
possible avenues have been done.

> The tricky part is then bestowed on the producers to decide whether they can
> downgrade "invalid" to "don't care". You can technically build 'a device'
> whose internal logic is distributed between nodes and thus appears to have
> equal affinity - interrupt controllers, for example, may have per-CPU or
> per-node interfaces that end up looking like that - so although it's
> unlikely it's not outright nonsensical.

I'm thinking we should/do create per cpu/node devices for such
distributed stuff. For instance, we create per-cpu clockevent devices
(where appropriate).

> Similarly a 'device' that's actually emulated behind a firmware call
> interface may well effectively have no real affinity.

Emulated devices are typically slow as heck and should be avoided if at
all possible. I don't see NUMA affinity being important for them.
Yunsheng Lin Oct. 11, 2019, 3:27 a.m. UTC | #36
On 2019/10/10 15:32, Michal Hocko wrote:
> On Thu 10-10-19 14:07:21, Yunsheng Lin wrote:
>> On 2019/10/9 20:25, Robin Murphy wrote:
>>> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
>>>> On 2019/9/25 18:41, Peter Zijlstra wrote:
>>>>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>>>>>>  From the discussion above, It seems making the node_to_cpumask_map()
>>>>>> NUMA_NO_NODE aware is the most feasible way to move forwad.
>>>>>
>>>>> That's still wrong.
>>>>
>>>> Hi, Peter
>>>>
>>>> It seems this has trapped in the dead circle.
>>>>
>>>>  From my understanding, NUMA_NO_NODE which means not node numa preference
>>>> is the state to describe the node of virtual device or the physical device
>>>> that has equal distance to all cpu.
>>>>
>>>> We can be stricter if the device does have a nearer node, but we can not
>>>> deny that a device does not have a node numa preference or node affinity,
>>>> which also means the control or data buffer can be allocated at the node where
>>>> the process is running.
>>>>
>>>> As you has proposed, making it -2 and have dev_to_node() warn if the device does
>>>> have a nearer node and not set by the fw is a way to be stricter.
>>>>
>>>> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
>>>> we does need a state to describe the device that have equal distance to all node,
>>>> even if it is not physically scalable.
>>>>
>>>> Any better suggestion to move this forward?
>>>
>>> FWIW (since this is in my inbox), it sounds like the fundamental issue is that NUMA_NO_NODE is conflated for at least two different purposes, so trying to sort that out would be a good first step. AFAICS we have genuine "don't care" cases like alloc_pages_node(), where if the producer says it doesn't matter then the consumer is free to make its own judgement on what to do, and fundamentally different "we expect this thing to have an affinity but it doesn't, so we can't say what's appropriate" cases which could really do with some separate indicator like "NUMA_INVALID_NODE".
>>>
>>> The tricky part is then bestowed on the producers to decide whether they can downgrade "invalid" to "don't care". You can technically build 'a device' whose internal logic is distributed between nodes and thus appears to have equal affinity - interrupt controllers, for example, may have per-CPU or per-node interfaces that end up looking like that - so although it's unlikely it's not outright nonsensical. Similarly a 'device' that's actually emulated behind a firmware call interface may well effectively have no real affinity.
>>
>> We may set node of the physical device to NUMA_INVALID_NODE when fw does not
>> provide one.
>>
>> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
>> with nid being NUMA_INVALID_NODE?
> 
> There is nothing sensible the allocator can do. The only point of
> NUMA_INVALID_NODE would be to catch potential misconfiguration and
> report them to users so they can complain to their HW/FS suppliers.
> 
> Pushing it to other susbystem doesn't make much sense IMHO because there
> is nothing really actionable. Refusing an allocation altogether sounds
> like a bad plan to me.
>  
>> If we change the node to default one(like node 0) when node of device is
>> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
>> is the right one to choose?
> 
> Exactly. We cannot really assume any node in that situation.
>  
>> >From the privous disccusion, the below seems not get to consensus yet:
>> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not
>>    have any numa preference?
> 
> This is a traditional meaning MM subsystem is using.
> 
>> 2) What do we do if the fw does not provide a node for the device? Should
>>    we guess and pick one for it and how do we do the guessing? Or leave it
>>    as it is and handle it as NUMA_NO_NODE?
> 
> As already pointed several times, picking any node is rather error
> prone. You can never assume topology. We used to assume that there
> always be node 0 but that is not really the case (see 3e8589963773
> ("memcg: make it work on sparse non-0-node systems")). Nodes might also
> come and go so this might just lead to all sorts of subtle problems.
> 
> On the other hand using NUMA_NO_NODE as no preference could only lead to
> slightly sub optimal performance.
> 
> I do agree with Peter that reporting a lack of affinity might be useful
> but we shouldn't really try to clever and make up the affinity nilly
> willy

Ok, thanks for clarify.

So it seems we need to do the below if I understand it correctly:
1. fix up the node id of device that has a clear node affinity without it
   being set as much as possible.
2. catch the case when the device that should have a nearer node but without
   being set and warn about it.

But I failed to see why the above is related to making node_to_cpumask_map()
NUMA_NO_NODE aware?

There is clear bug here too when node_to_cpumask_map() is not handling
NUMA_NO_NODE.

Maybe we can make the node_to_cpumask_map() NUMA_NO_NODE aware first and then
deal with the above because event after the above is handled, we still need to
handle NUMA_NO_NODE in node_to_cpumask_map() and it may take times to deal with
the above when it is tricky to decide whether or not a device has a clear node.

.
>
Peter Zijlstra Oct. 11, 2019, 11:15 a.m. UTC | #37
On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> But I failed to see why the above is related to making node_to_cpumask_map()
> NUMA_NO_NODE aware?

Your initial bug is for hns3, which is a PCI device, which really _MUST_
have a node assigned.

It not having one, is a straight up bug. We must not silently accept
NO_NODE there, ever.
Yunsheng Lin Oct. 12, 2019, 6:17 a.m. UTC | #38
add pci and acpi maintainer
cc linux-pci@vger.kernel.org and linux-acpi@vger.kernel.org

On 2019/10/11 19:15, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>> But I failed to see why the above is related to making node_to_cpumask_map()
>> NUMA_NO_NODE aware?
> 
> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> have a node assigned.
> 
> It not having one, is a straight up bug. We must not silently accept
> NO_NODE there, ever.
> 

I suppose you mean reporting a lack of affinity when the node of a pcie
device is not set by "not silently accept NO_NODE".

As Greg has asked about in [1]:
what is a user to do when the user sees the kernel reporting that?

We may tell user to contact their vendor for info or updates about
that when they do not know about their system well enough, but their
vendor may get away with this by quoting ACPI spec as the spec
considering this optional. Should the user believe this is indeed a
fw bug or a misreport from the kernel?

If this kind of reporting is common pratice and will not cause any
misunderstanding, then maybe we can report that.

[1] https://lore.kernel.org/lkml/20190905055727.GB23826@kroah.com/

> .
>
Greg KH Oct. 12, 2019, 7:40 a.m. UTC | #39
On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> add pci and acpi maintainer
> cc linux-pci@vger.kernel.org and linux-acpi@vger.kernel.org
> 
> On 2019/10/11 19:15, Peter Zijlstra wrote:
> > On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >> But I failed to see why the above is related to making node_to_cpumask_map()
> >> NUMA_NO_NODE aware?
> > 
> > Your initial bug is for hns3, which is a PCI device, which really _MUST_
> > have a node assigned.
> > 
> > It not having one, is a straight up bug. We must not silently accept
> > NO_NODE there, ever.
> > 
> 
> I suppose you mean reporting a lack of affinity when the node of a pcie
> device is not set by "not silently accept NO_NODE".

If the firmware of a pci device does not provide the node information,
then yes, warn about that.

> As Greg has asked about in [1]:
> what is a user to do when the user sees the kernel reporting that?
> 
> We may tell user to contact their vendor for info or updates about
> that when they do not know about their system well enough, but their
> vendor may get away with this by quoting ACPI spec as the spec
> considering this optional. Should the user believe this is indeed a
> fw bug or a misreport from the kernel?

Say it is a firmware bug, if it is a firmware bug, that's simple.

> If this kind of reporting is common pratice and will not cause any
> misunderstanding, then maybe we can report that.

Yes, please do so, that's the only way those boxes are ever going to get
fixed.  And go add the test to the "firmware testing" tool that is based
on Linux that Intel has somewhere, to give vendors a chance to fix this
before they ship hardware.

This shouldn't be a big deal, we warn of other hardware bugs all the
time.

thanks,

greg k-h
Yunsheng Lin Oct. 12, 2019, 9:47 a.m. UTC | #40
On 2019/10/12 15:40, Greg KH wrote:
> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>> add pci and acpi maintainer
>> cc linux-pci@vger.kernel.org and linux-acpi@vger.kernel.org
>>
>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>>>> But I failed to see why the above is related to making node_to_cpumask_map()
>>>> NUMA_NO_NODE aware?
>>>
>>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>>> have a node assigned.
>>>
>>> It not having one, is a straight up bug. We must not silently accept
>>> NO_NODE there, ever.
>>>
>>
>> I suppose you mean reporting a lack of affinity when the node of a pcie
>> device is not set by "not silently accept NO_NODE".
> 
> If the firmware of a pci device does not provide the node information,
> then yes, warn about that.
> 
>> As Greg has asked about in [1]:
>> what is a user to do when the user sees the kernel reporting that?
>>
>> We may tell user to contact their vendor for info or updates about
>> that when they do not know about their system well enough, but their
>> vendor may get away with this by quoting ACPI spec as the spec
>> considering this optional. Should the user believe this is indeed a
>> fw bug or a misreport from the kernel?
> 
> Say it is a firmware bug, if it is a firmware bug, that's simple.
> 
>> If this kind of reporting is common pratice and will not cause any
>> misunderstanding, then maybe we can report that.
> 
> Yes, please do so, that's the only way those boxes are ever going to get
> fixed.  And go add the test to the "firmware testing" tool that is based
> on Linux that Intel has somewhere, to give vendors a chance to fix this
> before they ship hardware.
> 
> This shouldn't be a big deal, we warn of other hardware bugs all the
> time.

Ok, thanks for clarifying.

Will send a patch to catch the case when a pcie device without numa node
being set and warn about it.

Maybe use dev->bus to verify if it is a pci device?

> 
> thanks,
> 
> greg k-h
> 
> .
>
Greg KH Oct. 12, 2019, 10:40 a.m. UTC | #41
On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 15:40, Greg KH wrote:
> > On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >> add pci and acpi maintainer
> >> cc linux-pci@vger.kernel.org and linux-acpi@vger.kernel.org
> >>
> >> On 2019/10/11 19:15, Peter Zijlstra wrote:
> >>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >>>> But I failed to see why the above is related to making node_to_cpumask_map()
> >>>> NUMA_NO_NODE aware?
> >>>
> >>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> >>> have a node assigned.
> >>>
> >>> It not having one, is a straight up bug. We must not silently accept
> >>> NO_NODE there, ever.
> >>>
> >>
> >> I suppose you mean reporting a lack of affinity when the node of a pcie
> >> device is not set by "not silently accept NO_NODE".
> > 
> > If the firmware of a pci device does not provide the node information,
> > then yes, warn about that.
> > 
> >> As Greg has asked about in [1]:
> >> what is a user to do when the user sees the kernel reporting that?
> >>
> >> We may tell user to contact their vendor for info or updates about
> >> that when they do not know about their system well enough, but their
> >> vendor may get away with this by quoting ACPI spec as the spec
> >> considering this optional. Should the user believe this is indeed a
> >> fw bug or a misreport from the kernel?
> > 
> > Say it is a firmware bug, if it is a firmware bug, that's simple.
> > 
> >> If this kind of reporting is common pratice and will not cause any
> >> misunderstanding, then maybe we can report that.
> > 
> > Yes, please do so, that's the only way those boxes are ever going to get
> > fixed.  And go add the test to the "firmware testing" tool that is based
> > on Linux that Intel has somewhere, to give vendors a chance to fix this
> > before they ship hardware.
> > 
> > This shouldn't be a big deal, we warn of other hardware bugs all the
> > time.
> 
> Ok, thanks for clarifying.
> 
> Will send a patch to catch the case when a pcie device without numa node
> being set and warn about it.
> 
> Maybe use dev->bus to verify if it is a pci device?

No, do that in the pci bus core code itself, when creating the devices
as that is when you know, or do not know, the numa node, right?

This can't be in the driver core only, as each bus type will have a
different way of determining what the node the device is on.  For some
reason, I thought the PCI core code already does this, right?

thanks,

greg k-h
Greg KH Oct. 12, 2019, 10:47 a.m. UTC | #42
On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> > On 2019/10/12 15:40, Greg KH wrote:
> > > On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> > >> add pci and acpi maintainer
> > >> cc linux-pci@vger.kernel.org and linux-acpi@vger.kernel.org
> > >>
> > >> On 2019/10/11 19:15, Peter Zijlstra wrote:
> > >>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> > >>>> But I failed to see why the above is related to making node_to_cpumask_map()
> > >>>> NUMA_NO_NODE aware?
> > >>>
> > >>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> > >>> have a node assigned.
> > >>>
> > >>> It not having one, is a straight up bug. We must not silently accept
> > >>> NO_NODE there, ever.
> > >>>
> > >>
> > >> I suppose you mean reporting a lack of affinity when the node of a pcie
> > >> device is not set by "not silently accept NO_NODE".
> > > 
> > > If the firmware of a pci device does not provide the node information,
> > > then yes, warn about that.
> > > 
> > >> As Greg has asked about in [1]:
> > >> what is a user to do when the user sees the kernel reporting that?
> > >>
> > >> We may tell user to contact their vendor for info or updates about
> > >> that when they do not know about their system well enough, but their
> > >> vendor may get away with this by quoting ACPI spec as the spec
> > >> considering this optional. Should the user believe this is indeed a
> > >> fw bug or a misreport from the kernel?
> > > 
> > > Say it is a firmware bug, if it is a firmware bug, that's simple.
> > > 
> > >> If this kind of reporting is common pratice and will not cause any
> > >> misunderstanding, then maybe we can report that.
> > > 
> > > Yes, please do so, that's the only way those boxes are ever going to get
> > > fixed.  And go add the test to the "firmware testing" tool that is based
> > > on Linux that Intel has somewhere, to give vendors a chance to fix this
> > > before they ship hardware.
> > > 
> > > This shouldn't be a big deal, we warn of other hardware bugs all the
> > > time.
> > 
> > Ok, thanks for clarifying.
> > 
> > Will send a patch to catch the case when a pcie device without numa node
> > being set and warn about it.
> > 
> > Maybe use dev->bus to verify if it is a pci device?
> 
> No, do that in the pci bus core code itself, when creating the devices
> as that is when you know, or do not know, the numa node, right?
> 
> This can't be in the driver core only, as each bus type will have a
> different way of determining what the node the device is on.  For some
> reason, I thought the PCI core code already does this, right?

Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
thing...

Anyway, it looks like the pci core code does call set_dev_node() based
on the PCI bridge, so if that is set up properly, all should be fine.

If not, well, you have buggy firmware and you need to warn about that at
the time you are creating the bridge.  Look at the call to
pcibus_to_node() in pci_register_host_bridge().

And yes, you need to do this all on a per-bus-type basis, as has been
pointed out.  It's up to the bus to create the device and set this up
properly.

thanks,

greg k-h

Patch
diff mbox series

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@  const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+	if (node == NUMA_NO_NODE)
+		return cpu_online_mask;
+
 	return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4decf16..5ae7eea 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,6 +46,9 @@  EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+	if (node == NUMA_NO_NODE)
+		return cpu_online_mask;
+
 	if (WARN_ON(node >= nr_node_ids))
 		return cpu_none_mask;
 
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@ 
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)	(cpu_logical_map(cpu) >> 2)
-#define cpumask_of_node(node)	(&__node_data[(node)]->cpumask)
+#define cpumask_of_node(node)	((node) == NUMA_NO_NODE ?		\
+				 cpu_online_mask :			\
+				 &__node_data[(node)]->cpumask)
 
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cca406f..1bd2e73 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -78,6 +78,9 @@  static inline int cpu_to_node(int cpu)
 #define cpumask_of_node cpumask_of_node
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+	if (node == NUMA_NO_NODE)
+		return cpu_online_mask;
+
 	return &node_to_cpumask_map[node];
 }
 
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d23..7fa82e1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -69,6 +69,9 @@  extern const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+	if (node == NUMA_NO_NODE)
+		return cpu_online_mask;
+
 	return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4123100e..9859acb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,6 +861,9 @@  void numa_remove_cpu(int cpu)
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+	if (node == NUMA_NO_NODE)
+		return cpu_online_mask;
+
 	if ((unsigned)node >= nr_node_ids) {
 		printk(KERN_WARNING
 			"cpumask_of_node(%d): (unsigned)node >= nr_node_ids(%u)\n",