PCI: Warn about host bridge device when its numa node is NO_NODE
diff mbox series

Message ID 1571467543-26125-1-git-send-email-linyunsheng@huawei.com
State Changes Requested
Headers show
Series
  • PCI: Warn about host bridge device when its numa node is NO_NODE
Related show

Commit Message

Yunsheng Lin Oct. 19, 2019, 6:45 a.m. UTC
As the disscusion in [1]:
A PCI device really _MUST_ have a node assigned. It is possible to
have a PCI bridge shared between two nodes, such that the PCI
devices have equidistance. But the moment you scale this out, you
either get devices that are 'local' to a package while having
multiple packages, or if you maintain a single bridge in a big
system, things become so slow it all doesn't matter anyway.
Assigning a node (one of the shared) is, in the generic ase of
multiple packages, the better solution over assigning all nodes.

As pci_device_add() will assign the pci device' node according to
the bus the device is on, which is decided by pcibus_to_node().
Currently different arch may implement the pcibus_to_node() based
on bus->sysdata or bus device' node, which has the same node as
the bridge device.

And for devices behind another bridge case, the child bus device
is setup with proper parent bus device and inherit its parent'
sysdata in pci_alloc_child_bus(), so the pcie device under the
child bus should have the same node as the parent bridge when
device_add() is called, which will set the node to its parent's
node when the child device' node is NUMA_NO_NODE.

So this patch only warns about the case when a host bridge device
is registered with a node of NO_NODE in pci_register_host_bridge().
And it only warns about that when there are more than one numa
nodes in the system.

[1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/pci/probe.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

hch@infradead.org Oct. 19, 2019, 8:34 a.m. UTC | #1
On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
> +

The whole idea of mentioning a BIOS in architeture indepent code doesn't
make sense at all.
Yunsheng Lin Oct. 21, 2019, 4:05 a.m. UTC | #2
On 2019/10/19 16:34, Christoph Hellwig wrote:
> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>> +
> 
> The whole idea of mentioning a BIOS in architeture indepent code doesn't
> make sense at all.

Mentioning the BIOS is to tell user what firmware is broken, so that
user can report this to their vendor by referring the specific firmware.

It seems we can specific the node through different ways(DT, ACPI, etc).

Is there a better name for mentioning instead of BIOS, or we should do
the checking and warning in the architeture dependent code?

Or maybe just remove the BIOS from the above log?

Thanks.

> 
> .
>
Robin Murphy Oct. 22, 2019, 1:55 p.m. UTC | #3
On 21/10/2019 05:05, Yunsheng Lin wrote:
> On 2019/10/19 16:34, Christoph Hellwig wrote:
>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>>> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>>> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>>> +
>>
>> The whole idea of mentioning a BIOS in architeture indepent code doesn't
>> make sense at all.

[ Come to think of it, I'm sure an increasing number of x86 firmwares 
don't even implement a PC BIOS any more... ]

In all fairness, the server-class Arm-based machines I've come across so 
far do seem to consistently call their EFI firmware images "BIOS" 
despite the clear anachronism. At least the absurdity of conflating a 
system setup program with a semiconductor process seems to have mostly 
died out ;)

> Mentioning the BIOS is to tell user what firmware is broken, so that
> user can report this to their vendor by referring the specific firmware.
> 
> It seems we can specific the node through different ways(DT, ACPI, etc).
> 
> Is there a better name for mentioning instead of BIOS, or we should do
> the checking and warning in the architeture dependent code?
> 
> Or maybe just remove the BIOS from the above log?

Even though there may be some degree of historical convention hanging 
around on ACPI-based systems, that argument almost certainly doesn't 
hold for OF/FDT/etc. - the "[Firmware Bug]:" prefix is hopefully 
indicative enough, so I'd say just drop the "by BIOS" part.

Robin.
Bjorn Helgaas Oct. 22, 2019, 9:04 p.m. UTC | #4
On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> As the disscusion in [1]:

We need to justify this patch right here in the commit log, not with a
pointer to a 50+ message email thread.

> A PCI device really _MUST_ have a node assigned. 

No, it's not really essential.  It's *nice* if we know the node
closest to a PCI device, but the system should function correctly even
if we don't.  The only problem is that it will be slower.

I think the underlying problem you're addressing is that:

  - NUMA_NO_NODE == -1,
  - dev_to_node(dev) may return NUMA_NO_NODE,
  - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
  - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference

For example, on arm64, mips loongson, s390, and x86,
cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
an invalid array index.

That problem can't be solved by emitting a warning, of course.  I
assume some variation of your "numa: make node_to_cpumask_map()
NUMA_NO_NODE aware" patch [a] will solve that problem.

[a] https://lore.kernel.org/linux-mips/1568535656-158979-1-git-send-email-linyunsheng@huawei.com/T/#u

It is probably a good idea to emit a warning about the performance
issue.

When I run your patch on qemu, I see this:

  ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
  acpi PNP0A08:00: _OSC: platform does not support [LTR]
  acpi PNP0A08:00: _OSC: OS now controls [PME AER PCIeCapability]
  PCI host bridge to bus 0000:00
  pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
  pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
  pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
  pci_bus 0000:00: root bus resource [mem 0x100000000-0x8ffffffff window]
  pci_bus 0000:00: root bus resource [bus 00-ff]
   pci0000:00: [Firmware Bug]: No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.

I didn't debug it to see what's wrong with the " pci0000:00" string.
Ideally it would be connected with "acpi PNP0A08:00" since that's the
place where BIOS would make a fix but I suppose "pci_bus 0000:00"
would be adequate.

> It is possible to
> have a PCI bridge shared between two nodes, such that the PCI
> devices have equidistance. But the moment you scale this out, you
> either get devices that are 'local' to a package while having
> multiple packages, or if you maintain a single bridge in a big
> system, things become so slow it all doesn't matter anyway.
> Assigning a node (one of the shared) is, in the generic ase of
> multiple packages, the better solution over assigning all nodes.
> 
> As pci_device_add() will assign the pci device' node according to
> the bus the device is on, which is decided by pcibus_to_node().
> Currently different arch may implement the pcibus_to_node() based
> on bus->sysdata or bus device' node, which has the same node as
> the bridge device.
>
> And for devices behind another bridge case, the child bus device
> is setup with proper parent bus device and inherit its parent'
> sysdata in pci_alloc_child_bus(), so the pcie device under the
> child bus should have the same node as the parent bridge when
> device_add() is called, which will set the node to its parent's
> node when the child device' node is NUMA_NO_NODE.
> 
> So this patch only warns about the case when a host bridge device
> is registered with a node of NO_NODE in pci_register_host_bridge().
> And it only warns about that when there are more than one numa
> nodes in the system.


> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  drivers/pci/probe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 3d5271a..22be96a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>  	list_add_tail(&bus->node, &pci_root_buses);
>  	up_write(&pci_bus_sem);
>  
> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
> +
>  	return 0;
>  
>  unregister:
Yunsheng Lin Oct. 23, 2019, 8:22 a.m. UTC | #5
On 2019/10/23 5:04, Bjorn Helgaas wrote:
> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>> As the disscusion in [1]:
> 
> We need to justify this patch right here in the commit log, not with a
> pointer to a 50+ message email thread.

Ok, thanks.

> 
>> A PCI device really _MUST_ have a node assigned. 
> 
> No, it's not really essential.  It's *nice* if we know the node
> closest to a PCI device, but the system should function correctly even
> if we don't.  The only problem is that it will be slower.

Ok, will try to mention the info you mention here in the commit log.

> 
> I think the underlying problem you're addressing is that:
> 
>   - NUMA_NO_NODE == -1,
>   - dev_to_node(dev) may return NUMA_NO_NODE,
>   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
>   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> 
> For example, on arm64, mips loongson, s390, and x86,
> cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> an invalid array index.

The invalid array index of -1 is the underlying problem here when
cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
is not NUMA_NO_NODE aware yet.

In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
disscusion, it is requested that it is better to warn about the pcie
device without a node assigned by the firmware before making the
cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
devices of "NUMA_NO_NODE" node can be fixed by their vendor.

See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/

> 
> That problem can't be solved by emitting a warning, of course.  I
> assume some variation of your "numa: make node_to_cpumask_map()
> NUMA_NO_NODE aware" patch [a] will solve that problem.
> 
> [a] https://lore.kernel.org/linux-mips/1568535656-158979-1-git-send-email-linyunsheng@huawei.com/T/#u
> 
> It is probably a good idea to emit a warning about the performance
> issue.
> 
> When I run your patch on qemu, I see this:
> 
>   ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
>   acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
>   acpi PNP0A08:00: _OSC: platform does not support [LTR]
>   acpi PNP0A08:00: _OSC: OS now controls [PME AER PCIeCapability]
>   PCI host bridge to bus 0000:00
>   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
>   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
>   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
>   pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
>   pci_bus 0000:00: root bus resource [mem 0x100000000-0x8ffffffff window]
>   pci_bus 0000:00: root bus resource [bus 00-ff]
>    pci0000:00: [Firmware Bug]: No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.
> 
> I didn't debug it to see what's wrong with the " pci0000:00" string.
> Ideally it would be connected with "acpi PNP0A08:00" since that's the
> place where BIOS would make a fix but I suppose "pci_bus 0000:00"
> would be adequate.

It seems the string at the beginning of dev_err() output is consisted
of dev_driver_string() and dev_name().

drivers/base/core.c:
const char *dev_driver_string(const struct device *dev)
{
	struct device_driver *drv;

	/* dev->driver can change to NULL underneath us because of unbinding,
	 * so be careful about accessing it.  dev->bus and dev->class should
	 * never change once they are set, so they don't need special care.
	 */
	drv = READ_ONCE(dev->driver);
	return drv ? drv->name :
			(dev->bus ? dev->bus->name :
			(dev->class ? dev->class->name : ""));
}

The bridge device does not have a driver, a bus or a class, so
dev_driver_string() will return "", that is why there is a extral
space at the beginning of the " pci0000:00". I am not familiar with
the pci, so not sure if this is a problem that the bridge device
does not have a driver, a bus or a class.

And the bus device has a class of pcibus_class, and name of
pcibus_class is "pci_bus".

So maybe change the warning to below:

if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");

And it seems a pci device's parent will always set to the bridge
device in pci_setup_device(), and device_add() which will set the
node to its parent's when the child device' node is NUMA_NO_NODE,
maybe we can add the bridge device' node checking to make sure
the pci device really does not have a node assigned, as below:

if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE &&
    dev_to_node(bus->bridge) == NUMA_NO_NODE)
	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");


> 
>> It is possible to
>> have a PCI bridge shared between two nodes, such that the PCI
>> devices have equidistance. But the moment you scale this out, you
>> either get devices that are 'local' to a package while having
>> multiple packages, or if you maintain a single bridge in a big
>> system, things become so slow it all doesn't matter anyway.
>> Assigning a node (one of the shared) is, in the generic ase of
>> multiple packages, the better solution over assigning all nodes.
>>
>> As pci_device_add() will assign the pci device' node according to
>> the bus the device is on, which is decided by pcibus_to_node().
>> Currently different arch may implement the pcibus_to_node() based
>> on bus->sysdata or bus device' node, which has the same node as
>> the bridge device.
>>
>> And for devices behind another bridge case, the child bus device
>> is setup with proper parent bus device and inherit its parent'
>> sysdata in pci_alloc_child_bus(), so the pcie device under the
>> child bus should have the same node as the parent bridge when
>> device_add() is called, which will set the node to its parent's
>> node when the child device' node is NUMA_NO_NODE.
>>
>> So this patch only warns about the case when a host bridge device
>> is registered with a node of NO_NODE in pci_register_host_bridge().
>> And it only warns about that when there are more than one numa
>> nodes in the system.
> 
> 
>> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  drivers/pci/probe.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 3d5271a..22be96a 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>>  	list_add_tail(&bus->node, &pci_root_buses);
>>  	up_write(&pci_bus_sem);
>>  
>> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>> +
>>  	return 0;
>>  
>>  unregister:
> 
> .
>
Yunsheng Lin Oct. 23, 2019, 8:24 a.m. UTC | #6
On 2019/10/22 21:55, Robin Murphy wrote:
> On 21/10/2019 05:05, Yunsheng Lin wrote:
>> On 2019/10/19 16:34, Christoph Hellwig wrote:
>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>>>> +    if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>>>> +        dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>>>> +
>>>
>>> The whole idea of mentioning a BIOS in architeture indepent code doesn't
>>> make sense at all.
> 
> [ Come to think of it, I'm sure an increasing number of x86 firmwares don't even implement a PC BIOS any more... ]
> 
> In all fairness, the server-class Arm-based machines I've come across so far do seem to consistently call their EFI firmware images "BIOS" despite the clear anachronism. At least the absurdity of conflating a system setup program with a semiconductor process seems to have mostly died out ;)
> 
>> Mentioning the BIOS is to tell user what firmware is broken, so that
>> user can report this to their vendor by referring the specific firmware.
>>
>> It seems we can specific the node through different ways(DT, ACPI, etc).
>>
>> Is there a better name for mentioning instead of BIOS, or we should do
>> the checking and warning in the architeture dependent code?
>>
>> Or maybe just remove the BIOS from the above log?
> 
> Even though there may be some degree of historical convention hanging around on ACPI-based systems, that argument almost certainly doesn't hold for OF/FDT/etc. - the "[Firmware Bug]:" prefix is hopefully indicative enough, so I'd say just drop the "by BIOS" part.

Will drop the "by BIOS" part if there is another version.
Tnanks for clarifying.

> 
> Robin.
> 
> .
>
Bjorn Helgaas Oct. 23, 2019, 5:10 p.m. UTC | #7
On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
> On 2019/10/23 5:04, Bjorn Helgaas wrote:
> > On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:

> > I think the underlying problem you're addressing is that:
> > 
> >   - NUMA_NO_NODE == -1,
> >   - dev_to_node(dev) may return NUMA_NO_NODE,
> >   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
> >   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> > 
> > For example, on arm64, mips loongson, s390, and x86,
> > cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> > an invalid array index.
> 
> The invalid array index of -1 is the underlying problem here when
> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
> is not NUMA_NO_NODE aware yet.
> 
> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
> disscusion, it is requested that it is better to warn about the pcie
> device without a node assigned by the firmware before making the
> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
> devices of "NUMA_NO_NODE" node can be fixed by their vendor.
> 
> See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/

Right.  We should warn if the NUMA node number would help us but DT or
the firmware didn't give us one.

But we can do that independently of any cpumask_of_node() changes.
There's no need to do one patch before the other.  Even if you make
cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
because we're not actually changing any node assignments.

> So maybe change the warning to below:
> 
> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
> 	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");

I think this is perfect and I don't see the need for the refinement
below:

> And it seems a pci device's parent will always set to the bridge
> device in pci_setup_device(), and device_add() which will set the
> node to its parent's when the child device' node is NUMA_NO_NODE,
> maybe we can add the bridge device' node checking to make sure
> the pci device really does not have a node assigned, as below:
> 
> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE &&
>     dev_to_node(bus->bridge) == NUMA_NO_NODE)
> 	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");

Anyway, would the attached patch work for you?  I have it tentatively
queued up on pci/enumeration for v5.5.

> >> It is possible to
> >> have a PCI bridge shared between two nodes, such that the PCI
> >> devices have equidistance. But the moment you scale this out, you
> >> either get devices that are 'local' to a package while having
> >> multiple packages, or if you maintain a single bridge in a big
> >> system, things become so slow it all doesn't matter anyway.
> >> Assigning a node (one of the shared) is, in the generic ase of
> >> multiple packages, the better solution over assigning all nodes.
> >>
> >> As pci_device_add() will assign the pci device' node according to
> >> the bus the device is on, which is decided by pcibus_to_node().
> >> Currently different arch may implement the pcibus_to_node() based
> >> on bus->sysdata or bus device' node, which has the same node as
> >> the bridge device.
> >>
> >> And for devices behind another bridge case, the child bus device
> >> is setup with proper parent bus device and inherit its parent'
> >> sysdata in pci_alloc_child_bus(), so the pcie device under the
> >> child bus should have the same node as the parent bridge when
> >> device_add() is called, which will set the node to its parent's
> >> node when the child device' node is NUMA_NO_NODE.
> >>
> >> So this patch only warns about the case when a host bridge device
> >> is registered with a node of NO_NODE in pci_register_host_bridge().
> >> And it only warns about that when there are more than one numa
> >> nodes in the system.
> > 
> > 
> >> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >>  drivers/pci/probe.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> >> index 3d5271a..22be96a 100644
> >> --- a/drivers/pci/probe.c
> >> +++ b/drivers/pci/probe.c
> >> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
> >>  	list_add_tail(&bus->node, &pci_root_buses);
> >>  	up_write(&pci_bus_sem);
> >>  
> >> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
> >> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
> >> +
> >>  	return 0;
> >>  
> >>  unregister:

commit 8f8cf239c4f1
Author: Yunsheng Lin <linyunsheng@huawei.com>
Date:   Sat Oct 19 14:45:43 2019 +0800

    PCI: Warn if no host bridge NUMA node info
    
    In pci_call_probe(), we try to run driver probe functions on the node where
    the device is attached.  If we don't know which node the device is attached
    to, the driver will likely run on the wrong node.  This will still work,
    but performance will not be as good as it could be.
    
    On NUMA systems, warn if we don't know which node a PCI host bridge is
    attached to.  This is likely an indication that ACPI didn't supply a _PXM
    method or the DT didn't supply a "numa-node-id" property.
    
    [bhelgaas: commit log, check bus node]
    Link: https://lore.kernel.org/r/1571467543-26125-1-git-send-email-linyunsheng@huawei.com
    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 3d5271a7a849..40259c38d66a 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -897,6 +897,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
 	else
 		pr_info("PCI host bridge to bus %s\n", name);
 
+	if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
+		dev_warn(&bus->dev, "Unknown NUMA node; performance will be reduced\n");
+
 	/* Add initial resources to the bus */
 	resource_list_for_each_entry_safe(window, n, &resources) {
 		list_move_tail(&window->node, &bridge->windows);
Michal Hocko Oct. 24, 2019, 9:20 a.m. UTC | #8
On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote:
> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
> > On 2019/10/23 5:04, Bjorn Helgaas wrote:
> > > On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> 
> > > I think the underlying problem you're addressing is that:
> > > 
> > >   - NUMA_NO_NODE == -1,
> > >   - dev_to_node(dev) may return NUMA_NO_NODE,
> > >   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
> > >   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> > > 
> > > For example, on arm64, mips loongson, s390, and x86,
> > > cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> > > an invalid array index.
> > 
> > The invalid array index of -1 is the underlying problem here when
> > cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
> > is not NUMA_NO_NODE aware yet.
> > 
> > In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
> > disscusion, it is requested that it is better to warn about the pcie
> > device without a node assigned by the firmware before making the
> > cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
> > devices of "NUMA_NO_NODE" node can be fixed by their vendor.
> > 
> > See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/
> 
> Right.  We should warn if the NUMA node number would help us but DT or
> the firmware didn't give us one.
> 
> But we can do that independently of any cpumask_of_node() changes.
> There's no need to do one patch before the other.  Even if you make
> cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
> because we're not actually changing any node assignments.

Agreed. And this has been proposed previously I believe but my
understanding was that Petr was against that.
Yunsheng Lin Oct. 24, 2019, 9:39 a.m. UTC | #9
On 2019/10/24 1:10, Bjorn Helgaas wrote:
> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
>> On 2019/10/23 5:04, Bjorn Helgaas wrote:
>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>> And it seems a pci device's parent will always set to the bridge
>> device in pci_setup_device(), and device_add() which will set the
>> node to its parent's when the child device' node is NUMA_NO_NODE,
>> maybe we can add the bridge device' node checking to make sure
>> the pci device really does not have a node assigned, as below:
>>
>> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE &&
>>     dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> 	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");
> 
> Anyway, would the attached patch work for you?  I have it tentatively
> queued up on pci/enumeration for v5.5.
> 

Yes, thanks for making the change when applying.

>>>
>>>
>>>> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>>  drivers/pci/probe.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 3d5271a..22be96a 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>>>>  	list_add_tail(&bus->node, &pci_root_buses);
>>>>  	up_write(&pci_bus_sem);
>>>>  
>>>> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>>>> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>>>> +
>>>>  	return 0;
>>>>  
>>>>  unregister:
> 
> commit 8f8cf239c4f1
> Author: Yunsheng Lin <linyunsheng@huawei.com>
> Date:   Sat Oct 19 14:45:43 2019 +0800
> 
>     PCI: Warn if no host bridge NUMA node info
>     
>     In pci_call_probe(), we try to run driver probe functions on the node where
>     the device is attached.  If we don't know which node the device is attached
>     to, the driver will likely run on the wrong node.  This will still work,
>     but performance will not be as good as it could be.
>     
>     On NUMA systems, warn if we don't know which node a PCI host bridge is
>     attached to.  This is likely an indication that ACPI didn't supply a _PXM
>     method or the DT didn't supply a "numa-node-id" property.
>     
>     [bhelgaas: commit log, check bus node]
>     Link: https://lore.kernel.org/r/1571467543-26125-1-git-send-email-linyunsheng@huawei.com
>     Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 3d5271a7a849..40259c38d66a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -897,6 +897,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>  	else
>  		pr_info("PCI host bridge to bus %s\n", name);
>  
> +	if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
> +		dev_warn(&bus->dev, "Unknown NUMA node; performance will be reduced\n");
> +
>  	/* Add initial resources to the bus */
>  	resource_list_for_each_entry_safe(window, n, &resources) {
>  		list_move_tail(&window->node, &bridge->windows);
> 
> .
>
Robin Murphy Oct. 24, 2019, 10:16 a.m. UTC | #10
On 2019-10-23 6:10 pm, Bjorn Helgaas wrote:
> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
>> On 2019/10/23 5:04, Bjorn Helgaas wrote:
>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> 
>>> I think the underlying problem you're addressing is that:
>>>
>>>    - NUMA_NO_NODE == -1,
>>>    - dev_to_node(dev) may return NUMA_NO_NODE,
>>>    - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
>>>    - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
>>>
>>> For example, on arm64, mips loongson, s390, and x86,
>>> cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
>>> an invalid array index.
>>
>> The invalid array index of -1 is the underlying problem here when
>> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
>> is not NUMA_NO_NODE aware yet.
>>
>> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
>> disscusion, it is requested that it is better to warn about the pcie
>> device without a node assigned by the firmware before making the
>> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
>> devices of "NUMA_NO_NODE" node can be fixed by their vendor.
>>
>> See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/
> 
> Right.  We should warn if the NUMA node number would help us but DT or
> the firmware didn't give us one.
> 
> But we can do that independently of any cpumask_of_node() changes.
> There's no need to do one patch before the other.  Even if you make
> cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
> because we're not actually changing any node assignments.
> 
>> So maybe change the warning to below:
>>
>> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
>> 	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");
> 
> I think this is perfect and I don't see the need for the refinement
> below:
> 
>> And it seems a pci device's parent will always set to the bridge
>> device in pci_setup_device(), and device_add() which will set the
>> node to its parent's when the child device' node is NUMA_NO_NODE,
>> maybe we can add the bridge device' node checking to make sure
>> the pci device really does not have a node assigned, as below:
>>
>> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE &&
>>      dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> 	dev_err(&bus->dev, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");
> 
> Anyway, would the attached patch work for you?  I have it tentatively
> queued up on pci/enumeration for v5.5.
> 
>>>> It is possible to
>>>> have a PCI bridge shared between two nodes, such that the PCI
>>>> devices have equidistance. But the moment you scale this out, you
>>>> either get devices that are 'local' to a package while having
>>>> multiple packages, or if you maintain a single bridge in a big
>>>> system, things become so slow it all doesn't matter anyway.
>>>> Assigning a node (one of the shared) is, in the generic ase of
>>>> multiple packages, the better solution over assigning all nodes.
>>>>
>>>> As pci_device_add() will assign the pci device' node according to
>>>> the bus the device is on, which is decided by pcibus_to_node().
>>>> Currently different arch may implement the pcibus_to_node() based
>>>> on bus->sysdata or bus device' node, which has the same node as
>>>> the bridge device.
>>>>
>>>> And for devices behind another bridge case, the child bus device
>>>> is setup with proper parent bus device and inherit its parent'
>>>> sysdata in pci_alloc_child_bus(), so the pcie device under the
>>>> child bus should have the same node as the parent bridge when
>>>> device_add() is called, which will set the node to its parent's
>>>> node when the child device' node is NUMA_NO_NODE.
>>>>
>>>> So this patch only warns about the case when a host bridge device
>>>> is registered with a node of NO_NODE in pci_register_host_bridge().
>>>> And it only warns about that when there are more than one numa
>>>> nodes in the system.
>>>
>>>
>>>> [1] https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsheng@huawei.com/
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>>   drivers/pci/probe.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index 3d5271a..22be96a 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>>>>   	list_add_tail(&bus->node, &pci_root_buses);
>>>>   	up_write(&pci_bus_sem);
>>>>   
>>>> +	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
>>>> +		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
>>>> +
>>>>   	return 0;
>>>>   
>>>>   unregister:
> 
> commit 8f8cf239c4f1
> Author: Yunsheng Lin <linyunsheng@huawei.com>
> Date:   Sat Oct 19 14:45:43 2019 +0800
> 
>      PCI: Warn if no host bridge NUMA node info
>      
>      In pci_call_probe(), we try to run driver probe functions on the node where
>      the device is attached.  If we don't know which node the device is attached
>      to, the driver will likely run on the wrong node.  This will still work,
>      but performance will not be as good as it could be.

Is it guaranteed to be purely a performance issue? In other words, is 
there definitely no way a physical node could be disabled via 
idle/hotplug/etc. such that unattributed devices can silently disappear 
while still in use?

>      
>      On NUMA systems, warn if we don't know which node a PCI host bridge is
>      attached to.  This is likely an indication that ACPI didn't supply a _PXM
>      method or the DT didn't supply a "numa-node-id" property.
>      
>      [bhelgaas: commit log, check bus node]
>      Link: https://lore.kernel.org/r/1571467543-26125-1-git-send-email-linyunsheng@huawei.com
>      Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 3d5271a7a849..40259c38d66a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -897,6 +897,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
>   	else
>   		pr_info("PCI host bridge to bus %s\n", name);
>   
> +	if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
> +		dev_warn(&bus->dev, "Unknown NUMA node; performance will be reduced\n");

I think this still deserves the FW_BUG prefix.

Robin.

> +
>   	/* Add initial resources to the bus */
>   	resource_list_for_each_entry_safe(window, n, &resources) {
>   		list_move_tail(&window->node, &bridge->windows);
>
Bjorn Helgaas Oct. 24, 2019, 5:40 p.m. UTC | #11
On Thu, Oct 24, 2019 at 11:20:13AM +0200, Michal Hocko wrote:
> On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote:
> > On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
> > > On 2019/10/23 5:04, Bjorn Helgaas wrote:
> > > > On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> > 
> > > > I think the underlying problem you're addressing is that:
> > > > 
> > > >   - NUMA_NO_NODE == -1,
> > > >   - dev_to_node(dev) may return NUMA_NO_NODE,
> > > >   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
> > > >   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> > > > 
> > > > For example, on arm64, mips loongson, s390, and x86,
> > > > cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> > > > an invalid array index.
> > > 
> > > The invalid array index of -1 is the underlying problem here when
> > > cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
> > > is not NUMA_NO_NODE aware yet.
> > > 
> > > In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
> > > disscusion, it is requested that it is better to warn about the pcie
> > > device without a node assigned by the firmware before making the
> > > cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
> > > devices of "NUMA_NO_NODE" node can be fixed by their vendor.
> > > 
> > > See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/
> > 
> > Right.  We should warn if the NUMA node number would help us but DT or
> > the firmware didn't give us one.
> > 
> > But we can do that independently of any cpumask_of_node() changes.
> > There's no need to do one patch before the other.  Even if you make
> > cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
> > because we're not actually changing any node assignments.
> 
> Agreed. And this has been proposed previously I believe but my
> understanding was that Petr was against that.

I don't think Peter was opposed to a warning.  I assume you're
referring to [1], which is about how cpumask_of_node() should work.
That's not my area, so I don't have a strong opinion.  From that
discussion:

Yunsheng> From what I can see, the problem can be fixed in three
Yunsheng> place:
Yunsheng> 1. Make user dev_to_node return a valid node id
Yunsheng>    even when proximity domain is not set by bios(or node id
Yunsheng>    set by buggy bios is not valid), which may need info from
Yunsheng>    the numa system to make sure it will return a valid node.
Yunsheng>
Yunsheng> 2. User that call cpumask_of_node should ensure the node
Yunsheng>    id is valid before calling cpumask_of_node, and user also
Yunsheng>    need some info to make ensure node id is valid.
Yunsheng>
Yunsheng> 3. Make sure cpumask_of_node deal with invalid node id as
Yunsheng>    this patchset.

Peter> 1) because even it is not set, the device really does belong
Peter> to a node.  It is impossible a device will have magic
Peter> uniform access to memory when CPUs cannot.
Peter> 
Peter> 2) is already true today, cpumask_of_node() requires a valid
Peter> node_id.
Peter> 
Peter> 3) is just wrong and increases overhead for everyone.

I think Peter is advocating (1), i.e., if firmware doesn't tell us a
node ID, we just pick one.  We can certainly do that in *addition* to
adding a warning.  I'd like it to be a separate patch because I think
we want the warning no matter what so users have some clue that
performance could be better.

If we pick one, I guess we either assign some default, like node 0, to
everything; or we somehow pick a random node to assign.

FWIW, (3) might be wrong, but it is what alpha, ia64, mips (ip27),
powerpc, sparc do already.  It'd be nice if they were all the same one
way or the other.

[1] https://lore.kernel.org/linux-arm-kernel/20190831161247.GM2369@hirez.programming.kicks-ass.net/
Michal Hocko Oct. 25, 2019, 8:16 a.m. UTC | #12
On Thu 24-10-19 12:40:13, Bjorn Helgaas wrote:
> On Thu, Oct 24, 2019 at 11:20:13AM +0200, Michal Hocko wrote:
> > On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote:
> > > On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
> > > > On 2019/10/23 5:04, Bjorn Helgaas wrote:
> > > > > On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
> > > 
> > > > > I think the underlying problem you're addressing is that:
> > > > > 
> > > > >   - NUMA_NO_NODE == -1,
> > > > >   - dev_to_node(dev) may return NUMA_NO_NODE,
> > > > >   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
> > > > >   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> > > > > 
> > > > > For example, on arm64, mips loongson, s390, and x86,
> > > > > cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> > > > > an invalid array index.
> > > > 
> > > > The invalid array index of -1 is the underlying problem here when
> > > > cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
> > > > is not NUMA_NO_NODE aware yet.
> > > > 
> > > > In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
> > > > disscusion, it is requested that it is better to warn about the pcie
> > > > device without a node assigned by the firmware before making the
> > > > cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
> > > > devices of "NUMA_NO_NODE" node can be fixed by their vendor.
> > > > 
> > > > See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/
> > > 
> > > Right.  We should warn if the NUMA node number would help us but DT or
> > > the firmware didn't give us one.
> > > 
> > > But we can do that independently of any cpumask_of_node() changes.
> > > There's no need to do one patch before the other.  Even if you make
> > > cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
> > > because we're not actually changing any node assignments.
> > 
> > Agreed. And this has been proposed previously I believe but my
> > understanding was that Petr was against that.
> 
> I don't think Peter was opposed to a warning.

Now, he was opposed to cpumask_of_node handling IIRC.

> I assume you're
> referring to [1], which is about how cpumask_of_node() should work.
> That's not my area, so I don't have a strong opinion.  From that
> discussion:
> 
> Yunsheng> From what I can see, the problem can be fixed in three
> Yunsheng> place:
> Yunsheng> 1. Make user dev_to_node return a valid node id
> Yunsheng>    even when proximity domain is not set by bios(or node id
> Yunsheng>    set by buggy bios is not valid), which may need info from
> Yunsheng>    the numa system to make sure it will return a valid node.
> Yunsheng>
> Yunsheng> 2. User that call cpumask_of_node should ensure the node
> Yunsheng>    id is valid before calling cpumask_of_node, and user also
> Yunsheng>    need some info to make ensure node id is valid.
> Yunsheng>
> Yunsheng> 3. Make sure cpumask_of_node deal with invalid node id as
> Yunsheng>    this patchset.
> 
> Peter> 1) because even it is not set, the device really does belong
> Peter> to a node.  It is impossible a device will have magic
> Peter> uniform access to memory when CPUs cannot.
> Peter> 
> Peter> 2) is already true today, cpumask_of_node() requires a valid
> Peter> node_id.
> Peter> 
> Peter> 3) is just wrong and increases overhead for everyone.
> 
> I think Peter is advocating (1), i.e., if firmware doesn't tell us a
> node ID, we just pick one.  We can certainly do that in *addition* to
> adding a warning.  I'd like it to be a separate patch because I think
> we want the warning no matter what so users have some clue that
> performance could be better.

Yes, those should be two different patches.

> If we pick one, I guess we either assign some default, like node 0, to
> everything; or we somehow pick a random node to assign.

I have tried to explain that picking a default node is tricky because
node 0 is not generally available and you never know when a node might
just disappear if the device is not bound to it.

I really do not see why the proposed online_cpu_mask for that case is
such a big deal. It will likely lead to suboptimal performance but what
do you expect from a suboptimal HW description. There is no question
that the proper node affinity should be set and the warning might really
help here but trying to be clever and find a replacement sounds like
potential for more subtly broken results than doing a straightforward
thing.

But I will just go silent now because I am just repeating myself.
Yunsheng Lin Oct. 25, 2019, 8:51 a.m. UTC | #13
On 2019/10/25 16:16, Michal Hocko wrote:
> On Thu 24-10-19 12:40:13, Bjorn Helgaas wrote:
>> On Thu, Oct 24, 2019 at 11:20:13AM +0200, Michal Hocko wrote:
>>> On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote:
>>>> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
>>>>> On 2019/10/23 5:04, Bjorn Helgaas wrote:
>>>>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>>>>
>>>>>> I think the underlying problem you're addressing is that:
>>>>>>
>>>>>>   - NUMA_NO_NODE == -1,
>>>>>>   - dev_to_node(dev) may return NUMA_NO_NODE,
>>>>>>   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
>>>>>>   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
>>>>>>
>>>>>> For example, on arm64, mips loongson, s390, and x86,
>>>>>> cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
>>>>>> an invalid array index.
>>>>>
>>>>> The invalid array index of -1 is the underlying problem here when
>>>>> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
>>>>> is not NUMA_NO_NODE aware yet.
>>>>>
>>>>> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
>>>>> disscusion, it is requested that it is better to warn about the pcie
>>>>> device without a node assigned by the firmware before making the
>>>>> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
>>>>> devices of "NUMA_NO_NODE" node can be fixed by their vendor.
>>>>>
>>>>> See: https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/
>>>>
>>>> Right.  We should warn if the NUMA node number would help us but DT or
>>>> the firmware didn't give us one.
>>>>
>>>> But we can do that independently of any cpumask_of_node() changes.
>>>> There's no need to do one patch before the other.  Even if you make
>>>> cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
>>>> because we're not actually changing any node assignments.
>>>
>>> Agreed. And this has been proposed previously I believe but my
>>> understanding was that Petr was against that.
>>
>> I don't think Peter was opposed to a warning.
> 
> Now, he was opposed to cpumask_of_node handling IIRC.

That was my understanding too.

But I am not sure if Peter is still opposed to cpumask_of_node() after
the pcie device without node affinity is warned in this patch.


From previous discussion [1]:

>Yunsheng> But I failed to see why the above is related to making node_to_cpumask_map()
>Yunsheng> NUMA_NO_NODE aware?

Peter> Your initial bug is for hns3, which is a PCI device, which really _MUST_
Peter> have a node assigned.

Peter> It not having one, is a straight up bug. We must not silently accept
Peter> NO_NODE there, ever."

I think Peter is not opposed to cpumask_of_node() after this patch if I
understand it correctly.


[1] https://lore.kernel.org/lkml/20191011111539.GX2311@hirez.programming.kicks-ass.net/

> 
>> I assume you're
>> referring to [1], which is about how cpumask_of_node() should work.
>> That's not my area, so I don't have a strong opinion.  From that
>> discussion:
>>
>> Yunsheng> From what I can see, the problem can be fixed in three
>> Yunsheng> place:
>> Yunsheng> 1. Make user dev_to_node return a valid node id
>> Yunsheng>    even when proximity domain is not set by bios(or node id
>> Yunsheng>    set by buggy bios is not valid), which may need info from
>> Yunsheng>    the numa system to make sure it will return a valid node.
>> Yunsheng>
>> Yunsheng> 2. User that call cpumask_of_node should ensure the node
>> Yunsheng>    id is valid before calling cpumask_of_node, and user also
>> Yunsheng>    need some info to make ensure node id is valid.
>> Yunsheng>
>> Yunsheng> 3. Make sure cpumask_of_node deal with invalid node id as
>> Yunsheng>    this patchset.
>>
>> Peter> 1) because even it is not set, the device really does belong
>> Peter> to a node.  It is impossible a device will have magic
>> Peter> uniform access to memory when CPUs cannot.
>> Peter> 
>> Peter> 2) is already true today, cpumask_of_node() requires a valid
>> Peter> node_id.
>> Peter> 
>> Peter> 3) is just wrong and increases overhead for everyone.
>>
>> I think Peter is advocating (1), i.e., if firmware doesn't tell us a
>> node ID, we just pick one.  We can certainly do that in *addition* to
>> adding a warning.  I'd like it to be a separate patch because I think
>> we want the warning no matter what so users have some clue that
>> performance could be better.
> 
> Yes, those should be two different patches.
> 
>> If we pick one, I guess we either assign some default, like node 0, to
>> everything; or we somehow pick a random node to assign.
> 
> I have tried to explain that picking a default node is tricky because
> node 0 is not generally available and you never know when a node might
> just disappear if the device is not bound to it.

Maybe the example you already mentioned in previous discussion may help
here:

3e8589963773 ("memcg: make it work on sparse non-0-node systems")


> 
> I really do not see why the proposed online_cpu_mask for that case is
> such a big deal. It will likely lead to suboptimal performance but what
> do you expect from a suboptimal HW description. There is no question
> that the proper node affinity should be set and the warning might really
> help here but trying to be clever and find a replacement sounds like
> potential for more subtly broken results than doing a straightforward
> thing.
> 
> But I will just go silent now because I am just repeating myself.
>
Bjorn Helgaas Oct. 25, 2019, 12:51 p.m. UTC | #14
On Thu, Oct 24, 2019 at 11:16:41AM +0100, Robin Murphy wrote:
> On 2019-10-23 6:10 pm, Bjorn Helgaas wrote:

> >      PCI: Warn if no host bridge NUMA node info
> >      In pci_call_probe(), we try to run driver probe functions on the node where
> >      the device is attached.  If we don't know which node the device is attached
> >      to, the driver will likely run on the wrong node.  This will still work,
> >      but performance will not be as good as it could be.
> 
> Is it guaranteed to be purely a performance issue? In other words, is there
> definitely no way a physical node could be disabled via idle/hotplug/etc.
> such that unattributed devices can silently disappear while still in use?

I think so.  At least, if it's more than a performance issue, I have
no idea what sort of problem might happen or how to deal with it.

> > @@ -897,6 +897,9 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
> >   	else
> >   		pr_info("PCI host bridge to bus %s\n", name);
> > +	if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
> > +		dev_warn(&bus->dev, "Unknown NUMA node; performance will be reduced\n");
> 
> I think this still deserves the FW_BUG prefix.

Putting the warning here in pci_register_host_bridge() is convenient
for now but doesn't seem like the ideal place.

I'd rather have the warning at the point where we get the node number,
e.g., in pci_acpi_root_get_node() or of_node_to_nid(), where we would
know what's actually required by spec and we could point to the
specific ACPI device or DT device node that's broken.  Then I think
we'd have a better case for using FW_BUG.

I'm a little hesitant to use FW_BUG here in pci_register_host_bridge()
because we don't know where the node number was supposed to come from,
so we can't reliably determine that the lack of one is a bug.

Bjorn

Patch
diff mbox series

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 3d5271a..22be96a 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -927,6 +927,9 @@  static int pci_register_host_bridge(struct pci_host_bridge *bridge)
 	list_add_tail(&bus->node, &pci_root_buses);
 	up_write(&pci_bus_sem);
 
+	if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
+		dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. Please contact your vendor for updates.\n");
+
 	return 0;
 
 unregister: