diff mbox series

[net-next,4/8] devlink: allow subports on devlink PCI ports

Message ID 20190226182436.23811-5-jakub.kicinski@netronome.com
State Changes Requested
Delegated to: David Miller
Headers show
Series devlink: add PF and VF port flavours | expand

Commit Message

Jakub Kicinski Feb. 26, 2019, 6:24 p.m. UTC
PCI endpoint corresponds to a PCI device, but such device
can have one more more logical device ports associated with it.
We need a way to distinguish those. Add a PCI subport in the
dumps and print the info in phys_port_name appropriately.

This is not equivalent to port splitting, there is no split
group. It's just a way of representing multiple netdevs on
a single PCI function.

Note that the quality of being multiport pertains only to
the PCI function itself. A PF having multiple netdevs does
not mean that its VFs will also have multiple, or that VFs
are associated with any particular port of a multiport VF.

Example (bus 05 device has subports, bus 82 has only one port per
function):

$ devlink port
pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical
pci/0000:05:00.0/10000: type eth netdev enp5s0npf0s0 flavour pci_pf pf 0 subport 0
pci/0000:05:00.0/4: type eth netdev enp5s0np1 flavour physical
pci/0000:05:00.0/11000: type eth netdev enp5s0npf0s1 flavour pci_pf pf 0 subport 1
pci/0000:82:00.0/0: type eth netdev p4p1 flavour physical
pci/0000:82:00.0/10000: type eth netdev eth0 flavour pci_pf pf 0

$ devlink -jp port
{
    "port": {
        "pci/0000:05:00.0/0": {
            "type": "eth",
            "netdev": "enp5s0np0",
            "flavour": "physical"
        },
        "pci/0000:05:00.0/10000": {
            "type": "eth",
            "netdev": "enp5s0npf0s0",
            "flavour": "pci_pf",
            "pf": 0,
            "subport": 0
        },
        "pci/0000:05:00.0/4": {
            "type": "eth",
            "netdev": "enp5s0np1",
            "flavour": "physical"
        },
        "pci/0000:05:00.0/11000": {
            "type": "eth",
            "netdev": "enp5s0npf0s1",
            "flavour": "pci_pf",
            "pf": 0,
            "subport": 1
        },
        "pci/0000:82:00.0/0": {
            "type": "eth",
            "netdev": "p4p1",
            "flavour": "physical"
        },
        "pci/0000:82:00.0/10000": {
            "type": "eth",
            "netdev": "eth0",
            "flavour": "pci_pf",
            "pf": 0
        }
    }
}

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../net/ethernet/netronome/nfp/nfp_devlink.c  |  3 +-
 include/net/devlink.h                         |  9 ++++--
 include/uapi/linux/devlink.h                  |  1 +
 net/core/devlink.c                            | 29 ++++++++++++++++---
 4 files changed, 35 insertions(+), 7 deletions(-)

Comments

Jiri Pirko Feb. 27, 2019, 12:37 p.m. UTC | #1
Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:
>PCI endpoint corresponds to a PCI device, but such device
>can have one more more logical device ports associated with it.
>We need a way to distinguish those. Add a PCI subport in the
>dumps and print the info in phys_port_name appropriately.
>
>This is not equivalent to port splitting, there is no split
>group. It's just a way of representing multiple netdevs on
>a single PCI function.
>
>Note that the quality of being multiport pertains only to
>the PCI function itself. A PF having multiple netdevs does
>not mean that its VFs will also have multiple, or that VFs
>are associated with any particular port of a multiport VF.
>

We've been discussing the problem of subport (we call it "subfunction"
or "SF") for some time internally. Turned out, this is probably harder
task to model. Please prove me wrong.

The nature of VF makes it a logically separate entity. It has a separate
PCI address, it should therefore have a separate devlink instance.
You can pass it through to VM, then the same devlink instance should be
created inside the VM and disappear from the host.

SF (or subport) feels similar to that. Basically it is exactly the same
thing as VF, only does reside under PF PCI function.

That is why I think, for sake of consistency, it should have a separate
devlink entity as well. The problem is correct sysfs modelling and
devlink handle derived from that. Parav is working on a simple soft
bus for this purpose called "subbus". There is a RFC floating around on
Mellanox internal mailing list, looks like it is time to send it
upstream.

Then each PF driver which have SFs would register subbus devices
according to SFs/subports and they would be properly handled by bus
probe, devlink and devlink port and netdev instances created.

Ccing Parav and Jason.
Jakub Kicinski Feb. 27, 2019, 6:30 p.m. UTC | #2
On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:
> Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:
> >PCI endpoint corresponds to a PCI device, but such device
> >can have one more more logical device ports associated with it.
> >We need a way to distinguish those. Add a PCI subport in the
> >dumps and print the info in phys_port_name appropriately.
> >
> >This is not equivalent to port splitting, there is no split
> >group. It's just a way of representing multiple netdevs on
> >a single PCI function.
> >
> >Note that the quality of being multiport pertains only to
> >the PCI function itself. A PF having multiple netdevs does
> >not mean that its VFs will also have multiple, or that VFs
> >are associated with any particular port of a multiport VF.
> 
> We've been discussing the problem of subport (we call it "subfunction"
> or "SF") for some time internally. Turned out, this is probably harder
> task to model. Please prove me wrong.
> 
> The nature of VF makes it a logically separate entity. It has a separate
> PCI address, it should therefore have a separate devlink instance.
> You can pass it through to VM, then the same devlink instance should be
> created inside the VM and disappear from the host.

Depends what a devlink instance represents :/  On one hand you may want
to create an instance for a VF to allow it to spawn soft ports, on the
other you may want to group multiple functions together.

IOW if devlink instance is for an ASIC, there should be one per device
per host.  So if we start connecting multiple functions (PFs and/or VFs)
to one host we should probably introduce the notion of devlink aliases
or some such (so that multiple bus addresses can target the same
devlink instance).  Those less pipelined NICs can forward between
ports, but still want a function per port (otherwise user space
sometimes gets confused).  If we have multiple functions which are on
the same "switchid" they should have a single devlink instance if you
ask me.  That instance will have all the ports of the device.

You say disappear from the host - what do you mean.  Are you referring
to the VF port disappearing?  But on the switch the port is still
there, and you should show the subports on the PF side IMHO.  Devlink
ports should allow users to understand the topology of the switch.

Is spawning VMDq sub-instances the only thing we can think of that VMs
may want to do?  Are there any other uses?

> SF (or subport) feels similar to that. Basically it is exactly the same
> thing as VF, only does reside under PF PCI function.
> 
> That is why I think, for sake of consistency, it should have a separate
> devlink entity as well. The problem is correct sysfs modelling and
> devlink handle derived from that. Parav is working on a simple soft
> bus for this purpose called "subbus". There is a RFC floating around on
> Mellanox internal mailing list, looks like it is time to send it
> upstream.
> 
> Then each PF driver which have SFs would register subbus devices
> according to SFs/subports and they would be properly handled by bus
> probe, devlink and devlink port and netdev instances created.
> 
> Ccing Parav and Jason.

You guys come from the RDMA side of the world, with which I'm less
familiar, and the soft bus + spawning devices seems to be a popular
design there.  Could you describe the advantages of that model for 
the sake of the netdev-only folks? :)

Another term that gets thrown into the mix here is mediated devices,
right?  If you wanna pass the sub-spawn-soft-port to a VM.  Or run 
DPDK on some queues.

To state the obvious AF_XDP and macvlan offload were are previous
answers to some of those use cases.  What is the forwarding model
for those subports?  Are we going to allow flower rules from VMs?
Is it going to be dst MAC only?  Or is the hypervisor going to forward
as it sees appropriate (OvS + "repr"/port netdev)?
Jiri Pirko Feb. 28, 2019, 8:56 a.m. UTC | #3
Wed, Feb 27, 2019 at 07:30:00PM CET, jakub.kicinski@netronome.com wrote:
>On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:
>> Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:
>> >PCI endpoint corresponds to a PCI device, but such device
>> >can have one more more logical device ports associated with it.
>> >We need a way to distinguish those. Add a PCI subport in the
>> >dumps and print the info in phys_port_name appropriately.
>> >
>> >This is not equivalent to port splitting, there is no split
>> >group. It's just a way of representing multiple netdevs on
>> >a single PCI function.
>> >
>> >Note that the quality of being multiport pertains only to
>> >the PCI function itself. A PF having multiple netdevs does
>> >not mean that its VFs will also have multiple, or that VFs
>> >are associated with any particular port of a multiport VF.
>> 
>> We've been discussing the problem of subport (we call it "subfunction"
>> or "SF") for some time internally. Turned out, this is probably harder
>> task to model. Please prove me wrong.
>> 
>> The nature of VF makes it a logically separate entity. It has a separate
>> PCI address, it should therefore have a separate devlink instance.
>> You can pass it through to VM, then the same devlink instance should be
>> created inside the VM and disappear from the host.
>
>Depends what a devlink instance represents :/  On one hand you may want
>to create an instance for a VF to allow it to spawn soft ports, on the
>other you may want to group multiple functions together.
>
>IOW if devlink instance is for an ASIC, there should be one per device
>per host.  So if we start connecting multiple functions (PFs and/or VFs)
>to one host we should probably introduce the notion of devlink aliases
>or some such (so that multiple bus addresses can target the same

Hmm. Like VF address -> PF address alias? That would be confusing to see
eswitch ports under VF devlink instance... I probably did not get you
right.


>devlink instance).  Those less pipelined NICs can forward between
>ports, but still want a function per port (otherwise user space
>sometimes gets confused).  If we have multiple functions which are on
>the same "switchid" they should have a single devlink instance if you
>ask me.  That instance will have all the ports of the device.

Okay, that makes sense. But the question it, can the same devlink
instance contain ports that does not have "Switchid"?

I think it would be beneficial to have the switchid shown for devlink
ports too. Then it is clean that the devlink ports with the same
switchid belong to the same switch, and other ports under the same
devlink instance (like PF itself) is separate, but still under the same
ASIC.


>
>You say disappear from the host - what do you mean.  Are you referring
>to the VF port disappearing?  But on the switch the port is still

No, VF itself. eswitch port will be still there on the host.


>there, and you should show the subports on the PF side IMHO.  Devlink
>ports should allow users to understand the topology of the switch.

What do you mean by "topology"?


>
>Is spawning VMDq sub-instances the only thing we can think of that VMs
>may want to do?  Are there any other uses?
>
>> SF (or subport) feels similar to that. Basically it is exactly the same
>> thing as VF, only does reside under PF PCI function.
>> 
>> That is why I think, for sake of consistency, it should have a separate
>> devlink entity as well. The problem is correct sysfs modelling and
>> devlink handle derived from that. Parav is working on a simple soft
>> bus for this purpose called "subbus". There is a RFC floating around on
>> Mellanox internal mailing list, looks like it is time to send it
>> upstream.
>> 
>> Then each PF driver which have SFs would register subbus devices
>> according to SFs/subports and they would be properly handled by bus
>> probe, devlink and devlink port and netdev instances created.
>> 
>> Ccing Parav and Jason.
>
>You guys come from the RDMA side of the world, with which I'm less
>familiar, and the soft bus + spawning devices seems to be a popular
>design there.  Could you describe the advantages of that model for 
>the sake of the netdev-only folks? :)

I'll try to draw some ascii art :)

>
>Another term that gets thrown into the mix here is mediated devices,
>right?  If you wanna pass the sub-spawn-soft-port to a VM.  Or run 
>DPDK on some queues.
>
>To state the obvious AF_XDP and macvlan offload were are previous
>answers to some of those use cases.  What is the forwarding model
>for those subports?  Are we going to allow flower rules from VMs?
>Is it going to be dst MAC only?  Or is the hypervisor going to forward
>as it sees appropriate (OvS + "repr"/port netdev)?
Jiri Pirko Feb. 28, 2019, 1:32 p.m. UTC | #4
Thu, Feb 28, 2019 at 09:56:24AM CET, jiri@resnulli.us wrote:

[...]


>>devlink instance).  Those less pipelined NICs can forward between
>>ports, but still want a function per port (otherwise user space
>>sometimes gets confused).  If we have multiple functions which are on
>>the same "switchid" they should have a single devlink instance if you
>>ask me.  That instance will have all the ports of the device.
>
>Okay, that makes sense. But the question it, can the same devlink
>instance contain ports that does not have "Switchid"?
>
>I think it would be beneficial to have the switchid shown for devlink
>ports too. Then it is clean that the devlink ports with the same
>switchid belong to the same switch, and other ports under the same
>devlink instance (like PF itself) is separate, but still under the same
>ASIC.

Working on this. Will sent patchset later today/tmrw.

[...]
Jakub Kicinski Feb. 28, 2019, 4:24 p.m. UTC | #5
On Thu, 28 Feb 2019 09:56:24 +0100, Jiri Pirko wrote:
> Wed, Feb 27, 2019 at 07:30:00PM CET, jakub.kicinski@netronome.com wrote:
> >On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:  
> >> Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:  
> >> >PCI endpoint corresponds to a PCI device, but such device
> >> >can have one more more logical device ports associated with it.
> >> >We need a way to distinguish those. Add a PCI subport in the
> >> >dumps and print the info in phys_port_name appropriately.
> >> >
> >> >This is not equivalent to port splitting, there is no split
> >> >group. It's just a way of representing multiple netdevs on
> >> >a single PCI function.
> >> >
> >> >Note that the quality of being multiport pertains only to
> >> >the PCI function itself. A PF having multiple netdevs does
> >> >not mean that its VFs will also have multiple, or that VFs
> >> >are associated with any particular port of a multiport VF.  
> >> 
> >> We've been discussing the problem of subport (we call it "subfunction"
> >> or "SF") for some time internally. Turned out, this is probably harder
> >> task to model. Please prove me wrong.
> >> 
> >> The nature of VF makes it a logically separate entity. It has a separate
> >> PCI address, it should therefore have a separate devlink instance.
> >> You can pass it through to VM, then the same devlink instance should be
> >> created inside the VM and disappear from the host.  
> >
> >Depends what a devlink instance represents :/  On one hand you may want
> >to create an instance for a VF to allow it to spawn soft ports, on the
> >other you may want to group multiple functions together.
> >
> >IOW if devlink instance is for an ASIC, there should be one per device
> >per host.  So if we start connecting multiple functions (PFs and/or VFs)
> >to one host we should probably introduce the notion of devlink aliases
> >or some such (so that multiple bus addresses can target the same  
> 
> Hmm. Like VF address -> PF address alias? That would be confusing to see
> eswitch ports under VF devlink instance... I probably did not get you
> right.

No eswitch ports under VF, more in case of mutli-PF.  Bus addresses of
all PFs aliasing to the same devlink instance.

> >devlink instance).  Those less pipelined NICs can forward between
> >ports, but still want a function per port (otherwise user space
> >sometimes gets confused).  If we have multiple functions which are on
> >the same "switchid" they should have a single devlink instance if you
> >ask me.  That instance will have all the ports of the device.  
> 
> Okay, that makes sense. But the question it, can the same devlink
> instance contain ports that does not have "Switchid"?

No strong preference if switchid is different.  To me devlink is an ASIC
instance, if the multiport card is constructed by copy-pasting the same
IP twice onto a die, and the ports really are completely separate, there
is no reason to require single devlink instance.

> I think it would be beneficial to have the switchid shown for devlink
> ports too. Then it is clean that the devlink ports with the same
> switchid belong to the same switch, and other ports under the same
> devlink instance (like PF itself) is separate, but still under the same
> ASIC.

Sure, you mean in terms of UI - user space can do a link dump or get
that from sysfs, right?

> >You say disappear from the host - what do you mean.  Are you referring
> >to the VF port disappearing?  But on the switch the port is still  
> 
> No, VF itself. eswitch port will be still there on the host.
> 
> 
> >there, and you should show the subports on the PF side IMHO.  Devlink
> >ports should allow users to understand the topology of the switch.  
> 
> What do you mean by "topology"?

Mostly which ports are part of the switch and what's their "flavour".
Also (less importantly) which host netdevs are "peers" of eswitch ports.

> >Is spawning VMDq sub-instances the only thing we can think of that VMs
> >may want to do?  Are there any other uses?
> >  
> >> SF (or subport) feels similar to that. Basically it is exactly the same
> >> thing as VF, only does reside under PF PCI function.
> >> 
> >> That is why I think, for sake of consistency, it should have a separate
> >> devlink entity as well. The problem is correct sysfs modelling and
> >> devlink handle derived from that. Parav is working on a simple soft
> >> bus for this purpose called "subbus". There is a RFC floating around on
> >> Mellanox internal mailing list, looks like it is time to send it
> >> upstream.
> >> 
> >> Then each PF driver which have SFs would register subbus devices
> >> according to SFs/subports and they would be properly handled by bus
> >> probe, devlink and devlink port and netdev instances created.
> >> 
> >> Ccing Parav and Jason.  
> >
> >You guys come from the RDMA side of the world, with which I'm less
> >familiar, and the soft bus + spawning devices seems to be a popular
> >design there.  Could you describe the advantages of that model for 
> >the sake of the netdev-only folks? :)  
> 
> I'll try to draw some ascii art :)

Yess :)

> >Another term that gets thrown into the mix here is mediated devices,
> >right?  If you wanna pass the sub-spawn-soft-port to a VM.  Or run 
> >DPDK on some queues.
> >
> >To state the obvious AF_XDP and macvlan offload were are previous
> >answers to some of those use cases.  What is the forwarding model
> >for those subports?  Are we going to allow flower rules from VMs?
> >Is it going to be dst MAC only?  Or is the hypervisor going to forward
> >as it sees appropriate (OvS + "repr"/port netdev)?
Jiri Pirko March 1, 2019, 7:25 a.m. UTC | #6
Thu, Feb 28, 2019 at 05:24:04PM CET, jakub.kicinski@netronome.com wrote:
>On Thu, 28 Feb 2019 09:56:24 +0100, Jiri Pirko wrote:
>> Wed, Feb 27, 2019 at 07:30:00PM CET, jakub.kicinski@netronome.com wrote:
>> >On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:  
>> >> Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:  
>> >> >PCI endpoint corresponds to a PCI device, but such device
>> >> >can have one more more logical device ports associated with it.
>> >> >We need a way to distinguish those. Add a PCI subport in the
>> >> >dumps and print the info in phys_port_name appropriately.
>> >> >
>> >> >This is not equivalent to port splitting, there is no split
>> >> >group. It's just a way of representing multiple netdevs on
>> >> >a single PCI function.
>> >> >
>> >> >Note that the quality of being multiport pertains only to
>> >> >the PCI function itself. A PF having multiple netdevs does
>> >> >not mean that its VFs will also have multiple, or that VFs
>> >> >are associated with any particular port of a multiport VF.  
>> >> 
>> >> We've been discussing the problem of subport (we call it "subfunction"
>> >> or "SF") for some time internally. Turned out, this is probably harder
>> >> task to model. Please prove me wrong.
>> >> 
>> >> The nature of VF makes it a logically separate entity. It has a separate
>> >> PCI address, it should therefore have a separate devlink instance.
>> >> You can pass it through to VM, then the same devlink instance should be
>> >> created inside the VM and disappear from the host.  
>> >
>> >Depends what a devlink instance represents :/  On one hand you may want
>> >to create an instance for a VF to allow it to spawn soft ports, on the
>> >other you may want to group multiple functions together.
>> >
>> >IOW if devlink instance is for an ASIC, there should be one per device
>> >per host.  So if we start connecting multiple functions (PFs and/or VFs)
>> >to one host we should probably introduce the notion of devlink aliases
>> >or some such (so that multiple bus addresses can target the same  
>> 
>> Hmm. Like VF address -> PF address alias? That would be confusing to see
>> eswitch ports under VF devlink instance... I probably did not get you
>> right.
>
>No eswitch ports under VF, more in case of mutli-PF.  Bus addresses of
>all PFs aliasing to the same devlink instance.

The multi-PF aliasing makes sense to me.


>
>> >devlink instance).  Those less pipelined NICs can forward between
>> >ports, but still want a function per port (otherwise user space
>> >sometimes gets confused).  If we have multiple functions which are on
>> >the same "switchid" they should have a single devlink instance if you
>> >ask me.  That instance will have all the ports of the device.  
>> 
>> Okay, that makes sense. But the question it, can the same devlink
>> instance contain ports that does not have "Switchid"?
>
>No strong preference if switchid is different.  To me devlink is an ASIC
>instance, if the multiport card is constructed by copy-pasting the same
>IP twice onto a die, and the ports really are completely separate, there
>is no reason to require single devlink instance.

Okay.


>
>> I think it would be beneficial to have the switchid shown for devlink
>> ports too. Then it is clean that the devlink ports with the same
>> switchid belong to the same switch, and other ports under the same
>> devlink instance (like PF itself) is separate, but still under the same
>> ASIC.
>
>Sure, you mean in terms of UI - user space can do a link dump or get
>that from sysfs, right?

I thinking about moving it to devlink. I'll work on it more today.


>
>> >You say disappear from the host - what do you mean.  Are you referring
>> >to the VF port disappearing?  But on the switch the port is still  
>> 
>> No, VF itself. eswitch port will be still there on the host.
>> 
>> 
>> >there, and you should show the subports on the PF side IMHO.  Devlink
>> >ports should allow users to understand the topology of the switch.  
>> 
>> What do you mean by "topology"?
>
>Mostly which ports are part of the switch and what's their "flavour".
>Also (less importantly) which host netdevs are "peers" of eswitch ports.

Makes sense.


>
>> >Is spawning VMDq sub-instances the only thing we can think of that VMs
>> >may want to do?  Are there any other uses?
>> >  
>> >> SF (or subport) feels similar to that. Basically it is exactly the same
>> >> thing as VF, only does reside under PF PCI function.
>> >> 
>> >> That is why I think, for sake of consistency, it should have a separate
>> >> devlink entity as well. The problem is correct sysfs modelling and
>> >> devlink handle derived from that. Parav is working on a simple soft
>> >> bus for this purpose called "subbus". There is a RFC floating around on
>> >> Mellanox internal mailing list, looks like it is time to send it
>> >> upstream.
>> >> 
>> >> Then each PF driver which have SFs would register subbus devices
>> >> according to SFs/subports and they would be properly handled by bus
>> >> probe, devlink and devlink port and netdev instances created.
>> >> 
>> >> Ccing Parav and Jason.  
>> >
>> >You guys come from the RDMA side of the world, with which I'm less
>> >familiar, and the soft bus + spawning devices seems to be a popular
>> >design there.  Could you describe the advantages of that model for 
>> >the sake of the netdev-only folks? :)  
>> 
>> I'll try to draw some ascii art :)
>
>Yess :)
>
>> >Another term that gets thrown into the mix here is mediated devices,
>> >right?  If you wanna pass the sub-spawn-soft-port to a VM.  Or run 
>> >DPDK on some queues.
>> >
>> >To state the obvious AF_XDP and macvlan offload were are previous
>> >answers to some of those use cases.  What is the forwarding model
>> >for those subports?  Are we going to allow flower rules from VMs?
>> >Is it going to be dst MAC only?  Or is the hypervisor going to forward
>> >as it sees appropriate (OvS + "repr"/port netdev)?
Jakub Kicinski March 1, 2019, 4:04 p.m. UTC | #7
On Fri, 1 Mar 2019 08:25:57 +0100, Jiri Pirko wrote:
> >> I think it would be beneficial to have the switchid shown for devlink
> >> ports too. Then it is clean that the devlink ports with the same
> >> switchid belong to the same switch, and other ports under the same
> >> devlink instance (like PF itself) is separate, but still under the same
> >> ASIC.  
> >
> >Sure, you mean in terms of UI - user space can do a link dump or get
> >that from sysfs, right?  
> 
> I thinking about moving it to devlink. I'll work on it more today.

At the kernel level?  Are we sure this is appropriate?  I'd think
switchid is in essence a forwarding attribute, so it belongs with
netdevs, no?
Jiri Pirko March 1, 2019, 4:20 p.m. UTC | #8
Fri, Mar 01, 2019 at 05:04:01PM CET, jakub.kicinski@netronome.com wrote:
>On Fri, 1 Mar 2019 08:25:57 +0100, Jiri Pirko wrote:
>> >> I think it would be beneficial to have the switchid shown for devlink
>> >> ports too. Then it is clean that the devlink ports with the same
>> >> switchid belong to the same switch, and other ports under the same
>> >> devlink instance (like PF itself) is separate, but still under the same
>> >> ASIC.  
>> >
>> >Sure, you mean in terms of UI - user space can do a link dump or get
>> >that from sysfs, right?  
>> 
>> I thinking about moving it to devlink. I'll work on it more today.
>
>At the kernel level?  Are we sure this is appropriate?  I'd think
>switchid is in essence a forwarding attribute, so it belongs with
>netdevs, no?

It's an id which identifies ports which belong to the same switch.
Internally in kernel it is used for forwarding purposes. But it is
exposed to user too. I believe it is a nice way to see which devlink
ports belong to switch and which not. Please see the RFC I send today.
Parav Pandit March 4, 2019, 5 a.m. UTC | #9
> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Wednesday, February 27, 2019 6:38 AM
> To: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: davem@davemloft.net; oss-drivers@netronome.com;
> netdev@vger.kernel.org; Parav Pandit <parav@mellanox.com>; Jason
> Gunthorpe <jgg@mellanox.com>
> Subject: Re: [PATCH net-next 4/8] devlink: allow subports on devlink PCI
> ports
> 
> Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:
> >PCI endpoint corresponds to a PCI device, but such device can have one
> >more more logical device ports associated with it.
> >We need a way to distinguish those. Add a PCI subport in the dumps and
> >print the info in phys_port_name appropriately.
> >
> >This is not equivalent to port splitting, there is no split group. It's
> >just a way of representing multiple netdevs on a single PCI function.
> >
> >Note that the quality of being multiport pertains only to the PCI
> >function itself. A PF having multiple netdevs does not mean that its
> >VFs will also have multiple, or that VFs are associated with any
> >particular port of a multiport VF.
> >
> 
> We've been discussing the problem of subport (we call it "subfunction"
> or "SF") for some time internally. Turned out, this is probably harder task to
> model. Please prove me wrong.
> 
> The nature of VF makes it a logically separate entity. It has a separate PCI
> address, it should therefore have a separate devlink instance.
> You can pass it through to VM, then the same devlink instance should be
> created inside the VM and disappear from the host.
> 
> SF (or subport) feels similar to that. Basically it is exactly the same thing as
> VF, only does reside under PF PCI function.
> 
> That is why I think, for sake of consistency, it should have a separate devlink
> entity as well. The problem is correct sysfs modelling and devlink handle
> derived from that. Parav is working on a simple soft bus for this purpose
> called "subbus". There is a RFC floating around on Mellanox internal mailing
> list, looks like it is time to send it upstream.
> 
> Then each PF driver which have SFs would register subbus devices according
> to SFs/subports and they would be properly handled by bus probe, devlink
> and devlink port and netdev instances created.
> 
> Ccing Parav and Jason.

Thanks Jiri for the heads up. I sent the RFC in thread [1].

[1] https://lkml.org/lkml/2019/3/1/19
Jason Gunthorpe March 4, 2019, 4:15 p.m. UTC | #10
On Wed, Feb 27, 2019 at 10:30:00AM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:
> > Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:
> > >PCI endpoint corresponds to a PCI device, but such device
> > >can have one more more logical device ports associated with it.
> > >We need a way to distinguish those. Add a PCI subport in the
> > >dumps and print the info in phys_port_name appropriately.
> > >
> > >This is not equivalent to port splitting, there is no split
> > >group. It's just a way of representing multiple netdevs on
> > >a single PCI function.
> > >
> > >Note that the quality of being multiport pertains only to
> > >the PCI function itself. A PF having multiple netdevs does
> > >not mean that its VFs will also have multiple, or that VFs
> > >are associated with any particular port of a multiport VF.
> > 
> > We've been discussing the problem of subport (we call it "subfunction"
> > or "SF") for some time internally. Turned out, this is probably harder
> > task to model. Please prove me wrong.
> > 
> > The nature of VF makes it a logically separate entity. It has a separate
> > PCI address, it should therefore have a separate devlink instance.
> > You can pass it through to VM, then the same devlink instance should be
> > created inside the VM and disappear from the host.
> 
> Depends what a devlink instance represents :/  On one hand you may want
> to create an instance for a VF to allow it to spawn soft ports, on the
> other you may want to group multiple functions together.
> 
> IOW if devlink instance is for an ASIC, there should be one per device
> per host.  

Don't we already have devlink instances for every mlx5 physical port
and VF as they are unique PCI functions?

> You guys come from the RDMA side of the world, with which I'm less
> familiar, and the soft bus + spawning devices seems to be a popular
> design there.  Could you describe the advantages of that model for 
> the sake of the netdev-only folks? :)

I don't think we do this in RDMA at all yet, or maybe I'm not sure
what you are thinking of?

The forward looking discussion is mainly to create something like
macvlan that can be offloaded, so we have something like a 'rdma
offload HW context' for each 'rdma macvlan'

.. and that 'rdma offload HW context' has all the same knobs as an
offload context for a VF that would normally be tied to a unique PCI
BDF (via devlink). But in this case there is no unique BDF.

From another angle this is broadly similar to the scalable IOV stuff,
but without placing requirements on the host IOMMU to implement it.

Jason
Jakub Kicinski March 5, 2019, 1:03 a.m. UTC | #11
On Mon, 4 Mar 2019 16:15:14 +0000, Jason Gunthorpe wrote:
> On Wed, Feb 27, 2019 at 10:30:00AM -0800, Jakub Kicinski wrote:
> > On Wed, 27 Feb 2019 13:37:53 +0100, Jiri Pirko wrote:  
> > > Tue, Feb 26, 2019 at 07:24:32PM CET, jakub.kicinski@netronome.com wrote:  
> > > >PCI endpoint corresponds to a PCI device, but such device
> > > >can have one more more logical device ports associated with it.
> > > >We need a way to distinguish those. Add a PCI subport in the
> > > >dumps and print the info in phys_port_name appropriately.
> > > >
> > > >This is not equivalent to port splitting, there is no split
> > > >group. It's just a way of representing multiple netdevs on
> > > >a single PCI function.
> > > >
> > > >Note that the quality of being multiport pertains only to
> > > >the PCI function itself. A PF having multiple netdevs does
> > > >not mean that its VFs will also have multiple, or that VFs
> > > >are associated with any particular port of a multiport VF.  
> > > 
> > > We've been discussing the problem of subport (we call it "subfunction"
> > > or "SF") for some time internally. Turned out, this is probably harder
> > > task to model. Please prove me wrong.
> > > 
> > > The nature of VF makes it a logically separate entity. It has a separate
> > > PCI address, it should therefore have a separate devlink instance.
> > > You can pass it through to VM, then the same devlink instance should be
> > > created inside the VM and disappear from the host.  
> > 
> > Depends what a devlink instance represents :/  On one hand you may want
> > to create an instance for a VF to allow it to spawn soft ports, on the
> > other you may want to group multiple functions together.
> > 
> > IOW if devlink instance is for an ASIC, there should be one per device
> > per host.    
> 
> Don't we already have devlink instances for every mlx5 physical port
> and VF as they are unique PCI functions?

That's a very NIC-centric view of the world, though.  Equating devlink
instances to ports, and further to PCI devices.  Its fundamentally
different from what switches and some NICs do, where all ports are under
single devlink instance.

> > You guys come from the RDMA side of the world, with which I'm less
> > familiar, and the soft bus + spawning devices seems to be a popular
> > design there.  Could you describe the advantages of that model for 
> > the sake of the netdev-only folks? :)  
> 
> I don't think we do this in RDMA at all yet, or maybe I'm not sure
> what you are thinking of?

Mm.. I caught an Intel patch set recently which was talking about buses
and spawning devices.  It must have been a different kettle of fish.

> The forward looking discussion is mainly to create something like
> macvlan that can be offloaded, so we have something like a 'rdma
> offload HW context' for each 'rdma macvlan'
> 
> .. and that 'rdma offload HW context' has all the same knobs as an
> offload context for a VF that would normally be tied to a unique PCI
> BDF (via devlink). But in this case there is no unique BDF.
> 
> From another angle this is broadly similar to the scalable IOV stuff,
> but without placing requirements on the host IOMMU to implement it.

Understood, thanks for clarifying.  The question becomes how do we
square this SR-IOV world with world of switches.  Is the hypervisor
supposed to know that the VM has partitioned its VF?
Jason Gunthorpe March 5, 2019, 1:30 a.m. UTC | #12
On Mon, Mar 04, 2019 at 05:03:20PM -0800, Jakub Kicinski wrote:

> > Don't we already have devlink instances for every mlx5 physical port
> > and VF as they are unique PCI functions?
> 
> That's a very NIC-centric view of the world, though.  Equating devlink
> instances to ports, and further to PCI devices.  Its fundamentally
> different from what switches and some NICs do, where all ports are under
> single devlink instance.

I think, as a practical matter, it is a bit hard to recombine an asic
that presents multiple PCI BDFs into a single SW object. It is tricky
to give stable labels to things, to leave gaps to allow for uncertain
discovey, to co-ordinate between multiple struct pci_device drivers
probe functions, etc.

And at least with devlink, if you have a object layer that is broader
then PCI BDF, how do the devlink commands work? Are all BDFs just an
alias for this hidden super object?

Do any drivers attempt to provide single instant made up of merged
BDFs?

In other words, is a PCI BDF really the largest granularity that
devlink can address today?

At least in RDMA we have drivers doing all combinations of this:
multiple ports per BDF, one port per BDF, and one composite RDMA
device formed by combining multiple BDFs worth of ports together.

> > > You guys come from the RDMA side of the world, with which I'm less
> > > familiar, and the soft bus + spawning devices seems to be a popular
> > > design there.  Could you describe the advantages of that model for 
> > > the sake of the netdev-only folks? :)  
> > 
> > I don't think we do this in RDMA at all yet, or maybe I'm not sure
> > what you are thinking of?
> 
> Mm.. I caught an Intel patch set recently which was talking about buses
> and spawning devices.  It must have been a different kettle of fish.

That sounds like scalable iov..

Jason
Jakub Kicinski March 5, 2019, 2:11 a.m. UTC | #13
On Tue, 5 Mar 2019 01:30:19 +0000, Jason Gunthorpe wrote:
> On Mon, Mar 04, 2019 at 05:03:20PM -0800, Jakub Kicinski wrote:
> 
> > > Don't we already have devlink instances for every mlx5 physical port
> > > and VF as they are unique PCI functions?  
> > 
> > That's a very NIC-centric view of the world, though.  Equating devlink
> > instances to ports, and further to PCI devices.  Its fundamentally
> > different from what switches and some NICs do, where all ports are under
> > single devlink instance.  
> 
> I think, as a practical matter, it is a bit hard to recombine an asic
> that presents multiple PCI BDFs into a single SW object. It is tricky
> to give stable labels to things, to leave gaps to allow for uncertain
> discovey, to co-ordinate between multiple struct pci_device drivers
> probe functions, etc.

It is tricky indeed, hence my so far unsuccessful search for a stable
handle :/  One thing which would not make things easier tho, is if we
objects we use to model this scenario don't have clear meanings...

> And at least with devlink, if you have a object layer that is broader
> then PCI BDF, how do the devlink commands work? Are all BDFs just an
> alias for this hidden super object?

My thinking was that they'd alias.

> Do any drivers attempt to provide single instant made up of merged
> BDFs?

Not yet, but our NFP can do it.  NFP used to be single PF per host,
which made life easier, but the silicon team was persuaded to remove
that comfort :)

> In other words, is a PCI BDF really the largest granularity that
> devlink can address today?

Yes, DBDF is the largest today, _but_ most advanced devices (mlxsw, nfp)
have only one PF per host.  IOW we existed blissfully in a world where
devices either pipelined from port to PF or had only one PF.

> At least in RDMA we have drivers doing all combinations of this:
> multiple ports per BDF, one port per BDF, and one composite RDMA
> device formed by combining multiple BDFs worth of ports together.

Right, last but not least we have the case where there is one port but
multiple links (for NUMA, or just because 1 PCIe link can't really cope
with 200Gbps).  In that case which DBDF would the port go to? :(
Do all internal info of the ASIC (health, regions, sbs) get registered
twice?

> > > > You guys come from the RDMA side of the world, with which I'm less
> > > > familiar, and the soft bus + spawning devices seems to be a popular
> > > > design there.  Could you describe the advantages of that model for 
> > > > the sake of the netdev-only folks? :)    
> > > 
> > > I don't think we do this in RDMA at all yet, or maybe I'm not sure
> > > what you are thinking of?  
> > 
> > Mm.. I caught an Intel patch set recently which was talking about buses
> > and spawning devices.  It must have been a different kettle of fish.  
> 
> That sounds like scalable iov..
> 
> Jason
Jason Gunthorpe March 5, 2019, 10:11 p.m. UTC | #14
On Mon, Mar 04, 2019 at 06:11:07PM -0800, Jakub Kicinski wrote:

> > At least in RDMA we have drivers doing all combinations of this:
> > multiple ports per BDF, one port per BDF, and one composite RDMA
> > device formed by combining multiple BDFs worth of ports together.
> 
> Right, last but not least we have the case where there is one port but
> multiple links (for NUMA, or just because 1 PCIe link can't really cope
> with 200Gbps).  In that case which DBDF would the port go to? :(
> Do all internal info of the ASIC (health, regions, sbs) get registered
> twice?

This I don't know, at least for RDMA this configuration gets confusing
very fast and devlink is the least of the worries..

Personally I would advocate for a master/slave kind of arrangement
where the master BDF has a different PCI DID from the slaves. devlink
and other kernel objects hang off the master.

The slave port is then only used to carry selected NUMA aware data
path traffic and doesn't show in devlink.

Jason
diff mbox series

Patch

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index 36976e37d162..bb07be4117a3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -375,7 +375,8 @@  nfp_devlink_port_init_pci(struct devlink *devlink, struct nfp_port *port,
 {
 	devlink_port_type_eth_set(&port->dl_port, port->netdev);
 	devlink_port_attrs_pci_set(&port->dl_port, flavour,
-				   port->pf_id, port->vf_id);
+				   port->pf_id, port->vf_id,
+				   port->pf_split, port->pf_split_id);
 	return 0;
 }
 
diff --git a/include/net/devlink.h b/include/net/devlink.h
index b5376ef492f1..13e0a479c546 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -53,6 +53,8 @@  struct devlink_port_attrs {
 		struct {
 			u32 pf_number;
 			u32 vf_number;
+			bool multiport;
+			u32 subport_number;
 		} pci;
 	};
 };
@@ -580,7 +582,8 @@  void devlink_port_attrs_set(struct devlink_port *devlink_port,
 			    u32 split_subport_number);
 void devlink_port_attrs_pci_set(struct devlink_port *devlink_port,
 				enum devlink_port_flavour flavour,
-				u32 pf_number, u32 vf_number);
+				u32 pf_number, u32 vf_number,
+				bool multiport, u32 subport_number);
 int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
 				    char *name, size_t len);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
@@ -797,7 +800,9 @@  static inline void devlink_port_attrs_set(struct devlink_port *devlink_port,
 
 static inline void devlink_port_attrs_pci_set(struct devlink_port *devlink_port,
 					      enum devlink_port_flavour flavour,
-					      u32 pf_number, u32 vf_number)
+					      u32 pf_number, u32 vf_number,
+					      bool multiport,
+					      u32 subport_number)
 {
 }
 
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 9ce76d4f640d..417ae8233cce 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -336,6 +336,7 @@  enum devlink_attr {
 
 	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u32 */
 	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u32 */
+	DEVLINK_ATTR_PORT_PCI_SUBPORT,		/* u32 */
 
 	/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index af177284830b..6cdf7e87d7fc 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -541,6 +541,11 @@  static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER,
 				attrs->pci.pf_number))
 			return -EMSGSIZE;
+
+		if (attrs->pci.multiport &&
+		    nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_SUBPORT,
+				attrs->pci.subport_number))
+			return -EMSGSIZE;
 		return 0;
 	default:
 		return -EINVAL;
@@ -5449,10 +5454,13 @@  EXPORT_SYMBOL_GPL(devlink_port_attrs_set);
  *	@pf_number: PCI PF number, in multi-host mapping to hosts depends
  *	            on the platform
  *	@vf_number: PCI VF number within given PF (ignored for PF itself)
+ *	@multiport: PCI function has more than one logical port
+ *	@subport_number: PCI function has more than one logical port
  */
 void devlink_port_attrs_pci_set(struct devlink_port *devlink_port,
 				enum devlink_port_flavour flavour,
-				u32 pf_number, u32 vf_number)
+				u32 pf_number, u32 vf_number,
+				bool multiport, u32 subport_number)
 {
 	struct devlink_port_attrs *attrs = &devlink_port->attrs;
 
@@ -5463,6 +5471,8 @@  void devlink_port_attrs_pci_set(struct devlink_port *devlink_port,
 	attrs->flavour = flavour;
 	attrs->pci.pf_number = pf_number;
 	attrs->pci.vf_number = vf_number;
+	attrs->pci.multiport = multiport;
+	attrs->pci.subport_number = subport_number;
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_NEW);
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_set);
@@ -5492,11 +5502,22 @@  int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
 		WARN_ON(1);
 		return -EINVAL;
 	case DEVLINK_PORT_FLAVOUR_PCI_PF:
-		n = snprintf(name, len, "pf%u", attrs->pci.pf_number);
+		if (!attrs->pci.multiport)
+			n = snprintf(name, len, "pf%u", attrs->pci.pf_number);
+		else
+			n = snprintf(name, len, "pf%us%u", attrs->pci.pf_number,
+				     attrs->pci.subport_number);
 		break;
 	case DEVLINK_PORT_FLAVOUR_PCI_VF:
-		n = snprintf(name, len, "pf%uvf%u",
-			     attrs->pf_number, attrs->pci.vf_number);
+		if (!attrs->pci.multiport)
+			n = snprintf(name, len, "pf%uvf%u",
+				     attrs->pci.pf_number,
+				     attrs->pci.vf_number);
+		else
+			n = snprintf(name, len, "pf%uvf%us%u",
+				     attrs->pci.pf_number,
+				     attrs->pci.vf_number,
+				     attrs->pci.subport_number);
 		break;
 	}