mbox series

[net-next,00/13] Add mlx5 subfunction support

Message ID 20201112192424.2742-1-parav@nvidia.com
Headers show
Series Add mlx5 subfunction support | expand

Message

Parav Pandit Nov. 12, 2020, 7:24 p.m. UTC
Hi Dave, Jakub, Greg,

This series introduces support for mlx5 subfunction (SF).
A subfunction is a portion of a PCI device that supports multiple
classes of devices such as netdev, RDMA and more.

This patchset is based on Leon's series [3].
It is a third user of proposed auxiliary bus [4].

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design, and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support.

Patch summary:
--------------
Patch 1 to 6 prepares devlink:
Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 refactors devlink to avoid using global mutext
Patch-6 uses refcount to allow creating devlink instance from existing
one
Patch 7 to 13 implements mlx5 pieces for SF support.
Patch-7 adds SF auxiliary device
Patch-8 adds SF auxiliary driver
Patch-9 prepares eswitch to handler SF vport
PAtch-10 adds eswitch helpers to add/remove SF vport
Patch-11 adds SF device configuration commands
Patch-12 implements devlink port add/del callbacks
Patch-13 implements devlink port function get/set callbacks

More on SF plumbing below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attributes
before its activated.

When a subfunction is activated, it results in an auxiliary device.
A driver binds to the auxiliary device that further creates supported
class devices.

short example sequence:
-----------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flaovur:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.28.1002 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.28.1002 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

subfunction (SF) in detail:
---------------------------
- A sub-function is a portion of the PCI device which supports multiple
  classes of devices such as netdev, RDMA and more.
- A SF netdev has its own dedicated queues(txq, rxq).
- A SF RDMA device has its own QP1, GID table and other RDMA resources.
- A SF supports eswitch representation and tc offload support similar
  to existing PF and VF representors.
- User must configure eswitch to send/receive SF's packets.
- A SF shares PCI level resources with other SFs and/or with its
  parent PCI function.
  For example, an SF shares IRQ vectors with other SFs and its
  PCI function.
  In future it may have dedicated IRQ vector per SF.
  A SF has dedicated window in PCI BAR space that is not shared
  with other SFs or PF. This ensures that when a SF is assigned to
  an application, only that application can access device resources.
- SF's auxiliary device exposes sfnum sysfs attribute. This will be
  used by systemd/udev to deterministic names for its netdev and
  RDMA device.

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
[3] https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050473.html
[4] https://lore.kernel.org/linux-rdma/20201023003338.1285642-2-david.m.ertman@intel.com/

Parav Pandit (11):
  devlink: Prepare code to fill multiple port function attributes
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Support add and delete devlink port
  devlink: Support get and set state of port function
  devlink: Avoid global devlink mutex, use per instance reload lock
  devlink: Introduce devlink refcount to reduce scope of global
    devlink_mutex
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: SF, Port function state change support

Vu Pham (2):
  net/mlx5: E-switch, Prepare eswitch to handle SF vport
  net/mlx5: SF, Add SF configuration hardware commands

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  19 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   2 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |   2 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |  41 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  46 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  82 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  47 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  48 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  10 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  48 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 213 ++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  68 +++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 105 ++++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  14 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.c   | 498 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  59 +++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   3 +-
 include/linux/mlx5/driver.h                   |  12 +-
 include/net/devlink.h                         |  82 +++
 include/uapi/linux/devlink.h                  |  26 +
 net/core/devlink.c                            | 362 +++++++++++--
 25 files changed, 1754 insertions(+), 73 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h

Comments

Jakub Kicinski Nov. 16, 2020, 10:52 p.m. UTC | #1
On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:
> This series introduces support for mlx5 subfunction (SF).
> A subfunction is a portion of a PCI device that supports multiple
> classes of devices such as netdev, RDMA and more.
> 
> This patchset is based on Leon's series [3].
> It is a third user of proposed auxiliary bus [4].
> 
> Subfunction support is discussed in detail in RFC [1] and [2].
> RFC [1] and extension [2] describes requirements, design, and proposed
> plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> support.

So we're going to have two ways of adding subdevs? Via devlink and via
the new vdpa netlink thing?

Question number two - is this supposed to be ready to be applied to
net-next? It seems there is a conflict.

Also could you please wrap your code at 80 chars?

Thanks.
Saeed Mahameed Nov. 17, 2020, 12:06 a.m. UTC | #2
On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote:
> On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:
> > This series introduces support for mlx5 subfunction (SF).
> > A subfunction is a portion of a PCI device that supports multiple
> > classes of devices such as netdev, RDMA and more.
> > 
> > This patchset is based on Leon's series [3].
> > It is a third user of proposed auxiliary bus [4].
> > 
> > Subfunction support is discussed in detail in RFC [1] and [2].
> > RFC [1] and extension [2] describes requirements, design, and
> > proposed
> > plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> > support.
> 
> So we're going to have two ways of adding subdevs? Via devlink and
> via
> the new vdpa netlink thing?
> 

Via devlink you add the Sub-function bus device - think of it as
spawning a new VF - but has no actual characteristics
(netdev/vpda/rdma) "yet" until user admin decides to load an interface
on it via aux sysfs.

Basically devlink adds a new eswitch port (the SF port) and loading the
drivers and the interfaces is done via the auxbus subsystem only after
the SF is spawned by FW.


> Question number two - is this supposed to be ready to be applied to
> net-next? It seems there is a conflict.
> 

This series requires other mlx5 and auxbus infrastructure dependencies
that was already submitted by leon 2-3 weeks ago and pending Greg's
review, once finalized it will be merged into mlx5-next, then I will
ask you to pull mlx5-next and only after, you can apply this series
cleanly to net-next, sorry for the mess but we had to move forward and
show how auxdev subsystem is being actually used.

Leon's series:
https://patchwork.ozlabs.org/project/netdev/cover/20201101201542.2027568-1-leon@kernel.org/

> Also could you please wrap your code at 80 chars?
> 

I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
But if you insist :) .. 

Thanks,
Saeed.
Jakub Kicinski Nov. 17, 2020, 1:58 a.m. UTC | #3
On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> On Mon, 2020-11-16 at 14:52 -0800, Jakub Kicinski wrote:
> > On Thu, 12 Nov 2020 21:24:10 +0200 Parav Pandit wrote:  
> > > This series introduces support for mlx5 subfunction (SF).
> > > A subfunction is a portion of a PCI device that supports multiple
> > > classes of devices such as netdev, RDMA and more.
> > > 
> > > This patchset is based on Leon's series [3].
> > > It is a third user of proposed auxiliary bus [4].
> > > 
> > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > RFC [1] and extension [2] describes requirements, design, and
> > > proposed
> > > plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> > > support.  
> > 
> > So we're going to have two ways of adding subdevs? Via devlink and
> > via the new vdpa netlink thing?
> 
> Via devlink you add the Sub-function bus device - think of it as
> spawning a new VF - but has no actual characteristics
> (netdev/vpda/rdma) "yet" until user admin decides to load an interface
> on it via aux sysfs.

By which you mean it doesn't get probed or the device type is not set
(IOW it can still become a block device or netdev depending on the vdpa
request)?

> Basically devlink adds a new eswitch port (the SF port) and loading the
> drivers and the interfaces is done via the auxbus subsystem only after
> the SF is spawned by FW.

But why?

Is this for the SmartNIC / bare metal case? The flow for spawning on
the local host gets highly convoluted.

> > Also could you please wrap your code at 80 chars?
> 
> I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
> But if you insist :) .. 

Oh yeah, I meant the devlink patches!
Parav Pandit Nov. 17, 2020, 4:08 a.m. UTC | #4
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, November 17, 2020 7:28 AM
> 
> On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > RFC [1] and extension [2] describes requirements, design, and
> > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > systemd/udev support.
> > >
> > > So we're going to have two ways of adding subdevs? Via devlink and
> > > via the new vdpa netlink thing?
Nop.
Subfunctions (subdevs) are added only one way, 
i.e. devlink port as settled in RFC [1].

Just to refresh all our memory, we discussed and settled on the flow in [2];
RFC [1] followed this discussion.

vdpa tool of [3] can add one or more vdpa device(s) on top of already spawned PF, VF, SF device.

> >
> > Via devlink you add the Sub-function bus device - think of it as
> > spawning a new VF - but has no actual characteristics
> > (netdev/vpda/rdma) "yet" until user admin decides to load an interface
> > on it via aux sysfs.
> 
> By which you mean it doesn't get probed or the device type is not set (IOW it can
> still become a block device or netdev depending on the vdpa request)?
> 
> > Basically devlink adds a new eswitch port (the SF port) and loading
> > the drivers and the interfaces is done via the auxbus subsystem only
> > after the SF is spawned by FW.
> 
> But why?
> 
> Is this for the SmartNIC / bare metal case? The flow for spawning on the local
> host gets highly convoluted.
> 
The flow of spawning for (a) local host or (b) for external host controller from smartnic is same.

$ devlink port add..
[..]
Followed by
$ devlink port function set state...

Only change would be to specify the destination where to spawn it. (controller number, pf, sf num etc)
Please refer to the detailed examples in individual patch.
Patch 12 and 13 mostly covers the complete view.

> > > Also could you please wrap your code at 80 chars?
> >
> > I prefer no to do this in mlx5, in mlx5 we follow a 95 chars rule.
> > But if you insist :) ..
> 
> Oh yeah, I meant the devlink patches!
May I ask why?
Past few devlink patches [4] followed 100 chars rule. When did we revert back to 80?
If so, any pointers to the thread for 80? checkpatch.pl with --strict mode didn't complain me when I prepared the patches.

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://lore.kernel.org/netdev/20200324132044.GI20941@ziepe.ca/
[3] https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050623.html
[4] commits dc64cc7c6310, 77069ba2e3ad, a1e8ae907c8d, 2a916ecc4056, ba356c90985d
Jakub Kicinski Nov. 17, 2020, 5:11 p.m. UTC | #5
On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote:
> > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:  
> > > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > > RFC [1] and extension [2] describes requirements, design, and
> > > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > > systemd/udev support.  
> > > >
> > > > So we're going to have two ways of adding subdevs? Via devlink and
> > > > via the new vdpa netlink thing?  
> Nop.
> Subfunctions (subdevs) are added only one way, 
> i.e. devlink port as settled in RFC [1].
> 
> Just to refresh all our memory, we discussed and settled on the flow
> in [2]; RFC [1] followed this discussion.
> 
> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> spawned PF, VF, SF device.

Nack for the networking part of that. It'd basically be VMDq.

> > > Via devlink you add the Sub-function bus device - think of it as
> > > spawning a new VF - but has no actual characteristics
> > > (netdev/vpda/rdma) "yet" until user admin decides to load an
> > > interface on it via aux sysfs.  
> > 
> > By which you mean it doesn't get probed or the device type is not
> > set (IOW it can still become a block device or netdev depending on
> > the vdpa request)? 
> > > Basically devlink adds a new eswitch port (the SF port) and
> > > loading the drivers and the interfaces is done via the auxbus
> > > subsystem only after the SF is spawned by FW.  
> > 
> > But why?
> > 
> > Is this for the SmartNIC / bare metal case? The flow for spawning
> > on the local host gets highly convoluted.
>
> The flow of spawning for (a) local host or (b) for external host
> controller from smartnic is same.
> 
> $ devlink port add..
> [..]
> Followed by
> $ devlink port function set state...
> 
> Only change would be to specify the destination where to spawn it.
> (controller number, pf, sf num etc) Please refer to the detailed
> examples in individual patch. Patch 12 and 13 mostly covers the
> complete view.

Please share full examples of the workflow.

I'm asking how the vdpa API fits in with this, and you're showing me
the two devlink commands we already talked about in the past.
Jason Gunthorpe Nov. 17, 2020, 6:49 p.m. UTC | #6
On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:

> > Just to refresh all our memory, we discussed and settled on the flow
> > in [2]; RFC [1] followed this discussion.
> > 
> > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > spawned PF, VF, SF device.
> 
> Nack for the networking part of that. It'd basically be VMDq.

What are you NAK'ing? 

It is consistent with the multi-subsystem device sharing model we've
had for ages now.

The physical ethernet port is shared between multiple accelerator
subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
VDPA, etc.

Jason
Parav Pandit Nov. 17, 2020, 6:50 p.m. UTC | #7
Hi Jakub,

> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, November 17, 2020 10:41 PM
> 
> On Tue, 17 Nov 2020 04:08:57 +0000 Parav Pandit wrote:
> > > On Mon, 16 Nov 2020 16:06:02 -0800 Saeed Mahameed wrote:
> > > > > > Subfunction support is discussed in detail in RFC [1] and [2].
> > > > > > RFC [1] and extension [2] describes requirements, design, and
> > > > > > proposed plumbing using devlink, auxiliary bus and sysfs for
> > > > > > systemd/udev support.
> > > > >
> > > > > So we're going to have two ways of adding subdevs? Via devlink
> > > > > and via the new vdpa netlink thing?
> > Nop.
> > Subfunctions (subdevs) are added only one way, i.e. devlink port as
> > settled in RFC [1].
> >
> > Just to refresh all our memory, we discussed and settled on the flow
> > in [2]; RFC [1] followed this discussion.
> >
> > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > spawned PF, VF, SF device.
> 
> Nack for the networking part of that. It'd basically be VMDq.
> 
Can you please clarify which networking part do you mean?
Which patches exactly in this patchset?

> > > > Via devlink you add the Sub-function bus device - think of it as
> > > > spawning a new VF - but has no actual characteristics
> > > > (netdev/vpda/rdma) "yet" until user admin decides to load an
> > > > interface on it via aux sysfs.
> > >
> > > By which you mean it doesn't get probed or the device type is not
> > > set (IOW it can still become a block device or netdev depending on
> > > the vdpa request)?
> > > > Basically devlink adds a new eswitch port (the SF port) and
> > > > loading the drivers and the interfaces is done via the auxbus
> > > > subsystem only after the SF is spawned by FW.
> > >
> > > But why?
> > >
> > > Is this for the SmartNIC / bare metal case? The flow for spawning on
> > > the local host gets highly convoluted.
> >
> > The flow of spawning for (a) local host or (b) for external host
> > controller from smartnic is same.
> >
> > $ devlink port add..
> > [..]
> > Followed by
> > $ devlink port function set state...
> >
> > Only change would be to specify the destination where to spawn it.
> > (controller number, pf, sf num etc) Please refer to the detailed
> > examples in individual patch. Patch 12 and 13 mostly covers the
> > complete view.
> 
> Please share full examples of the workflow.
> 
Please find the full example sequence below, taken from this cover letter and from the respective patches 12 and 13.

Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function: (existing API).
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.28.1002 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.28.1002 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence.

$ vdpa parentdev list
auxiliary/mlx5_core.sf.4
  supported_classes
    net

$ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0

$ vdpa dev show foo0
foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256

> I'm asking how the vdpa API fits in with this, and you're showing me the two
> devlink commands we already talked about in the past.
Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this.
Please look at the above example sequence that covers the vdpa example also.

[1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
Jakub Kicinski Nov. 19, 2020, 2:14 a.m. UTC | #8
On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> 
> > > Just to refresh all our memory, we discussed and settled on the flow
> > > in [2]; RFC [1] followed this discussion.
> > > 
> > > vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > > spawned PF, VF, SF device.  
> > 
> > Nack for the networking part of that. It'd basically be VMDq.  
> 
> What are you NAK'ing? 

Spawning multiple netdevs from one device by slicing up its queues.

> It is consistent with the multi-subsystem device sharing model we've
> had for ages now.
> 
> The physical ethernet port is shared between multiple accelerator
> subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
> VDPA, etc.

Right, devices of other subsystems are fine, I don't care.

Sorry for not being crystal clear but quite frankly IDK what else can
be expected from me given the submissions have little to no context and
documentation. This comes up every damn time with the SF patches, I'm
tired of having to ask for a basic workflow.
Jakub Kicinski Nov. 19, 2020, 2:23 a.m. UTC | #9
On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote:
> At this point vdpa tool of [1] can create one or more vdpa net devices on this subfunction device in below sequence.
> 
> $ vdpa parentdev list
> auxiliary/mlx5_core.sf.4
>   supported_classes
>     net
> 
> $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name foo0
> 
> $ vdpa dev show foo0
> foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev vdpasim vendor_id 0 max_vqs 2 max_vq_size 256
> 
> > I'm asking how the vdpa API fits in with this, and you're showing me the two
> > devlink commands we already talked about in the past.  
> Oh ok, sorry, my bad. I understood your question now about relation of vdpa commands with this.
> Please look at the above example sequence that covers the vdpa example also.
> 
> [1] https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/

I think the biggest missing piece in my understanding is what's the
technical difference between an SF and a VDPA device.

Isn't a VDPA device an SF with a particular descriptor format for the
queues?
David Ahern Nov. 19, 2020, 4:35 a.m. UTC | #10
On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>
>>>> Just to refresh all our memory, we discussed and settled on the flow
>>>> in [2]; RFC [1] followed this discussion.
>>>>
>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
>>>> spawned PF, VF, SF device.  
>>>
>>> Nack for the networking part of that. It'd basically be VMDq.  
>>
>> What are you NAK'ing? 
> 
> Spawning multiple netdevs from one device by slicing up its queues.

Why do you object to that? Slicing up h/w resources for virtual what
ever has been common practice for a long time.
Saeed Mahameed Nov. 19, 2020, 5:57 a.m. UTC | #11
On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote:
> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > > 
> > > > > Just to refresh all our memory, we discussed and settled on
> > > > > the flow
> > > > > in [2]; RFC [1] followed this discussion.
> > > > > 
> > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of
> > > > > already
> > > > > spawned PF, VF, SF device.  
> > > > 
> > > > Nack for the networking part of that. It'd basically be VMDq.  
> > > 
> > > What are you NAK'ing? 
> > 
> > Spawning multiple netdevs from one device by slicing up its queues.
> 
> Why do you object to that? Slicing up h/w resources for virtual what
> ever has been common practice for a long time.
> 
> 

We are not slicing up any queues, from our HW and FW perspective SF ==
VF literally, a full blown HW slice (Function), with isolated control
and data plane of its own, this is very different from VMDq and more
generic and secure. an SF device is exactly like a VF, doesn't steal or
share any HW resources or control/data path with others. SF is
basically SRIOV done right.

this series has nothing to do with netdev, if you look at the list of
files Parav is touching, there is 0 change in our netdev stack :) ..
all Parav is doing is adding the API to create/destroy SFs and
represents the low level SF function to devlink as a device, just
like a VF.
Saeed Mahameed Nov. 19, 2020, 6:12 a.m. UTC | #12
On Wed, 2020-11-18 at 18:14 -0800, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > 
> > It is consistent with the multi-subsystem device sharing model
> > we've
> > had for ages now.
> > 
> > The physical ethernet port is shared between multiple accelerator
> > subsystems. netdev gets its slice of traffic, so does RDMA, iSCSI,
> > VDPA, etc.

not just a slice of traffic, a whole HW domain.

> 
> Right, devices of other subsystems are fine, I don't care.
> 

But a netdev will be loaded on SF automatically just through the
current driver design and modularity, since SF == VF and our netdev is
abstract and doesn't know if it runs on a PF/VF/SF .. we literally have
to add code to not load a netdev on a SF. why ? :/

> Sorry for not being crystal clear but quite frankly IDK what else can
> be expected from me given the submissions have little to no context
> and
> documentation. This comes up every damn time with the SF patches, I'm
> tired of having to ask for a basic workflow.

From how this discussion is going, i think you are right, we need to
clarify what we are doing in a more high level simplified and generic
documentation to give some initial context, Parav, let's add the
missing documentation, we can also add some comments regarding how this
is very different from VMDq, but i would like to avoid that, since it
is different in almost every way:) ..
Saeed Mahameed Nov. 19, 2020, 6:22 a.m. UTC | #13
On Wed, 2020-11-18 at 18:23 -0800, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 18:50:57 +0000 Parav Pandit wrote:
> > At this point vdpa tool of [1] can create one or more vdpa net
> > devices on this subfunction device in below sequence.
> > 
> > $ vdpa parentdev list
> > auxiliary/mlx5_core.sf.4
> >   supported_classes
> >     net
> > 
> > $ vdpa dev add parentdev auxiliary/mlx5_core.sf.4 type net name
> > foo0
> > 
> > $ vdpa dev show foo0
> > foo0: parentdev auxiliary/mlx5_core.sf.4 type network parentdev
> > vdpasim vendor_id 0 max_vqs 2 max_vq_size 256
> > 
> > > I'm asking how the vdpa API fits in with this, and you're showing
> > > me the two
> > > devlink commands we already talked about in the past.  
> > Oh ok, sorry, my bad. I understood your question now about relation
> > of vdpa commands with this.
> > Please look at the above example sequence that covers the vdpa
> > example also.
> > 
> > [1] 
> > https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
> 
> I think the biggest missing piece in my understanding is what's the
> technical difference between an SF and a VDPA device.
> 

Same difference as between a VF and netdev.
SF == VF, so a full HW function.
VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
same functions as always been, nothing new about this.

Today on a VF we load a RDMA/VDPA/netdev interfaces
SF will do exactly the same and the ULPs will simply load, and we don't
need to modify them.

> Isn't a VDPA device an SF with a particular descriptor format for the
> queues?
No :/, 
I hope the above answer clarifies things a bit.
SF is a device function that provides all kinds of queues.
Parav Pandit Nov. 19, 2020, 8:25 a.m. UTC | #14
> From: Saeed Mahameed <saeed@kernel.org>
> Sent: Thursday, November 19, 2020 11:42 AM
> 
> From how this discussion is going, i think you are right, we need to clarify
> what we are doing in a more high level simplified and generic documentation
> to give some initial context, Parav, let's add the missing documentation, we
> can also add some comments regarding how this is very different from
> VMDq, but i would like to avoid that, since it is different in almost every way:)

Sure I will add Documentation/networking/subfunction.rst in v2 describing subfunction details.
Jason Gunthorpe Nov. 19, 2020, 2 p.m. UTC | #15
On Wed, Nov 18, 2020 at 10:22:51PM -0800, Saeed Mahameed wrote:
> > I think the biggest missing piece in my understanding is what's the
> > technical difference between an SF and a VDPA device.
> 
> Same difference as between a VF and netdev.
> SF == VF, so a full HW function.
> VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
> same functions as always been, nothing new about this.

All the implementation details are very different, but this white
paper from Intel goes into some detail the basic elements and rational
for the SF concept:

https://software.intel.com/content/dam/develop/public/us/en/documents/intel-scalable-io-virtualization-technical-specification.pdf

What we are calling a sub-function here is a close cousin to what
Intel calls an Assignable Device Interface. I expect to see other
drivers following this general pattern eventually.

A SF will eventually be assignable to a VM and the VM won't be able to
tell the difference between a VF or SF providing the assignable PCI
resources.

VDPA is also assignable to a guest, but the key difference between
mlx5's SF and VDPA is what guest driver binds to the virtual PCI
function. For a SF the guest will bind mlx5_core, for VDPA the guest
will bind virtio-net.

So, the driver stack for a VM using VDPA might be

 Physical device [pci] -> mlx5_core -> [aux] -> SF -> [aux] ->  mlx5_core -> [aux] -> mlx5_vdpa -> QEMU -> |VM| -> [pci] -> virtio_net

When Parav is talking about creating VDPA devices he means attaching
the VDPA accelerator subsystem to a mlx5_core, where ever that
mlx5_core might be attached to.

To your other remark:

> > What are you NAK'ing?
> Spawning multiple netdevs from one device by slicing up its queues.

This is a bit vauge. In SRIOV a device spawns multiple netdevs for a
physical port by "slicing up its physical queues" - where do you see
the cross over between VMDq (bad) and SRIOV (ok)?

I thought the issue with VMDq was more on the horrid management to
configure the traffic splitting, not the actual splitting itself?

In classic SRIOV the traffic is split by a simple non-configurable HW
switch based on MAC address of the VF.

mlx5 already has the extended version of that idea, we can run in
switchdev mode and use switchdev to configure the HW switch. Now
configurable switchdev rules split the traffic for VFs.

This SF step replaces the VF in the above, but everything else is the
same. The switchdev still splits the traffic, it still ends up in same
nested netdev queue structure & RSS a VF/PF would use, etc, etc. No
queues are "stolen" to create the nested netdev.

From the driver perspective there is no significant difference between
sticking a netdev on a mlx5 VF or sticking a netdev on a mlx5 SF. A SF
netdev is not going in and doing deep surgery to the PF netdev to
steal queues or something.

Both VF and SF will be eventually assignable to guests, both can
support all the accelerator subsystems - VDPA, RDMA, etc. Both can
support netdev.

Compared to VMDq, I think it is really no comparison. SF/ADI is an
evolution of a SRIOV VF from something PCI-SGI controlled to something
device specific and lighter weight.

SF/ADI come with a architectural security boundary suitable for
assignment to an untrusted guest. It is not just a jumble of queues.

VMDq is .. not that.

Actually it has been one of the open debates in the virtualization
userspace world. The approach to use switchdev to control the traffic
splitting to VMs is elegant but many drivers are are not following
this design. :(

Finally, in the mlx5 model VDPA is just an "application". It asks the
device to create a 'RDMA' raw ethernet packet QP that is uses rings
formed in the virtio-net specification. We can create it in the kernel
using mlx5_vdpa, and we can create it in userspace through the RDMA
subsystem. Like any "RDMA" application it is contained by the security
boundary of the PF/VF/SF the mlx5_core is running on.

Jason
Jakub Kicinski Nov. 20, 2020, 1:29 a.m. UTC | #16
On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:  
> >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> >>  
> >>>> Just to refresh all our memory, we discussed and settled on the flow
> >>>> in [2]; RFC [1] followed this discussion.
> >>>>
> >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> >>>> spawned PF, VF, SF device.    
> >>>
> >>> Nack for the networking part of that. It'd basically be VMDq.    
> >>
> >> What are you NAK'ing?   
> > 
> > Spawning multiple netdevs from one device by slicing up its queues.  
> 
> Why do you object to that? Slicing up h/w resources for virtual what
> ever has been common practice for a long time.

My memory of the VMDq debate is hazy, let me rope in Alex into this.
I believe the argument was that we should offload software constructs,
not create HW-specific APIs which depend on HW availability and
implementation. So the path we took was offloading macvlan.
Jakub Kicinski Nov. 20, 2020, 1:31 a.m. UTC | #17
On Wed, 18 Nov 2020 21:57:57 -0800 Saeed Mahameed wrote:
> On Wed, 2020-11-18 at 21:35 -0700, David Ahern wrote:
> > On 11/18/20 7:14 PM, Jakub Kicinski wrote:  
> > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:  
> > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > > >   
> > > > > > Just to refresh all our memory, we discussed and settled on
> > > > > > the flow
> > > > > > in [2]; RFC [1] followed this discussion.
> > > > > > 
> > > > > > vdpa tool of [3] can add one or more vdpa device(s) on top of
> > > > > > already
> > > > > > spawned PF, VF, SF device.    
> > > > > 
> > > > > Nack for the networking part of that. It'd basically be VMDq.    
> > > > 
> > > > What are you NAK'ing?   
> > > 
> > > Spawning multiple netdevs from one device by slicing up its queues.  
> > 
> > Why do you object to that? Slicing up h/w resources for virtual what
> > ever has been common practice for a long time.
> 
> We are not slicing up any queues, from our HW and FW perspective SF ==
> VF literally, a full blown HW slice (Function), with isolated control
> and data plane of its own, this is very different from VMDq and more
> generic and secure. an SF device is exactly like a VF, doesn't steal or
> share any HW resources or control/data path with others. SF is
> basically SRIOV done right.
> 
> this series has nothing to do with netdev, if you look at the list of
> files Parav is touching, there is 0 change in our netdev stack :) ..
> all Parav is doing is adding the API to create/destroy SFs and
> represents the low level SF function to devlink as a device, just
> like a VF.

Ack, the concern is about the vdpa, not SF. 
So not really this patch set.
Jakub Kicinski Nov. 20, 2020, 1:35 a.m. UTC | #18
On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote:
> > Right, devices of other subsystems are fine, I don't care.
> 
> But a netdev will be loaded on SF automatically just through the
> current driver design and modularity, since SF == VF and our netdev is
> abstract and doesn't know if it runs on a PF/VF/SF .. we literally have
> to add code to not load a netdev on a SF. why ? :/

A netdev is fine, but the examples so far don't make it clear (to me) 
if it's expected/supported to spawn _multiple_ netdevs from a single
"vdpa parentdev".
Parav Pandit Nov. 20, 2020, 3:34 a.m. UTC | #19
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, November 20, 2020 7:05 AM
> 
> On Wed, 18 Nov 2020 22:12:22 -0800 Saeed Mahameed wrote:
> > > Right, devices of other subsystems are fine, I don't care.
> >
> > But a netdev will be loaded on SF automatically just through the
> > current driver design and modularity, since SF == VF and our netdev is
> > abstract and doesn't know if it runs on a PF/VF/SF .. we literally
> > have to add code to not load a netdev on a SF. why ? :/
> 
> A netdev is fine, but the examples so far don't make it clear (to me) if it's
> expected/supported to spawn _multiple_ netdevs from a single "vdpa
> parentdev".
We do not create Netdev from vdpa parentdev.
From vdpa parentdev, only vdpa device(s) are created which is 'struct device' residing in /sys/bus/vdpa/<device>.
Currently such vdpa device is already created on mlx5_vpa.ko driver load, however user has no way to inspect, stats, get/set features of this device, hence the vdpa tool.
Jakub Kicinski Nov. 20, 2020, 3:35 a.m. UTC | #20
On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> Finally, in the mlx5 model VDPA is just an "application". It asks the
> device to create a 'RDMA' raw ethernet packet QP that is uses rings
> formed in the virtio-net specification. We can create it in the kernel
> using mlx5_vdpa, and we can create it in userspace through the RDMA
> subsystem. Like any "RDMA" application it is contained by the security
> boundary of the PF/VF/SF the mlx5_core is running on.

Thanks for the write up!

The SF part is pretty clear to me, it is what it is. DPDK camp has been
pretty excited about ADI/PASID for a while now.


The part that's blurry to me is VDPA.

I was under the impression that for VDPA the device is supposed to
support native virtio 2.0 (or whatever the "HW friendly" spec was).

I believe that's what the early patches from Intel did.

You're saying it's a client application like any other - do I understand
it right that the hypervisor driver will be translating descriptors
between virtio and device-native then?

The vdpa parent is in the hypervisor correct?

Can a VDPA device have multiple children of the same type?

Why do we have a representor for a SF, if the interface is actually VDPA?
Block and net traffic can't reasonably be treated the same by the switch.

Also I'm confused how block device can bind to mlx5_core - in that case
I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and
that QP is plugged into an appropriate backend?
Parav Pandit Nov. 20, 2020, 3:50 a.m. UTC | #21
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, November 20, 2020 9:05 AM
> 
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!
> 
> The SF part is pretty clear to me, it is what it is. DPDK camp has been pretty
> excited about ADI/PASID for a while now.
> 
> 
> The part that's blurry to me is VDPA.
> 
> I was under the impression that for VDPA the device is supposed to support
> native virtio 2.0 (or whatever the "HW friendly" spec was).
> 
> I believe that's what the early patches from Intel did.
> 
> You're saying it's a client application like any other - do I understand it right that
> the hypervisor driver will be translating descriptors between virtio and device-
> native then?
>
mlx5 device support virtio descriptors natively. So no need of translation.
 
> The vdpa parent is in the hypervisor correct?
> 
Yep. 

> Can a VDPA device have multiple children of the same type?
>
I guess, you mean VDPA parentdev? If so, yes, however at present we see only one_to_one mapping of vdpa device and parent dev.
 
> Why do we have a representor for a SF, if the interface is actually VDPA?
Because vdpa is just one client out of multiple.
At the moment there is one to one relation of vdpa device to a SF/VF.

> Block and net traffic can't reasonably be treated the same by the switch.
> 
> Also I'm confused how block device can bind to mlx5_core - in that case I'm
> assuming the QP is bound 1:1 with a QP on the SmartNIC side, and that QP is
> plugged into an appropriate backend?
So far there isn't mlx5_vdpa.ko or plan to do block. But yes, in future for block, it needs to bind to a QP in backend in smartnic.
Jason Gunthorpe Nov. 20, 2020, 4:16 p.m. UTC | #22
On Thu, Nov 19, 2020 at 07:35:26PM -0800, Jakub Kicinski wrote:
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!

No problem!

> The part that's blurry to me is VDPA.

Okay, I think I see where the gap is, I'm going to elaborate below so
we are clear.

> I was under the impression that for VDPA the device is supposed to
> support native virtio 2.0 (or whatever the "HW friendly" spec was).

I think VDPA covers a wide range of things.

The basic idea is starting with the all SW virtio-net implementation
we can move parts to HW. Each implementation will probably be a little
different here. The kernel vdpa subsystem is a toolbox to mix the
required emulation and HW capability to build a virtio-net PCI
interface.

The most key question to ask of any VDPA design is "what does the VDPA
FW do with the packet once the HW accelerator has parsed the
virtio-net descriptor?".

The VDPA world has refused to agree on this due to vendor squabbling,
but mlx5 has a clear answer:

 VDPA Tx generates an ethernet packet and sends it out the SF/VF port
 through a tunnel to the representor and then on to the switchdev.

Other VDPA designs have a different answer!!

This concept is so innate to how Mellanox views the world it is not
surprising me that the cover letters and patch descriptions don't
belabor this point much :)

I'm going to deep dive through this answer below. I think you'll see
this is the most sane and coherent architecture with the tools
available in netdev.. Mellanox thinks the VDPA world should
standardize on this design so we can have a standard control plane.

> You're saying it's a client application like any other - do I understand
> it right that the hypervisor driver will be translating descriptors
> between virtio and device-native then?

No, the hypervisor creates a QP and tells the HW that this QP's
descriptor format follows virtio-net. The QP processes those
descriptors in HW and generates ethernet packets.

A "client application like any other" means that the ethernet packets
VDPA forms are identical to the ones netdev or RDMA forms. They are
all delivered into the tunnel on the SF/VF to the representor and on
to the switch. See below

> The vdpa parent is in the hypervisor correct?
> 
> Can a VDPA device have multiple children of the same type?

I'm not sure parent/child are good words here.

The VDPA emulation runs in the hypervisor, and the virtio-net netdev
driver runs in the guest. The VDPA is attached to a switchdev port and
representor tunnel by virtue of its QPs being created under a SF/VF.

If we imagine a virtio-rdma, then you might have a SF/VF hosting both
VDPA and VDPA-RDMA which emulate two PCI devices assigned to a
VM. Both of these peer virtio's would generate ethernet packets for TX
on the SF/VF port into the tunnel through the represntor and to the
switch.

> Why do we have a representor for a SF, if the interface is actually VDPA?
> Block and net traffic can't reasonably be treated the same by the
> switch.

I think you are focusing on queues, the architecture at PF/SF/VF is
not queue based, it is packet based.

At the physical mlx5 the netdev has a switchdev. On that switch I can
create a *switch port*.

The switch port is composed of a representor and a SF/VF. They form a
tunnel for packets.

The representor is the hypervisor side of the tunnel and contains all
packets coming out of and into the SF/VF.

The SF/VF is the guest side of the tunnel and has a full NIC.

The SF/VF can be:
 - Used in the same OS as the switch
 - Assigned to a guest VM as a PCI device
 - Assigned to another processor in the SmartNIC case.

In all cases if I use a queue on a SF/VF to generate an ethernet
packet then that packet *always* goes into the tunnel to the
representor and goes into a switch. It is always contained by any
rules on the switch side. If the switch is set so the representor is
VLAN tagged then a queue on a SF/VF *cannot* escape the VLAN tag.

Similarly SF/VF cannot Rx any packets that are not sent into the
tunnel, meaning the switch controls what packets go into the
representor, through the tunnel and to the SF.

Yes, block and net traffic are all reduced to ethernet packets, sent
through the tunnel to the representor and treated by the switch. It is
no different than a physical switch. If there is to be some net/block
difference it has to be represented in the ethernet packets, eg with
vlan or something.

This is the fundamental security boundary of the architecture. The
SF/VF is a security domain and the only exchange of information from
that security domain to the hypervisor security domain is the tunnel
to the representor. The exchange across the boundary is only *packets*
not queues.

Essentially it exactly models the physical world. If I phyically plug
in a NIC to a switch then the "representor" is the switch port in the
physical switch OS and the "SF/VF" is the NIC in the server.

The switch OS does not know or care what the NIC is doing. It does not
know or care if the NIC is doing VDPA, or if the packets are "block"
or "net" - they are all just packets by the time it gets to switching.

> Also I'm confused how block device can bind to mlx5_core - in that case
> I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and
> that QP is plugged into an appropriate backend?

Every mlx5_core is a full multi-queue instance. It can have a huge
number of queues with no problems. Do not focus on the
queues. *queues* are irrelevant here.

Queues always have two ends. In this model one end is at the CPU and
the other is just ethernet packets. The purpose of the queue is to
convert CPU stuff into ethernet packets and vice versa. A mlx5 device
has a wide range of accelerators that can do all sorts of
transformations between CPU and packets built into the queues.

A queue can only be attached to a single mlx5_core, meaning all the
ethernet packets the queue sources/sinks must come from the PF/SF/VF
port. For SF/VF this port is connected to a tunnel to a representor to
the switch. Thus every queue has its packet side connected to the
switch.

However, the *queue* is an opaque detail of how the ethernet packets
are created from CPU data.

It doesn't matter if the queue is running VDPA, RDMA, netdev, or block
traffic - all of these things inherently result in ethernet packets,
and the hypervisor can't tell how the packet was created.

The architecture is *not* like virtio. virtio queues are individual
tunnels between hypervisor and guest.

This is the key detail: A VDPA queue is *not a tunnel*. It is a engine
to covert CPU data in virtio-net format to ethernet packets and
deliver those packet to the SF/VF end of the tunnel to the representor
and then to the switch. The tunnel is the SF/VF and representor
pairing, NOT the VDPA queue.

Looking at the logical life of a Tx packet from a VM doing VDPA:
 - VM's netdev builds the skb and writes a vitio-net formed descriptor
   to a send qeuue
 - VM triggers a doorbell via write to a BAR. In mlx5 this write goes
   to the device - qemu mmaps part of the device BAR to the guest
 - The HW begins processing a queue. The queue is in virtio-net format
   so it fetches the descriptor and now has the skb data
 - The HW forms the skb into an ethernet packet and delivers it to the
   representor through the tunnel, which immediately sends it to the
   HW switch. The VDPA QP in the SF/VF is now done.

 - In the switch the HW determines the packet is an exception. It
   applies RSS rules/etc and dynamically identifies on a per-packet
   basis what hypervisor queue the packet should be delivered to.
   This queue is in the hypervisor, and is in mlx5 native format.
 - The choosen hypervisor queue recives this packet and begins
   processing. It gets a receive buffer and writes the packet,
   triggers an interrupts. This queue is now done.

 - hypervisor netdev now has the packet. It does the exception path
   in netdev and puts the SKB back on another queue for TX to the
   physical port. This queue is in mlx5 native format, the packet goes
   to the physical port.

It traversed three queues. The HW dynamically selected the hypervisor
queue the VDPA packet is delivered to based *entirely* on switch
rules. The originating queue only informs the switch of what SF/VF
(and thus switch port) generated the packet.

At no point does the hypervisor know the packet originated from a VDPA
QP.

The RX side the similar, each PF/SF/VF port has a selector that
chooses which queue each packet goes to. That chooses how the packet
is converted to CPU. Each PF/SF/VF can have a huge number of
selectors, and SF/VF source their packets from the logical tunnel
attached to a representor which receives packets from the switch.

The selector is how the cross subsystem sharing of the ethernet port
works, regardless of PF/SF/VF.

Again the hypervisor side has *no idea* what queue the packet will be
selected to when it delivers the packet to the representor side of the
tunnel.

Jason
Alexander H Duyck Nov. 20, 2020, 5:58 p.m. UTC | #23
On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> > On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > >> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
> > >>
> > >>>> Just to refresh all our memory, we discussed and settled on the flow
> > >>>> in [2]; RFC [1] followed this discussion.
> > >>>>
> > >>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
> > >>>> spawned PF, VF, SF device.
> > >>>
> > >>> Nack for the networking part of that. It'd basically be VMDq.
> > >>
> > >> What are you NAK'ing?
> > >
> > > Spawning multiple netdevs from one device by slicing up its queues.
> >
> > Why do you object to that? Slicing up h/w resources for virtual what
> > ever has been common practice for a long time.
>
> My memory of the VMDq debate is hazy, let me rope in Alex into this.
> I believe the argument was that we should offload software constructs,
> not create HW-specific APIs which depend on HW availability and
> implementation. So the path we took was offloading macvlan.

I think it somewhat depends on the type of interface we are talking
about. What we were wanting to avoid was drivers spawning their own
unique VMDq netdevs and each having a different way of doing it. The
approach Intel went with was to use a MACVLAN offload to approach it.
Although I would imagine many would argue the approach is somewhat
dated and limiting since you cannot do many offloads on a MACVLAN
interface.

With the VDPA case I believe there is a set of predefined virtio
devices that are being emulated and presented so it isn't as if they
are creating a totally new interface for this.

What I would be interested in seeing is if there are any other vendors
that have reviewed this and sign off on this approach. What we don't
want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
come along later and have yet another way of doing this. We need an
interface and feature set that will work for everyone in terms of how
this will look going forward.
Samudrala, Sridhar Nov. 20, 2020, 7:04 p.m. UTC | #24
On 11/20/2020 9:58 AM, Alexander Duyck wrote:
> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>
>>>>>>> Just to refresh all our memory, we discussed and settled on the flow
>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>
>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of already
>>>>>>> spawned PF, VF, SF device.
>>>>>>
>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>
>>>>> What are you NAK'ing?
>>>>
>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>
>>> Why do you object to that? Slicing up h/w resources for virtual what
>>> ever has been common practice for a long time.
>>
>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>> I believe the argument was that we should offload software constructs,
>> not create HW-specific APIs which depend on HW availability and
>> implementation. So the path we took was offloading macvlan.
> 
> I think it somewhat depends on the type of interface we are talking
> about. What we were wanting to avoid was drivers spawning their own
> unique VMDq netdevs and each having a different way of doing it. The
> approach Intel went with was to use a MACVLAN offload to approach it.
> Although I would imagine many would argue the approach is somewhat
> dated and limiting since you cannot do many offloads on a MACVLAN
> interface.

Yes. We talked about this at netdev 0x14 and the limitations of macvlan 
based offloads.
https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces

Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as a 
netdev for kernel containers. AF_XDP ZC in a container is one of the 
usecase this would address. Today we have to pass the entire PF/VF to a 
container to do AF_XDP.

Looks like the current model is to create a subfunction of a specific 
type on auxiliary bus, do some configuration to assign resources and 
then activate the subfunction.

> 
> With the VDPA case I believe there is a set of predefined virtio
> devices that are being emulated and presented so it isn't as if they
> are creating a totally new interface for this.
> 
> What I would be interested in seeing is if there are any other vendors
> that have reviewed this and sign off on this approach. What we don't
> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
> come along later and have yet another way of doing this. We need an
> interface and feature set that will work for everyone in terms of how
> this will look going forward.
Saeed Mahameed Nov. 23, 2020, 9:51 p.m. UTC | #25
On Fri, 2020-11-20 at 11:04 -0800, Samudrala, Sridhar wrote:
> 
> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
> > On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org>
> > wrote:
> > > On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
> > > > On 11/18/20 7:14 PM, Jakub Kicinski wrote:
> > > > > On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
> > > > > > On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski
> > > > > > wrote:
> > > > > > 
> > > > > > > > Just to refresh all our memory, we discussed and
> > > > > > > > settled on the flow
> > > > > > > > in [2]; RFC [1] followed this discussion.
> > > > > > > > 
> > > > > > > > vdpa tool of [3] can add one or more vdpa device(s) on
> > > > > > > > top of already
> > > > > > > > spawned PF, VF, SF device.
> > > > > > > 
> > > > > > > Nack for the networking part of that. It'd basically be
> > > > > > > VMDq.
> > > > > > 
> > > > > > What are you NAK'ing?
> > > > > 
> > > > > Spawning multiple netdevs from one device by slicing up its
> > > > > queues.
> > > > 
> > > > Why do you object to that? Slicing up h/w resources for virtual
> > > > what
> > > > ever has been common practice for a long time.
> > > 
> > > My memory of the VMDq debate is hazy, let me rope in Alex into
> > > this.
> > > I believe the argument was that we should offload software
> > > constructs,
> > > not create HW-specific APIs which depend on HW availability and
> > > implementation. So the path we took was offloading macvlan.
> > 
> > I think it somewhat depends on the type of interface we are talking
> > about. What we were wanting to avoid was drivers spawning their own
> > unique VMDq netdevs and each having a different way of doing it. 

Agreed, but SF netdevs are not a VMDq netdevs, they are avaiable in the
switchdev model where they correspond to a full blown port (HW domain).

> > The
> > approach Intel went with was to use a MACVLAN offload to approach
> > it.
> > Although I would imagine many would argue the approach is somewhat
> > dated and limiting since you cannot do many offloads on a MACVLAN
> > interface.
> 
> Yes. We talked about this at netdev 0x14 and the limitations of
> macvlan 
> based offloads.
> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces
> 
> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI
> as a 

Exactly, Subfunctions is the most generic model to overcome any SW
model limitations e.g macvtap offload, all HW vendors are already
creating netdevs on a given PF/VF .. all we need is to model the SF and
all the rest is the same! most likely every thing else comes for free
like in the mlx5 model where the netdev/rmda interfaces are abstracted
from the underlying HW, same netdev loads on a PF/VF/SF or even an
embedded function !


> netdev for kernel containers. AF_XDP ZC in a container is one of the 
> usecase this would address. Today we have to pass the entire PF/VF to
> a 
> container to do AF_XDP.
> 

this will be supported out of the box for free with SFs.

> Looks like the current model is to create a subfunction of a
> specific 
> type on auxiliary bus, do some configuration to assign resources and 
> then activate the subfunction.
> 
> > With the VDPA case I believe there is a set of predefined virtio
> > devices that are being emulated and presented so it isn't as if
> > they
> > are creating a totally new interface for this.
> > 
> > What I would be interested in seeing is if there are any other
> > vendors
> > that have reviewed this and sign off on this approach. What we
> > don't
> > want to see is Nivida/Mellanox do this one way, then Broadcom or
> > Intel
> > come along later and have yet another way of doing this. We need an
> > interface and feature set that will work for everyone in terms of
> > how
> > this will look going forward.

Well, the vdpa interface was created by the virtio community and
especially redhat, i am not sure mellanox were even involved in the
initial development stages :-)

anyway historically speaking vDPA was originally created for DPDK, but
same API applies to device drivers who can deliver the same set of
queues and API while bypassing the whole DPDK stack, enters Kernel vDPA
which was created to overcome some of the userspace limitations and
complexity and to leverage some of the kernel great feature such as
eBPF.

https://www.redhat.com/en/blog/introduction-vdpa-kernel-framework
Jason Wang Nov. 24, 2020, 7:01 a.m. UTC | #26
On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:
>
>
> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
>> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>
>>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>>
>>>>>>>> Just to refresh all our memory, we discussed and settled on the 
>>>>>>>> flow
>>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>>
>>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of 
>>>>>>>> already
>>>>>>>> spawned PF, VF, SF device.
>>>>>>>
>>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>>
>>>>>> What are you NAK'ing?
>>>>>
>>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>>
>>>> Why do you object to that? Slicing up h/w resources for virtual what
>>>> ever has been common practice for a long time.
>>>
>>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>>> I believe the argument was that we should offload software constructs,
>>> not create HW-specific APIs which depend on HW availability and
>>> implementation. So the path we took was offloading macvlan.
>>
>> I think it somewhat depends on the type of interface we are talking
>> about. What we were wanting to avoid was drivers spawning their own
>> unique VMDq netdevs and each having a different way of doing it. The
>> approach Intel went with was to use a MACVLAN offload to approach it.
>> Although I would imagine many would argue the approach is somewhat
>> dated and limiting since you cannot do many offloads on a MACVLAN
>> interface.
>
> Yes. We talked about this at netdev 0x14 and the limitations of 
> macvlan based offloads.
> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 
>
>
> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI as 
> a netdev for kernel containers. AF_XDP ZC in a container is one of the 
> usecase this would address. Today we have to pass the entire PF/VF to 
> a container to do AF_XDP.
>
> Looks like the current model is to create a subfunction of a specific 
> type on auxiliary bus, do some configuration to assign resources and 
> then activate the subfunction.
>
>>
>> With the VDPA case I believe there is a set of predefined virtio
>> devices that are being emulated and presented so it isn't as if they
>> are creating a totally new interface for this.


vDPA doesn't have any limitation of how the devices is created or 
implemented. It could be predefined or created dynamically. vDPA leaves 
all of those to the parent device with the help of a unified management 
API[1]. E.g It could be a PCI device (PF or VF), sub-function or  
software emulated devices.


>>
>> What I would be interested in seeing is if there are any other vendors
>> that have reviewed this and sign off on this approach.


For "this approach" do you mean vDPA subfucntion? My understanding is 
that it's totally vendor specific, vDPA subsystem don't want to be 
limited by a specific type of device.


>> What we don't
>> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
>> come along later and have yet another way of doing this. We need an
>> interface and feature set that will work for everyone in terms of how
>> this will look going forward.

For feature set,  it would be hard to force (we can have a 
recommendation set of features) vendors to implement a common set of 
features consider they can be negotiated. So the management interface is 
expected to implement features like cpu clusters in order to make sure 
the migration compatibility, or qemu can assist for the missing feature 
with performance lose.

Thanks
Jason Wang Nov. 24, 2020, 7:05 a.m. UTC | #27
On 2020/11/24 下午3:01, Jason Wang wrote:
>
> On 2020/11/21 上午3:04, Samudrala, Sridhar wrote:
>>
>>
>> On 11/20/2020 9:58 AM, Alexander Duyck wrote:
>>> On Thu, Nov 19, 2020 at 5:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Wed, 18 Nov 2020 21:35:29 -0700 David Ahern wrote:
>>>>> On 11/18/20 7:14 PM, Jakub Kicinski wrote:
>>>>>> On Tue, 17 Nov 2020 14:49:54 -0400 Jason Gunthorpe wrote:
>>>>>>> On Tue, Nov 17, 2020 at 09:11:20AM -0800, Jakub Kicinski wrote:
>>>>>>>
>>>>>>>>> Just to refresh all our memory, we discussed and settled on 
>>>>>>>>> the flow
>>>>>>>>> in [2]; RFC [1] followed this discussion.
>>>>>>>>>
>>>>>>>>> vdpa tool of [3] can add one or more vdpa device(s) on top of 
>>>>>>>>> already
>>>>>>>>> spawned PF, VF, SF device.
>>>>>>>>
>>>>>>>> Nack for the networking part of that. It'd basically be VMDq.
>>>>>>>
>>>>>>> What are you NAK'ing?
>>>>>>
>>>>>> Spawning multiple netdevs from one device by slicing up its queues.
>>>>>
>>>>> Why do you object to that? Slicing up h/w resources for virtual what
>>>>> ever has been common practice for a long time.
>>>>
>>>> My memory of the VMDq debate is hazy, let me rope in Alex into this.
>>>> I believe the argument was that we should offload software constructs,
>>>> not create HW-specific APIs which depend on HW availability and
>>>> implementation. So the path we took was offloading macvlan.
>>>
>>> I think it somewhat depends on the type of interface we are talking
>>> about. What we were wanting to avoid was drivers spawning their own
>>> unique VMDq netdevs and each having a different way of doing it. The
>>> approach Intel went with was to use a MACVLAN offload to approach it.
>>> Although I would imagine many would argue the approach is somewhat
>>> dated and limiting since you cannot do many offloads on a MACVLAN
>>> interface.
>>
>> Yes. We talked about this at netdev 0x14 and the limitations of 
>> macvlan based offloads.
>> https://netdevconf.info/0x14/session.html?talk-hardware-acceleration-of-container-networking-interfaces 
>>
>>
>> Subfunction seems to be a good model to expose VMDq VSI or SIOV ADI 
>> as a netdev for kernel containers. AF_XDP ZC in a container is one of 
>> the usecase this would address. Today we have to pass the entire 
>> PF/VF to a container to do AF_XDP.
>>
>> Looks like the current model is to create a subfunction of a specific 
>> type on auxiliary bus, do some configuration to assign resources and 
>> then activate the subfunction.
>>
>>>
>>> With the VDPA case I believe there is a set of predefined virtio
>>> devices that are being emulated and presented so it isn't as if they
>>> are creating a totally new interface for this.
>
>
> vDPA doesn't have any limitation of how the devices is created or 
> implemented. It could be predefined or created dynamically. vDPA 
> leaves all of those to the parent device with the help of a unified 
> management API[1]. E.g It could be a PCI device (PF or VF), 
> sub-function or  software emulated devices.


Miss the link, https://www.spinics.net/lists/netdev/msg699374.html.

Thanks


>
>
>>>
>>> What I would be interested in seeing is if there are any other vendors
>>> that have reviewed this and sign off on this approach.
>
>
> For "this approach" do you mean vDPA subfucntion? My understanding is 
> that it's totally vendor specific, vDPA subsystem don't want to be 
> limited by a specific type of device.
>
>
>>> What we don't
>>> want to see is Nivida/Mellanox do this one way, then Broadcom or Intel
>>> come along later and have yet another way of doing this. We need an
>>> interface and feature set that will work for everyone in terms of how
>>> this will look going forward.
>
> For feature set,  it would be hard to force (we can have a 
> recommendation set of features) vendors to implement a common set of 
> features consider they can be negotiated. So the management interface 
> is expected to implement features like cpu clusters in order to make 
> sure the migration compatibility, or qemu can assist for the missing 
> feature with performance lose.
>
> Thanks
>
>
David Ahern Nov. 25, 2020, 5:33 a.m. UTC | #28
On 11/18/20 10:57 PM, Saeed Mahameed wrote:
> 
> We are not slicing up any queues, from our HW and FW perspective SF ==
> VF literally, a full blown HW slice (Function), with isolated control
> and data plane of its own, this is very different from VMDq and more
> generic and secure. an SF device is exactly like a VF, doesn't steal or
> share any HW resources or control/data path with others. SF is
> basically SRIOV done right.

What does that mean with respect to mac filtering and ntuple rules?

Also, Tx is fairly easy to imagine, but how does hardware know how to
direct packets for the Rx path? As an example, consider 2 VMs or
containers with the same destination ip both using subfunction devices.
How does the nic know how to direct the ingress flows to the right
queues for the subfunction?
Parav Pandit Nov. 25, 2020, 6 a.m. UTC | #29
Hi David,

> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 25, 2020 11:04 AM
> 
> On 11/18/20 10:57 PM, Saeed Mahameed wrote:
> >
> > We are not slicing up any queues, from our HW and FW perspective SF ==
> > VF literally, a full blown HW slice (Function), with isolated control
> > and data plane of its own, this is very different from VMDq and more
> > generic and secure. an SF device is exactly like a VF, doesn't steal
> > or share any HW resources or control/data path with others. SF is
> > basically SRIOV done right.
> 
> What does that mean with respect to mac filtering and ntuple rules?
> 
> Also, Tx is fairly easy to imagine, but how does hardware know how to direct
> packets for the Rx path? As an example, consider 2 VMs or containers with the
> same destination ip both using subfunction devices.
Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc.

> How does the nic know how to direct the ingress flows to the right queues for
> the subfunction?
> 
Rx steering occurs through tc filters via representor netdev of SF.
Exactly same way as VF representor netdev operation.

When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice.
Below is the snippet of it.

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
                                                ^^^^^^^^^^^^^^
This is the representor netdevice. It is created by port add command.
This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors.

Now user can add unicast rx tc rule for example,

$ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88

I didn't cover this tc example in cover letter, to keep it short.
But I had a one line description as below in the 'detail' section of cover-letter.
Hope it helps.

- A SF supports eswitch representation and tc offload support similar
  to existing PF and VF representors.

Now above portion answers, how to forward the packet to subfunction.
But how to forward to the right rx queue out of multiple rxqueues?
This is done by the rss configuration done by the user, number of channels from ethtool.
Just like VF and PF.
The driver defaults are similar to VF, which user can change via ethtool.
David Ahern Nov. 25, 2020, 2:37 p.m. UTC | #30
On 11/24/20 11:00 PM, Parav Pandit wrote:
> Hi David,
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 25, 2020 11:04 AM
>>
>> On 11/18/20 10:57 PM, Saeed Mahameed wrote:
>>>
>>> We are not slicing up any queues, from our HW and FW perspective SF ==
>>> VF literally, a full blown HW slice (Function), with isolated control
>>> and data plane of its own, this is very different from VMDq and more
>>> generic and secure. an SF device is exactly like a VF, doesn't steal
>>> or share any HW resources or control/data path with others. SF is
>>> basically SRIOV done right.
>>
>> What does that mean with respect to mac filtering and ntuple rules?
>>
>> Also, Tx is fairly easy to imagine, but how does hardware know how to direct
>> packets for the Rx path? As an example, consider 2 VMs or containers with the
>> same destination ip both using subfunction devices.
> Since both VM/containers are having same IP, it is better to place them in different L2 domains via vlan, vxlan etc.

ok, so relying on <vlan, dmac> pairs.

> 
>> How does the nic know how to direct the ingress flows to the right queues for
>> the subfunction?
>>
> Rx steering occurs through tc filters via representor netdev of SF.
> Exactly same way as VF representor netdev operation.
> 
> When devlink eswitch port is created as shown in example in cover letter, and also in patch-12, it creates the representor netdevice.
> Below is the snippet of it.
> 
> Add a devlink port of subfunction flavour:
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> Configure mac address of the port function:
> $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
>                                                 ^^^^^^^^^^^^^^
> This is the representor netdevice. It is created by port add command.
> This name is setup by systemd/udev v245 and higher by utilizing the existing phys_port_name infrastructure already exists for PF and VF representors.

hardware ensures only packets with that dmac are sent to the subfunction
device.

> 
> Now user can add unicast rx tc rule for example,
> 
> $ tc filter add dev ens2f0np0 parent ffff: prio 1 flower dst_mac 00:00:00:00:88:88 action mirred egress redirect dev ens2f0npf0sf88
> 
> I didn't cover this tc example in cover letter, to keep it short.
> But I had a one line description as below in the 'detail' section of cover-letter.
> Hope it helps.
> 
> - A SF supports eswitch representation and tc offload support similar
>   to existing PF and VF representors.
> 
> Now above portion answers, how to forward the packet to subfunction.
> But how to forward to the right rx queue out of multiple rxqueues?
> This is done by the rss configuration done by the user, number of channels from ethtool.
> Just like VF and PF.
> The driver defaults are similar to VF, which user can change via ethtool.
> 

so users can add flow steering or drop rules to SF devices.

thanks,