diff mbox series

[RFC,net-next,8/8] net/mlx5: Add subdev driver to bind to subdev devices

Message ID 1551418672-12822-9-git-send-email-parav@mellanox.com
State RFC
Delegated to: David Miller
Headers show
Series Introducing subdev bus and devlink extension | expand

Commit Message

Parav Pandit March 1, 2019, 5:37 a.m. UTC
Add a subdev driver to probe the subdev devices and create fake
netdevice for it.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  8 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  3 +
 .../ethernet/mellanox/mlx5/core/subdev_driver.c    | 93 ++++++++++++++++++++++
 4 files changed, 104 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c

Comments

gregkh@linuxfoundation.org March 1, 2019, 7:21 a.m. UTC | #1
On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> Add a subdev driver to probe the subdev devices and create fake
> netdevice for it.

So I'm guessing here is the "meat" of the whole goal here?

You just want multiple netdevices per PCI device?  Why can't you do that
today in your PCI driver?

What problem are you trying to solve that others also are having that
requires all of this?

Adding a new bus type and subsystem is fine, but usually we want more
than just one user of it, as this does not really show how it is
exercised very well.  Ideally 3 users would be there as that is when it
proves itself that it is flexible enough.

Would just using the mfd subsystem work better for you?  That provides
core support for "multi-function" drivers/devices already.  What is
missing from that subsystem that does not work for you here?

thanks,

greg k-h
Parav Pandit March 1, 2019, 5:21 p.m. UTC | #2
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Friday, March 1, 2019 1:22 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> <jiri@mellanox.com>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
> 
> On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > Add a subdev driver to probe the subdev devices and create fake
> > netdevice for it.
> 
> So I'm guessing here is the "meat" of the whole goal here?
> 
> You just want multiple netdevices per PCI device?  Why can't you do that
> today in your PCI driver?
> 
Yes, but it just not multiple netdevices.
Let me please elaborate in detail.

There is a swichdev mode of a PCI function for netdevices.
In this mode a given netdev has additional control netdev (called representor netdevice = rep-ndev).
This rep-ndev is attached to OVS for adding rules, offloads etc using standard tc, netfilter infra.
Currently this rep-ndev controls switch side of the settings, but not the host side of netdev.
So there is discussion to create another netdev or devlink port..

Additionally this subdev has optional rdma device too.

And when we are in switchdev mode, this rdma dev has similar rdma rep device for control.

In some cases we actually don't create netdev when it is in InfiniBand mode.
Here there is PCI device->rdma_device.

In other case, a given sub device for rdma is dual port device, having netdevice for each that can use existing netdev->dev_port.

Creating 4 devices of two different classes using one iproute2/ip or iproute2/rdma command is horrible thing to do.

In case if this sub device has to be a passthrough device, ip link command will fail badly that day, because we are creating some sub device which is not even a netdevice.

So iproute2/devlink which works on bus+device, mainly PCI today, seems right abstraction point to create sub devices.
This also extends to map ports of the device, health, registers debug, etc rich infrastructure that is already built.

Additionally, we don't want mlx driver and other drivers to go through its child devices (split logic in netdev and rdma) for power management.
Kernel core code does that well today, that we like to leverage through subdev bus or mfd pm callbacks.

So it is lot more than just creating netdevices.

> What problem are you trying to solve that others also are having that
> requires all of this?
> 
> Adding a new bus type and subsystem is fine, but usually we want more
> than just one user of it, as this does not really show how it is exercised very
> well.
This subdev and devlink infrastructure solves this problem of creating smaller sub devices out of one PCI device.
Someone has to start.. :-)

To my knowledge, currently Netronome, Broadcom and Mellanox are actively using this devlink and switchdev infra today.
I added Jakub from Netronome, he is in netdev mailing list, but added in CC, to listen his feedback.

> Ideally 3 users would be there as that is when it proves itself that it is
> flexible enough.
> 

We were looking at drivers/visorbus if we can repurpose it, but GUID device naming scheme is just not user friendly.
It has only single s-Par user and whose guest drivers are still in staging for more than a year now. So doesn't really fit well.

> Would just using the mfd subsystem work better for you?  That provides
> core support for "multi-function" drivers/devices already.  What is missing
> from that subsystem that does not work for you here?
> 
We were not aware of mfd until now. I looked at very high level now. It's a wrapper to platform devices and seems widely use.
Before subdev proposal, Jason suggested an alternative is to create platform devices and driver attach to it.

When I read kernel documentation [1], it says "platform devices typically appear as autonomous entities"
Here instead of autonomy, it is in user's control.
Platform devices probably don't disappear a lot in live system as opposed to subdevices which are created and removed dynamically a lot often.

Not sure if platform device is abuse for this purpose or not.
So which direction to go, devlink->mfd(platform wrapper) or devlink->subdev would be obviously a huge blessing.

[1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt
Saeed Mahameed March 1, 2019, 10:12 p.m. UTC | #3
On Thu, 2019-02-28 at 23:37 -0600, Parav Pandit wrote:
> Add a subdev driver to probe the subdev devices and create fake
> netdevice for it.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     |  8 +-
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  3 +
>  .../ethernet/mellanox/mlx5/core/subdev_driver.c    | 93
> ++++++++++++++++++++++
>  4 files changed, 104 insertions(+), 2 deletions(-)
>  create mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index f218789..c8aeaf1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -16,7 +16,7 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o
> eq.o uar.o pagealloc.o \
>  		transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
>  		fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o
> \
>  		lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
> -mlx5_core-$(CONFIG_SUBDEV) += subdev.o
> +mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o
>  
>  #
>  # Netdev basic
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 5f8cf0d..7dfa8c4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1548,7 +1548,11 @@ static int __init init(void)
>  	mlx5e_init();
>  #endif
>  
> -	return 0;
> +	err = subdev_register_driver(&mlx5_subdev_driver);
> +	if (err)
> +		pci_unregister_driver(&mlx5_core_driver);
> +
> +	return err;
>  
>  err_debug:
>  	mlx5_unregister_debugfs();
> @@ -1557,6 +1561,8 @@ static int __init init(void)
>  
>  static void __exit cleanup(void)
>  {
> +	subdev_unregister_driver(&mlx5_subdev_driver);
> +
>  #ifdef CONFIG_MLX5_CORE_EN
>  	mlx5e_cleanup();
>  #endif
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> index 2a54148..1b733c7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> @@ -41,12 +41,15 @@
>  #include <linux/ptp_clock_kernel.h>
>  #include <linux/mlx5/cq.h>
>  #include <linux/mlx5/fs.h>
> +#include <linux/subdev_bus.h>
>  
>  #define DRIVER_NAME "mlx5_core"
>  #define DRIVER_VERSION "5.0-0"
>  
>  extern uint mlx5_core_debug_mask;
>  
> +extern struct subdev_driver mlx5_subdev_driver;
> +
>  #define mlx5_core_dbg(__dev, format, ...)				
> \
>  	dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,		
> \
>  		 __func__, __LINE__, current->pid,			
> \
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> new file mode 100644
> index 0000000..880aa4f
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018-19 Mellanox Technologies
> +
> +#include <linux/module.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/subdev_bus.h>
> +#include <linux/subdev_ids.h>
> +#include <linux/etherdevice.h>
> +
> +struct mlx5_subdev_ndev {
> +	struct net_device ndev;
> +};
> +
> +static void mlx5_dma_test(struct device *dev)
> +{
> +	dma_addr_t pa;
> +	void *va;
> +
> +	va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
> +	if (va)
> +		dma_free_coherent(dev, 4096, va, pa);
> +}
> +
> +static struct net_device *ndev;
> +
> +static int mlx5e_subdev_open(struct net_device *netdev)
> +{
> +	return 0;
> +}
> +
> +static int mlx5e_subdev_close(struct net_device *netdev)
> +{
> +	return 0;
> +}
> +
> +static netdev_tx_t
> +mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{
> +	return NETDEV_TX_BUSY;
> +}
> +
> +const struct net_device_ops mlx5e_subdev_netdev_ops = {
> +	.ndo_open                = mlx5e_subdev_open,
> +	.ndo_stop                = mlx5e_subdev_close,
> +	.ndo_start_xmit          = mlx5e_subdev_xmit,
> +};
> +
> +static int mlx5_subdev_probe(struct device *dev)
> +{
> +	int err;
> +
> +	mlx5_dma_test(dev);

Hi Parav, can you please shed some light on how do you plan to
communicate with the parent device ? (pci_dev and it's running driver
instance), We will need to share some resources, such as IRQs/BARs/etc
.., and maybe some HW objects which are going to be managed by the
parent pci device driver.

Just allocating a dma buffer doesn't mean anything, the dma buffer is
just bound to the generic device.

> +	/* Only one device supported in rfc */
> +	if (ndev)
> +		return 0;
> +
> +	ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1,
> 1);
> +	if (!ndev)
> +		return -ENOMEM;
> +
> +	SET_NETDEV_DEV(ndev, dev);
> +	ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
> +	err = register_netdev(ndev);
> +	if (err) {
> +		free_netdev(ndev);
> +		ndev = NULL;
> +	}
> +	return err;
> +}
> +
> +static int mlx5_subdev_remove(struct device *dev)
> +{
> +	if (ndev) {
> +		unregister_netdev(ndev);
> +		free_netdev(ndev);
> +		ndev = NULL;
> +	}
> +	return 0;
> +}
> +
> +static const struct subdev_id mlx5_subdev_id_table[] = {
> +	{ .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
> +	  .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
> +	{ 0, }
> +};
> +MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
> +
> +struct subdev_driver mlx5_subdev_driver = {
> +	.id_table = mlx5_subdev_id_table,
> +	.driver.name = "mlx5_subdev_driver",
> +	.driver.probe = mlx5_subdev_probe,
> +	.driver.remove = mlx5_subdev_remove,
> +};
Parav Pandit March 4, 2019, 4:45 p.m. UTC | #4
Hi Saeed,

> -----Original Message-----
> From: Saeed Mahameed
> Sent: Friday, March 1, 2019 4:12 PM
> To: Jiri Pirko <jiri@mellanox.com>; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; Parav Pandit <parav@mellanox.com>;
> davem@davemloft.net; gregkh@linuxfoundation.org;
> michal.lkml@markovi.net
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
> 
> On Thu, 2019-02-28 at 23:37 -0600, Parav Pandit wrote:
> > Add a subdev driver to probe the subdev devices and create fake
> > netdevice for it.
> >
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
> >  drivers/net/ethernet/mellanox/mlx5/core/main.c     |  8 +-
> >  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  3 +
> >  .../ethernet/mellanox/mlx5/core/subdev_driver.c    | 93
> > ++++++++++++++++++++++
> >  4 files changed, 104 insertions(+), 2 deletions(-)  create mode
> > 100644 drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > index f218789..c8aeaf1 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> > @@ -16,7 +16,7 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o
> > eq.o uar.o pagealloc.o \
> >  		transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
> >  		fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
> >  		lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
> > -mlx5_core-$(CONFIG_SUBDEV) += subdev.o
> > +mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o
> >
> >  #
> >  # Netdev basic
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > index 5f8cf0d..7dfa8c4 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > @@ -1548,7 +1548,11 @@ static int __init init(void)
> >  	mlx5e_init();
> >  #endif
> >
> > -	return 0;
> > +	err = subdev_register_driver(&mlx5_subdev_driver);
> > +	if (err)
> > +		pci_unregister_driver(&mlx5_core_driver);
> > +
> > +	return err;
> >
> >  err_debug:
> >  	mlx5_unregister_debugfs();
> > @@ -1557,6 +1561,8 @@ static int __init init(void)
> >
> >  static void __exit cleanup(void)
> >  {
> > +	subdev_unregister_driver(&mlx5_subdev_driver);
> > +
> >  #ifdef CONFIG_MLX5_CORE_EN
> >  	mlx5e_cleanup();
> >  #endif
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > index 2a54148..1b733c7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > @@ -41,12 +41,15 @@
> >  #include <linux/ptp_clock_kernel.h>
> >  #include <linux/mlx5/cq.h>
> >  #include <linux/mlx5/fs.h>
> > +#include <linux/subdev_bus.h>
> >
> >  #define DRIVER_NAME "mlx5_core"
> >  #define DRIVER_VERSION "5.0-0"
> >
> >  extern uint mlx5_core_debug_mask;
> >
> > +extern struct subdev_driver mlx5_subdev_driver;
> > +
> >  #define mlx5_core_dbg(__dev, format, ...)
> > \
> >  	dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,
> 
> > \
> >  		 __func__, __LINE__, current->pid,
> > \
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > new file mode 100644
> > index 0000000..880aa4f
> > --- /dev/null
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
> > @@ -0,0 +1,93 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright (c) 2018-19 Mellanox Technologies
> > +
> > +#include <linux/module.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/subdev_bus.h>
> > +#include <linux/subdev_ids.h>
> > +#include <linux/etherdevice.h>
> > +
> > +struct mlx5_subdev_ndev {
> > +	struct net_device ndev;
> > +};
> > +
> > +static void mlx5_dma_test(struct device *dev) {
> > +	dma_addr_t pa;
> > +	void *va;
> > +
> > +	va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
> > +	if (va)
> > +		dma_free_coherent(dev, 4096, va, pa); }
> > +
> > +static struct net_device *ndev;
> > +
> > +static int mlx5e_subdev_open(struct net_device *netdev) {
> > +	return 0;
> > +}
> > +
> > +static int mlx5e_subdev_close(struct net_device *netdev) {
> > +	return 0;
> > +}
> > +
> > +static netdev_tx_t
> > +mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev) {
> > +	return NETDEV_TX_BUSY;
> > +}
> > +
> > +const struct net_device_ops mlx5e_subdev_netdev_ops = {
> > +	.ndo_open                = mlx5e_subdev_open,
> > +	.ndo_stop                = mlx5e_subdev_close,
> > +	.ndo_start_xmit          = mlx5e_subdev_xmit,
> > +};
> > +
> > +static int mlx5_subdev_probe(struct device *dev) {
> > +	int err;
> > +
> > +	mlx5_dma_test(dev);
> 
> Hi Parav, can you please shed some light on how do you plan to
> communicate with the parent device ? (pci_dev and its running driver
> instance), We will need to share some resources, such as IRQs/BARs/etc ..,
> and maybe some HW objects which are going to be managed by the parent
> pci device driver.
> 
Since mlx5 driver works on its pci device, in mlx5_subdev_probe(struct device *device)
device->parent is a PCI device for driver to use.

> Just allocating a dma buffer doesn't mean anything, the dma buffer is just
> bound to the generic device.
>

dma buffer allocation is just to make sure that stack and core and rdma ULPs dma allocations in same way as PCI device.

 > > +	/* Only one device supported in rfc */
> > +	if (ndev)
> > +		return 0;
> > +
> > +	ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1,
> > 1);
> > +	if (!ndev)
> > +		return -ENOMEM;
> > +
> > +	SET_NETDEV_DEV(ndev, dev);
> > +	ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
> > +	err = register_netdev(ndev);
> > +	if (err) {
> > +		free_netdev(ndev);
> > +		ndev = NULL;
> > +	}
> > +	return err;
> > +}
> > +
> > +static int mlx5_subdev_remove(struct device *dev) {
> > +	if (ndev) {
> > +		unregister_netdev(ndev);
> > +		free_netdev(ndev);
> > +		ndev = NULL;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static const struct subdev_id mlx5_subdev_id_table[] = {
> > +	{ .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
> > +	  .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
> > +	{ 0, }
> > +};
> > +MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
> > +
> > +struct subdev_driver mlx5_subdev_driver = {
> > +	.id_table = mlx5_subdev_id_table,
> > +	.driver.name = "mlx5_subdev_driver",
> > +	.driver.probe = mlx5_subdev_probe,
> > +	.driver.remove = mlx5_subdev_remove, };
gregkh@linuxfoundation.org March 5, 2019, 7:13 a.m. UTC | #5
On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Friday, March 1, 2019 1:22 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > <jiri@mellanox.com>
> > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> > subdev devices
> > 
> > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > Add a subdev driver to probe the subdev devices and create fake
> > > netdevice for it.
> > 
> > So I'm guessing here is the "meat" of the whole goal here?
> > 
> > You just want multiple netdevices per PCI device?  Why can't you do that
> > today in your PCI driver?
> > 
> Yes, but it just not multiple netdevices.
> Let me please elaborate in detail.
> 
> There is a swichdev mode of a PCI function for netdevices.
> In this mode a given netdev has additional control netdev (called representor netdevice = rep-ndev).
> This rep-ndev is attached to OVS for adding rules, offloads etc using standard tc, netfilter infra.
> Currently this rep-ndev controls switch side of the settings, but not the host side of netdev.
> So there is discussion to create another netdev or devlink port..
> 
> Additionally this subdev has optional rdma device too.
> 
> And when we are in switchdev mode, this rdma dev has similar rdma rep device for control.
> 
> In some cases we actually don't create netdev when it is in InfiniBand mode.
> Here there is PCI device->rdma_device.
> 
> In other case, a given sub device for rdma is dual port device, having netdevice for each that can use existing netdev->dev_port.
> 
> Creating 4 devices of two different classes using one iproute2/ip or iproute2/rdma command is horrible thing to do.

Why is that?

> In case if this sub device has to be a passthrough device, ip link command will fail badly that day, because we are creating some sub device which is not even a netdevice.

But it is a network device, right?

> So iproute2/devlink which works on bus+device, mainly PCI today, seems right abstraction point to create sub devices.
> This also extends to map ports of the device, health, registers debug, etc rich infrastructure that is already built.
> 
> Additionally, we don't want mlx driver and other drivers to go through its child devices (split logic in netdev and rdma) for power management.

And how is power management going to work with your new devices?  All
you have here is a tiny shim around a driver bus, I do not see any new
functionality, and as others have said, no way to actually share, or
split up, the PCI resources.

> Kernel core code does that well today, that we like to leverage through subdev bus or mfd pm callbacks.
> 
> So it is lot more than just creating netdevices.

But that's all you are showing here :)

> > What problem are you trying to solve that others also are having that
> > requires all of this?
> > 
> > Adding a new bus type and subsystem is fine, but usually we want more
> > than just one user of it, as this does not really show how it is exercised very
> > well.
> This subdev and devlink infrastructure solves this problem of creating smaller sub devices out of one PCI device.
> Someone has to start.. :-)

That is what a mfd should allow you to do.

> To my knowledge, currently Netronome, Broadcom and Mellanox are actively using this devlink and switchdev infra today.

Where are they "using it"?  This patchset does not show that.

> > Ideally 3 users would be there as that is when it proves itself that it is
> > flexible enough.
> > 
> 
> We were looking at drivers/visorbus if we can repurpose it, but GUID device naming scheme is just not user friendly.

You can always change the naming scheme if needed.  But why isn't a GUID
ok?  It's very easy to reserve properly, and you do not need a central
naming "authority".

> > Would just using the mfd subsystem work better for you?  That provides
> > core support for "multi-function" drivers/devices already.  What is missing
> > from that subsystem that does not work for you here?
> > 
> We were not aware of mfd until now. I looked at very high level now. It's a wrapper to platform devices and seems widely use.
> Before subdev proposal, Jason suggested an alternative is to create platform devices and driver attach to it.
> 
> When I read kernel documentation [1], it says "platform devices typically appear as autonomous entities"
> Here instead of autonomy, it is in user's control.
> Platform devices probably don't disappear a lot in live system as opposed to subdevices which are created and removed dynamically a lot often.
> 
> Not sure if platform device is abuse for this purpose or not.

No, do not abuse a platform device.  You should be able to just use a
normal PCI device for this just fine, and if not, we should be able to
make the needed changes to mfd for that.

thanks,

greg k-h
Parav Pandit March 5, 2019, 5:57 p.m. UTC | #6
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, March 5, 2019 1:14 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> <jiri@mellanox.com>; Jakub Kicinski <jakub.kicinski@netronome.com>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
> 
> On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Greg KH <gregkh@linuxfoundation.org>
> > > Sent: Friday, March 1, 2019 1:22 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > > <jiri@mellanox.com>
> > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > to subdev devices
> > >
> > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > Add a subdev driver to probe the subdev devices and create fake
> > > > netdevice for it.
> > >
> > > So I'm guessing here is the "meat" of the whole goal here?
> > >
> > > You just want multiple netdevices per PCI device?  Why can't you do
> > > that today in your PCI driver?
> > >
> > Yes, but it just not multiple netdevices.
> > Let me please elaborate in detail.
> >
> > There is a swichdev mode of a PCI function for netdevices.
> > In this mode a given netdev has additional control netdev (called
> representor netdevice = rep-ndev).
> > This rep-ndev is attached to OVS for adding rules, offloads etc using
> standard tc, netfilter infra.
> > Currently this rep-ndev controls switch side of the settings, but not the
> host side of netdev.
> > So there is discussion to create another netdev or devlink port..
> >
> > Additionally this subdev has optional rdma device too.
> >
> > And when we are in switchdev mode, this rdma dev has similar rdma rep
> device for control.
> >
> > In some cases we actually don't create netdev when it is in InfiniBand
> mode.
> > Here there is PCI device->rdma_device.
> >
> > In other case, a given sub device for rdma is dual port device, having
> netdevice for each that can use existing netdev->dev_port.
> >
> > Creating 4 devices of two different classes using one iproute2/ip or
> iproute2/rdma command is horrible thing to do.
> 
> Why is that?
> 
When user creates the device, user tool needs to return a device handle that got created.
Creating multiple devices doesn't make sense. I haven't seen any tool doing such crazy thing.

> > In case if this sub device has to be a passthrough device, ip link command
> will fail badly that day, because we are creating some sub device which is not
> even a netdevice.
> 
> But it is a network device, right?
> 
When there is passthrough subdevice, there won't be netdevice created.
We don't want to create passthrough subdevice using iproute2/ip tool which primarily works on netdevices.

> > So iproute2/devlink which works on bus+device, mainly PCI today, seems
> right abstraction point to create sub devices.
> > This also extends to map ports of the device, health, registers debug, etc
> rich infrastructure that is already built.
> >
> > Additionally, we don't want mlx driver and other drivers to go through its
> child devices (split logic in netdev and rdma) for power management.
> 
> And how is power management going to work with your new devices?  All
> you have here is a tiny shim around a driver bus, 
So subdevices power management is done before their parent's.
Vendor driver doesn't need to iterate its child devices to suspend/resume it.

> I do not see any new
> functionality, and as others have said, no way to actually share, or split up,
> the PCI resources.
> 
devlink tool create command will be able to accept more parameters during device creation time to share and split PCI resources.
This is just the start of the development and RFC is to agree on direction.
devlink tool has parameters options that can be queried/set and existing infra will be used for granular device config.

> > Kernel core code does that well today, that we like to leverage through
> subdev bus or mfd pm callbacks.
> >
> > So it is lot more than just creating netdevices.
> 
> But that's all you are showing here :)
> 
Starting use case is netdev and rdma, but we don't want to create new tools few months/a year later for passthrough mode or for different link layers etc.

> > > What problem are you trying to solve that others also are having
> > > that requires all of this?
> > >
> > > Adding a new bus type and subsystem is fine, but usually we want
> > > more than just one user of it, as this does not really show how it
> > > is exercised very well.
> > This subdev and devlink infrastructure solves this problem of creating
> smaller sub devices out of one PCI device.
> > Someone has to start.. :-)
> 
> That is what a mfd should allow you to do.
> 
I did cursory look at mfd.
It lacks removing specific devices, but that is small. It can be enhanced to remove specific mfd device.

> > To my knowledge, currently Netronome, Broadcom and Mellanox are
> actively using this devlink and switchdev infra today.
> 
> Where are they "using it"?  This patchset does not show that.
> 
devlink and swhichdev mode for SRIOV is common among these vendors and more.
The code is in,
drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c
drivers/net/ethernet/netronome/nfp/nfp_net_main.c
drivers/net/ethernet/mellanox/mlx5/core/main.c

This patchset covers only mlx5, but other vendors who also intent to create subdevices will be able to reuse it.
This RFC doesn't cover other vendors.
Jakub and netdev list in CC. We are discussing with Jakub in this patchset discussion.

> > > Ideally 3 users would be there as that is when it proves itself that
> > > it is flexible enough.
> > >
> >
> > We were looking at drivers/visorbus if we can repurpose it, but GUID
> device naming scheme is just not user friendly.
> 
> You can always change the naming scheme if needed.  But why isn't a GUID
> ok? 
I think it was ok.
vendor-device id scheme seems more user friendly and in kernels control, also fits with existing modpost tools.
GUID can be used instead of vendor, device id.
However visorbus is tied to acpi and device life cycle is very different under workqueue handlering.
It is also meant for a vendor s-Par devices.
Its guest drivers are in staging without a clear roadmap for more than year now.
So do not want to depend on it. mfd or dedicated bus seems better fit.

> It's very easy to reserve properly, and you do not need a central naming
> "authority".
> 
> > > Would just using the mfd subsystem work better for you?  That
> > > provides core support for "multi-function" drivers/devices already.
> > > What is missing from that subsystem that does not work for you here?
> > >
> > We were not aware of mfd until now. I looked at very high level now. It's a
> wrapper to platform devices and seems widely use.
> > Before subdev proposal, Jason suggested an alternative is to create
> platform devices and driver attach to it.
> >
> > When I read kernel documentation [1], it says "platform devices typically
> appear as autonomous entities"
> > Here instead of autonomy, it is in user's control.
> > Platform devices probably don't disappear a lot in live system as opposed
> to subdevices which are created and removed dynamically a lot often.
> >
> > Not sure if platform device is abuse for this purpose or not.
> 
> No, do not abuse a platform device. 
Yes. that is my point mfd devices are platform devices.
mfd creates platform devices. and to match to it, platfrom_register_driver() have to be called to bind to it.
I do not know currently if we have the flexibility to say that instead of binding X driver, bind Y driver for platform devices.

> You should be able to just use a normal
> PCI device for this just fine, and if not, we should be able to make the
> needed changes to mfd for that.
> 
Ok. so parent pci device and mfd devices.
mfd seems to fit this use case.
Do you think 'Platform devices' section is stale in [1] for autonomy, host bridge, soc platform etc points?
Should we update the documentation to indicate that it can be used for non-autonomous, user created devices and it can be used for creating devices on top of PCI parent device etc?

 [1] https://www.kernel.org/doc/Documentation/driver-model/platform.txt
gregkh@linuxfoundation.org March 5, 2019, 7:27 p.m. UTC | #7
On Tue, Mar 05, 2019 at 05:57:58PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Tuesday, March 5, 2019 1:14 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > <jiri@mellanox.com>; Jakub Kicinski <jakub.kicinski@netronome.com>
> > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> > subdev devices
> > 
> > On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Greg KH <gregkh@linuxfoundation.org>
> > > > Sent: Friday, March 1, 2019 1:22 AM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > > > <jiri@mellanox.com>
> > > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > > to subdev devices
> > > >
> > > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > > Add a subdev driver to probe the subdev devices and create fake
> > > > > netdevice for it.
> > > >
> > > > So I'm guessing here is the "meat" of the whole goal here?
> > > >
> > > > You just want multiple netdevices per PCI device?  Why can't you do
> > > > that today in your PCI driver?
> > > >
> > > Yes, but it just not multiple netdevices.
> > > Let me please elaborate in detail.
> > >
> > > There is a swichdev mode of a PCI function for netdevices.
> > > In this mode a given netdev has additional control netdev (called
> > representor netdevice = rep-ndev).
> > > This rep-ndev is attached to OVS for adding rules, offloads etc using
> > standard tc, netfilter infra.
> > > Currently this rep-ndev controls switch side of the settings, but not the
> > host side of netdev.
> > > So there is discussion to create another netdev or devlink port..
> > >
> > > Additionally this subdev has optional rdma device too.
> > >
> > > And when we are in switchdev mode, this rdma dev has similar rdma rep
> > device for control.
> > >
> > > In some cases we actually don't create netdev when it is in InfiniBand
> > mode.
> > > Here there is PCI device->rdma_device.
> > >
> > > In other case, a given sub device for rdma is dual port device, having
> > netdevice for each that can use existing netdev->dev_port.
> > >
> > > Creating 4 devices of two different classes using one iproute2/ip or
> > iproute2/rdma command is horrible thing to do.
> > 
> > Why is that?
> > 
> When user creates the device, user tool needs to return a device handle that got created.
> Creating multiple devices doesn't make sense. I haven't seen any tool doing such crazy thing.

And what do you mean by "device handle"?  All you get here is a sysfs
device tree.

> > > In case if this sub device has to be a passthrough device, ip link command
> > will fail badly that day, because we are creating some sub device which is not
> > even a netdevice.
> > 
> > But it is a network device, right?
> > 
> When there is passthrough subdevice, there won't be netdevice created.
> We don't want to create passthrough subdevice using iproute2/ip tool which primarily works on netdevices.

I don't know enough networking to claim anything here, so I'll ignore
this :)

> > > So iproute2/devlink which works on bus+device, mainly PCI today, seems
> > right abstraction point to create sub devices.
> > > This also extends to map ports of the device, health, registers debug, etc
> > rich infrastructure that is already built.
> > >
> > > Additionally, we don't want mlx driver and other drivers to go through its
> > child devices (split logic in netdev and rdma) for power management.
> > 
> > And how is power management going to work with your new devices?  All
> > you have here is a tiny shim around a driver bus, 
> So subdevices power management is done before their parent's.
> Vendor driver doesn't need to iterate its child devices to suspend/resume it.

True, so we can just autosuspend these "children" device and the "vendor
driver" is not going to care?  You are going to care as you are talking
to the same PCI device.  This goes to the other question about "how are
you sharing PCI device resources?"

> > I do not see any new
> > functionality, and as others have said, no way to actually share, or split up,
> > the PCI resources.
> > 
> devlink tool create command will be able to accept more parameters during device creation time to share and split PCI resources.
> This is just the start of the development and RFC is to agree on direction.
> devlink tool has parameters options that can be queried/set and existing infra will be used for granular device config.

Pointers to this beast?

> > > Kernel core code does that well today, that we like to leverage through
> > subdev bus or mfd pm callbacks.
> > >
> > > So it is lot more than just creating netdevices.
> > 
> > But that's all you are showing here :)
> > 
> Starting use case is netdev and rdma, but we don't want to create new
> tools few months/a year later for passthrough mode or for different
> link layers etc.

And I don't want to see duplicated driver model code happen either,
which is why I point out the MFD layer :)

> > > > What problem are you trying to solve that others also are having
> > > > that requires all of this?
> > > >
> > > > Adding a new bus type and subsystem is fine, but usually we want
> > > > more than just one user of it, as this does not really show how it
> > > > is exercised very well.
> > > This subdev and devlink infrastructure solves this problem of creating
> > smaller sub devices out of one PCI device.
> > > Someone has to start.. :-)
> > 
> > That is what a mfd should allow you to do.
> > 
> I did cursory look at mfd.
> It lacks removing specific devices, but that is small. It can be
> enhanced to remove specific mfd device.

That should be easy enough, work with the MFD developers.  I think
something like that should work today as you can use USB devices with
MFD, right?

> > 
> > No, do not abuse a platform device. 
> Yes. that is my point mfd devices are platform devices.
> mfd creates platform devices. and to match to it, platfrom_register_driver() have to be called to bind to it.
> I do not know currently if we have the flexibility to say that instead of binding X driver, bind Y driver for platform devices.

try it :)

> > You should be able to just use a normal
> > PCI device for this just fine, and if not, we should be able to make the
> > needed changes to mfd for that.
> > 
> Ok. so parent pci device and mfd devices.
> mfd seems to fit this use case.
> Do you think 'Platform devices' section is stale in [1] for autonomy, host bridge, soc platform etc points?

Nope, they are still horrible things and I hate them :)

Maybe we should just make MFD create "virtual" devices (bare ones, no
need for platform stuff), and that would solve the issue of the platform
device bloat being drug around everywhere.

> Should we update the documentation to indicate that it can be used for
> non-autonomous, user created devices and it can be used for creating
> devices on top of PCI parent device etc?

Nope, leave it alone please.

thanks,

greg k-h
Parav Pandit March 5, 2019, 9:37 p.m. UTC | #8
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, March 5, 2019 1:27 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> <jiri@mellanox.com>; Jakub Kicinski <jakub.kicinski@netronome.com>
> Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> subdev devices
> 
> On Tue, Mar 05, 2019 at 05:57:58PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Greg KH <gregkh@linuxfoundation.org>
> > > Sent: Tuesday, March 5, 2019 1:14 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > > <jiri@mellanox.com>; Jakub Kicinski <jakub.kicinski@netronome.com>
> > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind
> > > to subdev devices
> > >
> > > On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Greg KH <gregkh@linuxfoundation.org>
> > > > > Sent: Friday, March 1, 2019 1:22 AM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > michal.lkml@markovi.net; davem@davemloft.net; Jiri Pirko
> > > > > <jiri@mellanox.com>
> > > > > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to
> > > > > bind to subdev devices
> > > > >
> > > > > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > > > > Add a subdev driver to probe the subdev devices and create
> > > > > > fake netdevice for it.
> > > > >
> > > > > So I'm guessing here is the "meat" of the whole goal here?
> > > > >
> > > > > You just want multiple netdevices per PCI device?  Why can't you
> > > > > do that today in your PCI driver?
> > > > >
> > > > Yes, but it just not multiple netdevices.
> > > > Let me please elaborate in detail.
> > > >
> > > > There is a swichdev mode of a PCI function for netdevices.
> > > > In this mode a given netdev has additional control netdev (called
> > > representor netdevice = rep-ndev).
> > > > This rep-ndev is attached to OVS for adding rules, offloads etc
> > > > using
> > > standard tc, netfilter infra.
> > > > Currently this rep-ndev controls switch side of the settings, but
> > > > not the
> > > host side of netdev.
> > > > So there is discussion to create another netdev or devlink port..
> > > >
> > > > Additionally this subdev has optional rdma device too.
> > > >
> > > > And when we are in switchdev mode, this rdma dev has similar rdma
> > > > rep
> > > device for control.
> > > >
> > > > In some cases we actually don't create netdev when it is in
> > > > InfiniBand
> > > mode.
> > > > Here there is PCI device->rdma_device.
> > > >
> > > > In other case, a given sub device for rdma is dual port device,
> > > > having
> > > netdevice for each that can use existing netdev->dev_port.
> > > >
> > > > Creating 4 devices of two different classes using one iproute2/ip
> > > > or
> > > iproute2/rdma command is horrible thing to do.
> > >
> > > Why is that?
> > >
> > When user creates the device, user tool needs to return a device handle
> that got created.
> > Creating multiple devices doesn't make sense. I haven't seen any tool
> doing such crazy thing.
> 
> And what do you mean by "device handle"?  All you get here is a sysfs device
> tree.
> 
Subdev devices are created using devlink tool that works on device handle.
Device handle is defined using bus/device of a 'struct device'.
It is described in [1].
$ devlink dev add DEV creates new devlink device instance and its holding 'struct device'.
This command returns device handle = new devlink instance bus/name.
Patch 6 in the series returns device handle.
Patch 6 is at [2] with example in it where sysfs name and devlink matches with each other.

> > > > In case if this sub device has to be a passthrough device, ip link
> > > > command
> > > will fail badly that day, because we are creating some sub device
> > > which is not even a netdevice.
> > >
> > > But it is a network device, right?
> > >
> > When there is passthrough subdevice, there won't be netdevice created.
> > We don't want to create passthrough subdevice using iproute2/ip tool
> which primarily works on netdevices.
> 
> I don't know enough networking to claim anything here, so I'll ignore this :)
> 
> > > > So iproute2/devlink which works on bus+device, mainly PCI today,
> > > > seems
> > > right abstraction point to create sub devices.
> > > > This also extends to map ports of the device, health, registers
> > > > debug, etc
> > > rich infrastructure that is already built.
> > > >
> > > > Additionally, we don't want mlx driver and other drivers to go
> > > > through its
> > > child devices (split logic in netdev and rdma) for power management.
> > >
> > > And how is power management going to work with your new devices?
> > > All you have here is a tiny shim around a driver bus,
> > So subdevices power management is done before their parent's.
> > Vendor driver doesn't need to iterate its child devices to suspend/resume
> it.
> 
> True, so we can just autosuspend these "children" device and the "vendor
> driver" is not going to care?  You are going to care as you are talking to the
> same PCI device.  
Oh, vendor driver certainly care.
subdev vendor driver implements driver->pm callbacks to work on just a specific subdev.
Patch-2 in series at [3] implement shim layer by connecting core pm layer to driver pm callbacks.

> This goes to the other question about "how are you
> sharing PCI device resources?"
> 
Currently its equal distribution among all subdevices.
But when actual user arise to ask for specific resource reservation etc, we add those parameters using existing devlink infra [4].

> > > I do not see any new
> > > functionality, and as others have said, no way to actually share, or
> > > split up, the PCI resources.
> > >
> > devlink tool create command will be able to accept more parameters
> during device creation time to share and split PCI resources.
> > This is just the start of the development and RFC is to agree on direction.
> > devlink tool has parameters options that can be queried/set and existing
> infra will be used for granular device config.
> 
> Pointers to this beast?
> 
[1] and [4].

> > > > Kernel core code does that well today, that we like to leverage
> > > > through
> > > subdev bus or mfd pm callbacks.
> > > >
> > > > So it is lot more than just creating netdevices.
> > >
> > > But that's all you are showing here :)
> > >
> > Starting use case is netdev and rdma, but we don't want to create new
> > tools few months/a year later for passthrough mode or for different
> > link layers etc.
> 
> And I don't want to see duplicated driver model code happen either, which
> is why I point out the MFD layer :)
> 
Yes. Sure.

> > > > > What problem are you trying to solve that others also are having
> > > > > that requires all of this?
> > > > >
> > > > > Adding a new bus type and subsystem is fine, but usually we want
> > > > > more than just one user of it, as this does not really show how
> > > > > it is exercised very well.
> > > > This subdev and devlink infrastructure solves this problem of
> > > > creating
> > > smaller sub devices out of one PCI device.
> > > > Someone has to start.. :-)
> > >
> > > That is what a mfd should allow you to do.
> > >
> > I did cursory look at mfd.
> > It lacks removing specific devices, but that is small. It can be
> > enhanced to remove specific mfd device.
> 
> That should be easy enough, work with the MFD developers.  I think
> something like that should work today as you can use USB devices with MFD,
> right?
> 
> > >
> > > No, do not abuse a platform device.
> > Yes. that is my point mfd devices are platform devices.
> > mfd creates platform devices. and to match to it, platfrom_register_driver()
> have to be called to bind to it.
> > I do not know currently if we have the flexibility to say that instead of
> binding X driver, bind Y driver for platform devices.
> 
> try it :)
> 
> > > You should be able to just use a normal PCI device for this just
> > > fine, and if not, we should be able to make the needed changes to
> > > mfd for that.
> > >
> > Ok. so parent pci device and mfd devices.
> > mfd seems to fit this use case.
> > Do you think 'Platform devices' section is stale in [1] for autonomy, host
> bridge, soc platform etc points?
> 
> Nope, they are still horrible things and I hate them :)
> 
> Maybe we should just make MFD create "virtual" devices (bare ones, no
> need for platform stuff), and that would solve the issue of the platform
> device bloat being drug around everywhere.
> 
If you mean virtual MFD devices in /sys/devices/virtual/, than, it becomes difficult to do their life cycle using devlink because, devlink handle = bus+device.
devlink will fail to work. Inventing new tool and make it work with devlink wouldn't work.

virtual device has bus=NULL.
mfd device currently has bus_type=platform.

We still need to link subdevice to parent pci for power_mgmt to work, right?
And also to see right device hierarchy.
Don't you think subdev bus is actually able to link all the pieces together?
devlink, sysfs, core kernel, vendor drivers..

> > Should we update the documentation to indicate that it can be used for
> > non-autonomous, user created devices and it can be used for creating
> > devices on top of PCI parent device etc?
> 
> Nope, leave it alone please.
> 
> thanks,
> 
> greg k-h

[1] http://man7.org/linux/man-pages/man8/devlink-dev.8.html
[2] https://lore.kernel.org/patchwork/patch/1046995/
[3] https://lore.kernel.org/patchwork/patch/1046996/
[4] https://lore.kernel.org/patchwork/patch/959280/
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index f218789..c8aeaf1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -16,7 +16,7 @@  mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
 		fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
 		lib/devcom.o diag/fs_tracepoint.o diag/fw_tracer.o
-mlx5_core-$(CONFIG_SUBDEV) += subdev.o
+mlx5_core-$(CONFIG_SUBDEV) += subdev.o subdev_driver.o
 
 #
 # Netdev basic
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 5f8cf0d..7dfa8c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1548,7 +1548,11 @@  static int __init init(void)
 	mlx5e_init();
 #endif
 
-	return 0;
+	err = subdev_register_driver(&mlx5_subdev_driver);
+	if (err)
+		pci_unregister_driver(&mlx5_core_driver);
+
+	return err;
 
 err_debug:
 	mlx5_unregister_debugfs();
@@ -1557,6 +1561,8 @@  static int __init init(void)
 
 static void __exit cleanup(void)
 {
+	subdev_unregister_driver(&mlx5_subdev_driver);
+
 #ifdef CONFIG_MLX5_CORE_EN
 	mlx5e_cleanup();
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 2a54148..1b733c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -41,12 +41,15 @@ 
 #include <linux/ptp_clock_kernel.h>
 #include <linux/mlx5/cq.h>
 #include <linux/mlx5/fs.h>
+#include <linux/subdev_bus.h>
 
 #define DRIVER_NAME "mlx5_core"
 #define DRIVER_VERSION "5.0-0"
 
 extern uint mlx5_core_debug_mask;
 
+extern struct subdev_driver mlx5_subdev_driver;
+
 #define mlx5_core_dbg(__dev, format, ...)				\
 	dev_dbg(&(__dev)->pdev->dev, "%s:%d:(pid %d): " format,		\
 		 __func__, __LINE__, current->pid,			\
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
new file mode 100644
index 0000000..880aa4f
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/subdev_driver.c
@@ -0,0 +1,93 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018-19 Mellanox Technologies
+
+#include <linux/module.h>
+#include <linux/dma-mapping.h>
+#include <linux/subdev_bus.h>
+#include <linux/subdev_ids.h>
+#include <linux/etherdevice.h>
+
+struct mlx5_subdev_ndev {
+	struct net_device ndev;
+};
+
+static void mlx5_dma_test(struct device *dev)
+{
+	dma_addr_t pa;
+	void *va;
+
+	va = dma_alloc_coherent(dev, 4096, &pa, GFP_KERNEL);
+	if (va)
+		dma_free_coherent(dev, 4096, va, pa);
+}
+
+static struct net_device *ndev;
+
+static int mlx5e_subdev_open(struct net_device *netdev)
+{
+	return 0;
+}
+
+static int mlx5e_subdev_close(struct net_device *netdev)
+{
+	return 0;
+}
+
+static netdev_tx_t
+mlx5e_subdev_xmit(struct sk_buff *skb, struct net_device *netdev)
+{
+	return NETDEV_TX_BUSY;
+}
+
+const struct net_device_ops mlx5e_subdev_netdev_ops = {
+	.ndo_open                = mlx5e_subdev_open,
+	.ndo_stop                = mlx5e_subdev_close,
+	.ndo_start_xmit          = mlx5e_subdev_xmit,
+};
+
+static int mlx5_subdev_probe(struct device *dev)
+{
+	int err;
+
+	mlx5_dma_test(dev);
+	/* Only one device supported in rfc */
+	if (ndev)
+		return 0;
+
+	ndev = alloc_etherdev_mqs(sizeof(struct mlx5_subdev_ndev), 1, 1);
+	if (!ndev)
+		return -ENOMEM;
+
+	SET_NETDEV_DEV(ndev, dev);
+	ndev->netdev_ops = &mlx5e_subdev_netdev_ops;
+	err = register_netdev(ndev);
+	if (err) {
+		free_netdev(ndev);
+		ndev = NULL;
+	}
+	return err;
+}
+
+static int mlx5_subdev_remove(struct device *dev)
+{
+	if (ndev) {
+		unregister_netdev(ndev);
+		free_netdev(ndev);
+		ndev = NULL;
+	}
+	return 0;
+}
+
+static const struct subdev_id mlx5_subdev_id_table[] = {
+	{ .vendor_id = SUBDEV_VENDOR_ID_MELLANOX,
+	  .device_id = SUBDEV_DEVICE_ID_MELLANOX_SF },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(subdev, mlx5_subdev_id_table);
+
+struct subdev_driver mlx5_subdev_driver = {
+	.id_table = mlx5_subdev_id_table,
+	.driver.name = "mlx5_subdev_driver",
+	.driver.probe = mlx5_subdev_probe,
+	.driver.remove = mlx5_subdev_remove,
+};