diff mbox series

[mlx5-next,v5,1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

Message ID 20210126085730.1165673-2-leon@kernel.org
State New
Headers show
Series Dynamically assign MSI-X vectors count | expand

Commit Message

Leon Romanovsky Jan. 26, 2021, 8:57 a.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows configure
the number of MSI-X vectors for specific SR-IO VF. This is needed
to optimize the performance of newly bound devices by allocating
the number of vectors based on the administrator knowledge of targeted VM.

This function is applicable for SR-IOV VF because such devices allocate
their MSI-X table before they will run on the VMs and HW can't guess the
right number of vectors, so the HW allocates them statically and equally.

1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
file will be seen for the VFs and it is writable as long as a driver is not
bounded to the VF.

The values accepted are:
 * > 0 - this will be number reported by the VF's MSI-X capability
 * < 0 - not valid
 * = 0 - will reset to the device default value

2) In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors to consume by the VFs

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
 drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
 drivers/pci/msi.c                       |  47 +++++++
 drivers/pci/pci.h                       |   4 +
 include/linux/pci.h                     |  10 ++
 5 files changed, 273 insertions(+)

--
2.29.2

Comments

Leon Romanovsky Feb. 2, 2021, 7:02 a.m. UTC | #1
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.
>
> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.
>
> 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> file will be seen for the VFs and it is writable as long as a driver is not
> bounded to the VF.
>
> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability
>  * < 0 - not valid
>  * = 0 - will reset to the device default value
>
> 2) In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
>
> cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
>  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
>  drivers/pci/msi.c                       |  47 +++++++
>  drivers/pci/pci.h                       |   4 +
>  include/linux/pci.h                     |  10 ++
>  5 files changed, 273 insertions(+)

Bjorn,

Can I please get your Acked-by on this so it will go through
mlx5-next -> netdev submission flow?

Thanks
Bjorn Helgaas Feb. 2, 2021, 6:06 p.m. UTC | #2
On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.

s/configure/configuration of/
s/SR-IO/SR-IOV/
s/newly bound/VFs/ ?
s/VF/VFs/
s/knowledge of targeted VM/knowledge of the intended use of the VF/
(I'm not a VF expert, but I think they can be used even without VMs)

I'm reading between the lines here, but IIUC the point is that you
have a PF that supports a finite number of MSI-X vectors for use by
all the VFs, and this interface is to control the distribution of
those MSI-X vectors among the VFs.

> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.

This is written in a way that suggests this is behavior required by
the PCIe spec.  If it is indeed related to something in the spec,
please cite it.

But I think this is actually something device-specific, not something
we can derive directly from the spec.  If that's the case, be clear
that we're addressing a device-specific need, and we're hoping that
this will be useful for other devices as well.

"such devices allocate their MSI-X table before they will run on the
VMs": Let's be specific here.  This MSI-X Table allocation apparently
doesn't happen when we set VF Enable in the PF, because these sysfs
files are attached to the VFs, which don't exist yet.  It's not the VF
driver binding, because that's a software construct.  What is the
hardware event that triggers the allocation?

Obviously the distribution among VFs can be changed after VF Enable is
set.  Maybe the distribution is dynamic, and the important point is
that it must be changed before the VF driver reads the Message Control
register for Table Size?

But that isn't the same as "devices allocating their MSI-X table
before being passed through to a VM," so it's confusing.  The
language about allocating the MSI-X table needs to be made precise
here and in the code comments below.

"before they will run on the VMs": Devices don't "run on VMs".  I
think the usual terminology is that a device may be "passed through to
a VM".

"HW allocates them statically and equally" sounds like a description
of some device-specific behavior (unless there's something in the spec
that requires this, in which case you should cite it).  It's OK if
this is device-specific; just don't pretend that it's generic if it's
not.

> 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> file will be seen for the VFs and it is writable as long as a driver is not
> bounded to the VF.

"bound to the VF"

> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability

Specifically, I guess by Table Size in the VF's MSI-X Message Control
register?

>  * < 0 - not valid
>  * = 0 - will reset to the device default value
> 
> 2) In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
> 
> cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs

"total number of MSI-X vectors available for distribution among the
VFs"?

> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
>  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
>  drivers/pci/msi.c                       |  47 +++++++
>  drivers/pci/pci.h                       |   4 +
>  include/linux/pci.h                     |  10 ++
>  5 files changed, 273 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..4d206ade5331 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,35 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV VFs.
> +		It allows configuration of the number of MSI-X vectors for
> +		the VF. This is needed to optimize performance of newly bound
> +		devices by allocating the number of vectors based on the
> +		administrator knowledge of targeted VM.
> +
> +		The values accepted are:
> +		 * > 0 - this will be number reported by the VF's MSI-X
> +			 capability
> +		 * < 0 - not valid
> +		 * = 0 - will reset to the device default value
> +
> +		The file is writable if the PF is bound to a driver that
> +		set sriov_vf_total_msix > 0 and there is no driver bound
> +		to the VF.
> +
> +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV PFs.
> +		It returns a total number of possible to configure MSI-X
> +		vectors on the enabled VFs.
> +
> +		The values returned are:
> +		 * > 0 - this will be total number possible to consume by VFs,
> +		 * = 0 - feature is not supported

What does "vfs_overlay" mean?  "vfs" sounds like the Virtual File
System.

Do these filenames really need to contain both "sriov" and "vf"?

Should these be next to the existing SR-IOV sysfs files, i.e., in or
alongside sriov_dev_attr_group?

Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
really like that because then we're dependent on drivers to maintain
the PCI sysfs hierarchy.  E.g., a driver might forget to call
pci_disable_vfs_overlay(), and then a future driver load will fail.

Maybe this could be done with .is_visible() functions that call driver
callbacks.

> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..3e95f835eba5 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> 
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
>  	return rc;
>  }
> 
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t sriov_vf_msix_count_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int count = pci_msix_vec_count(pdev);
> +
> +	if (count < 0)
> +		return count;
> +
> +	return sysfs_emit(buf, "%d\n", count);
> +}
> +
> +static ssize_t sriov_vf_msix_count_store(struct device *dev,
> +					 struct device_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	struct pci_dev *vf_dev = to_pci_dev(dev);
> +	int val, ret;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	ret = pci_vf_set_msix_vec_count(vf_dev, val);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(sriov_vf_msix_count);
> +
> +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
> +}
> +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> +#endif
> +
> +static struct attribute *sriov_pf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_total_msix.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_msix_count.attr,
> +#endif
> +	NULL,
> +};
> +
> +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->msix_cap || !dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->msix_cap || dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static const struct attribute_group sriov_pf_dev_attr_group = {
> +	.attrs = sriov_pf_dev_attrs,
> +	.is_visible = sriov_pf_attrs_are_visible,
> +	.name = "vfs_overlay",
> +};
> +
> +static const struct attribute_group sriov_vf_dev_attr_group = {
> +	.attrs = sriov_vf_dev_attrs,
> +	.is_visible = sriov_vf_attrs_are_visible,
> +	.name = "vfs_overlay",
> +};
> +
> +int pci_enable_vfs_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id, ret;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return 0;
> +
> +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +	if (ret)
> +		return ret;
> +
> +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		ret = sysfs_create_group(&virtfn->dev.kobj,
> +					 &sriov_vf_dev_attr_group);
> +		if (ret)
> +			goto out;
> +	}
> +	return 0;
> +
> +out:
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> +	}
> +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> +
> +void pci_disable_vfs_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return;
> +
> +	id = dev->sriov->num_VFs;
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> +	}
> +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +}
> +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);

I'm not convinced all this sysfs wrangling is necessary.  If it is,
add a hint in a comment about why this is special and can't use
something like sriov_dev_attr_group.

>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  {
>  	int i;
> @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev)
>  		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> 
>  	iov->num_VFs = 0;
> +	iov->vf_total_msix = 0;
>  	pci_iov_set_numvfs(dev, 0);
>  }
> 
> @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> 
> +/**
> + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> + * @dev: the PCI PF device
> + * @count: the total number of MSI-X vector to consume by the VFs
> + *
> + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> + * This interface is complimentary part of the pci_vf_set_msix_vec_count()

s/Sets the/Set the/
s/complimentary part of the/complementary to/

> + * that will be used to configure the required number on the VF.
> + */
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> +{
> +	if (!dev->is_physfn)
> +		return;
> +
> +	dev->sriov->vf_total_msix = count;

The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
mlx5, calls this, and all the PCI core does is hang onto the value and
expose it via sysfs.  I think I'd rather have a callback in struct
pci_driver and let the driver supply the value when needed.  I.e.,
sriov_vf_total_msix_show() would call the callback instead of looking
at pdev->sriov->vf_total_msix.

> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> +
>  /**
>   * pci_sriov_configure_simple - helper to configure SR-IOV
>   * @dev: the PCI device
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3162f88fe940..1022fe9e6efd 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msix_vec_count);
> 
> +/**
> + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
> + * This function is applicable for SR-IOV VF because such devices allocate
> + * their MSI-X table before they will run on the VMs and HW can't guess the
> + * right number of vectors, so the HW allocates them statically and equally.
> + * @dev: VF device that is going to be changed
> + * @count: amount of MSI-X vectors

s/amount/number/

> + **/
> +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> +{
> +	struct pci_dev *pdev = pci_physfn(dev);
> +	int ret;
> +
> +	if (count < 0)
> +		/*
> +		 * We don't support negative numbers for now,
> +		 * but maybe in the future it will make sense.
> +		 */

Drop the comment; I don't think it adds useful information.

> +		return -EINVAL;
> +
> +	device_lock(&pdev->dev);
> +	if (!pdev->driver) {
> +		ret = -EOPNOTSUPP;
> +		goto err_pdev;
> +	}
> +
> +	device_lock(&dev->dev);
> +	if (dev->driver) {
> +		/*
> +		 * Driver already probed this VF and configured itself
> +		 * based on previously configured (or default) MSI-X vector
> +		 * count. It is too late to change this field for this
> +		 * specific VF.
> +		 */
> +		ret = -EBUSY;
> +		goto err_dev;
> +	}
> +
> +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);

This looks like a NULL pointer dereference.

> +err_dev:
> +	device_unlock(&dev->dev);
> +err_pdev:
> +	device_unlock(&pdev->dev);
> +	return ret;
> +}
> +
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>  			     int nvec, struct irq_affinity *affd, int flags)
>  {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5c59365092fa..2bd6560d91e2 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> 
>  #ifdef CONFIG_PCI_MSI
>  void pci_no_msi(void);
> +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
>  #else
>  static inline void pci_no_msi(void) { }
>  #endif
> @@ -326,6 +327,9 @@ struct pci_sriov {
>  	u16		subsystem_device; /* VF subsystem device */
>  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
>  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> +					 * can consume
> +					 */

  * can consume */

Hopefully you can eliminate vf_total_msix altogether.

>  };
> 
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..24d118ad6e7b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,8 @@ struct module;
>   *		e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              exposed by the sysfs "vf_msix_vec" entry.

"vf_msix_vec" is apparently stale?  There's no other reference in this
patch.

I think the important part is that this changes the number of vectors
advertised in the VF's MSI-X Message Control register, which will be
read when the VF driver enables MSI-X.

If that's true, why would we expose this via a sysfs file?  We can
easily read it via lspci.

>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:	Sysfs attribute groups.
>   * @driver:	Driver model structure.
> @@ -871,6 +873,7 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> 
> +int pci_enable_vfs_overlay(struct pci_dev *dev);
> +void pci_disable_vfs_overlay(struct pci_dev *dev);
> +
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
> 
> @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);
> 
>  /* Arch may override these (weak) */
>  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  }
>  static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
>  					 int id) { }
> +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
> +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}

s/{}/{ }/
Please make your code match the rest of the file, e.g., the very next line!

>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
>  static inline int pci_vfs_assigned(struct pci_dev *dev)
> @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
>  { return 0; }
>  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}

Also here.  Unless removing the space would make this fit in 80
columns.

>  #endif
> 
>  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> --
> 2.29.2
>
Leon Romanovsky Feb. 2, 2021, 7:44 p.m. UTC | #3
On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Extend PCI sysfs interface with a new callback that allows configure
> > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > to optimize the performance of newly bound devices by allocating
> > the number of vectors based on the administrator knowledge of targeted VM.
>
> s/configure/configuration of/
> s/SR-IO/SR-IOV/
> s/newly bound/VFs/ ?
> s/VF/VFs/
> s/knowledge of targeted VM/knowledge of the intended use of the VF/
> (I'm not a VF expert, but I think they can be used even without VMs)
>
> I'm reading between the lines here, but IIUC the point is that you
> have a PF that supports a finite number of MSI-X vectors for use by
> all the VFs, and this interface is to control the distribution of
> those MSI-X vectors among the VFs.

The MSI-X is HW resource, all devices in the world have limitation here.

>
> > This function is applicable for SR-IOV VF because such devices allocate
> > their MSI-X table before they will run on the VMs and HW can't guess the
> > right number of vectors, so the HW allocates them statically and equally.
>
> This is written in a way that suggests this is behavior required by
> the PCIe spec.  If it is indeed related to something in the spec,
> please cite it.

Spec doesn't say it directly, but you will need to really hurt brain of your
users if you decide to do it differently. You have one enable bit to create
all VFs at the same time without any option to configure them in advance.

Of course, you can create some partition map, upload it to FW and create from
there.

>
> But I think this is actually something device-specific, not something
> we can derive directly from the spec.  If that's the case, be clear
> that we're addressing a device-specific need, and we're hoping that
> this will be useful for other devices as well.

I will add.

>
> "such devices allocate their MSI-X table before they will run on the
> VMs": Let's be specific here.  This MSI-X Table allocation apparently
> doesn't happen when we set VF Enable in the PF, because these sysfs
> files are attached to the VFs, which don't exist yet.  It's not the VF
> driver binding, because that's a software construct.  What is the
> hardware event that triggers the allocation?

Write of MSI-X vector count to the FW through PF.

>
> Obviously the distribution among VFs can be changed after VF Enable is
> set.  Maybe the distribution is dynamic, and the important point is
> that it must be changed before the VF driver reads the Message Control
> register for Table Size?

Yes

>
> But that isn't the same as "devices allocating their MSI-X table
> before being passed through to a VM," so it's confusing.  The
> language about allocating the MSI-X table needs to be made precise
> here and in the code comments below.
>
> "before they will run on the VMs": Devices don't "run on VMs".  I
> think the usual terminology is that a device may be "passed through to
> a VM".
>
> "HW allocates them statically and equally" sounds like a description
> of some device-specific behavior (unless there's something in the spec
> that requires this, in which case you should cite it).  It's OK if
> this is device-specific; just don't pretend that it's generic if it's
> not.

I will change.

>
> > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > file will be seen for the VFs and it is writable as long as a driver is not
> > bounded to the VF.
>
> "bound to the VF"
>
> > The values accepted are:
> >  * > 0 - this will be number reported by the VF's MSI-X capability
>
> Specifically, I guess by Table Size in the VF's MSI-X Message Control
> register?

Right

>
> >  * < 0 - not valid
> >  * = 0 - will reset to the device default value
> >
> > 2) In order to make management easy, provide new read-only sysfs file that
> > returns a total number of possible to configure MSI-X vectors.
> >
> > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> >   = 0 - feature is not supported
> >   > 0 - total number of MSI-X vectors to consume by the VFs
>
> "total number of MSI-X vectors available for distribution among the
> VFs"?

Users need to be aware of how much vectors exist in the system.

>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
> >  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
> >  drivers/pci/msi.c                       |  47 +++++++
> >  drivers/pci/pci.h                       |   4 +
> >  include/linux/pci.h                     |  10 ++
> >  5 files changed, 273 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 25c9c39770c6..4d206ade5331 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -375,3 +375,35 @@ Description:
> >  		The value comes from the PCI kernel device state and can be one
> >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> >  		The file is read only.
> > +
> > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV VFs.
> > +		It allows configuration of the number of MSI-X vectors for
> > +		the VF. This is needed to optimize performance of newly bound
> > +		devices by allocating the number of vectors based on the
> > +		administrator knowledge of targeted VM.
> > +
> > +		The values accepted are:
> > +		 * > 0 - this will be number reported by the VF's MSI-X
> > +			 capability
> > +		 * < 0 - not valid
> > +		 * = 0 - will reset to the device default value
> > +
> > +		The file is writable if the PF is bound to a driver that
> > +		set sriov_vf_total_msix > 0 and there is no driver bound
> > +		to the VF.
> > +
> > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV PFs.
> > +		It returns a total number of possible to configure MSI-X
> > +		vectors on the enabled VFs.
> > +
> > +		The values returned are:
> > +		 * > 0 - this will be total number possible to consume by VFs,
> > +		 * = 0 - feature is not supported
>
> What does "vfs_overlay" mean?  "vfs" sounds like the Virtual File
> System.

VFs == many VF?

>
> Do these filenames really need to contain both "sriov" and "vf"?

This is what I was asked at some point. In previous versions the name
was without "sriov".

>
> Should these be next to the existing SR-IOV sysfs files, i.e., in or
> alongside sriov_dev_attr_group?

This was suggestion in previous versions. I didn't hear anyone
supporting it.

>
> Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> really like that because then we're dependent on drivers to maintain
> the PCI sysfs hierarchy.  E.g., a driver might forget to call
> pci_disable_vfs_overlay(), and then a future driver load will fail.

It is not different from any other kernel bug. I have gazillion ways to
break the system with crappy driver.

>
> Maybe this could be done with .is_visible() functions that call driver
> callbacks.

It was in previous versions too, but this solution allows PF to control
the VFs overlay dynamically.

>
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 4afd4ee4f7f0..3e95f835eba5 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> >  	return (dev->devfn + dev->sriov->offset +
> >  		dev->sriov->stride * vf_id) & 0xff;
> >  }
> > +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> >
> >  /*
> >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
> >  	return rc;
> >  }
> >
> > +#ifdef CONFIG_PCI_MSI
> > +static ssize_t sriov_vf_msix_count_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int count = pci_msix_vec_count(pdev);
> > +
> > +	if (count < 0)
> > +		return count;
> > +
> > +	return sysfs_emit(buf, "%d\n", count);
> > +}
> > +
> > +static ssize_t sriov_vf_msix_count_store(struct device *dev,
> > +					 struct device_attribute *attr,
> > +					 const char *buf, size_t count)
> > +{
> > +	struct pci_dev *vf_dev = to_pci_dev(dev);
> > +	int val, ret;
> > +
> > +	ret = kstrtoint(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pci_vf_set_msix_vec_count(vf_dev, val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_RW(sriov_vf_msix_count);
> > +
> > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
> > +}
> > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > +#endif
> > +
> > +static struct attribute *sriov_pf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +	&dev_attr_sriov_vf_total_msix.attr,
> > +#endif
> > +	NULL,
> > +};
> > +
> > +static struct attribute *sriov_vf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +	&dev_attr_sriov_vf_msix_count.attr,
> > +#endif
> > +	NULL,
> > +};
> > +
> > +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
> > +					  struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (!pdev->msix_cap || !dev_is_pf(dev))
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > +					  struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (!pdev->msix_cap || dev_is_pf(dev))
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +static const struct attribute_group sriov_pf_dev_attr_group = {
> > +	.attrs = sriov_pf_dev_attrs,
> > +	.is_visible = sriov_pf_attrs_are_visible,
> > +	.name = "vfs_overlay",
> > +};
> > +
> > +static const struct attribute_group sriov_vf_dev_attr_group = {
> > +	.attrs = sriov_vf_dev_attrs,
> > +	.is_visible = sriov_vf_attrs_are_visible,
> > +	.name = "vfs_overlay",
> > +};
> > +
> > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *virtfn;
> > +	int id, ret;
> > +
> > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > +		return 0;
> > +
> > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +	if (ret)
> > +		return ret;
> > +
> > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > +					 &sriov_vf_dev_attr_group);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	return 0;
> > +
> > +out:
> > +	while (id--) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > +	}
> > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > +
> > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *virtfn;
> > +	int id;
> > +
> > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > +		return;
> > +
> > +	id = dev->sriov->num_VFs;
> > +	while (id--) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > +	}
> > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
>
> I'm not convinced all this sysfs wrangling is necessary.  If it is,
> add a hint in a comment about why this is special and can't use
> something like sriov_dev_attr_group.

This makes the overlay to be PF-driven. Alexander insisted on this flow.

>
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  {
> >  	int i;
> > @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev)
> >  		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> >
> >  	iov->num_VFs = 0;
> > +	iov->vf_total_msix = 0;
> >  	pci_iov_set_numvfs(dev, 0);
> >  }
> >
> > @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> >
> > +/**
> > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > + * @dev: the PCI PF device
> > + * @count: the total number of MSI-X vector to consume by the VFs
> > + *
> > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > + * This interface is complimentary part of the pci_vf_set_msix_vec_count()
>
> s/Sets the/Set the/
> s/complimentary part of the/complementary to/
>
> > + * that will be used to configure the required number on the VF.
> > + */
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > +{
> > +	if (!dev->is_physfn)
> > +		return;
> > +
> > +	dev->sriov->vf_total_msix = count;
>
> The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> mlx5, calls this, and all the PCI core does is hang onto the value and
> expose it via sysfs.  I think I'd rather have a callback in struct
> pci_driver and let the driver supply the value when needed.  I.e.,
> sriov_vf_total_msix_show() would call the callback instead of looking
> at pdev->sriov->vf_total_msix.

It will cause to unnecessary locking to ensure that driver doesn't
vanish during sysfs read. I can change, but don't think that it is right
decision.

>
> > +}
> > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > +
> >  /**
> >   * pci_sriov_configure_simple - helper to configure SR-IOV
> >   * @dev: the PCI device
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3162f88fe940..1022fe9e6efd 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msix_vec_count);
> >
> > +/**
> > + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
> > + * This function is applicable for SR-IOV VF because such devices allocate
> > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > + * right number of vectors, so the HW allocates them statically and equally.
> > + * @dev: VF device that is going to be changed
> > + * @count: amount of MSI-X vectors
>
> s/amount/number/
>
> > + **/
> > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > +{
> > +	struct pci_dev *pdev = pci_physfn(dev);
> > +	int ret;
> > +
> > +	if (count < 0)
> > +		/*
> > +		 * We don't support negative numbers for now,
> > +		 * but maybe in the future it will make sense.
> > +		 */
>
> Drop the comment; I don't think it adds useful information.
>
> > +		return -EINVAL;
> > +
> > +	device_lock(&pdev->dev);
> > +	if (!pdev->driver) {
> > +		ret = -EOPNOTSUPP;
> > +		goto err_pdev;
> > +	}
> > +
> > +	device_lock(&dev->dev);
> > +	if (dev->driver) {
> > +		/*
> > +		 * Driver already probed this VF and configured itself
> > +		 * based on previously configured (or default) MSI-X vector
> > +		 * count. It is too late to change this field for this
> > +		 * specific VF.
> > +		 */
> > +		ret = -EBUSY;
> > +		goto err_dev;
> > +	}
> > +
> > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
>
> This looks like a NULL pointer dereference.

In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
will be only for devices that supports sriov_vf_msix_count which is
possible with ->sriov_set_msix_vec_count() only.

But I will add the check to be bullet proof for the future extensions.

>
> > +err_dev:
> > +	device_unlock(&dev->dev);
> > +err_pdev:
> > +	device_unlock(&pdev->dev);
> > +	return ret;
> > +}
> > +
> >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> >  			     int nvec, struct irq_affinity *affd, int flags)
> >  {
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 5c59365092fa..2bd6560d91e2 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> >
> >  #ifdef CONFIG_PCI_MSI
> >  void pci_no_msi(void);
> > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
> >  #else
> >  static inline void pci_no_msi(void) { }
> >  #endif
> > @@ -326,6 +327,9 @@ struct pci_sriov {
> >  	u16		subsystem_device; /* VF subsystem device */
> >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > +					 * can consume
> > +					 */
>
>   * can consume */
>
> Hopefully you can eliminate vf_total_msix altogether.

I think that will be worse from the flow point of view (extra locks) and the memory
if you are worried about it. This variable consumes 4 bytes, the pointer to the function
will take 8 bytes.

>
> >  };
> >
> >  /**
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index b32126d26997..24d118ad6e7b 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -856,6 +856,8 @@ struct module;
> >   *		e.g. drivers/net/e100.c.
> >   * @sriov_configure: Optional driver callback to allow configuration of
> >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > + *              exposed by the sysfs "vf_msix_vec" entry.
>
> "vf_msix_vec" is apparently stale?  There's no other reference in this
> patch.
>
> I think the important part is that this changes the number of vectors
> advertised in the VF's MSI-X Message Control register, which will be
> read when the VF driver enables MSI-X.
>
> If that's true, why would we expose this via a sysfs file?  We can
> easily read it via lspci.

I did it for feature complete, we don't need read of this sysfs.

>
> >   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
> >   * @groups:	Sysfs attribute groups.
> >   * @driver:	Driver model structure.
> > @@ -871,6 +873,7 @@ struct pci_driver {
> >  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
> >  	void (*shutdown)(struct pci_dev *dev);
> >  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> > +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> >  	const struct pci_error_handlers *err_handler;
> >  	const struct attribute_group **groups;
> >  	struct device_driver	driver;
> > @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> >  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> >  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> >
> > +int pci_enable_vfs_overlay(struct pci_dev *dev);
> > +void pci_disable_vfs_overlay(struct pci_dev *dev);
> > +
> >  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> >  void pci_disable_sriov(struct pci_dev *dev);
> >
> > @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
> >  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> >  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> >  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);
> >
> >  /* Arch may override these (weak) */
> >  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> > @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  }
> >  static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
> >  					 int id) { }
> > +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
> > +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}
>
> s/{}/{ }/
> Please make your code match the rest of the file, e.g., the very next line!
>
> >  static inline void pci_disable_sriov(struct pci_dev *dev) { }
> >  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
> >  static inline int pci_vfs_assigned(struct pci_dev *dev)
> > @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> >  { return 0; }
> >  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
>
> Also here.  Unless removing the space would make this fit in 80
> columns.

Yes, it is 80 columns without space.

Thanks

>
> >  #endif
> >
> >  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> > --
> > 2.29.2
> >
Bjorn Helgaas Feb. 4, 2021, 12:10 a.m. UTC | #4
On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Extend PCI sysfs interface with a new callback that allows
> > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > This is needed to optimize the performance of newly bound
> > > devices by allocating the number of vectors based on the
> > > administrator knowledge of targeted VM.
> >
> > I'm reading between the lines here, but IIUC the point is that you
> > have a PF that supports a finite number of MSI-X vectors for use
> > by all the VFs, and this interface is to control the distribution
> > of those MSI-X vectors among the VFs.
> 
> The MSI-X is HW resource, all devices in the world have limitation
> here.
>
> > > This function is applicable for SR-IOV VF because such devices
> > > allocate their MSI-X table before they will run on the VMs and
> > > HW can't guess the right number of vectors, so the HW allocates
> > > them statically and equally.
> >
> > This is written in a way that suggests this is behavior required
> > by the PCIe spec.  If it is indeed related to something in the
> > spec, please cite it.
> 
> Spec doesn't say it directly, but you will need to really hurt brain
> of your users if you decide to do it differently. You have one
> enable bit to create all VFs at the same time without any option to
> configure them in advance.
> 
> Of course, you can create some partition map, upload it to FW and
> create from there.

Of course all devices have limitations.  But let's add some details
about the way *this* device works.  That will help avoid giving the
impression that this is the *only* way spec-conforming devices can
work.

> > "such devices allocate their MSI-X table before they will run on
> > the VMs": Let's be specific here.  This MSI-X Table allocation
> > apparently doesn't happen when we set VF Enable in the PF, because
> > these sysfs files are attached to the VFs, which don't exist yet.
> > It's not the VF driver binding, because that's a software
> > construct.  What is the hardware event that triggers the
> > allocation?
> 
> Write of MSI-X vector count to the FW through PF.

This is an example of something that is obviously specific to this
mlx5 device.  The Table Size field in Message Control is RO per spec,
and obviously firmware on the device is completely outside the scope
of the PCIe spec.

This commit log should describe how *this* device manages this
allocation and how the PF Table Size and the VF Table Sizes are
related.  Per PCIe, there is no necessary connection between them.

> > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors to consume by the VFs
> >
> > "total number of MSI-X vectors available for distribution among the
> > VFs"?
> 
> Users need to be aware of how much vectors exist in the system.

Understood -- if there's an interface to influence the distribution of
vectors among VFs, one needs to know how many vectors there are to
work with.

My point was that "number of vectors to consume by VFs" is awkward
wording, so I suggested an alternative.

> > > +What:            /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > > +Date:            January 2021
> > > +Contact: Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +         This file is associated with the SR-IOV VFs.
> > > +         It allows configuration of the number of MSI-X vectors for
> > > +         the VF. This is needed to optimize performance of newly bound
> > > +         devices by allocating the number of vectors based on the
> > > +         administrator knowledge of targeted VM.
> > > +
> > > +         The values accepted are:
> > > +          * > 0 - this will be number reported by the VF's MSI-X
> > > +                  capability
> > > +          * < 0 - not valid
> > > +          * = 0 - will reset to the device default value
> > > +
> > > +         The file is writable if the PF is bound to a driver that
> > > +         set sriov_vf_total_msix > 0 and there is no driver bound
> > > +         to the VF.

Drivers don't actually set "sriov_vf_total_msix".  This should
probably say something like "the PF is bound to a driver that
implements ->sriov_set_msix_vec_count()."

> > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > +Date:		January 2021
> > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +		This file is associated with the SR-IOV PFs.
> > > +		It returns a total number of possible to configure MSI-X
> > > +		vectors on the enabled VFs.
> > > +
> > > +		The values returned are:
> > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > +		 * = 0 - feature is not supported

Can you swap the order of these two files in the documentation?
sriov_vf_total_msix is associated with the PF and logically precedes
sriov_vf_msix_count, which is associated with VFs.

Not sure the "= 0" description is necessary here.  If the value
returned is the number of MSI-X vectors available for assignment to
VFs, "0" is a perfectly legitimate value.  It just means there are
none.  It doesn't need to be described separately.

> > Do these filenames really need to contain both "sriov" and "vf"?
> 
> This is what I was asked at some point. In previous versions the name
> was without "sriov".

We currently have:

  $ grep DEVICE_ATTR drivers/pci/iov.c
  static DEVICE_ATTR_RO(sriov_totalvfs);
  static DEVICE_ATTR_RW(sriov_numvfs);
  static DEVICE_ATTR_RO(sriov_offset);
  static DEVICE_ATTR_RO(sriov_stride);
  static DEVICE_ATTR_RO(sriov_vf_device);
  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);

If we put them in a new "vfs_overlay" directory, it seems like
overkill to repeat the "vf" part, but I'm hoping the new files can end
up next to these existing files.  In that case, I think it makes sense
to include "sriov".  And it probably does make sense to include "vf"
as well.

> > Should these be next to the existing SR-IOV sysfs files, i.e., in or
> > alongside sriov_dev_attr_group?
> 
> This was suggestion in previous versions. I didn't hear anyone
> supporting it.

Pointer to this discussion?  I'm not sure what value is added by a new
directory.

> > Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> > really like that because then we're dependent on drivers to maintain
> > the PCI sysfs hierarchy.  E.g., a driver might forget to call
> > pci_disable_vfs_overlay(), and then a future driver load will fail.
> > 
> > Maybe this could be done with .is_visible() functions that call driver
> > callbacks.
> 
> It was in previous versions too, but this solution allows PF to control
> the VFs overlay dynamically.

See below; I think it might be possible to do this dynamically even
without pci_enable_vfs_overlay().

> > > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id, ret;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return 0;
> > > +
> > > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > > +					 &sriov_vf_dev_attr_group);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > > +	return 0;
> > > +
> > > +out:
> > > +	while (id--) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > +	}
> > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > > +
> > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return;
> > > +
> > > +	id = dev->sriov->num_VFs;
> > > +	while (id--) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > +	}
> > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> >
> > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > add a hint in a comment about why this is special and can't use
> > something like sriov_dev_attr_group.
> 
> This makes the overlay to be PF-driven. Alexander insisted on this flow.

If you're referring to [1], I think "insisted on this flow" might be
an overstatement of what Alex was looking for.  IIUC Alex wants the
sysfs files to be visible only when they're useful, i.e., when a
driver implements ->sriov_set_msix_vec_count().

That seems reasonable and also seems like something a smarter
.is_visible() function could accomplish without having drivers call
pci_enable_vfs_overlay(), e.g., maybe some variation of this:

  static umode_t sriov_vf_attrs_are_visible(...)
  {
    if (!pdev->msix_cap || dev_is_pf(dev))
      return 0;

    pf = pci_physfn(pdev);
    if (pf->driver && pf->driver->sriov_set_msix_vec_count)
      return a->mode;

    return 0;
  }

(Probably needs locking while we look at pf->driver, just like in
pci_vf_set_msix_vec_count().)
 
pci_enable_vfs_overlay() significantly complicates the code and it's
the sort of sysfs code we're trying to avoid, so if we really do need
it, please add a brief comment about *why* we have to do it that way.

> > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > +{
> > > +	if (!dev->is_physfn)
> > > +		return;
> > > +
> > > +	dev->sriov->vf_total_msix = count;
> >
> > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > mlx5, calls this, and all the PCI core does is hang onto the value and
> > expose it via sysfs.  I think I'd rather have a callback in struct
> > pci_driver and let the driver supply the value when needed.  I.e.,
> > sriov_vf_total_msix_show() would call the callback instead of looking
> > at pdev->sriov->vf_total_msix.
> 
> It will cause to unnecessary locking to ensure that driver doesn't
> vanish during sysfs read. I can change, but don't think that it is right
> decision.

Doesn't sysfs already ensure the driver can't vanish while we're
executing a DEVICE_ATTR accessor?

> > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > > +{
> > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > +	int ret;
> > > +
> > > +	if (count < 0)
> > > +		/*
> > > +		 * We don't support negative numbers for now,
> > > +		 * but maybe in the future it will make sense.
> > > +		 */
> >
> > Drop the comment; I don't think it adds useful information.
> >
> > > +		return -EINVAL;
> > > +
> > > +	device_lock(&pdev->dev);
> > > +	if (!pdev->driver) {
> > > +		ret = -EOPNOTSUPP;
> > > +		goto err_pdev;
> > > +	}
> > > +
> > > +	device_lock(&dev->dev);
> > > +	if (dev->driver) {
> > > +		/*
> > > +		 * Driver already probed this VF and configured itself
> > > +		 * based on previously configured (or default) MSI-X vector
> > > +		 * count. It is too late to change this field for this
> > > +		 * specific VF.
> > > +		 */
> > > +		ret = -EBUSY;
> > > +		goto err_dev;
> > > +	}
> > > +
> > > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
> >
> > This looks like a NULL pointer dereference.
> 
> In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
> will be only for devices that supports sriov_vf_msix_count which is
> possible with ->sriov_set_msix_vec_count() only.

OK.  I think you're right, but it takes quite a lot of analysis to
prove that right now.  If the .is_visible() function for
sriov_vf_msix_count checked whether the driver implements
->sriov_set_msix_vec_count(), it would be quite a bit easier,
and it might even help get rid of pci_enable_vfs_overlay().

Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
no actual *reason* for it to be there other than the fact that it has
"msix" in the name.  It uses no MSI data structures.  Maybe it could
be folded into sriov_vf_msix_count_store(), which would make the
analysis even easier.

> > > @@ -326,6 +327,9 @@ struct pci_sriov {
> > >  	u16		subsystem_device; /* VF subsystem device */
> > >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> > >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > > +					 * can consume
> > > +					 */
> >
> >   * can consume */
> >
> > Hopefully you can eliminate vf_total_msix altogether.
> 
> I think that will be worse from the flow point of view (extra locks)
> and the memory if you are worried about it. This variable consumes 4
> bytes, the pointer to the function will take 8 bytes.

I'm not concerned about the space.  The function pointer is in struct
pci_driver, whereas the variable is in struct pci_sriov, so the
variable would likely consume more space because there are probably
more VFs than pci_drivers.

My bigger concern is that vf_total_msix is really *driver* state, not
PCI core state, and I'd prefer if the PCI core were not responsible
for it.

> > > +++ b/include/linux/pci.h
> > > @@ -856,6 +856,8 @@ struct module;
> > >   *		e.g. drivers/net/e100.c.
> > >   * @sriov_configure: Optional driver callback to allow configuration of
> > >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > + *              exposed by the sysfs "vf_msix_vec" entry.
> >
> > "vf_msix_vec" is apparently stale?  There's no other reference in this
> > patch.
> >
> > I think the important part is that this changes the number of vectors
> > advertised in the VF's MSI-X Message Control register, which will be
> > read when the VF driver enables MSI-X.
> >
> > If that's true, why would we expose this via a sysfs file?  We can
> > easily read it via lspci.
> 
> I did it for feature complete, we don't need read of this sysfs.

If you don't need this sysfs file, just remove it.  If you do need it,
please update the "vf_msix_vec" so it's correct.  Also, clarify the
description so we can tell that we're changing the Table Size VFs will
see in their Message Control registers.

> > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
> >
> > Also here.  Unless removing the space would make this fit in 80
> > columns.
> 
> Yes, it is 80 columns without space.

No, I don't think it is.  In an 80 column window, the result looks
like:

  static inline void pci_sriov_set_vf_total_msix(...) {
  }

Please change it so it looks like this so it matches the rest of the
file:

  static inline void pci_sriov_set_vf_total_msix(...)
  { }

Bjorn

[1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
Leon Romanovsky Feb. 4, 2021, 3:50 p.m. UTC | #5
On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Extend PCI sysfs interface with a new callback that allows
> > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > This is needed to optimize the performance of newly bound
> > > > devices by allocating the number of vectors based on the
> > > > administrator knowledge of targeted VM.
> > >
> > > I'm reading between the lines here, but IIUC the point is that you
> > > have a PF that supports a finite number of MSI-X vectors for use
> > > by all the VFs, and this interface is to control the distribution
> > > of those MSI-X vectors among the VFs.
> >
> > The MSI-X is HW resource, all devices in the world have limitation
> > here.
> >
> > > > This function is applicable for SR-IOV VF because such devices
> > > > allocate their MSI-X table before they will run on the VMs and
> > > > HW can't guess the right number of vectors, so the HW allocates
> > > > them statically and equally.
> > >
> > > This is written in a way that suggests this is behavior required
> > > by the PCIe spec.  If it is indeed related to something in the
> > > spec, please cite it.
> >
> > Spec doesn't say it directly, but you will need to really hurt brain
> > of your users if you decide to do it differently. You have one
> > enable bit to create all VFs at the same time without any option to
> > configure them in advance.
> >
> > Of course, you can create some partition map, upload it to FW and
> > create from there.
>
> Of course all devices have limitations.  But let's add some details
> about the way *this* device works.  That will help avoid giving the
> impression that this is the *only* way spec-conforming devices can
> work.

Sure

>
> > > "such devices allocate their MSI-X table before they will run on
> > > the VMs": Let's be specific here.  This MSI-X Table allocation
> > > apparently doesn't happen when we set VF Enable in the PF, because
> > > these sysfs files are attached to the VFs, which don't exist yet.
> > > It's not the VF driver binding, because that's a software
> > > construct.  What is the hardware event that triggers the
> > > allocation?
> >
> > Write of MSI-X vector count to the FW through PF.
>
> This is an example of something that is obviously specific to this
> mlx5 device.  The Table Size field in Message Control is RO per spec,
> and obviously firmware on the device is completely outside the scope
> of the PCIe spec.
>
> This commit log should describe how *this* device manages this
> allocation and how the PF Table Size and the VF Table Sizes are
> related.  Per PCIe, there is no necessary connection between them.

There is no connection in mlx5 devices either. PF is used as a vehicle
to access VF that doesn't have driver yet. From "table size" perspective
they completely independent, because PF already probed by driver and
it is already too late to change it.

So PF table size is static and can be changed through FW utility only.

>
> > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > >   = 0 - feature is not supported
> > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > >
> > > "total number of MSI-X vectors available for distribution among the
> > > VFs"?
> >
> > Users need to be aware of how much vectors exist in the system.
>
> Understood -- if there's an interface to influence the distribution of
> vectors among VFs, one needs to know how many vectors there are to
> work with.
>
> My point was that "number of vectors to consume by VFs" is awkward
> wording, so I suggested an alternative.

Got it, thanks

>
> > > > +What:            /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > > > +Date:            January 2021
> > > > +Contact: Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +         This file is associated with the SR-IOV VFs.
> > > > +         It allows configuration of the number of MSI-X vectors for
> > > > +         the VF. This is needed to optimize performance of newly bound
> > > > +         devices by allocating the number of vectors based on the
> > > > +         administrator knowledge of targeted VM.
> > > > +
> > > > +         The values accepted are:
> > > > +          * > 0 - this will be number reported by the VF's MSI-X
> > > > +                  capability
> > > > +          * < 0 - not valid
> > > > +          * = 0 - will reset to the device default value
> > > > +
> > > > +         The file is writable if the PF is bound to a driver that
> > > > +         set sriov_vf_total_msix > 0 and there is no driver bound
> > > > +         to the VF.
>
> Drivers don't actually set "sriov_vf_total_msix".  This should
> probably say something like "the PF is bound to a driver that
> implements ->sriov_set_msix_vec_count()."

Sure, will change.

>
> > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > +Date:		January 2021
> > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +		This file is associated with the SR-IOV PFs.
> > > > +		It returns a total number of possible to configure MSI-X
> > > > +		vectors on the enabled VFs.
> > > > +
> > > > +		The values returned are:
> > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > +		 * = 0 - feature is not supported
>
> Can you swap the order of these two files in the documentation?
> sriov_vf_total_msix is associated with the PF and logically precedes
> sriov_vf_msix_count, which is associated with VFs.

I'll do.

>
> Not sure the "= 0" description is necessary here.  If the value
> returned is the number of MSI-X vectors available for assignment to
> VFs, "0" is a perfectly legitimate value.  It just means there are
> none.  It doesn't need to be described separately.

I wanted to help users and remove ambiguity. For example, mlx5 drivers
will always implement proper .set_...() callbacks but for some devices
without needed FW support, the value will be 0. Instead of misleading
users with wrong promise that feature supported but doesn't have
available vectors, I decided to be more clear. For the users, 0 means, don't
try, it is not working.

>
> > > Do these filenames really need to contain both "sriov" and "vf"?
> >
> > This is what I was asked at some point. In previous versions the name
> > was without "sriov".
>
> We currently have:
>
>   $ grep DEVICE_ATTR drivers/pci/iov.c
>   static DEVICE_ATTR_RO(sriov_totalvfs);
>   static DEVICE_ATTR_RW(sriov_numvfs);
>   static DEVICE_ATTR_RO(sriov_offset);
>   static DEVICE_ATTR_RO(sriov_stride);
>   static DEVICE_ATTR_RO(sriov_vf_device);
>   static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
>
> If we put them in a new "vfs_overlay" directory, it seems like
> overkill to repeat the "vf" part, but I'm hoping the new files can end
> up next to these existing files.  In that case, I think it makes sense
> to include "sriov".  And it probably does make sense to include "vf"
> as well.

I put everything in folder to group any possible future extensions.
Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
separate folder.

This is the link with the request to change name:
https://lore.kernel.org/linux-pci/CAKgT0UcJrSNMPAOoniRSnUn+wyRUkL62AfgR3-8QbAkak=pQ=w@mail.gmail.com/

Also, _vf_ in the sriov_vf_total_msix symbolize that we are talking
explicitly about VFs and not whole device/PF.

>
> > > Should these be next to the existing SR-IOV sysfs files, i.e., in or
> > > alongside sriov_dev_attr_group?
> >
> > This was suggestion in previous versions. I didn't hear anyone
> > supporting it.
>
> Pointer to this discussion?  I'm not sure what value is added by a new
> directory.

In the link below, Alex talks about creating PF driven sysfs entries clearly
visible by the users as not PCI config space writes. The overlay folder achieves
that.

https://lore.kernel.org/linux-pci/CAKgT0UeYb5xz8iehE1Y0s-cyFbsy46bjF83BkA7qWZMkAOLR-g@mail.gmail.com/

and this is the resolution that proposed structure is OK:
https://lore.kernel.org/linux-pci/20210124190032.GD5038@unreal/

>
> > > Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> > > really like that because then we're dependent on drivers to maintain
> > > the PCI sysfs hierarchy.  E.g., a driver might forget to call
> > > pci_disable_vfs_overlay(), and then a future driver load will fail.
> > >
> > > Maybe this could be done with .is_visible() functions that call driver
> > > callbacks.
> >
> > It was in previous versions too, but this solution allows PF to control
> > the VFs overlay dynamically.
>
> See below; I think it might be possible to do this dynamically even
> without pci_enable_vfs_overlay().

The request was to have this overlay be PF driven. It means that if
PF driver is removed, the folder should disappear. I tried to do it
with .si_visible() and found myself adding hooks in general pci remove
flow. It was far from clean.

>
> > > > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *virtfn;
> > > > +	int id, ret;
> > > > +
> > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > +		return 0;
> > > > +
> > > > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > > > +					 &sriov_vf_dev_attr_group);
> > > > +		if (ret)
> > > > +			goto out;
> > > > +	}
> > > > +	return 0;
> > > > +
> > > > +out:
> > > > +	while (id--) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > +	}
> > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > > > +
> > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *virtfn;
> > > > +	int id;
> > > > +
> > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > +		return;
> > > > +
> > > > +	id = dev->sriov->num_VFs;
> > > > +	while (id--) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > +	}
> > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > >
> > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > add a hint in a comment about why this is special and can't use
> > > something like sriov_dev_attr_group.
> >
> > This makes the overlay to be PF-driven. Alexander insisted on this flow.
>
> If you're referring to [1], I think "insisted on this flow" might be
> an overstatement of what Alex was looking for.  IIUC Alex wants the
> sysfs files to be visible only when they're useful, i.e., when a
> driver implements ->sriov_set_msix_vec_count().

It is only one side of the request, the sysfs files shouldn't be visible
if PF driver was removed and visible again when its probed again.

>
> That seems reasonable and also seems like something a smarter
> .is_visible() function could accomplish without having drivers call
> pci_enable_vfs_overlay(), e.g., maybe some variation of this:
>
>   static umode_t sriov_vf_attrs_are_visible(...)
>   {
>     if (!pdev->msix_cap || dev_is_pf(dev))
>       return 0;
>
>     pf = pci_physfn(pdev);
>     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
>       return a->mode;
>
>     return 0;
>   }

It doesn't work with the following flow:
1. load driver
2. disable autoprobe
3. echo to sriov_numvfs
.... <--- you have this sriov_vf_attrs_are_visible() created
4. unload driver
.... <--- sysfs still exists despite not having PF driver.

>
> (Probably needs locking while we look at pf->driver, just like in
> pci_vf_set_msix_vec_count().)
>
> pci_enable_vfs_overlay() significantly complicates the code and it's
> the sort of sysfs code we're trying to avoid, so if we really do need
> it, please add a brief comment about *why* we have to do it that way.

I will add.

>
> > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > +{
> > > > +	if (!dev->is_physfn)
> > > > +		return;
> > > > +
> > > > +	dev->sriov->vf_total_msix = count;
> > >
> > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > at pdev->sriov->vf_total_msix.
> >
> > It will cause to unnecessary locking to ensure that driver doesn't
> > vanish during sysfs read. I can change, but don't think that it is right
> > decision.
>
> Doesn't sysfs already ensure the driver can't vanish while we're
> executing a DEVICE_ATTR accessor?

It is not, you can see it by adding device_lock_held() check in any
.show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
do much. This is why pci_vf_set_msix_vec_count() has double lock.

>
> > > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > > > +{
> > > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > > +	int ret;
> > > > +
> > > > +	if (count < 0)
> > > > +		/*
> > > > +		 * We don't support negative numbers for now,
> > > > +		 * but maybe in the future it will make sense.
> > > > +		 */
> > >
> > > Drop the comment; I don't think it adds useful information.
> > >
> > > > +		return -EINVAL;
> > > > +
> > > > +	device_lock(&pdev->dev);
> > > > +	if (!pdev->driver) {
> > > > +		ret = -EOPNOTSUPP;
> > > > +		goto err_pdev;
> > > > +	}
> > > > +
> > > > +	device_lock(&dev->dev);
> > > > +	if (dev->driver) {
> > > > +		/*
> > > > +		 * Driver already probed this VF and configured itself
> > > > +		 * based on previously configured (or default) MSI-X vector
> > > > +		 * count. It is too late to change this field for this
> > > > +		 * specific VF.
> > > > +		 */
> > > > +		ret = -EBUSY;
> > > > +		goto err_dev;
> > > > +	}
> > > > +
> > > > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
> > >
> > > This looks like a NULL pointer dereference.
> >
> > In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
> > will be only for devices that supports sriov_vf_msix_count which is
> > possible with ->sriov_set_msix_vec_count() only.
>
> OK.  I think you're right, but it takes quite a lot of analysis to
> prove that right now.  If the .is_visible() function for
> sriov_vf_msix_count checked whether the driver implements
> ->sriov_set_msix_vec_count(), it would be quite a bit easier,
> and it might even help get rid of pci_enable_vfs_overlay().
>
> Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> no actual *reason* for it to be there other than the fact that it has
> "msix" in the name.  It uses no MSI data structures.  Maybe it could
> be folded into sriov_vf_msix_count_store(), which would make the
> analysis even easier.

I put _set_ command near _get_ command, but I can move it to iov.c

>
> > > > @@ -326,6 +327,9 @@ struct pci_sriov {
> > > >  	u16		subsystem_device; /* VF subsystem device */
> > > >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> > > >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > > > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > > > +					 * can consume
> > > > +					 */
> > >
> > >   * can consume */
> > >
> > > Hopefully you can eliminate vf_total_msix altogether.
> >
> > I think that will be worse from the flow point of view (extra locks)
> > and the memory if you are worried about it. This variable consumes 4
> > bytes, the pointer to the function will take 8 bytes.
>
> I'm not concerned about the space.  The function pointer is in struct
> pci_driver, whereas the variable is in struct pci_sriov, so the
> variable would likely consume more space because there are probably
> more VFs than pci_drivers.
>
> My bigger concern is that vf_total_msix is really *driver* state, not
> PCI core state, and I'd prefer if the PCI core were not responsible
> for it.

As i said above, I can do it, just don't think that it is right.
Should I do it or not?

>
> > > > +++ b/include/linux/pci.h
> > > > @@ -856,6 +856,8 @@ struct module;
> > > >   *		e.g. drivers/net/e100.c.
> > > >   * @sriov_configure: Optional driver callback to allow configuration of
> > > >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > > + *              exposed by the sysfs "vf_msix_vec" entry.
> > >
> > > "vf_msix_vec" is apparently stale?  There's no other reference in this
> > > patch.
> > >
> > > I think the important part is that this changes the number of vectors
> > > advertised in the VF's MSI-X Message Control register, which will be
> > > read when the VF driver enables MSI-X.
> > >
> > > If that's true, why would we expose this via a sysfs file?  We can
> > > easily read it via lspci.
> >
> > I did it for feature complete, we don't need read of this sysfs.
>
> If you don't need this sysfs file, just remove it.  If you do need it,
> please update the "vf_msix_vec" so it's correct.  Also, clarify the
> description so we can tell that we're changing the Table Size VFs will
> see in their Message Control registers.

I'll do

>
> > > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
> > >
> > > Also here.  Unless removing the space would make this fit in 80
> > > columns.
> >
> > Yes, it is 80 columns without space.
>
> No, I don't think it is.  In an 80 column window, the result looks
> like:
>
>   static inline void pci_sriov_set_vf_total_msix(...) {
>   }
>
> Please change it so it looks like this so it matches the rest of the
> file:
>
>   static inline void pci_sriov_set_vf_total_msix(...)
>   { }

I counted back then, but sure will check again.

Thanks for your review.

>
> Bjorn
>
> [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
Bjorn Helgaas Feb. 4, 2021, 9:12 p.m. UTC | #6
On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > This is needed to optimize the performance of newly bound
> > > > > devices by allocating the number of vectors based on the
> > > > > administrator knowledge of targeted VM.
> > > >
> > > > I'm reading between the lines here, but IIUC the point is that you
> > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > by all the VFs, and this interface is to control the distribution
> > > > of those MSI-X vectors among the VFs.

> > This commit log should describe how *this* device manages this
> > allocation and how the PF Table Size and the VF Table Sizes are
> > related.  Per PCIe, there is no necessary connection between them.
> 
> There is no connection in mlx5 devices either. PF is used as a vehicle
> to access VF that doesn't have driver yet. From "table size" perspective
> they completely independent, because PF already probed by driver and
> it is already too late to change it.
> 
> So PF table size is static and can be changed through FW utility only.

This is where description of the device would be useful.  

The fact that you need "sriov_vf_total_msix" to advertise how many
vectors are available and "sriov_vf_msix_count" to influence how they
are distributed across the VFs suggests that these Table Sizes are not
completely independent.

Can a VF have a bigger Table Size than the PF does?  Can all the VF
Table Sizes added together be bigger than the PF Table Size?  If VF A
has a larger Table Size, does that mean VF B must have a smaller Table
Size?

Obviously I do not understand the details about how this device works.
It would be helpful to have those details here.

Here's the sequence as I understand it:

  1) PF driver binds to PF
  2) PF driver enables VFs
  3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
  4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
  5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
  6) VF driver binds to VF 
  7) VF reads MSI-X Message Control (Table Size)

Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
different Table Sizes?  That would be a little weird.

I'm also a little concerned about doing 2 before 3 & 4.  That works
for mlx5 because implements the Table Size adjustment in a way that
works *after* the VFs have been enabled.

But it seems conceivable that a device could implement vector
distribution in a way that would require the VF Table Sizes to be
fixed *before* enabling VFs.  That would be nice in the sense that the
VFs would be created "fully formed" and the VF Table Size would be
completely read-only as documented.

The other knob idea you mentioned at [2]:

  echo "0000:01:00.2 123" > sriov_vf_msix_count

would have the advantage of working for both cases.  That's definitely
more complicated, but at the same time, I would hate to carve a sysfs
interface into stone if it might not work for other devices.

> > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > +Date:		January 2021
> > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > +Description:
> > > > > +		This file is associated with the SR-IOV PFs.
> > > > > +		It returns a total number of possible to configure MSI-X
> > > > > +		vectors on the enabled VFs.
> > > > > +
> > > > > +		The values returned are:
> > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > +		 * = 0 - feature is not supported
> 
> > Not sure the "= 0" description is necessary here.  If the value
> > returned is the number of MSI-X vectors available for assignment to
> > VFs, "0" is a perfectly legitimate value.  It just means there are
> > none.  It doesn't need to be described separately.
> 
> I wanted to help users and remove ambiguity. For example, mlx5 drivers
> will always implement proper .set_...() callbacks but for some devices
> without needed FW support, the value will be 0. Instead of misleading
> users with wrong promise that feature supported but doesn't have
> available vectors, I decided to be more clear. For the users, 0 means, don't
> try, it is not working.

Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
devices"?  I totally missed that; I thought you meant "not supported
by the PF driver."  Why not have the PF driver detect that the
firmware doesn't support the feature and just not expose the sysfs
files at all in that case?

> > If we put them in a new "vfs_overlay" directory, it seems like
> > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > up next to these existing files.  In that case, I think it makes sense
> > to include "sriov".  And it probably does make sense to include "vf"
> > as well.
> 
> I put everything in folder to group any possible future extensions.
> Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> separate folder.

I'm not convinced (yet) that the possibility of future extensions is
enough justification for adding the "vfs_overlay" directory.  It
really complicates the code flow -- if we skipped the new directory,
I'm pretty sure we could make .is_visible() work, which would be a
major simplification.

And there's quite a bit of value in the new files being right next to
the existing sriov_* files.

> > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > > +{
> > > > > +	struct pci_dev *virtfn;
> > > > > +	int id;
> > > > > +
> > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > +		return;
> > > > > +
> > > > > +	id = dev->sriov->num_VFs;
> > > > > +	while (id--) {
> > > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > > +
> > > > > +		if (!virtfn)
> > > > > +			continue;
> > > > > +
> > > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > > +	}
> > > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > > >
> > > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > > add a hint in a comment about why this is special and can't use
> > > > something like sriov_dev_attr_group.
> > >
> > > This makes the overlay to be PF-driven. Alexander insisted on this flow.
> >
> > If you're referring to [1], I think "insisted on this flow" might be
> > an overstatement of what Alex was looking for.  IIUC Alex wants the
> > sysfs files to be visible only when they're useful, i.e., when a
> > driver implements ->sriov_set_msix_vec_count().
> 
> It is only one side of the request, the sysfs files shouldn't be visible
> if PF driver was removed and visible again when its probed again.

I can't parse this, but it's probably related to the question below.

> > That seems reasonable and also seems like something a smarter
> > .is_visible() function could accomplish without having drivers call
> > pci_enable_vfs_overlay(), e.g., maybe some variation of this:
> >
> >   static umode_t sriov_vf_attrs_are_visible(...)
> >   {
> >     if (!pdev->msix_cap || dev_is_pf(dev))
> >       return 0;
> >
> >     pf = pci_physfn(pdev);
> >     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
> >       return a->mode;
> >
> >     return 0;
> >   }
> 
> It doesn't work with the following flow:
> 1. load driver
> 2. disable autoprobe
> 3. echo to sriov_numvfs
> .... <--- you have this sriov_vf_attrs_are_visible() created
> 4. unload driver
> .... <--- sysfs still exists despite not having PF driver.

I missed your point here, sorry.  After unloading the PF driver,
"pf->driver" in the sketch above will be NULL, so the VF sysfs file
would not be visible.  Right?  Maybe it has to do with autoprobe?  I
didn't catch what the significance of disabling autoprobe was.

> > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > > +{
> > > > > +	if (!dev->is_physfn)
> > > > > +		return;
> > > > > +
> > > > > +	dev->sriov->vf_total_msix = count;
> > > >
> > > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > > at pdev->sriov->vf_total_msix.
> > >
> > > It will cause to unnecessary locking to ensure that driver doesn't
> > > vanish during sysfs read. I can change, but don't think that it is right
> > > decision.
> >
> > Doesn't sysfs already ensure the driver can't vanish while we're
> > executing a DEVICE_ATTR accessor?
> 
> It is not, you can see it by adding device_lock_held() check in any
> .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
> do much. This is why pci_vf_set_msix_vec_count() has double lock.

Aahh, right, I learned something today, thanks!  There are only a few
PCI sysfs attributes that reference the driver, and they do their own
locking.

I do think vf_total_msix is a bit of driver state related to
implementing this functionality and doesn't need to be in the PCI
core.  I think device locking is acceptable; it's very similar to what
is done in sriov_numvfs_store().  Doing the locking and calling a
driver callback makes it obvious that vf_total_msix is part of this PF
driver-specific functionality, not a generic part of the PCI core.

So let's give it a try.  If it turns out to be terrible, we can
revisit it.

> > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> > no actual *reason* for it to be there other than the fact that it has
> > "msix" in the name.  It uses no MSI data structures.  Maybe it could
> > be folded into sriov_vf_msix_count_store(), which would make the
> > analysis even easier.
> 
> I put _set_ command near _get_ command, but I can move it to iov.c

You mean you put pci_vf_set_msix_vec_count() near
pci_msix_vec_count()?  That's *true*, but they are not analogues, and
one is not setting the value returned by the other.

pci_vf_set_msix_vec_count() is a completely magical thing that uses a
device-specific mechanism on a PF that happens to change what
pci_msix_vec_count() on a VF will return later.  I think this is more
related to SR-IOV than it is to MSI.

Bjorn

> > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com

[2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/
Leon Romanovsky Feb. 5, 2021, 5:35 p.m. UTC | #7
On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > This is needed to optimize the performance of newly bound
> > > > > > devices by allocating the number of vectors based on the
> > > > > > administrator knowledge of targeted VM.
> > > > >
> > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > by all the VFs, and this interface is to control the distribution
> > > > > of those MSI-X vectors among the VFs.
>
> > > This commit log should describe how *this* device manages this
> > > allocation and how the PF Table Size and the VF Table Sizes are
> > > related.  Per PCIe, there is no necessary connection between them.
> >
> > There is no connection in mlx5 devices either. PF is used as a vehicle
> > to access VF that doesn't have driver yet. From "table size" perspective
> > they completely independent, because PF already probed by driver and
> > it is already too late to change it.
> >
> > So PF table size is static and can be changed through FW utility only.
>
> This is where description of the device would be useful.
>
> The fact that you need "sriov_vf_total_msix" to advertise how many
> vectors are available and "sriov_vf_msix_count" to influence how they
> are distributed across the VFs suggests that these Table Sizes are not
> completely independent.
>
> Can a VF have a bigger Table Size than the PF does?  Can all the VF
> Table Sizes added together be bigger than the PF Table Size?  If VF A
> has a larger Table Size, does that mean VF B must have a smaller Table
> Size?

VFs are completely independent devices and their table size can be
bigger than PF. FW has two pools, one for PF and another for all VFs.
In real world scenarios, every VF will have more MSI-X vectors than PF,
which will be distributed by orchestration software.

In theory, users can assign almost all vectors to one VF and leave others
depleted. In practice, it is not possible, we ensure that all VFs start
with some sensible number of vectors and FW protects with check of max
vector size write.

>
> Obviously I do not understand the details about how this device works.
> It would be helpful to have those details here.
>
> Here's the sequence as I understand it:
>
>   1) PF driver binds to PF
>   2) PF driver enables VFs
>   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
>   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
>   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
>   6) VF driver binds to VF
>   7) VF reads MSI-X Message Control (Table Size)
>
> Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> different Table Sizes?  That would be a little weird.

Yes, this is the flow. I think differently from you and think this
is actual good thing that user writes new msix count and it is shown
immediately.

>
> I'm also a little concerned about doing 2 before 3 & 4.  That works
> for mlx5 because implements the Table Size adjustment in a way that
> works *after* the VFs have been enabled.

It is not related to mlx5, but to the PCI spec that requires us to
create all VFs at the same time. Before enabling VFs, they don't
exist.

>
> But it seems conceivable that a device could implement vector
> distribution in a way that would require the VF Table Sizes to be
> fixed *before* enabling VFs.  That would be nice in the sense that the
> VFs would be created "fully formed" and the VF Table Size would be
> completely read-only as documented.

It is not how SR-IOV is used in real world. Cloud providers create many
VFs but don't assign those VFs yet. They do it after customer request
VM with specific properties (number of CPUs) which means number of MSI-X
vectors in our case.

All this is done when other VFs already in use and we can't destroy and
recreate them at that point of time.

>
> The other knob idea you mentioned at [2]:
>
>   echo "0000:01:00.2 123" > sriov_vf_msix_count

This knob doesn't always work if you have many devices and it is
nightmare to guess "new" VF BDF before it is claimed.

As an example on my machine with two devices, VFs are created differently:
[root@c ~]$ lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:01.2 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:01.3 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
[root@c ~]# echo 0 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

>
> would have the advantage of working for both cases.  That's definitely
> more complicated, but at the same time, I would hate to carve a sysfs
> interface into stone if it might not work for other devices.

As you can see above, it is not really solves pre-creation configuration flow.
For such flows, probably devlink is the best choice.

>
> > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > +Date:		January 2021
> > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > +Description:
> > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > +		vectors on the enabled VFs.
> > > > > > +
> > > > > > +		The values returned are:
> > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > +		 * = 0 - feature is not supported
> >
> > > Not sure the "= 0" description is necessary here.  If the value
> > > returned is the number of MSI-X vectors available for assignment to
> > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > none.  It doesn't need to be described separately.
> >
> > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > will always implement proper .set_...() callbacks but for some devices
> > without needed FW support, the value will be 0. Instead of misleading
> > users with wrong promise that feature supported but doesn't have
> > available vectors, I decided to be more clear. For the users, 0 means, don't
> > try, it is not working.
>
> Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> devices"?  I totally missed that; I thought you meant "not supported
> by the PF driver."  Why not have the PF driver detect that the
> firmware doesn't support the feature and just not expose the sysfs
> files at all in that case?

I can do it and will do it, but this behavior will be possible because
mlx5 is flexible enough. I imagine that other device won't be so flexible,
for example they will decide to "close" this feature because of other
SW limitation.

>
> > > If we put them in a new "vfs_overlay" directory, it seems like
> > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > up next to these existing files.  In that case, I think it makes sense
> > > to include "sriov".  And it probably does make sense to include "vf"
> > > as well.
> >
> > I put everything in folder to group any possible future extensions.
> > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > separate folder.
>
> I'm not convinced (yet) that the possibility of future extensions is
> enough justification for adding the "vfs_overlay" directory.  It
> really complicates the code flow -- if we skipped the new directory,
> I'm pretty sure we could make .is_visible() work, which would be a
> major simplification.

Unfortunately not, is_visible() is not dynamic and called when device
creates sysfs and it doesn't rescan it or refresh them. It means that
all files from vfs_overlay folder will exist even driver is removed
from PF.

Also sysfs is created before driver is probed.

>
> And there's quite a bit of value in the new files being right next to
> the existing sriov_* files.

I have no problem to drop folder, just need clear request, should I.

>
> > > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > > > +{
> > > > > > +	struct pci_dev *virtfn;
> > > > > > +	int id;
> > > > > > +
> > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > +		return;
> > > > > > +
> > > > > > +	id = dev->sriov->num_VFs;
> > > > > > +	while (id--) {
> > > > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > > > +
> > > > > > +		if (!virtfn)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > > > +	}
> > > > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > > > >
> > > > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > > > add a hint in a comment about why this is special and can't use
> > > > > something like sriov_dev_attr_group.
> > > >
> > > > This makes the overlay to be PF-driven. Alexander insisted on this flow.
> > >
> > > If you're referring to [1], I think "insisted on this flow" might be
> > > an overstatement of what Alex was looking for.  IIUC Alex wants the
> > > sysfs files to be visible only when they're useful, i.e., when a
> > > driver implements ->sriov_set_msix_vec_count().
> >
> > It is only one side of the request, the sysfs files shouldn't be visible
> > if PF driver was removed and visible again when its probed again.
>
> I can't parse this, but it's probably related to the question below.
>
> > > That seems reasonable and also seems like something a smarter
> > > .is_visible() function could accomplish without having drivers call
> > > pci_enable_vfs_overlay(), e.g., maybe some variation of this:
> > >
> > >   static umode_t sriov_vf_attrs_are_visible(...)
> > >   {
> > >     if (!pdev->msix_cap || dev_is_pf(dev))
> > >       return 0;
> > >
> > >     pf = pci_physfn(pdev);
> > >     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
> > >       return a->mode;
> > >
> > >     return 0;
> > >   }
> >
> > It doesn't work with the following flow:
> > 1. load driver
> > 2. disable autoprobe
> > 3. echo to sriov_numvfs
> > .... <--- you have this sriov_vf_attrs_are_visible() created
> > 4. unload driver
> > .... <--- sysfs still exists despite not having PF driver.
>
> I missed your point here, sorry.  After unloading the PF driver,
> "pf->driver" in the sketch above will be NULL, so the VF sysfs file
> would not be visible.  Right?  Maybe it has to do with autoprobe?  I
> didn't catch what the significance of disabling autoprobe was.

No, is_visible() is called on file creation only, see fs/sysfs/group.c:
create_files(). After you remove driver, there is no sysfs file refresh.

>
> > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > > > +{
> > > > > > +	if (!dev->is_physfn)
> > > > > > +		return;
> > > > > > +
> > > > > > +	dev->sriov->vf_total_msix = count;
> > > > >
> > > > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > > > at pdev->sriov->vf_total_msix.
> > > >
> > > > It will cause to unnecessary locking to ensure that driver doesn't
> > > > vanish during sysfs read. I can change, but don't think that it is right
> > > > decision.
> > >
> > > Doesn't sysfs already ensure the driver can't vanish while we're
> > > executing a DEVICE_ATTR accessor?
> >
> > It is not, you can see it by adding device_lock_held() check in any
> > .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
> > do much. This is why pci_vf_set_msix_vec_count() has double lock.
>
> Aahh, right, I learned something today, thanks!  There are only a few
> PCI sysfs attributes that reference the driver, and they do their own
> locking.
>
> I do think vf_total_msix is a bit of driver state related to
> implementing this functionality and doesn't need to be in the PCI
> core.  I think device locking is acceptable; it's very similar to what
> is done in sriov_numvfs_store().  Doing the locking and calling a
> driver callback makes it obvious that vf_total_msix is part of this PF
> driver-specific functionality, not a generic part of the PCI core.
>
> So let's give it a try.  If it turns out to be terrible, we can
> revisit it.

No problem.

>
> > > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> > > no actual *reason* for it to be there other than the fact that it has
> > > "msix" in the name.  It uses no MSI data structures.  Maybe it could
> > > be folded into sriov_vf_msix_count_store(), which would make the
> > > analysis even easier.
> >
> > I put _set_ command near _get_ command, but I can move it to iov.c
>
> You mean you put pci_vf_set_msix_vec_count() near
> pci_msix_vec_count()?  That's *true*, but they are not analogues, and
> one is not setting the value returned by the other.
>
> pci_vf_set_msix_vec_count() is a completely magical thing that uses a
> device-specific mechanism on a PF that happens to change what
> pci_msix_vec_count() on a VF will return later.  I think this is more
> related to SR-IOV than it is to MSI.

I will do.

>
> Bjorn
>
> > > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
>
> [2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/
Bjorn Helgaas Feb. 5, 2021, 10:57 p.m. UTC | #8
On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote:
> On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >
> > > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > > This is needed to optimize the performance of newly bound
> > > > > > > devices by allocating the number of vectors based on the
> > > > > > > administrator knowledge of targeted VM.
> > > > > >
> > > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > > by all the VFs, and this interface is to control the distribution
> > > > > > of those MSI-X vectors among the VFs.
> >
> > > > This commit log should describe how *this* device manages this
> > > > allocation and how the PF Table Size and the VF Table Sizes are
> > > > related.  Per PCIe, there is no necessary connection between them.
> > >
> > > There is no connection in mlx5 devices either. PF is used as a vehicle
> > > to access VF that doesn't have driver yet. From "table size" perspective
> > > they completely independent, because PF already probed by driver and
> > > it is already too late to change it.
> > >
> > > So PF table size is static and can be changed through FW utility only.
> >
> > This is where description of the device would be useful.
> >
> > The fact that you need "sriov_vf_total_msix" to advertise how many
> > vectors are available and "sriov_vf_msix_count" to influence how they
> > are distributed across the VFs suggests that these Table Sizes are not
> > completely independent.
> >
> > Can a VF have a bigger Table Size than the PF does?  Can all the VF
> > Table Sizes added together be bigger than the PF Table Size?  If VF A
> > has a larger Table Size, does that mean VF B must have a smaller Table
> > Size?
> 
> VFs are completely independent devices and their table size can be
> bigger than PF. FW has two pools, one for PF and another for all VFs.
> In real world scenarios, every VF will have more MSI-X vectors than PF,
> which will be distributed by orchestration software.

Well, if the sum of all the VF Table Sizes cannot exceed the size of
the FW pool for VFs, I would say the VFs are not completely
independent.  Increasing the Table Size of one VF reduces it for other
VFs.

This is an essential detail because it's the whole reason behind this
interface, so sketching this out in the commit log will make this much
easier to understand.

> > Here's the sequence as I understand it:
> >
> >   1) PF driver binds to PF
> >   2) PF driver enables VFs
> >   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
> >   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
> >   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
> >   6) VF driver binds to VF
> >   7) VF reads MSI-X Message Control (Table Size)
> >
> > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> > different Table Sizes?  That would be a little weird.
> 
> Yes, this is the flow. I think differently from you and think this
> is actual good thing that user writes new msix count and it is shown
> immediately.

Only weird because per spec Table Size is read-only and in this
scenario it changes, so it may be surprising, but probably not a huge
deal.

> > I'm also a little concerned about doing 2 before 3 & 4.  That works
> > for mlx5 because implements the Table Size adjustment in a way that
> > works *after* the VFs have been enabled.
> 
> It is not related to mlx5, but to the PCI spec that requires us to
> create all VFs at the same time. Before enabling VFs, they don't
> exist.

Yes.  I can imagine a PF driver that collects characteristics for the
desired VFs before enabling them, sort of like we already collect the
*number* of VFs.  But I think your argument for dynamic changes after
creation below is pretty compelling.

> > But it seems conceivable that a device could implement vector
> > distribution in a way that would require the VF Table Sizes to be
> > fixed *before* enabling VFs.  That would be nice in the sense that the
> > VFs would be created "fully formed" and the VF Table Size would be
> > completely read-only as documented.
> 
> It is not how SR-IOV is used in real world. Cloud providers create many
> VFs but don't assign those VFs yet. They do it after customer request
> VM with specific properties (number of CPUs) which means number of MSI-X
> vectors in our case.
> 
> All this is done when other VFs already in use and we can't destroy and
> recreate them at that point of time.

Makes sense, thanks for this insight.  The need to change the MSI-X
Table Size dynamically while other VFs are in use would also be useful
in the commit log because it really helps motivate this design.

> > > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > > +Date:		January 2021
> > > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > > +Description:
> > > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > > +		vectors on the enabled VFs.
> > > > > > > +
> > > > > > > +		The values returned are:
> > > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > > +		 * = 0 - feature is not supported
> > >
> > > > Not sure the "= 0" description is necessary here.  If the value
> > > > returned is the number of MSI-X vectors available for assignment to
> > > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > > none.  It doesn't need to be described separately.
> > >
> > > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > > will always implement proper .set_...() callbacks but for some devices
> > > without needed FW support, the value will be 0. Instead of misleading
> > > users with wrong promise that feature supported but doesn't have
> > > available vectors, I decided to be more clear. For the users, 0 means, don't
> > > try, it is not working.
> >
> > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> > devices"?  I totally missed that; I thought you meant "not supported
> > by the PF driver."  Why not have the PF driver detect that the
> > firmware doesn't support the feature and just not expose the sysfs
> > files at all in that case?
> 
> I can do it and will do it, but this behavior will be possible because
> mlx5 is flexible enough. I imagine that other device won't be so flexible,
> for example they will decide to "close" this feature because of other
> SW limitation.

What about something like this?

  This file contains the total number of MSI-X vectors available for
  assignment to all VFs associated with this PF.  It may be zero if
  the device doesn't support this functionality.

Side question: How does user-space use this?  This file contains a
constant, and there's no interface to learn how many vectors are still
available in the pool, right?

I guess the user just has to manage the values written to
.../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds
sriov_vf_total_msix?  If that user app crashes, I guess it can
reconstruct this state by reading all the
.../VF<n>/sriov_vf_msix_count files?

> > > > If we put them in a new "vfs_overlay" directory, it seems like
> > > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > > up next to these existing files.  In that case, I think it makes sense
> > > > to include "sriov".  And it probably does make sense to include "vf"
> > > > as well.
> > >
> > > I put everything in folder to group any possible future extensions.
> > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > > separate folder.
> >
> > I'm not convinced (yet) that the possibility of future extensions is
> > enough justification for adding the "vfs_overlay" directory.  It
> > really complicates the code flow -- if we skipped the new directory,
> > I'm pretty sure we could make .is_visible() work, which would be a
> > major simplification.
> 
> Unfortunately not, is_visible() is not dynamic and called when device
> creates sysfs and it doesn't rescan it or refresh them. It means that
> all files from vfs_overlay folder will exist even driver is removed
> from PF.
> 
> Also sysfs is created before driver is probed.

Ah, both excellent points, that makes this much clearer to me, thank
you!

> > And there's quite a bit of value in the new files being right next to
> > the existing sriov_* files.
> 
> I have no problem to drop folder, just need clear request, should I.

I think we should drop the folder.  I missed the previous discussion
though, so if somebody has a good objection to dropping it, let me
know.

Dropping the folder has the potential advantage that we *could* decide
to expose these files always, and just have them be non-functional if
the device doesn't support assignment or the PF driver isn't bound.
Then I think we could use static sysfs attributes without
pci_enable_vfs_overlay().

Bjorn
Leon Romanovsky Feb. 6, 2021, 12:42 p.m. UTC | #9
On Fri, Feb 05, 2021 at 04:57:03PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote:
> > On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > >
> > > > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > > > This is needed to optimize the performance of newly bound
> > > > > > > > devices by allocating the number of vectors based on the
> > > > > > > > administrator knowledge of targeted VM.
> > > > > > >
> > > > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > > > by all the VFs, and this interface is to control the distribution
> > > > > > > of those MSI-X vectors among the VFs.
> > >
> > > > > This commit log should describe how *this* device manages this
> > > > > allocation and how the PF Table Size and the VF Table Sizes are
> > > > > related.  Per PCIe, there is no necessary connection between them.
> > > >
> > > > There is no connection in mlx5 devices either. PF is used as a vehicle
> > > > to access VF that doesn't have driver yet. From "table size" perspective
> > > > they completely independent, because PF already probed by driver and
> > > > it is already too late to change it.
> > > >
> > > > So PF table size is static and can be changed through FW utility only.
> > >
> > > This is where description of the device would be useful.
> > >
> > > The fact that you need "sriov_vf_total_msix" to advertise how many
> > > vectors are available and "sriov_vf_msix_count" to influence how they
> > > are distributed across the VFs suggests that these Table Sizes are not
> > > completely independent.
> > >
> > > Can a VF have a bigger Table Size than the PF does?  Can all the VF
> > > Table Sizes added together be bigger than the PF Table Size?  If VF A
> > > has a larger Table Size, does that mean VF B must have a smaller Table
> > > Size?
> >
> > VFs are completely independent devices and their table size can be
> > bigger than PF. FW has two pools, one for PF and another for all VFs.
> > In real world scenarios, every VF will have more MSI-X vectors than PF,
> > which will be distributed by orchestration software.
>
> Well, if the sum of all the VF Table Sizes cannot exceed the size of
> the FW pool for VFs, I would say the VFs are not completely
> independent.  Increasing the Table Size of one VF reduces it for other
> VFs.
>
> This is an essential detail because it's the whole reason behind this
> interface, so sketching this out in the commit log will make this much
> easier to understand.

I think that it is already written, but will recheck.

>
> > > Here's the sequence as I understand it:
> > >
> > >   1) PF driver binds to PF
> > >   2) PF driver enables VFs
> > >   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
> > >   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
> > >   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
> > >   6) VF driver binds to VF
> > >   7) VF reads MSI-X Message Control (Table Size)
> > >
> > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> > > different Table Sizes?  That would be a little weird.
> >
> > Yes, this is the flow. I think differently from you and think this
> > is actual good thing that user writes new msix count and it is shown
> > immediately.
>
> Only weird because per spec Table Size is read-only and in this
> scenario it changes, so it may be surprising, but probably not a huge
> deal.
>
> > > I'm also a little concerned about doing 2 before 3 & 4.  That works
> > > for mlx5 because implements the Table Size adjustment in a way that
> > > works *after* the VFs have been enabled.
> >
> > It is not related to mlx5, but to the PCI spec that requires us to
> > create all VFs at the same time. Before enabling VFs, they don't
> > exist.
>
> Yes.  I can imagine a PF driver that collects characteristics for the
> desired VFs before enabling them, sort of like we already collect the
> *number* of VFs.  But I think your argument for dynamic changes after
> creation below is pretty compelling.

IMHO, the best tool for such pre-configured changes will be devlink.

>
> > > But it seems conceivable that a device could implement vector
> > > distribution in a way that would require the VF Table Sizes to be
> > > fixed *before* enabling VFs.  That would be nice in the sense that the
> > > VFs would be created "fully formed" and the VF Table Size would be
> > > completely read-only as documented.
> >
> > It is not how SR-IOV is used in real world. Cloud providers create many
> > VFs but don't assign those VFs yet. They do it after customer request
> > VM with specific properties (number of CPUs) which means number of MSI-X
> > vectors in our case.
> >
> > All this is done when other VFs already in use and we can't destroy and
> > recreate them at that point of time.
>
> Makes sense, thanks for this insight.  The need to change the MSI-X
> Table Size dynamically while other VFs are in use would also be useful
> in the commit log because it really helps motivate this design.

It is already written, but I will add more info.

>
> > > > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > > > +Date:		January 2021
> > > > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > +Description:
> > > > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > > > +		vectors on the enabled VFs.
> > > > > > > > +
> > > > > > > > +		The values returned are:
> > > > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > > > +		 * = 0 - feature is not supported
> > > >
> > > > > Not sure the "= 0" description is necessary here.  If the value
> > > > > returned is the number of MSI-X vectors available for assignment to
> > > > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > > > none.  It doesn't need to be described separately.
> > > >
> > > > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > > > will always implement proper .set_...() callbacks but for some devices
> > > > without needed FW support, the value will be 0. Instead of misleading
> > > > users with wrong promise that feature supported but doesn't have
> > > > available vectors, I decided to be more clear. For the users, 0 means, don't
> > > > try, it is not working.
> > >
> > > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> > > devices"?  I totally missed that; I thought you meant "not supported
> > > by the PF driver."  Why not have the PF driver detect that the
> > > firmware doesn't support the feature and just not expose the sysfs
> > > files at all in that case?
> >
> > I can do it and will do it, but this behavior will be possible because
> > mlx5 is flexible enough. I imagine that other device won't be so flexible,
> > for example they will decide to "close" this feature because of other
> > SW limitation.
>
> What about something like this?
>
>   This file contains the total number of MSI-X vectors available for
>   assignment to all VFs associated with this PF.  It may be zero if
>   the device doesn't support this functionality.

Sounds good.

>
> Side question: How does user-space use this?  This file contains a
> constant, and there's no interface to learn how many vectors are still
> available in the pool, right?

Right, it is easy for the kernel implementation and easy for the users.
They don't need from the kernel to see exact utilized number.

Every VF has vectors assigned from the beginning and there is no parallel
race between kernel and user space to claim resources in our flow, so at the
point when orchestration will have a chance to run the system will be in steady
state.

Access to the hypervisor with ability to write to sysfs files requires
root permissions and management software already counts number of users,
their CPUs and number of vectors.

>
> I guess the user just has to manage the values written to
> .../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds
> sriov_vf_total_msix?  If that user app crashes, I guess it can
> reconstruct this state by reading all the
> .../VF<n>/sriov_vf_msix_count files?

Yes, if orchestration software crashes on hypervisor, it will simply
reiterate all VFs from the beginning.

>
> > > > > If we put them in a new "vfs_overlay" directory, it seems like
> > > > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > > > up next to these existing files.  In that case, I think it makes sense
> > > > > to include "sriov".  And it probably does make sense to include "vf"
> > > > > as well.
> > > >
> > > > I put everything in folder to group any possible future extensions.
> > > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > > > separate folder.
> > >
> > > I'm not convinced (yet) that the possibility of future extensions is
> > > enough justification for adding the "vfs_overlay" directory.  It
> > > really complicates the code flow -- if we skipped the new directory,
> > > I'm pretty sure we could make .is_visible() work, which would be a
> > > major simplification.
> >
> > Unfortunately not, is_visible() is not dynamic and called when device
> > creates sysfs and it doesn't rescan it or refresh them. It means that
> > all files from vfs_overlay folder will exist even driver is removed
> > from PF.
> >
> > Also sysfs is created before driver is probed.
>
> Ah, both excellent points, that makes this much clearer to me, thank
> you!
>
> > > And there's quite a bit of value in the new files being right next to
> > > the existing sriov_* files.
> >
> > I have no problem to drop folder, just need clear request, should I.
>
> I think we should drop the folder.  I missed the previous discussion
> though, so if somebody has a good objection to dropping it, let me
> know.

Just to be clear, dropping "vfs_overlay" folder won't remove any
complexity with pci_enable_vfs_overlay()/pci_disable_vfs_overlay(), but
will cause to simple change of the name attribute in the attribute_group.

>
> Dropping the folder has the potential advantage that we *could* decide
> to expose these files always, and just have them be non-functional if
> the device doesn't support assignment or the PF driver isn't bound.
> Then I think we could use static sysfs attributes without
> pci_enable_vfs_overlay().

Such static initialization was in v0..v3 versions, but the request was to have
new files to be PF-driven and it requires enable/disable callbacks.

They need to allow the flow when PF driver is bounded after VF devices
already created. For example this flow:
1. modprobe mlx5_core
2. echo 5 > .../sriov_numvf
3. rmmod mlx5_core
...
4. modprobe mlx5_core <- need to reattach to already existing VFs.

Thanks

>
> Bjorn
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..4d206ade5331 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,35 @@  Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV VFs.
+		It allows configuration of the number of MSI-X vectors for
+		the VF. This is needed to optimize performance of newly bound
+		devices by allocating the number of vectors based on the
+		administrator knowledge of targeted VM.
+
+		The values accepted are:
+		 * > 0 - this will be number reported by the VF's MSI-X
+			 capability
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if the PF is bound to a driver that
+		set sriov_vf_total_msix > 0 and there is no driver bound
+		to the VF.
+
+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV PFs.
+		It returns a total number of possible to configure MSI-X
+		vectors on the enabled VFs.
+
+		The values returned are:
+		 * > 0 - this will be total number possible to consume by VFs,
+		 * = 0 - feature is not supported
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..3e95f835eba5 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@  int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -157,6 +158,166 @@  int pci_iov_sysfs_link(struct pci_dev *dev,
 	return rc;
 }

+#ifdef CONFIG_PCI_MSI
+static ssize_t sriov_vf_msix_count_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int count = pci_msix_vec_count(pdev);
+
+	if (count < 0)
+		return count;
+
+	return sysfs_emit(buf, "%d\n", count);
+}
+
+static ssize_t sriov_vf_msix_count_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	ret = pci_vf_set_msix_vec_count(vf_dev, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+static DEVICE_ATTR_RW(sriov_vf_msix_count);
+
+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
+}
+static DEVICE_ATTR_RO(sriov_vf_total_msix);
+#endif
+
+static struct attribute *sriov_pf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_total_msix.attr,
+#endif
+	NULL,
+};
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_msix_count.attr,
+#endif
+	NULL,
+};
+
+static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->msix_cap || !dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->msix_cap || dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+static const struct attribute_group sriov_pf_dev_attr_group = {
+	.attrs = sriov_pf_dev_attrs,
+	.is_visible = sriov_pf_attrs_are_visible,
+	.name = "vfs_overlay",
+};
+
+static const struct attribute_group sriov_vf_dev_attr_group = {
+	.attrs = sriov_vf_dev_attrs,
+	.is_visible = sriov_vf_attrs_are_visible,
+	.name = "vfs_overlay",
+};
+
+int pci_enable_vfs_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id, ret;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return 0;
+
+	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+	if (ret)
+		return ret;
+
+	for (id = 0; id < dev->sriov->num_VFs; id++) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		ret = sysfs_create_group(&virtfn->dev.kobj,
+					 &sriov_vf_dev_attr_group);
+		if (ret)
+			goto out;
+	}
+	return 0;
+
+out:
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
+	}
+	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
+
+void pci_disable_vfs_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return;
+
+	id = dev->sriov->num_VFs;
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
+	}
+	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+}
+EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
+
 int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 {
 	int i;
@@ -596,6 +757,7 @@  static void sriov_disable(struct pci_dev *dev)
 		sysfs_remove_link(&dev->dev.kobj, "dep_link");

 	iov->num_VFs = 0;
+	iov->vf_total_msix = 0;
 	pci_iov_set_numvfs(dev, 0);
 }

@@ -1054,6 +1216,24 @@  int pci_sriov_get_totalvfs(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);

+/**
+ * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
+ * @dev: the PCI PF device
+ * @count: the total number of MSI-X vector to consume by the VFs
+ *
+ * Sets the number of MSI-X vectors that is possible to consume by the VFs.
+ * This interface is complimentary part of the pci_vf_set_msix_vec_count()
+ * that will be used to configure the required number on the VF.
+ */
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
+{
+	if (!dev->is_physfn)
+		return;
+
+	dev->sriov->vf_total_msix = count;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
+
 /**
  * pci_sriov_configure_simple - helper to configure SR-IOV
  * @dev: the PCI device
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3162f88fe940..1022fe9e6efd 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -991,6 +991,53 @@  int pci_msix_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msix_vec_count);

+/**
+ * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
+ * This function is applicable for SR-IOV VF because such devices allocate
+ * their MSI-X table before they will run on the VMs and HW can't guess the
+ * right number of vectors, so the HW allocates them statically and equally.
+ * @dev: VF device that is going to be changed
+ * @count: amount of MSI-X vectors
+ **/
+int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
+{
+	struct pci_dev *pdev = pci_physfn(dev);
+	int ret;
+
+	if (count < 0)
+		/*
+		 * We don't support negative numbers for now,
+		 * but maybe in the future it will make sense.
+		 */
+		return -EINVAL;
+
+	device_lock(&pdev->dev);
+	if (!pdev->driver) {
+		ret = -EOPNOTSUPP;
+		goto err_pdev;
+	}
+
+	device_lock(&dev->dev);
+	if (dev->driver) {
+		/*
+		 * Driver already probed this VF and configured itself
+		 * based on previously configured (or default) MSI-X vector
+		 * count. It is too late to change this field for this
+		 * specific VF.
+		 */
+		ret = -EBUSY;
+		goto err_dev;
+	}
+
+	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
+
+err_dev:
+	device_unlock(&dev->dev);
+err_pdev:
+	device_unlock(&pdev->dev);
+	return ret;
+}
+
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 			     int nvec, struct irq_affinity *affd, int flags)
 {
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..2bd6560d91e2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -183,6 +183,7 @@  extern unsigned int pci_pm_d3hot_delay;

 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
 #else
 static inline void pci_no_msi(void) { }
 #endif
@@ -326,6 +327,9 @@  struct pci_sriov {
 	u16		subsystem_device; /* VF subsystem device */
 	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
+	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
+					 * can consume
+					 */
 };

 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..24d118ad6e7b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,8 @@  struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
+ *              exposed by the sysfs "vf_msix_vec" entry.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +873,7 @@  struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -2059,6 +2062,9 @@  void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);

+int pci_enable_vfs_overlay(struct pci_dev *dev);
+void pci_disable_vfs_overlay(struct pci_dev *dev);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);

@@ -2072,6 +2078,7 @@  int pci_sriov_get_totalvfs(struct pci_dev *dev);
 int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);

 /* Arch may override these (weak) */
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
@@ -2100,6 +2107,8 @@  static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 }
 static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
 					 int id) { }
+static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
+static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
@@ -2112,6 +2121,7 @@  static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
 { return 0; }
 static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
+static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
 #endif

 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)