[mlx5-next,v7,0/4] Dynamically assign MSI-X vectors count

Message ID	20210301075524.441609-1-leon@kernel.org
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Leon Romanovsky <leon@kernel.org> To: Bjorn Helgaas <bhelgaas@google.com>, Saeed Mahameed <saeedm@nvidia.com>, Alexander Duyck <alexander.duyck@gmail.com> Cc: Leon Romanovsky <leonro@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Jakub Kicinski <kuba@kernel.org>, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Don Dutile <ddutile@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, "David S . Miller" <davem@davemloft.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Subject: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Date: Mon, 1 Mar 2021 09:55:20 +0200 Message-Id: <20210301075524.441609-1-leon@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Dynamically assign MSI-X vectors count \| expand [mlx5-next,v7,0/4] Dynamically assign MSI-X vectors count [mlx5-next,v7,1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs [mlx5-next,v7,2/4] net/mlx5: Add dynamic MSI-X capabilities bits [mlx5-next,v7,3/4] net/mlx5: Dynamically assign MSI-X vectors count [mlx5-next,v7,4/4] net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks

Leon Romanovsky March 1, 2021, 7:55 a.m. UTC

From: Leon Romanovsky <leonro@nvidia.com>

@Alexander Duyck, please update me if I can add your ROB tag again
to the series, because you liked v6 more.

Thanks

---------------------------------------------------------------------------------
Changelog
v7:
 * Rebase on top v5.12-rc1
 * More english fixes
 * Returned to static sysfs creation model as was implemented in v0/v1.
v6: https://lore.kernel.org/linux-pci/20210209133445.700225-1-leon@kernel.org
 * Patch 1:
   * English fixes
   * Moved pci_vf_set_msix_vec_count() from msi.c to iov.c
   * Embedded pci_vf_set_msix_vec_count() into sriov_vf_msix_count_store
   * Deleted sriov_vf_msix_count_show
   * Deleted vfs_overlay folder
   * Renamed functions *_vfs_overlay_* to be *_vf_overlay_*
   * Deleted is_supported and attribute_group because it confused people more than
     it gave advantage.
   * Changed vf_total_msix to be callback
 * Patch 3:
   * Fixed english as suggested by Bjorn
   * Added more explanations to the commit message
 * Patch 4:
   * Protected enable/disable with capability check
v5: https://lore.kernel.org/linux-pci/20210126085730.1165673-1-leon@kernel.org
 * Patch 1:
  * Added forgotten "inline" keyword when declaring empty functions.
v4: https://lore.kernel.org/linux-pci/20210124131119.558563-1-leon@kernel.org
 * Used sysfs_emit() instead of sprintf() in new sysfs entries.
 * Changed EXPORT_SYMBOL to be EXPORT_SYMBOL_GPL for pci_iov_virtfn_devfn().
 * Rewrote sysfs registration code to be driven by PF that wants to enable VF
   overlay instead of creating to all SR-IOV devices.
 * Grouped all such functionality under new "vfs_overlay" folder.
 * Combined two PCI patches into one.
v3: https://lore.kernel.org/linux-pci/20210117081548.1278992-1-leon@kernel.org
 * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count.
 * Added VF msix_cap check to hide sysfs entry if device doesn't support msix.
 * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing
   kdoc description.
 * Split differently error print in mlx5 driver to avoid checkpatch warning.
v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-leon@kernel.org
 * Patch 1:
  * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count
  * Added PF and VF device locks during set MSI-X call to protect from parallel
    driver bind/unbind operations.
  * Removed extra checks when reading sriov_vf_msix, because users will
    be able to distinguish between supported/not supported by looking on
    sriov_vf_total_msix count.
  * Changed all occurrences of "numb" to be "count"
  * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set
    MSI-X count after driver already bound to the VF.
  * Added extra comment in pci_set_msix_vec_count() to emphasize that driver
    should not be bound.
 * Patch 2:
  * Changed vf_total_msix from int to be u32 and updated function signatures
    accordingly.
  * Improved patch title
v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-leon@kernel.org
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-leon@kernel.org

--------------------------------------------------------------------
Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/sriov_vf_msix_count

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (4):
  PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks

 Documentation/ABI/testing/sysfs-bus-pci       |  29 +++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |   6 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  73 +++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  48 ++++++++-
 drivers/pci/iov.c                             | 102 ++++++++++++++++--
 drivers/pci/pci-sysfs.c                       |   3 +-
 drivers/pci/pci.h                             |   3 +-
 include/linux/mlx5/mlx5_ifc.h                 |  11 +-
 include/linux/pci.h                           |   8 ++
 10 files changed, 284 insertions(+), 11 deletions(-)

--
2.29.2

Leon Romanovsky March 7, 2021, 8:11 a.m. UTC | #1

On Mon, Mar 01, 2021 at 09:55:20AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> @Alexander Duyck, please update me if I can add your ROB tag again
> to the series, because you liked v6 more.
>
> Thanks
>
> ---------------------------------------------------------------------------------
> Changelog
> v7:
>  * Rebase on top v5.12-rc1
>  * More english fixes
>  * Returned to static sysfs creation model as was implemented in v0/v1.

Bjorn, are we good here?

Thanks

Alexander H Duyck March 7, 2021, 6:55 p.m. UTC | #2

On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> @Alexander Duyck, please update me if I can add your ROB tag again
> to the series, because you liked v6 more.
>
> Thanks
>
> ---------------------------------------------------------------------------------
> Changelog
> v7:
>  * Rebase on top v5.12-rc1
>  * More english fixes
>  * Returned to static sysfs creation model as was implemented in v0/v1.

Yeah, so I am not a fan of the series. The problem is there is only
one driver that supports this, all VFs are going to expose this sysfs,
and I don't know how likely it is that any others are going to
implement this functionality. I feel like you threw out all the
progress from v2-v6.

I really feel like the big issue is that this model is broken as you
have the VFs exposing sysfs interfaces that make use of the PFs to
actually implement. Greg's complaint was the PF pushing sysfs onto the
VFs. My complaint is VFs sysfs files operating on the PF. The trick is
to find a way to address both issues.

Maybe the compromise is to reach down into the IOV code and have it
register the sysfs interface at device creation time in something like
pci_iov_sysfs_link if the PF has the functionality present to support
it.

Also we might want to double check that the PF cannot be unbound while
the VF is present. I know for a while there it was possible to remove
the PF driver while the VF was present. The Mellanox drivers may not
allow it but it might not hurt to look at taking a reference against
the PF driver if you are allocating the VF MSI-X configuration sysfs
file.

Leon Romanovsky March 7, 2021, 7:19 p.m. UTC | #3

On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > @Alexander Duyck, please update me if I can add your ROB tag again
> > to the series, because you liked v6 more.
> >
> > Thanks
> >
> > ---------------------------------------------------------------------------------
> > Changelog
> > v7:
> >  * Rebase on top v5.12-rc1
> >  * More english fixes
> >  * Returned to static sysfs creation model as was implemented in v0/v1.
>
> Yeah, so I am not a fan of the series. The problem is there is only
> one driver that supports this, all VFs are going to expose this sysfs,
> and I don't know how likely it is that any others are going to
> implement this functionality. I feel like you threw out all the
> progress from v2-v6.

I'm with you here and tried to present the rationale in v6 when had
a discussion with Bjorn, so it is unfair to say "you threw out".

Bjorn expressed his preference, and no one came forward to support v6.

>
> I really feel like the big issue is that this model is broken as you
> have the VFs exposing sysfs interfaces that make use of the PFs to
> actually implement. Greg's complaint was the PF pushing sysfs onto the
> VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> to find a way to address both issues.

It is hard to say something meaningful about Greg's complain, he was
added in the middle of the discussion without much chances to get full
picture.

>
> Maybe the compromise is to reach down into the IOV code and have it
> register the sysfs interface at device creation time in something like
> pci_iov_sysfs_link if the PF has the functionality present to support
> it.

IMHO, it adds nothing.

>
> Also we might want to double check that the PF cannot be unbound while
> the VF is present. I know for a while there it was possible to remove
> the PF driver while the VF was present. The Mellanox drivers may not
> allow it but it might not hurt to look at taking a reference against
> the PF driver if you are allocating the VF MSI-X configuration sysfs
> file.

Right now, we always allocate these sysfs without relation if PF
supports or not. The check is done during write() call to such sysfs
and at that phase we check the existence of the drivers. It greatly
simplifies creation phase.

Thanks

Alexander H Duyck March 8, 2021, 4:33 p.m. UTC | #4

On Sun, Mar 7, 2021 at 11:19 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > Yeah, so I am not a fan of the series. The problem is there is only
> > one driver that supports this, all VFs are going to expose this sysfs,
> > and I don't know how likely it is that any others are going to
> > implement this functionality. I feel like you threw out all the
> > progress from v2-v6.
>
> I'm with you here and tried to present the rationale in v6 when had
> a discussion with Bjorn, so it is unfair to say "you threw out".
>
> Bjorn expressed his preference, and no one came forward to support v6.

Sorry, it wasn't my intention to be accusatory. I'm just not a fan of
going back to where we were with v1.

With that said, if it is what Bjorn wants then you are probably better
off going with that. However if that is the direction we are going in
then you should probably focus on getting his Reviewed-by or Ack since
he will ultimately be the maintainer for the code.

> >
> > I really feel like the big issue is that this model is broken as you
> > have the VFs exposing sysfs interfaces that make use of the PFs to
> > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > to find a way to address both issues.
>
> It is hard to say something meaningful about Greg's complain, he was
> added in the middle of the discussion without much chances to get full
> picture.

Right, but what I am getting at is that the underlying problem is that
you either have sysfs being pushed onto a remote device, or sysfs that
is having to call into another device. It's not exactly something we
have had precedent for enabling before, and either perspective seems a
bit ugly.

> >
> > Maybe the compromise is to reach down into the IOV code and have it
> > register the sysfs interface at device creation time in something like
> > pci_iov_sysfs_link if the PF has the functionality present to support
> > it.
>
> IMHO, it adds nothing.

My thought was to reduce clutter. As I mentioned before with this
patch set we are enabling sysfs for functionality that is currently
only exposed by one device. I'm not sure it will be used by many
others or not. Having these sysfs interfaces instantiated at probe
time or at creation time in the case of VFs was preferable to me.

> >
> > Also we might want to double check that the PF cannot be unbound while
> > the VF is present. I know for a while there it was possible to remove
> > the PF driver while the VF was present. The Mellanox drivers may not
> > allow it but it might not hurt to look at taking a reference against
> > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > file.
>
> Right now, we always allocate these sysfs without relation if PF
> supports or not. The check is done during write() call to such sysfs
> and at that phase we check the existence of the drivers. It greatly
> simplifies creation phase.

Yeah, I see that. From what I can tell the locking looks correct to
keep things from breaking. For what we have it is probably good enough
to keep things from causing any issues. My concern was more about
preventing the driver from reloading if we only exposed these
interfaces if the PF driver supported them.

Leon Romanovsky March 8, 2021, 7:20 p.m. UTC | #5

On Mon, Mar 08, 2021 at 08:33:03AM -0800, Alexander Duyck wrote:
> On Sun, Mar 7, 2021 at 11:19 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > >
> > > Yeah, so I am not a fan of the series. The problem is there is only
> > > one driver that supports this, all VFs are going to expose this sysfs,
> > > and I don't know how likely it is that any others are going to
> > > implement this functionality. I feel like you threw out all the
> > > progress from v2-v6.
> >
> > I'm with you here and tried to present the rationale in v6 when had
> > a discussion with Bjorn, so it is unfair to say "you threw out".
> >
> > Bjorn expressed his preference, and no one came forward to support v6.
>
> Sorry, it wasn't my intention to be accusatory. I'm just not a fan of
> going back to where we were with v1.
>
> With that said, if it is what Bjorn wants then you are probably better
> off going with that. However if that is the direction we are going in
> then you should probably focus on getting his Reviewed-by or Ack since
> he will ultimately be the maintainer for the code.

I hope that he will do it soon.

>
> > >
> > > I really feel like the big issue is that this model is broken as you
> > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > to find a way to address both issues.
> >
> > It is hard to say something meaningful about Greg's complain, he was
> > added in the middle of the discussion without much chances to get full
> > picture.
>
> Right, but what I am getting at is that the underlying problem is that
> you either have sysfs being pushed onto a remote device, or sysfs that
> is having to call into another device. It's not exactly something we
> have had precedent for enabling before, and either perspective seems a
> bit ugly.

I don't agree with the word "ugly", but it is not the point. The point
is that this interface is backed by the ecosystem and must-to-be for the
right SR-IOV utilization.

>
> > >
> > > Maybe the compromise is to reach down into the IOV code and have it
> > > register the sysfs interface at device creation time in something like
> > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > it.
> >
> > IMHO, it adds nothing.
>
> My thought was to reduce clutter. As I mentioned before with this
> patch set we are enabling sysfs for functionality that is currently
> only exposed by one device. I'm not sure it will be used by many
> others or not. Having these sysfs interfaces instantiated at probe
> time or at creation time in the case of VFs was preferable to me.

I said that in v6 to Bjorn, that I expect up to 2-3 vendors to support
this knob. There are not many devices in the market that are comparable
to the mlx5 both in their complexity and adoption.

Thanks

Leon Romanovsky March 10, 2021, 5:58 a.m. UTC | #6

On Mon, Mar 01, 2021 at 09:55:20AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>

<...>

> Leon Romanovsky (4):
>   PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
>   net/mlx5: Add dynamic MSI-X capabilities bits
>   net/mlx5: Dynamically assign MSI-X vectors count
>   net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks

Bjorn,

This is gentle reminder, can we please progress with this series?

Thanks

>
>  Documentation/ABI/testing/sysfs-bus-pci       |  29 +++++
>  .../net/ethernet/mellanox/mlx5/core/main.c    |   6 ++
>  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +++
>  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  73 +++++++++++++
>  .../net/ethernet/mellanox/mlx5/core/sriov.c   |  48 ++++++++-
>  drivers/pci/iov.c                             | 102 ++++++++++++++++--
>  drivers/pci/pci-sysfs.c                       |   3 +-
>  drivers/pci/pci.h                             |   3 +-
>  include/linux/mlx5/mlx5_ifc.h                 |  11 +-
>  include/linux/pci.h                           |   8 ++
>  10 files changed, 284 insertions(+), 11 deletions(-)
>
> --
> 2.29.2
>

Bjorn Helgaas March 10, 2021, 7:09 p.m. UTC | #7

On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > @Alexander Duyck, please update me if I can add your ROB tag again
> > to the series, because you liked v6 more.
> >
> > Thanks
> >
> > ---------------------------------------------------------------------------------
> > Changelog
> > v7:
> >  * Rebase on top v5.12-rc1
> >  * More english fixes
> >  * Returned to static sysfs creation model as was implemented in v0/v1.
> 
> Yeah, so I am not a fan of the series. The problem is there is only
> one driver that supports this, all VFs are going to expose this sysfs,
> and I don't know how likely it is that any others are going to
> implement this functionality. I feel like you threw out all the
> progress from v2-v6.

pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
sysfs files regardless of whether the PF driver was bound.

> I really feel like the big issue is that this model is broken as you
> have the VFs exposing sysfs interfaces that make use of the PFs to
> actually implement. Greg's complaint was the PF pushing sysfs onto the
> VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> to find a way to address both issues.
> 
> Maybe the compromise is to reach down into the IOV code and have it
> register the sysfs interface at device creation time in something like
> pci_iov_sysfs_link if the PF has the functionality present to support
> it.

IIUC there are two questions on the table:

  1) Should the sysfs files be visible only when a PF driver that
     supports MSI-X vector assignment is bound?

     I think this is a cosmetic issue.  The presence of the file is
     not a reliable signal to management software; it must always
     tolerate files that don't exist (e.g., on old kernels) or files
     that are visible but don't work (e.g., vectors may be exhausted).

     If we start with the files always being visible, we should be
     able to add smarts later to expose them only when the PF driver
     is bound.

     My concerns with pci_enable_vf_overlay() are that it uses a
     little more sysfs internals than I'd like (although there are
     many callers of sysfs_create_files()) and it uses
     pci_get_domain_bus_and_slot(), which is generally a hack and
     creates refcounting hassles.  Speaking of which, isn't v6 missing
     a pci_dev_put() to match the pci_get_domain_bus_and_slot()?

  2) Should a VF sysfs file use the PF to implement this?

     Can you elaborate on your idea here?  I guess
     pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
     VF, and you're thinking we could also make a "virtfnX_msix_count"
     in the PF directory?  That's a really interesting idea.

> Also we might want to double check that the PF cannot be unbound while
> the VF is present. I know for a while there it was possible to remove
> the PF driver while the VF was present. The Mellanox drivers may not
> allow it but it might not hurt to look at taking a reference against
> the PF driver if you are allocating the VF MSI-X configuration sysfs
> file.

Unbinding the PF driver will either remove the *_msix_count files or
make them stop working.  Is that a problem?  I'm not sure we should
add a magic link that prevents driver unbinding.  Seems like it would
be hard for users to figure out why the driver can't be removed.

Bjorn

Leon Romanovsky March 10, 2021, 8:10 p.m. UTC | #8

On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.

<...>

>   2) Should a VF sysfs file use the PF to implement this?
>
>      Can you elaborate on your idea here?  I guess
>      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
>      VF, and you're thinking we could also make a "virtfnX_msix_count"
>      in the PF directory?  That's a really interesting idea.

I want to remind that we are talking about mlx5 devices that support
upto 255 VFs and they indeed are used to their limits. So seeing 255
links of virtfnX_msix_count in the same directory looks too much unpleasant
to me.

Thanks

gregkh@linuxfoundation.org March 10, 2021, 8:21 p.m. UTC | #9

On Wed, Mar 10, 2021 at 10:10:41PM +0200, Leon Romanovsky wrote:
> On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> 
> <...>
> 
> >   2) Should a VF sysfs file use the PF to implement this?
> >
> >      Can you elaborate on your idea here?  I guess
> >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> >      in the PF directory?  That's a really interesting idea.
> 
> I want to remind that we are talking about mlx5 devices that support
> upto 255 VFs and they indeed are used to their limits. So seeing 255
> links of virtfnX_msix_count in the same directory looks too much unpleasant
> to me.

255 files are nothing, if that's what the hardware supports, what is the
problem?  If it's "unpleasant", go complain to the hardware designers :)

greg k-h

Alexander H Duyck March 10, 2021, 11:34 p.m. UTC | #10

On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > Yeah, so I am not a fan of the series. The problem is there is only
> > one driver that supports this, all VFs are going to expose this sysfs,
> > and I don't know how likely it is that any others are going to
> > implement this functionality. I feel like you threw out all the
> > progress from v2-v6.
>
> pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> sysfs files regardless of whether the PF driver was bound.
>
> > I really feel like the big issue is that this model is broken as you
> > have the VFs exposing sysfs interfaces that make use of the PFs to
> > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > to find a way to address both issues.
> >
> > Maybe the compromise is to reach down into the IOV code and have it
> > register the sysfs interface at device creation time in something like
> > pci_iov_sysfs_link if the PF has the functionality present to support
> > it.
>
> IIUC there are two questions on the table:
>
>   1) Should the sysfs files be visible only when a PF driver that
>      supports MSI-X vector assignment is bound?
>
>      I think this is a cosmetic issue.  The presence of the file is
>      not a reliable signal to management software; it must always
>      tolerate files that don't exist (e.g., on old kernels) or files
>      that are visible but don't work (e.g., vectors may be exhausted).
>
>      If we start with the files always being visible, we should be
>      able to add smarts later to expose them only when the PF driver
>      is bound.
>
>      My concerns with pci_enable_vf_overlay() are that it uses a
>      little more sysfs internals than I'd like (although there are
>      many callers of sysfs_create_files()) and it uses
>      pci_get_domain_bus_and_slot(), which is generally a hack and
>      creates refcounting hassles.  Speaking of which, isn't v6 missing
>      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?

I'm not so much worried about management software as the fact that
this is a vendor specific implementation detail that is shaping how
the kernel interfaces are meant to work. Other than the mlx5 I don't
know if there are any other vendors really onboard with this sort of
solution.

In addition it still feels rather hacky to be modifying read-only PCIe
configuration space on the fly via a backdoor provided by the PF. It
almost feels like this should be some sort of quirk rather than a
standard feature for an SR-IOV VF.

>   2) Should a VF sysfs file use the PF to implement this?
>
>      Can you elaborate on your idea here?  I guess
>      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
>      VF, and you're thinking we could also make a "virtfnX_msix_count"
>      in the PF directory?  That's a really interesting idea.

I would honestly be more comfortable if the PF owned these files
instead of the VFs. One of the things I didn't like about this back
during the V1/2 days was the fact that it gave the impression that
MSI-X count was something that is meant to be edited. Since then I
think at least the naming was changed so that it implies that this is
only possible due to SR-IOV.

I also didn't like that it makes the VFs feel like they are port
representors rather than being actual PCIe devices. Having
functionality that only works when the VF driver is not loaded just
feels off. The VF sysfs directory feels like it is being used as a
subdirectory of the PF rather than being a device on its own.

> > Also we might want to double check that the PF cannot be unbound while
> > the VF is present. I know for a while there it was possible to remove
> > the PF driver while the VF was present. The Mellanox drivers may not
> > allow it but it might not hurt to look at taking a reference against
> > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > file.
>
> Unbinding the PF driver will either remove the *_msix_count files or
> make them stop working.  Is that a problem?  I'm not sure we should
> add a magic link that prevents driver unbinding.  Seems like it would
> be hard for users to figure out why the driver can't be removed.

I checked it again, it will make the *_msix_count files stop working.
In order to guarantee it cannot happen in the middle of things though
we are sitting on the device locks for both the PF and the VF. I'm not
a fan of having to hold 2 locks while we perform a firmware operation
for one device, but I couldn't find anything where we would deadlock
so it should be fine.

Leon Romanovsky March 11, 2021, 8:37 a.m. UTC | #11

On Wed, Mar 10, 2021 at 09:21:55PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 10, 2021 at 10:10:41PM +0200, Leon Romanovsky wrote:
> > On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > <...>
> >
> > >   2) Should a VF sysfs file use the PF to implement this?
> > >
> > >      Can you elaborate on your idea here?  I guess
> > >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> > >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> > >      in the PF directory?  That's a really interesting idea.
> >
> > I want to remind that we are talking about mlx5 devices that support
> > upto 255 VFs and they indeed are used to their limits. So seeing 255
> > links of virtfnX_msix_count in the same directory looks too much unpleasant
> > to me.
>
> 255 files are nothing, if that's what the hardware supports, what is the
> problem?  If it's "unpleasant", go complain to the hardware designers :)

It is 255 same files that every SR-IOV user will see in /sys/bus/pci/devices/*/
folder, unless we will do dynamic creation of those files and this is something
that Bjorn didn't like in v7.

So instead of complaining to the hardware designers, I will complain here.
I probably implemented all possible variants already. :)

Thanks

>
> greg k-h

Bjorn Helgaas March 11, 2021, 6:17 p.m. UTC | #12

On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > >
> > > Yeah, so I am not a fan of the series. The problem is there is only
> > > one driver that supports this, all VFs are going to expose this sysfs,
> > > and I don't know how likely it is that any others are going to
> > > implement this functionality. I feel like you threw out all the
> > > progress from v2-v6.
> >
> > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > sysfs files regardless of whether the PF driver was bound.
> >
> > > I really feel like the big issue is that this model is broken as you
> > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > to find a way to address both issues.
> > >
> > > Maybe the compromise is to reach down into the IOV code and have it
> > > register the sysfs interface at device creation time in something like
> > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > it.
> >
> > IIUC there are two questions on the table:
> >
> >   1) Should the sysfs files be visible only when a PF driver that
> >      supports MSI-X vector assignment is bound?
> >
> >      I think this is a cosmetic issue.  The presence of the file is
> >      not a reliable signal to management software; it must always
> >      tolerate files that don't exist (e.g., on old kernels) or files
> >      that are visible but don't work (e.g., vectors may be exhausted).
> >
> >      If we start with the files always being visible, we should be
> >      able to add smarts later to expose them only when the PF driver
> >      is bound.
> >
> >      My concerns with pci_enable_vf_overlay() are that it uses a
> >      little more sysfs internals than I'd like (although there are
> >      many callers of sysfs_create_files()) and it uses
> >      pci_get_domain_bus_and_slot(), which is generally a hack and
> >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> 
> I'm not so much worried about management software as the fact that
> this is a vendor specific implementation detail that is shaping how
> the kernel interfaces are meant to work. Other than the mlx5 I don't
> know if there are any other vendors really onboard with this sort of
> solution.

I know this is currently vendor-specific, but I thought the value
proposition of dynamic configuration of VFs for different clients
sounded compelling enough that other vendors would do something
similar.  But I'm not an SR-IOV guy and have no vendor insight, so
maybe that's not the case?

> In addition it still feels rather hacky to be modifying read-only PCIe
> configuration space on the fly via a backdoor provided by the PF. It
> almost feels like this should be some sort of quirk rather than a
> standard feature for an SR-IOV VF.

I agree, I'm not 100% comfortable with modifying the read-only Table
Size register.  Maybe there's another approach that would be better?
It *is* nice that the current approach doesn't require changes in the
VF driver.

> >   2) Should a VF sysfs file use the PF to implement this?
> >
> >      Can you elaborate on your idea here?  I guess
> >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> >      in the PF directory?  That's a really interesting idea.
> 
> I would honestly be more comfortable if the PF owned these files
> instead of the VFs. One of the things I didn't like about this back
> during the V1/2 days was the fact that it gave the impression that
> MSI-X count was something that is meant to be edited. Since then I
> think at least the naming was changed so that it implies that this is
> only possible due to SR-IOV.
> 
> I also didn't like that it makes the VFs feel like they are port
> representors rather than being actual PCIe devices. Having
> functionality that only works when the VF driver is not loaded just
> feels off. The VF sysfs directory feels like it is being used as a
> subdirectory of the PF rather than being a device on its own.

Moving "virtfnX_msix_count" to the PF seems like it would mitigate
this somewhat.  I don't know how to make this work while a VF driver
is bound without making the VF feel even less like a PCIe device,
i.e., we won't be able to use the standard MSI-X model.

> > > Also we might want to double check that the PF cannot be unbound while
> > > the VF is present. I know for a while there it was possible to remove
> > > the PF driver while the VF was present. The Mellanox drivers may not
> > > allow it but it might not hurt to look at taking a reference against
> > > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > > file.
> >
> > Unbinding the PF driver will either remove the *_msix_count files or
> > make them stop working.  Is that a problem?  I'm not sure we should
> > add a magic link that prevents driver unbinding.  Seems like it would
> > be hard for users to figure out why the driver can't be removed.
> 
> I checked it again, it will make the *_msix_count files stop working.
> In order to guarantee it cannot happen in the middle of things though
> we are sitting on the device locks for both the PF and the VF. I'm not
> a fan of having to hold 2 locks while we perform a firmware operation
> for one device, but I couldn't find anything where we would deadlock
> so it should be fine.

I agree again, it's not ideal to hold two locks.  Is it possible we
could get by without the VF lock?  If we simply check whether a VF
driver is bound (without a lock), a VF driver bind can race with the
PF sriov_set_msix_vec_count().

If the VF driver bind wins, it reads the old Table Size.  If it reads
a too-small size, it won't use all the vectors.  If it reads a
too-large size, it will try to use too many vectors and some won't
work.  But the race would be caused by a bug in the management
software, and the consequence doesn't seem *terrible*.

Bjorn

Keith Busch March 11, 2021, 7:16 p.m. UTC | #13

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > 
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
> 
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

NVMe has a similar feature defined by the standard where a PF controller can
dynamically assign MSIx vectors to VFs. The whole thing is managed in user
space with an ioctl, though. I guess we could wire up the driver to handle it
through this sysfs interface too, but I think the protocol specific tooling is
more appropriate for nvme.

Leon Romanovsky March 11, 2021, 7:17 p.m. UTC | #14

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > > >
> > > > Yeah, so I am not a fan of the series. The problem is there is only
> > > > one driver that supports this, all VFs are going to expose this sysfs,
> > > > and I don't know how likely it is that any others are going to
> > > > implement this functionality. I feel like you threw out all the
> > > > progress from v2-v6.
> > >
> > > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > > sysfs files regardless of whether the PF driver was bound.
> > >
> > > > I really feel like the big issue is that this model is broken as you
> > > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > > to find a way to address both issues.
> > > >
> > > > Maybe the compromise is to reach down into the IOV code and have it
> > > > register the sysfs interface at device creation time in something like
> > > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > > it.
> > >
> > > IIUC there are two questions on the table:
> > >
> > >   1) Should the sysfs files be visible only when a PF driver that
> > >      supports MSI-X vector assignment is bound?
> > >
> > >      I think this is a cosmetic issue.  The presence of the file is
> > >      not a reliable signal to management software; it must always
> > >      tolerate files that don't exist (e.g., on old kernels) or files
> > >      that are visible but don't work (e.g., vectors may be exhausted).
> > >
> > >      If we start with the files always being visible, we should be
> > >      able to add smarts later to expose them only when the PF driver
> > >      is bound.
> > >
> > >      My concerns with pci_enable_vf_overlay() are that it uses a
> > >      little more sysfs internals than I'd like (although there are
> > >      many callers of sysfs_create_files()) and it uses
> > >      pci_get_domain_bus_and_slot(), which is generally a hack and
> > >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> > >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> >
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
>
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

It is not the case, any vendor who wants to compete with Mellanox
devices in large scale clouds, will be asked to implement this feature.

You can't provide reliable VM service without it.

<...>

> > I checked it again, it will make the *_msix_count files stop working.
> > In order to guarantee it cannot happen in the middle of things though
> > we are sitting on the device locks for both the PF and the VF. I'm not
> > a fan of having to hold 2 locks while we perform a firmware operation
> > for one device, but I couldn't find anything where we would deadlock
> > so it should be fine.
>
> I agree again, it's not ideal to hold two locks.  Is it possible we
> could get by without the VF lock?  If we simply check whether a VF
> driver is bound (without a lock), a VF driver bind can race with the
> PF sriov_set_msix_vec_count().

We are holding huge amount of locks simultaneously. I don't understand
why two locks can be so big deal now.

Alex said that my locking scheme is correct, I think you also said that.
Why do we need to add races and make correct solution to be half baked?

>
> If the VF driver bind wins, it reads the old Table Size.  If it reads
> a too-small size, it won't use all the vectors.  If it reads a
> too-large size, it will try to use too many vectors and some won't
> work.  But the race would be caused by a bug in the management
> software, and the consequence doesn't seem *terrible*.

It will present to the user/administrator incorrect picture where lspci
output won't be aligned with the real situation.

Thanks

>
> Bjorn

Leon Romanovsky March 11, 2021, 7:21 p.m. UTC | #15

On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > >
> > > I'm not so much worried about management software as the fact that
> > > this is a vendor specific implementation detail that is shaping how
> > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > know if there are any other vendors really onboard with this sort of
> > > solution.
> >
> > I know this is currently vendor-specific, but I thought the value
> > proposition of dynamic configuration of VFs for different clients
> > sounded compelling enough that other vendors would do something
> > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > maybe that's not the case?
>
> NVMe has a similar feature defined by the standard where a PF controller can
> dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> space with an ioctl, though. I guess we could wire up the driver to handle it
> through this sysfs interface too, but I think the protocol specific tooling is
> more appropriate for nvme.

We need this MSI-X thing for IB and ethernet devices too. This is why PCI
sysfs is the best place to put it, so all SR-IOV flavours will have sane
way to configure it.

Thanks

Alexander H Duyck March 11, 2021, 7:37 p.m. UTC | #16

On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > > >
> > > > Yeah, so I am not a fan of the series. The problem is there is only
> > > > one driver that supports this, all VFs are going to expose this sysfs,
> > > > and I don't know how likely it is that any others are going to
> > > > implement this functionality. I feel like you threw out all the
> > > > progress from v2-v6.
> > >
> > > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > > sysfs files regardless of whether the PF driver was bound.
> > >
> > > > I really feel like the big issue is that this model is broken as you
> > > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > > to find a way to address both issues.
> > > >
> > > > Maybe the compromise is to reach down into the IOV code and have it
> > > > register the sysfs interface at device creation time in something like
> > > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > > it.
> > >
> > > IIUC there are two questions on the table:
> > >
> > >   1) Should the sysfs files be visible only when a PF driver that
> > >      supports MSI-X vector assignment is bound?
> > >
> > >      I think this is a cosmetic issue.  The presence of the file is
> > >      not a reliable signal to management software; it must always
> > >      tolerate files that don't exist (e.g., on old kernels) or files
> > >      that are visible but don't work (e.g., vectors may be exhausted).
> > >
> > >      If we start with the files always being visible, we should be
> > >      able to add smarts later to expose them only when the PF driver
> > >      is bound.
> > >
> > >      My concerns with pci_enable_vf_overlay() are that it uses a
> > >      little more sysfs internals than I'd like (although there are
> > >      many callers of sysfs_create_files()) and it uses
> > >      pci_get_domain_bus_and_slot(), which is generally a hack and
> > >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> > >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> >
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
>
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

The problem is there are multiple ways to deal with this issue. I have
worked on parts in the past that simply advertised a fixed table size
and then only allowed for configuring the number of vectors internally
so some vectors simply couldn't be unmasked. I don't know if that was
any better than this though. It is just yet another way to do this.

> > In addition it still feels rather hacky to be modifying read-only PCIe
> > configuration space on the fly via a backdoor provided by the PF. It
> > almost feels like this should be some sort of quirk rather than a
> > standard feature for an SR-IOV VF.
>
> I agree, I'm not 100% comfortable with modifying the read-only Table
> Size register.  Maybe there's another approach that would be better?
> It *is* nice that the current approach doesn't require changes in the
> VF driver.

Changes in the VF driver would be a non-starter since the VF driver
may be running in the guest. One thing that would have been nice is if
the hardware cthis change could have been delayed until the VF went
through something like an FLR where it would have had to unregister
all the MSI-X vectors and then reconfigure them after the reset. It
would have allowed us to get away from needing all the extra locking.
It would then just be something the PF would configure and apply to
the VF following a reset.

> > >   2) Should a VF sysfs file use the PF to implement this?
> > >
> > >      Can you elaborate on your idea here?  I guess
> > >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> > >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> > >      in the PF directory?  That's a really interesting idea.
> >
> > I would honestly be more comfortable if the PF owned these files
> > instead of the VFs. One of the things I didn't like about this back
> > during the V1/2 days was the fact that it gave the impression that
> > MSI-X count was something that is meant to be edited. Since then I
> > think at least the naming was changed so that it implies that this is
> > only possible due to SR-IOV.
> >
> > I also didn't like that it makes the VFs feel like they are port
> > representors rather than being actual PCIe devices. Having
> > functionality that only works when the VF driver is not loaded just
> > feels off. The VF sysfs directory feels like it is being used as a
> > subdirectory of the PF rather than being a device on its own.
>
> Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> this somewhat.  I don't know how to make this work while a VF driver
> is bound without making the VF feel even less like a PCIe device,
> i.e., we won't be able to use the standard MSI-X model.

Yeah, I actually do kind of like that idea. In addition it would
address one of the things I pointed out as an issue before as you
could place the virtfn values that the total value in the same folder
so that it is all in one central spot rather than having to walk all
over the sysfs hierarchy to check the setting for each VF when trying
to figure out how the vectors are currently distributed.

> > > > Also we might want to double check that the PF cannot be unbound while
> > > > the VF is present. I know for a while there it was possible to remove
> > > > the PF driver while the VF was present. The Mellanox drivers may not
> > > > allow it but it might not hurt to look at taking a reference against
> > > > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > > > file.
> > >
> > > Unbinding the PF driver will either remove the *_msix_count files or
> > > make them stop working.  Is that a problem?  I'm not sure we should
> > > add a magic link that prevents driver unbinding.  Seems like it would
> > > be hard for users to figure out why the driver can't be removed.
> >
> > I checked it again, it will make the *_msix_count files stop working.
> > In order to guarantee it cannot happen in the middle of things though
> > we are sitting on the device locks for both the PF and the VF. I'm not
> > a fan of having to hold 2 locks while we perform a firmware operation
> > for one device, but I couldn't find anything where we would deadlock
> > so it should be fine.
>
> I agree again, it's not ideal to hold two locks.  Is it possible we
> could get by without the VF lock?  If we simply check whether a VF
> driver is bound (without a lock), a VF driver bind can race with the
> PF sriov_set_msix_vec_count().

I wonder if we couldn't do something like a reversible version of the
kill_device function to make it so that a VF is marked as being
blocked from being probed until we come through and release it.

> If the VF driver bind wins, it reads the old Table Size.  If it reads
> a too-small size, it won't use all the vectors.  If it reads a
> too-large size, it will try to use too many vectors and some won't
> work.  But the race would be caused by a bug in the management
> software, and the consequence doesn't seem *terrible*.

Based on the idea I just mentioned above it might make sense to
possibly have an option to mark a device as "stale" instead of "dead",
or maybe we add an extra bit that can flag that it is only "mostly
dead" and we are expecting a recovery soon.

Then the flow for this could be changed where we take the VF lock and
mark it as "stale" to prevent any driver binding and then we can
release the VF lock. Next we would perform the PF operation telling it
to update the VF.  Then we spin on the VF waiting for the stale data
to be updated and once that happens we can pop the indication that the
device is "stale" freeing it for use. If something comes along that
decides it wants to kill it then it just clears the "stale" bit or
"mostly dead" flag indicating that in fact this is "dead" and needs to
be unloaded resulting in the loop being broken.

I feel like that might be a much better fit for the SR-IOV model as it
would be the PF directly disabling, modifying, and then restoring a VF
rather than requesting things from the VF itself.

Thoughts?

Leon Romanovsky March 11, 2021, 7:51 p.m. UTC | #17

On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > > to the series, because you liked v6 more.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > ---------------------------------------------------------------------------------
> > > > > > Changelog
> > > > > > v7:
> > > > > >  * Rebase on top v5.12-rc1
> > > > > >  * More english fixes
> > > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.

<...>

> > > representors rather than being actual PCIe devices. Having
> > > functionality that only works when the VF driver is not loaded just
> > > feels off. The VF sysfs directory feels like it is being used as a
> > > subdirectory of the PF rather than being a device on its own.
> >
> > Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> > this somewhat.  I don't know how to make this work while a VF driver
> > is bound without making the VF feel even less like a PCIe device,
> > i.e., we won't be able to use the standard MSI-X model.
>
> Yeah, I actually do kind of like that idea. In addition it would
> address one of the things I pointed out as an issue before as you
> could place the virtfn values that the total value in the same folder
> so that it is all in one central spot rather than having to walk all
> over the sysfs hierarchy to check the setting for each VF when trying
> to figure out how the vectors are currently distributed.

User binds specific VF with specific PCI ID to VM, so instead of
changing MSI-X table for that specific VF, he will need to translate
from virtfn25_msix_count to PCI ID.

I also gave an example of my system where I have many PFs and VFs
function numbers are not distributed nicely. On that system virtfn25_msix_count
won't translate to AA:BB:CC.25 but to something else.

Thanks

Alexander H Duyck March 11, 2021, 8:11 p.m. UTC | #18

On Thu, Mar 11, 2021 at 11:51 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >
> > > > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > > > to the series, because you liked v6 more.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------
> > > > > > > Changelog
> > > > > > > v7:
> > > > > > >  * Rebase on top v5.12-rc1
> > > > > > >  * More english fixes
> > > > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
>
> <...>
>
> > > > representors rather than being actual PCIe devices. Having
> > > > functionality that only works when the VF driver is not loaded just
> > > > feels off. The VF sysfs directory feels like it is being used as a
> > > > subdirectory of the PF rather than being a device on its own.
> > >
> > > Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> > > this somewhat.  I don't know how to make this work while a VF driver
> > > is bound without making the VF feel even less like a PCIe device,
> > > i.e., we won't be able to use the standard MSI-X model.
> >
> > Yeah, I actually do kind of like that idea. In addition it would
> > address one of the things I pointed out as an issue before as you
> > could place the virtfn values that the total value in the same folder
> > so that it is all in one central spot rather than having to walk all
> > over the sysfs hierarchy to check the setting for each VF when trying
> > to figure out how the vectors are currently distributed.
>
> User binds specific VF with specific PCI ID to VM, so instead of
> changing MSI-X table for that specific VF, he will need to translate
> from virtfn25_msix_count to PCI ID.

Wouldn't that just be a matter of changing the naming so that the PCI
ID was present in the virtfn name?

> I also gave an example of my system where I have many PFs and VFs
> function numbers are not distributed nicely. On that system virtfn25_msix_count
> won't translate to AA:BB:CC.25 but to something else.

That isn't too surprising since normally we only support 7 functions
per device. I am okay with not using the name virtfnX. If you wanted
to embed the bus, device, func in the naming scheme that would work
for me too.

Really in general as a naming scheme just using a logical number have
probably never provided all that much value. There may be an argument
to be made for renaming the virtfn symlinks to include bus, device,
function instead.

Jason Gunthorpe March 11, 2021, 8:19 p.m. UTC | #19

On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:

> Then the flow for this could be changed where we take the VF lock and
> mark it as "stale" to prevent any driver binding and then we can
> release the VF lock. Next we would perform the PF operation telling it
> to update the VF.  Then we spin on the VF waiting for the stale data
> to be updated and once that happens we can pop the indication that the
> device is "stale" freeing it for use. 

I always get leary when people propose to open code locking constructs
:\

There is already an existing lock to prevent probe() it is the
device_lock() mutex on the VF. With no driver bound there is not much
issue to hold it over the HW activity.

This lock is normally held around the entire probe() and remove()
function which has huge amounts of HW activity already.

We don't need to invent new locks and new complexity for something
that is trivially solved already.

Jason

Jason Gunthorpe March 11, 2021, 8:22 p.m. UTC | #20

On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > 
> > > I'm not so much worried about management software as the fact that
> > > this is a vendor specific implementation detail that is shaping how
> > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > know if there are any other vendors really onboard with this sort of
> > > solution.
> > 
> > I know this is currently vendor-specific, but I thought the value
> > proposition of dynamic configuration of VFs for different clients
> > sounded compelling enough that other vendors would do something
> > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > maybe that's not the case?
> 
> NVMe has a similar feature defined by the standard where a PF controller can
> dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> space with an ioctl, though. I guess we could wire up the driver to handle it
> through this sysfs interface too, but I think the protocol specific tooling is
> more appropriate for nvme.

Really? Why not share a common uAPI?

Do you have a standards reference for this?

Jason

Jason Gunthorpe March 11, 2021, 8:31 p.m. UTC | #21

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:

> > In addition it still feels rather hacky to be modifying read-only PCIe
> > configuration space on the fly via a backdoor provided by the PF. It
> > almost feels like this should be some sort of quirk rather than a
> > standard feature for an SR-IOV VF.
> 
> I agree, I'm not 100% comfortable with modifying the read-only Table
> Size register.  Maybe there's another approach that would be better?
> It *is* nice that the current approach doesn't require changes in the
> VF driver.

Right, that is the whole point here - if we wanted to change every
driver we could do that in some other way. Changing the table size
register is the only way to keep existing drivers working.

Jason

Keith Busch March 11, 2021, 8:50 p.m. UTC | #22

On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > 
> > > > I'm not so much worried about management software as the fact that
> > > > this is a vendor specific implementation detail that is shaping how
> > > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > > know if there are any other vendors really onboard with this sort of
> > > > solution.
> > > 
> > > I know this is currently vendor-specific, but I thought the value
> > > proposition of dynamic configuration of VFs for different clients
> > > sounded compelling enough that other vendors would do something
> > > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > > maybe that's not the case?
> > 
> > NVMe has a similar feature defined by the standard where a PF controller can
> > dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> > space with an ioctl, though. I guess we could wire up the driver to handle it
> > through this sysfs interface too, but I think the protocol specific tooling is
> > more appropriate for nvme.
> 
> Really? Why not share a common uAPI?

We associate interrupt vectors with other dynamically assigned nvme
specific resources (IO queues), and these are not always allocated 1:1.
A common uAPI for MSIx only gets us half way to configuring the VFs for
that particular driver.
 
> Do you have a standards reference for this?

Yes, sections 5.22 and 8.5 from this spec:

  https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf

An example of open source tooling implementing this is nvme-cli's
"nvme virt-mgmt" command.

Jason Gunthorpe March 11, 2021, 9:44 p.m. UTC | #23

On Fri, Mar 12, 2021 at 05:50:34AM +0900, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > > 
> > > > > I'm not so much worried about management software as the fact that
> > > > > this is a vendor specific implementation detail that is shaping how
> > > > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > > > know if there are any other vendors really onboard with this sort of
> > > > > solution.
> > > > 
> > > > I know this is currently vendor-specific, but I thought the value
> > > > proposition of dynamic configuration of VFs for different clients
> > > > sounded compelling enough that other vendors would do something
> > > > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > > > maybe that's not the case?
> > > 
> > > NVMe has a similar feature defined by the standard where a PF controller can
> > > dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> > > space with an ioctl, though. I guess we could wire up the driver to handle it
> > > through this sysfs interface too, but I think the protocol specific tooling is
> > > more appropriate for nvme.
> > 
> > Really? Why not share a common uAPI?
> 
> We associate interrupt vectors with other dynamically assigned nvme
> specific resources (IO queues), and these are not always allocated 1:1.

mlx5 is doing that too, the end driver gets to assign the MSI vector
to a CPU and then dynamically attach queues to it.

I'm not sure I get why nvme would want to link those two things as the
CPU assignment and queue attach could happen in a VM while the MSIX
should be in the host?

> A common uAPI for MSIx only gets us half way to configuring the VFs for
> that particular driver.
>
> > Do you have a standards reference for this?
> 
> Yes, sections 5.22 and 8.5 from this spec:
> 
>   https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> 
> An example of open source tooling implementing this is nvme-cli's
> "nvme virt-mgmt" command.

Oh it is fascinating! 8.5.2 looks like exactly the same thing being
implemented here for mlx5, including changing the "Read only" config
space value

Still confused why this shouldn't be the same API??

Jason

Alexander H Duyck March 11, 2021, 9:49 p.m. UTC | #24

On Thu, Mar 11, 2021 at 12:19 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
>
>
> > Then the flow for this could be changed where we take the VF lock and
> > mark it as "stale" to prevent any driver binding and then we can
> > release the VF lock. Next we would perform the PF operation telling it
> > to update the VF.  Then we spin on the VF waiting for the stale data
> > to be updated and once that happens we can pop the indication that the
> > device is "stale" freeing it for use.
>
> I always get leary when people propose to open code locking constructs
> :\

I'm not suggesting we replace the lock. It is more about essentially
revoking the VF. What we are doing is essentially rewriting the PCIe
config of the VF so in my mind it makes sense to take sort of an RCU
approach where the old one is readable, but not something a new driver
can be bound to.

> There is already an existing lock to prevent probe() it is the
> device_lock() mutex on the VF. With no driver bound there is not much
> issue to hold it over the HW activity.

Yes. But sitting on those locks also has side effects such as
preventing us from taking any other actions such as disabling SR-IOV.
One concern I have is that if somebody else tries to implement this in
the future and they don't have a synchronous setup, or worse yet they
do but it takes a long time to process a request because they have a
slow controller it would be preferable to just have us post the
message to the PF and then have the thread spin and wait on the VF to
be updated rather than block on the PF while sitting on two locks.

> This lock is normally held around the entire probe() and remove()
> function which has huge amounts of HW activity already.

Yes, but usually that activity is time bound. You are usually reading
values and processing them in a timely fashion. In the case of probe
we even have cases where we have to defer because we don't want to
hold these locks for too long.

> We don't need to invent new locks and new complexity for something
> that is trivially solved already.

I am not wanting a new lock. What I am wanting is a way to mark the VF
as being stale/offline while we are performing the update. With that
we would be able to apply similar logic to any changes in the future.

Jason Gunthorpe March 11, 2021, 11:20 p.m. UTC | #25

On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > We don't need to invent new locks and new complexity for something
> > that is trivially solved already.
> 
> I am not wanting a new lock. What I am wanting is a way to mark the VF
> as being stale/offline while we are performing the update. With that
> we would be able to apply similar logic to any changes in the future.

I think we should hold off doing this until someone comes up with HW
that needs it. The response time here is microseconds, it is not worth
any complexity

Jason

Alexander H Duyck March 12, 2021, 2:53 a.m. UTC | #26

On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > We don't need to invent new locks and new complexity for something
> > > that is trivially solved already.
> >
> > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > as being stale/offline while we are performing the update. With that
> > we would be able to apply similar logic to any changes in the future.
>
> I think we should hold off doing this until someone comes up with HW
> that needs it. The response time here is microseconds, it is not worth
> any complexity

I disagree. Take a look at section 8.5.3 in the NVMe document that was
linked to earlier:
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf

This is exactly what they are doing and I think it makes a ton of
sense. Basically the VF has to be taken "offline" before you are
allowed to start changing resources on it. It would basically consist
of one extra sysfs file and has additional uses beyond just the
configuration of MSI-X vectors.

We would just have to add one additional sysfs file, maybe modify the
"dead" device flag to be "offline", and we could make this work with
minimal changes to the patch set you already have. We could probably
toggle to "offline" while holding just the VF lock. To toggle the VF
back to being "online" we might need to take the PF device lock since
it is ultimately responsible for guaranteeing we have the resources.

Another way to think of this is that we are essentially pulling a
device back after we have already allocated the VFs and we are
reconfiguring it before pushing it back out for usage. Having a flag
that we could set on the VF device to say it is "under
construction"/modification/"not ready for use" would be quite useful I
would think.

Leon Romanovsky March 12, 2021, 6:32 a.m. UTC | #27

On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > We don't need to invent new locks and new complexity for something
> > > > that is trivially solved already.
> > >
> > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > as being stale/offline while we are performing the update. With that
> > > we would be able to apply similar logic to any changes in the future.
> >
> > I think we should hold off doing this until someone comes up with HW
> > that needs it. The response time here is microseconds, it is not worth
> > any complexity

<...>

> Another way to think of this is that we are essentially pulling a
> device back after we have already allocated the VFs and we are
> reconfiguring it before pushing it back out for usage. Having a flag
> that we could set on the VF device to say it is "under
> construction"/modification/"not ready for use" would be quite useful I
> would think.

It is not simple flag change, but change of PCI state machine, which is
far more complex than holding two locks or call to sysfs_create_file in
the loop that made Bjorn nervous.

I want to remind again that the suggestion here has nothing to do with
the real use case of SR-IOV capable devices in the Linux.

The flow is:
1. Disable SR-IOV driver autoprobe
2. Create as much as possible VFs
3. Wait for request from the user to get VM
4. Change MSI-X table according to requested in item #3
5. Bind ready to go VF to VM
6. Inform user about VM readiness

The destroy flow includes VM destroy and unbind.

Let's focus on solutions for real problems instead of trying to solve theoretical
cases that are not going to be tested and deployed.

Thanks

Jason Gunthorpe March 12, 2021, 1 p.m. UTC | #28

On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > We don't need to invent new locks and new complexity for something
> > > > that is trivially solved already.
> > >
> > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > as being stale/offline while we are performing the update. With that
> > > we would be able to apply similar logic to any changes in the future.
> >
> > I think we should hold off doing this until someone comes up with HW
> > that needs it. The response time here is microseconds, it is not worth
> > any complexity
> 
> I disagree. Take a look at section 8.5.3 in the NVMe document that was
> linked to earlier:
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> 
> This is exactly what they are doing and I think it makes a ton of
> sense. Basically the VF has to be taken "offline" before you are

AFAIK this is internal to the NVMe command protocol, not something we
can expose generically to the OS. mlx5 has no protocol to "offline" an
already running VF, for instance.

The way Leon has it arranged that online/offline scheme has no
relevance because there is no driver or guest attached to the VF to
see the online/offline transition.

I wonder if people actually do offline a NVMe VF from a hypervisor?
Seems pretty weird.

> Another way to think of this is that we are essentially pulling a
> device back after we have already allocated the VFs and we are
> reconfiguring it before pushing it back out for usage. Having a flag
> that we could set on the VF device to say it is "under
> construction"/modification/"not ready for use" would be quite useful I
> would think.

Well, yes, the whole SRIOV VF lifecycle is a pretty bad fit for the
modern world.

I'd rather not see a half-job on a lifecycle model by hacking in
random flags. It needs a proper comprehensive design.

Jason

Keith Busch March 12, 2021, 1:36 p.m. UTC | #29

On Fri, Mar 12, 2021 at 09:00:17AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > We don't need to invent new locks and new complexity for something
> > > > > that is trivially solved already.
> > > >
> > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > as being stale/offline while we are performing the update. With that
> > > > we would be able to apply similar logic to any changes in the future.
> > >
> > > I think we should hold off doing this until someone comes up with HW
> > > that needs it. The response time here is microseconds, it is not worth
> > > any complexity
> > 
> > I disagree. Take a look at section 8.5.3 in the NVMe document that was
> > linked to earlier:
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> > 
> > This is exactly what they are doing and I think it makes a ton of
> > sense. Basically the VF has to be taken "offline" before you are
> 
> AFAIK this is internal to the NVMe command protocol, not something we
> can expose generically to the OS. mlx5 has no protocol to "offline" an
> already running VF, for instance.
> 
> The way Leon has it arranged that online/offline scheme has no
> relevance because there is no driver or guest attached to the VF to
> see the online/offline transition.
> 
> I wonder if people actually do offline a NVMe VF from a hypervisor?
> Seems pretty weird.

I agree, that would be weird. I'm pretty sure you can't modify these resources
once you attach the nvme VF to a guest. The resource allocation needs to happen
prior to that.
 
> > Another way to think of this is that we are essentially pulling a
> > device back after we have already allocated the VFs and we are
> > reconfiguring it before pushing it back out for usage. Having a flag
> > that we could set on the VF device to say it is "under
> > construction"/modification/"not ready for use" would be quite useful I
> > would think.
> 
> Well, yes, the whole SRIOV VF lifecycle is a pretty bad fit for the
> modern world.
> 
> I'd rather not see a half-job on a lifecycle model by hacking in
> random flags. It needs a proper comprehensive design.
> 
> Jason

Alexander H Duyck March 12, 2021, 4:59 p.m. UTC | #30

On Thu, Mar 11, 2021 at 10:32 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > We don't need to invent new locks and new complexity for something
> > > > > that is trivially solved already.
> > > >
> > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > as being stale/offline while we are performing the update. With that
> > > > we would be able to apply similar logic to any changes in the future.
> > >
> > > I think we should hold off doing this until someone comes up with HW
> > > that needs it. The response time here is microseconds, it is not worth
> > > any complexity
>
> <...>
>
> > Another way to think of this is that we are essentially pulling a
> > device back after we have already allocated the VFs and we are
> > reconfiguring it before pushing it back out for usage. Having a flag
> > that we could set on the VF device to say it is "under
> > construction"/modification/"not ready for use" would be quite useful I
> > would think.
>
> It is not simple flag change, but change of PCI state machine, which is
> far more complex than holding two locks or call to sysfs_create_file in
> the loop that made Bjorn nervous.
>
> I want to remind again that the suggestion here has nothing to do with
> the real use case of SR-IOV capable devices in the Linux.
>
> The flow is:
> 1. Disable SR-IOV driver autoprobe
> 2. Create as much as possible VFs
> 3. Wait for request from the user to get VM
> 4. Change MSI-X table according to requested in item #3
> 5. Bind ready to go VF to VM
> 6. Inform user about VM readiness
>
> The destroy flow includes VM destroy and unbind.
>
> Let's focus on solutions for real problems instead of trying to solve theoretical
> cases that are not going to be tested and deployed.
>
> Thanks

So part of the problem with this all along has been that you are only
focused on how you are going to use this and don't think about how
somebody else might need to use or implement it. In addition there are
a number of half measures even within your own flow. In reality if we
are thinking we are going to have to reconfigure every device it might
make sense to simply block the driver from being able to load until
you have configured it. Then the SR-IOV autoprobe would be redundant
since you could use something like the "offline" flag to avoid that.

If you are okay with step 1 where you are setting a flag to prevent
driver auto probing why is it so much more overhead to set a bit
blocking drivers from loading entirely while you are changing the
config space? Sitting on two locks and assuming a synchronous
operation is assuming a lot about the hardware and how this is going
to be used.

In addition it seems like the logic is that step 4 will always
succeed. What happens if for example you send the message to the
firmware and you don't get a response? Do you just say the request
failed let the VF be used anyway? This is another reason why I would
be much more comfortable with the option to offline the device and
then tinker with it rather than hope that your operation can somehow
do everything in one shot.

In my mind step 4 really should be 4 steps.

1. Offline VF to reserve it for modification
2. Submit request for modification
3. Verify modification has occurred, reset if needed.
4. Online VF

Doing it in that order allows for handling many more scenarios
including those where perhaps step 2 actually consists of several
changes to support any future extensions that are needed. Splitting
step 2 and 3 allows for an asynchronous event where you can wait if
firmware takes an excessively long time, or if step 2 somehow fails
you can then repeat or revert it to get back to a consistent state.
Lastly by splitting out the onlining step you can avoid potentially
releasing a broken VF to be reserved if there is some sort of
unrecoverable error between steps 2 and 3.

Jason Gunthorpe March 12, 2021, 5:03 p.m. UTC | #31

On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:

> Lastly by splitting out the onlining step you can avoid potentially
> releasing a broken VF to be reserved if there is some sort of
> unrecoverable error between steps 2 and 3.

If the PF FW gets in a state that it can't respond to a trivial
command like this then *every* VF is broken. It not a useful scenario
to focus on.

Jason

Leon Romanovsky March 12, 2021, 6:34 p.m. UTC | #32

On Fri, Mar 12, 2021 at 01:03:19PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:
>
> > Lastly by splitting out the onlining step you can avoid potentially
> > releasing a broken VF to be reserved if there is some sort of
> > unrecoverable error between steps 2 and 3.
>
> If the PF FW gets in a state that it can't respond to a trivial
> command like this then *every* VF is broken. It not a useful scenario
> to focus on.

Right, many VF initial configurations are done through PF. It MUST work.

Thanks

>
> Jason

Leon Romanovsky March 12, 2021, 6:41 p.m. UTC | #33

On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 10:32 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > > We don't need to invent new locks and new complexity for something
> > > > > > that is trivially solved already.
> > > > >
> > > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > > as being stale/offline while we are performing the update. With that
> > > > > we would be able to apply similar logic to any changes in the future.
> > > >
> > > > I think we should hold off doing this until someone comes up with HW
> > > > that needs it. The response time here is microseconds, it is not worth
> > > > any complexity
> >
> > <...>
> >
> > > Another way to think of this is that we are essentially pulling a
> > > device back after we have already allocated the VFs and we are
> > > reconfiguring it before pushing it back out for usage. Having a flag
> > > that we could set on the VF device to say it is "under
> > > construction"/modification/"not ready for use" would be quite useful I
> > > would think.
> >
> > It is not simple flag change, but change of PCI state machine, which is
> > far more complex than holding two locks or call to sysfs_create_file in
> > the loop that made Bjorn nervous.
> >
> > I want to remind again that the suggestion here has nothing to do with
> > the real use case of SR-IOV capable devices in the Linux.
> >
> > The flow is:
> > 1. Disable SR-IOV driver autoprobe
> > 2. Create as much as possible VFs
> > 3. Wait for request from the user to get VM
> > 4. Change MSI-X table according to requested in item #3
> > 5. Bind ready to go VF to VM
> > 6. Inform user about VM readiness
> >
> > The destroy flow includes VM destroy and unbind.
> >
> > Let's focus on solutions for real problems instead of trying to solve theoretical
> > cases that are not going to be tested and deployed.
> >
> > Thanks
>
> So part of the problem with this all along has been that you are only
> focused on how you are going to use this and don't think about how
> somebody else might need to use or implement it. In addition there are
> a number of half measures even within your own flow. In reality if we
> are thinking we are going to have to reconfigure every device it might
> make sense to simply block the driver from being able to load until
> you have configured it. Then the SR-IOV autoprobe would be redundant
> since you could use something like the "offline" flag to avoid that.
>
> If you are okay with step 1 where you are setting a flag to prevent
> driver auto probing why is it so much more overhead to set a bit
> blocking drivers from loading entirely while you are changing the
> config space? Sitting on two locks and assuming a synchronous
> operation is assuming a lot about the hardware and how this is going
> to be used.
>
> In addition it seems like the logic is that step 4 will always
> succeed. What happens if for example you send the message to the
> firmware and you don't get a response? Do you just say the request
> failed let the VF be used anyway? This is another reason why I would
> be much more comfortable with the option to offline the device and
> then tinker with it rather than hope that your operation can somehow
> do everything in one shot.
>
> In my mind step 4 really should be 4 steps.
>
> 1. Offline VF to reserve it for modification
> 2. Submit request for modification
> 3. Verify modification has occurred, reset if needed.
> 4. Online VF
>
> Doing it in that order allows for handling many more scenarios
> including those where perhaps step 2 actually consists of several
> changes to support any future extensions that are needed. Splitting
> step 2 and 3 allows for an asynchronous event where you can wait if
> firmware takes an excessively long time, or if step 2 somehow fails
> you can then repeat or revert it to get back to a consistent state.
> Lastly by splitting out the onlining step you can avoid potentially
> releasing a broken VF to be reserved if there is some sort of
> unrecoverable error between steps 2 and 3.

In all scenarios users need to disable autoprobe and unbind drivers.
This is actually the "offline" mode. Any SR-IOV capable HW that will
need this asynchronous probe will be able to extend current mechanism.

However I don't expect to see in Foreseen future any new SR-IOV player
that will be able to provide brand new high-speed, high-performance
and customizable SR-IOV card that will need asynchronous probe.

BTW, even NVMe with their "offline" mode in their spec implemented
the flow exactly like I'm proposing here.

Thanks

Bjorn Helgaas March 25, 2021, 5:21 p.m. UTC | #34

On Thu, Mar 11, 2021 at 05:44:24PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 12, 2021 at 05:50:34AM +0900, Keith Busch wrote:
> > On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > > > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > > > 
> > > > > > I'm not so much worried about management software as the
> > > > > > fact that this is a vendor specific implementation detail
> > > > > > that is shaping how the kernel interfaces are meant to
> > > > > > work. Other than the mlx5 I don't know if there are any
> > > > > > other vendors really onboard with this sort of solution.
> > > > > 
> > > > > I know this is currently vendor-specific, but I thought the
> > > > > value proposition of dynamic configuration of VFs for
> > > > > different clients sounded compelling enough that other
> > > > > vendors would do something similar.  But I'm not an SR-IOV
> > > > > guy and have no vendor insight, so maybe that's not the
> > > > > case?
> > > > 
> > > > NVMe has a similar feature defined by the standard where a PF
> > > > controller can dynamically assign MSIx vectors to VFs. The
> > > > whole thing is managed in user space with an ioctl, though. I
> > > > guess we could wire up the driver to handle it through this
> > > > sysfs interface too, but I think the protocol specific tooling
> > > > is more appropriate for nvme.
> > > 
> > > Really? Why not share a common uAPI?
> > 
> > We associate interrupt vectors with other dynamically assigned
> > nvme specific resources (IO queues), and these are not always
> > allocated 1:1.
> 
> mlx5 is doing that too, the end driver gets to assign the MSI vector
> to a CPU and then dynamically attach queues to it.
> 
> I'm not sure I get why nvme would want to link those two things as
> the CPU assignment and queue attach could happen in a VM while the
> MSIX should be in the host?
> 
> > A common uAPI for MSIx only gets us half way to configuring the
> > VFs for that particular driver.
> >
> > > Do you have a standards reference for this?
> > 
> > Yes, sections 5.22 and 8.5 from this spec:
> > 
> >   https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> > 
> > An example of open source tooling implementing this is nvme-cli's
> > "nvme virt-mgmt" command.
> 
> Oh it is fascinating! 8.5.2 looks like exactly the same thing being
> implemented here for mlx5, including changing the "Read only" config
> space value
> 
> Still confused why this shouldn't be the same API??

NVMe and mlx5 have basically identical functionality in this respect.
Other devices and vendors will likely implement similar functionality.
It would be ideal if we had an interface generic enough to support
them all.

Is the mlx5 interface proposed here sufficient to support the NVMe
model?  I think it's close, but not quite, because the the NVMe
"offline" state isn't explicitly visible in the mlx5 model.

I'd like to see an argument that nvme-cli *could* be implemented on
top of this.  nvme-cli uses an ioctl and we may not want to
reimplement it with a new interface, but if Leon's interface is
*capable* of supporting the NVMe model, it's a good clue that it may
also work for future devices.

If this isn't quite enough to support the NVMe model, what would it
take to get there?

Bjorn

Jason Gunthorpe March 25, 2021, 5:36 p.m. UTC | #35

On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:

> NVMe and mlx5 have basically identical functionality in this respect.
> Other devices and vendors will likely implement similar functionality.
> It would be ideal if we had an interface generic enough to support
> them all.
> 
> Is the mlx5 interface proposed here sufficient to support the NVMe
> model?  I think it's close, but not quite, because the the NVMe
> "offline" state isn't explicitly visible in the mlx5 model.

I thought Keith basically said "offline" wasn't really useful as a
distinct idea. It is an artifact of nvme being a standards body
divorced from the operating system.

In linux offline and no driver attached are the same thing, you'd
never want an API to make a nvme device with a driver attached offline
because it would break the driver.

So I think it is good as is (well one of the 8 versions anyhow).

Keith didn't go into detail why the queue allocations in nvme were any
different than the queue allocations in mlx5. I expect they can
probably work the same where the # of interrupts is an upper bound on
the # of CPUs that can get queues and the device, once instantiated,
could be configured for the number of queues to actually operate, if
it wants.

Jason

Bjorn Helgaas March 25, 2021, 6:20 p.m. UTC | #36

On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> 
> > NVMe and mlx5 have basically identical functionality in this respect.
> > Other devices and vendors will likely implement similar functionality.
> > It would be ideal if we had an interface generic enough to support
> > them all.
> > 
> > Is the mlx5 interface proposed here sufficient to support the NVMe
> > model?  I think it's close, but not quite, because the the NVMe
> > "offline" state isn't explicitly visible in the mlx5 model.
> 
> I thought Keith basically said "offline" wasn't really useful as a
> distinct idea. It is an artifact of nvme being a standards body
> divorced from the operating system.
> 
> In linux offline and no driver attached are the same thing, you'd
> never want an API to make a nvme device with a driver attached offline
> because it would break the driver.

I think the sticky part is that Linux driver attach is not visible to
the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
can only assign resources to a VF when the VF is offline, and the VF
is only usable when it is online.

For NVMe, software must ask the PF to make those online/offline
transitions via Secondary Controller Offline and Secondary Controller
Online commands [1].  How would this be integrated into this sysfs
interface?

> So I think it is good as is (well one of the 8 versions anyhow).
> 
> Keith didn't go into detail why the queue allocations in nvme were any
> different than the queue allocations in mlx5. I expect they can
> probably work the same where the # of interrupts is an upper bound on
> the # of CPUs that can get queues and the device, once instantiated,
> could be configured for the number of queues to actually operate, if
> it wants.

I don't really care about the queue allocations.  I don't think we
need to solve those here; we just need to make sure that what we do
here doesn't preclude NVMe queue allocations.

Bjorn

[1] NVMe 1.4a, sec 5.22

Jason Gunthorpe March 25, 2021, 6:28 p.m. UTC | #37

On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > 
> > > NVMe and mlx5 have basically identical functionality in this respect.
> > > Other devices and vendors will likely implement similar functionality.
> > > It would be ideal if we had an interface generic enough to support
> > > them all.
> > > 
> > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > model?  I think it's close, but not quite, because the the NVMe
> > > "offline" state isn't explicitly visible in the mlx5 model.
> > 
> > I thought Keith basically said "offline" wasn't really useful as a
> > distinct idea. It is an artifact of nvme being a standards body
> > divorced from the operating system.
> > 
> > In linux offline and no driver attached are the same thing, you'd
> > never want an API to make a nvme device with a driver attached offline
> > because it would break the driver.
> 
> I think the sticky part is that Linux driver attach is not visible to
> the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> can only assign resources to a VF when the VF is offline, and the VF
> is only usable when it is online.
> 
> For NVMe, software must ask the PF to make those online/offline
> transitions via Secondary Controller Offline and Secondary Controller
> Online commands [1].  How would this be integrated into this sysfs
> interface?

Either the NVMe PF driver tracks the driver attach state using a bus
notifier and mirrors it to the offline state, or it simply
offline/onlines as part of the sequence to program the MSI change.

I don't see why we need any additional modeling of this behavior. 

What would be the point of onlining a device without a driver?

Jason

Keith Busch March 25, 2021, 6:31 p.m. UTC | #38

On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> 
> > NVMe and mlx5 have basically identical functionality in this respect.
> > Other devices and vendors will likely implement similar functionality.
> > It would be ideal if we had an interface generic enough to support
> > them all.
> > 
> > Is the mlx5 interface proposed here sufficient to support the NVMe
> > model?  I think it's close, but not quite, because the the NVMe
> > "offline" state isn't explicitly visible in the mlx5 model.
> 
> I thought Keith basically said "offline" wasn't really useful as a
> distinct idea. It is an artifact of nvme being a standards body
> divorced from the operating system.

I think that was someone else who said that.

FWIW, the nvme "offline" state just means a driver can't use the nvme
capabilities of the device. You can bind a driver to it if you want, but
no IO will be possible, so it's fine if you bind your VF to something
like vfio prior to starting a VM, or just not have a driver bound to
anything during the intial resource assignment.

> In linux offline and no driver attached are the same thing, you'd
> never want an API to make a nvme device with a driver attached offline
> because it would break the driver.
> 
> So I think it is good as is (well one of the 8 versions anyhow).
> 
> Keith didn't go into detail why the queue allocations in nvme were any
> different than the queue allocations in mlx5. 

The NVMe IO queue resources are assignable just like the MSIx vectors.
But they're not always assigned 1:1. For example:

  NVMe has an admin queue that always requires an interrupt vector. Does
  the VM driver want this vector to share with the IO queues, or do we
  want a +1 vector for that queue? 

  Maybe the VM is going to use a user space polling driver, so now you
  don't even need MSIx vectors on the function assigned to that VM. You
  just need to assign the IO queue resouces, and reserve the MSIx
  resources for another function.

  The Linux nvme driver allows a mix of poll + interrupt queues, so the
  user may want to allocate more IO queues than interrupts.

A kernel interface for assigning interrupt vectors gets us only halfway
to configuring the assignable resources.

> I expect they can probably work the same where the # of interrupts is
> an upper bound on the # of CPUs that can get queues and the device,
> once instantiated, could be configured for the number of queues to
> actually operate, if it wants.

Jason Gunthorpe March 25, 2021, 6:36 p.m. UTC | #39

On Fri, Mar 26, 2021 at 03:31:57AM +0900, Keith Busch wrote:

> The NVMe IO queue resources are assignable just like the MSIx vectors.
> But they're not always assigned 1:1. For example:

But this is all driver configuration, the driver might be running in
some VM.

It seems saner to have two kernel interfaces, one used in the
hypervisor on the PF to setup the MSI vector count

And a second interface that becomes available after the driver is
bound to configure the driver, where-ever/how-ever that driver may be
running.

I don't know how you could combine them in one step except for the
simple case of the driver running in the hypervisor?

Jason

Leon Romanovsky March 26, 2021, 6:44 a.m. UTC | #40

On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > 
> > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > Other devices and vendors will likely implement similar functionality.
> > > > It would be ideal if we had an interface generic enough to support
> > > > them all.
> > > > 
> > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > model?  I think it's close, but not quite, because the the NVMe
> > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > 
> > > I thought Keith basically said "offline" wasn't really useful as a
> > > distinct idea. It is an artifact of nvme being a standards body
> > > divorced from the operating system.
> > > 
> > > In linux offline and no driver attached are the same thing, you'd
> > > never want an API to make a nvme device with a driver attached offline
> > > because it would break the driver.
> > 
> > I think the sticky part is that Linux driver attach is not visible to
> > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > can only assign resources to a VF when the VF is offline, and the VF
> > is only usable when it is online.
> > 
> > For NVMe, software must ask the PF to make those online/offline
> > transitions via Secondary Controller Offline and Secondary Controller
> > Online commands [1].  How would this be integrated into this sysfs
> > interface?
> 
> Either the NVMe PF driver tracks the driver attach state using a bus
> notifier and mirrors it to the offline state, or it simply
> offline/onlines as part of the sequence to program the MSI change.
> 
> I don't see why we need any additional modeling of this behavior. 
> 
> What would be the point of onlining a device without a driver?

Agree, we should remember that we are talking about Linux kernel model
and implementation, where _no_driver_ means _offline_.

Thanks

> 
> Jason

Alexander H Duyck March 26, 2021, 4 p.m. UTC | #41

On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > >
> > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > Other devices and vendors will likely implement similar functionality.
> > > > > It would be ideal if we had an interface generic enough to support
> > > > > them all.
> > > > >
> > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > >
> > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > distinct idea. It is an artifact of nvme being a standards body
> > > > divorced from the operating system.
> > > >
> > > > In linux offline and no driver attached are the same thing, you'd
> > > > never want an API to make a nvme device with a driver attached offline
> > > > because it would break the driver.
> > >
> > > I think the sticky part is that Linux driver attach is not visible to
> > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > can only assign resources to a VF when the VF is offline, and the VF
> > > is only usable when it is online.
> > >
> > > For NVMe, software must ask the PF to make those online/offline
> > > transitions via Secondary Controller Offline and Secondary Controller
> > > Online commands [1].  How would this be integrated into this sysfs
> > > interface?
> >
> > Either the NVMe PF driver tracks the driver attach state using a bus
> > notifier and mirrors it to the offline state, or it simply
> > offline/onlines as part of the sequence to program the MSI change.
> >
> > I don't see why we need any additional modeling of this behavior.
> >
> > What would be the point of onlining a device without a driver?
>
> Agree, we should remember that we are talking about Linux kernel model
> and implementation, where _no_driver_ means _offline_.

The only means you have of guaranteeing the driver is "offline" is by
holding on the device lock and checking it. So it is only really
useful for one operation and then you have to release the lock. The
idea behind having an "offline" state would be to allow you to
aggregate multiple potential operations into a single change.

For example you would place the device offline, then change
interrupts, and then queues, and then you could online it again. The
kernel code could have something in place to prevent driver load on
"offline" devices. What it gives you is more of a transactional model
versus what you have right now which is more of a concurrent model.

Jason Gunthorpe March 26, 2021, 4:56 p.m. UTC | #42

On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > them all.
> > > > > >
> > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > >
> > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > divorced from the operating system.
> > > > >
> > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > never want an API to make a nvme device with a driver attached offline
> > > > > because it would break the driver.
> > > >
> > > > I think the sticky part is that Linux driver attach is not visible to
> > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > is only usable when it is online.
> > > >
> > > > For NVMe, software must ask the PF to make those online/offline
> > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > Online commands [1].  How would this be integrated into this sysfs
> > > > interface?
> > >
> > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > notifier and mirrors it to the offline state, or it simply
> > > offline/onlines as part of the sequence to program the MSI change.
> > >
> > > I don't see why we need any additional modeling of this behavior.
> > >
> > > What would be the point of onlining a device without a driver?
> >
> > Agree, we should remember that we are talking about Linux kernel model
> > and implementation, where _no_driver_ means _offline_.
> 
> The only means you have of guaranteeing the driver is "offline" is by
> holding on the device lock and checking it. So it is only really
> useful for one operation and then you have to release the lock. The
> idea behind having an "offline" state would be to allow you to
> aggregate multiple potential operations into a single change.

What we really want is a solution where the SRIOV device exist for the
HW but isn't registered yet as a pci_device. We have endless problems
with needing to configure SRIOV instances at the PF before they get
plugged into the kernel and the no driver autoprobe buisness is such a
hack.

But that is a huge problem and not this series.

Jason

Bjorn Helgaas March 26, 2021, 5:08 p.m. UTC | #43

On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > them all.
> > > > > >
> > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > >
> > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > divorced from the operating system.
> > > > >
> > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > never want an API to make a nvme device with a driver attached offline
> > > > > because it would break the driver.
> > > >
> > > > I think the sticky part is that Linux driver attach is not visible to
> > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > is only usable when it is online.
> > > >
> > > > For NVMe, software must ask the PF to make those online/offline
> > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > Online commands [1].  How would this be integrated into this sysfs
> > > > interface?
> > >
> > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > notifier and mirrors it to the offline state, or it simply
> > > offline/onlines as part of the sequence to program the MSI change.
> > >
> > > I don't see why we need any additional modeling of this behavior.
> > >
> > > What would be the point of onlining a device without a driver?
> >
> > Agree, we should remember that we are talking about Linux kernel model
> > and implementation, where _no_driver_ means _offline_.
> 
> The only means you have of guaranteeing the driver is "offline" is by
> holding on the device lock and checking it. So it is only really
> useful for one operation and then you have to release the lock. The
> idea behind having an "offline" state would be to allow you to
> aggregate multiple potential operations into a single change.
> 
> For example you would place the device offline, then change
> interrupts, and then queues, and then you could online it again. The
> kernel code could have something in place to prevent driver load on
> "offline" devices. What it gives you is more of a transactional model
> versus what you have right now which is more of a concurrent model.

Thanks, Alex.  Leon currently does enforce the "offline" situation by
holding the VF device lock while checking that it has no driver and
asking the PF to do the assignment.  I agree this is only useful for a
single operation.  Would the current series *prevent* a transactional
model from being added later if it turns out to be useful?  I think I
can imagine keeping the same sysfs files but changing the
implementation to check for the VF being offline, while adding
something new to control online/offline.

I also want to resurrect your idea of associating
"sriov_vf_msix_count" with the PF instead of the VF.  I really like
that idea, and it better reflects the way both mlx5 and NVMe work.  I
don't think there was a major objection to it, but the discussion
seems to have petered out after your suggestion of putting the PCI
bus/device/funcion in the filename, which I also like [1].

Leon has implemented a ton of variations, but I don't think having all
the files in the PF directory was one of them.

Bjorn

[1] https://lore.kernel.org/r/CAKgT0Ue363fZEwqGUa1UAAYotUYH8QpEADW1U5yfNS7XkOLx0Q@mail.gmail.com

Jason Gunthorpe March 26, 2021, 5:12 p.m. UTC | #44

On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:

> Leon has implemented a ton of variations, but I don't think having all
> the files in the PF directory was one of them.

If you promise this is the last of this bike painting adventure then
let's do it.

Jason

Keith Busch March 26, 2021, 5:29 p.m. UTC | #45

On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> I also want to resurrect your idea of associating
> "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> that idea, and it better reflects the way both mlx5 and NVMe work.

That is a better match for nvme: we can assign resources to VFs with
the PF's "VF Enable" set to '0', so configuring VFs without requiring
them be enumerated in sysfs is a plus. The VFs would also implicitly be
in the "offline" state for that condition and able to accept these
requests.

Jason Gunthorpe March 26, 2021, 5:31 p.m. UTC | #46

On Sat, Mar 27, 2021 at 02:29:00AM +0900, Keith Busch wrote:
> On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> > I also want to resurrect your idea of associating
> > "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> > that idea, and it better reflects the way both mlx5 and NVMe work.
> 
> That is a better match for nvme: we can assign resources to VFs with
> the PF's "VF Enable" set to '0', so configuring VFs without requiring
> them be enumerated in sysfs is a plus. 

If the VF is not in sysfs already in the normal place, why would it be
in the special configuration directory? Do you want the driver to
somehow provide the configuration directory content?

I'm confused what you mean

As I said to Alex configuring things before they even get plugged in
sounds like the right direction..

Jason

Alexander H Duyck March 26, 2021, 6:50 p.m. UTC | #47

On Fri, Mar 26, 2021 at 10:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> > On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > > >
> > > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > > them all.
> > > > > > >
> > > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > > >
> > > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > > divorced from the operating system.
> > > > > >
> > > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > > never want an API to make a nvme device with a driver attached offline
> > > > > > because it would break the driver.
> > > > >
> > > > > I think the sticky part is that Linux driver attach is not visible to
> > > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > > is only usable when it is online.
> > > > >
> > > > > For NVMe, software must ask the PF to make those online/offline
> > > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > > Online commands [1].  How would this be integrated into this sysfs
> > > > > interface?
> > > >
> > > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > > notifier and mirrors it to the offline state, or it simply
> > > > offline/onlines as part of the sequence to program the MSI change.
> > > >
> > > > I don't see why we need any additional modeling of this behavior.
> > > >
> > > > What would be the point of onlining a device without a driver?
> > >
> > > Agree, we should remember that we are talking about Linux kernel model
> > > and implementation, where _no_driver_ means _offline_.
> >
> > The only means you have of guaranteeing the driver is "offline" is by
> > holding on the device lock and checking it. So it is only really
> > useful for one operation and then you have to release the lock. The
> > idea behind having an "offline" state would be to allow you to
> > aggregate multiple potential operations into a single change.
> >
> > For example you would place the device offline, then change
> > interrupts, and then queues, and then you could online it again. The
> > kernel code could have something in place to prevent driver load on
> > "offline" devices. What it gives you is more of a transactional model
> > versus what you have right now which is more of a concurrent model.
>
> Thanks, Alex.  Leon currently does enforce the "offline" situation by
> holding the VF device lock while checking that it has no driver and
> asking the PF to do the assignment.  I agree this is only useful for a
> single operation.  Would the current series *prevent* a transactional
> model from being added later if it turns out to be useful?  I think I
> can imagine keeping the same sysfs files but changing the
> implementation to check for the VF being offline, while adding
> something new to control online/offline.

My concern would be that we are defining the user space interface.
Once we have this working as a single operation I could see us having
to support it that way going forward as somebody will script something
not expecting an "offline" sysfs file, and the complaint would be that
we are breaking userspace if we require the use of an "offline" file.
So my preference would be to just do it that way now rather than wait
as the behavior will be grandfathered in once we allow the operation
without it.

> I also want to resurrect your idea of associating
> "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> that idea, and it better reflects the way both mlx5 and NVMe work.  I
> don't think there was a major objection to it, but the discussion
> seems to have petered out after your suggestion of putting the PCI
> bus/device/funcion in the filename, which I also like [1].
>
> Leon has implemented a ton of variations, but I don't think having all
> the files in the PF directory was one of them.
>
> Bjorn
>
> [1] https://lore.kernel.org/r/CAKgT0Ue363fZEwqGUa1UAAYotUYH8QpEADW1U5yfNS7XkOLx0Q@mail.gmail.com

I almost wonder if it wouldn't make sense to just partition this up to
handle flexible resources in the future. Maybe something like having
the directory setup such that you have "sriov_resources/msix/" and
then you could have individual files with one for the total and the
rest with the VF BDF naming scheme. Then if we have to, we could add
other subdirectories in the future to handle things like queues in the
future.

Jason Gunthorpe March 26, 2021, 7:01 p.m. UTC | #48

On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:

> My concern would be that we are defining the user space interface.
> Once we have this working as a single operation I could see us having
> to support it that way going forward as somebody will script something
> not expecting an "offline" sysfs file, and the complaint would be that
> we are breaking userspace if we require the use of an "offline"
> file.

Well, we wouldn't do that. The semantic we define here is that the
msix_count interface 'auto-offlines' if that is what is required. If
we add some formal offline someday then 'auto-offline' would be a NOP
when the device is offline and do the same online/offline sequence as
today if it isn't.

> I almost wonder if it wouldn't make sense to just partition this up to
> handle flexible resources in the future. Maybe something like having
> the directory setup such that you have "sriov_resources/msix/" and

This is supposed to be about PCI properties, that is why we are doing
it in the PCI layer.

If you want to see something that handles non-PCI properties too then
Leon needs to make the whole thing general so the device driver can
give a list of properties it wants to configure and the core manages
the thing.

But at that point, why involve the PCI core in the first place? Just
put the complex configuration in the driver, use configfs or devlink
or nvmecli or whatever is appropriate.

And we are doing that too, there will also be pre-driver configuration
in devlink for *non PCI* properties. *shrug*

Jason

Bjorn Helgaas March 26, 2021, 7:36 p.m. UTC | #49

On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:

> I almost wonder if it wouldn't make sense to just partition this up to
> handle flexible resources in the future. Maybe something like having
> the directory setup such that you have "sriov_resources/msix/" and
> then you could have individual files with one for the total and the
> rest with the VF BDF naming scheme. Then if we have to, we could add
> other subdirectories in the future to handle things like queues in the
> future.

Subdirectories would be nice, but Greg KH said earlier in a different
context that there's an issue with them [1].  He went on to say tools
like udev would miss uevents for the subdirs [2].

I don't know whether that's really a problem in this case -- it
doesn't seem like we would care about uevents for files that do MSI-X
vector assignment.

[1] https://lore.kernel.org/linux-pci/20191121211017.GA854512@kroah.com/
[2] https://lore.kernel.org/linux-pci/20191124170207.GA2267252@kroah.com/

Leon Romanovsky March 27, 2021, 6 a.m. UTC | #50

On Fri, Mar 26, 2021 at 02:12:09PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> 
> > Leon has implemented a ton of variations, but I don't think having all
> > the files in the PF directory was one of them.
> 
> If you promise this is the last of this bike painting adventure then
> let's do it.

So Bjorn, can you promise?

Thanks

> 
> Jason

gregkh@linuxfoundation.org March 27, 2021, 12:38 p.m. UTC | #51

On Fri, Mar 26, 2021 at 02:36:31PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> 
> > I almost wonder if it wouldn't make sense to just partition this up to
> > handle flexible resources in the future. Maybe something like having
> > the directory setup such that you have "sriov_resources/msix/" and
> > then you could have individual files with one for the total and the
> > rest with the VF BDF naming scheme. Then if we have to, we could add
> > other subdirectories in the future to handle things like queues in the
> > future.
> 
> Subdirectories would be nice, but Greg KH said earlier in a different
> context that there's an issue with them [1].  He went on to say tools
> like udev would miss uevents for the subdirs [2].
> 
> I don't know whether that's really a problem in this case -- it
> doesn't seem like we would care about uevents for files that do MSI-X
> vector assignment.
> 
> [1] https://lore.kernel.org/linux-pci/20191121211017.GA854512@kroah.com/
> [2] https://lore.kernel.org/linux-pci/20191124170207.GA2267252@kroah.com/

You can only go "one level deep" on subdirectories tied to a 'struct
device' and have userspace tools know they are still there.  You can do
that by giving an attribute group a "name" for the directory.

Anything more than that just gets very very messy very quickly and I do
not recommend doing that at all.

thanks,

greg k-h

Bjorn Helgaas March 30, 2021, 1:29 a.m. UTC | #52

On Fri, Mar 26, 2021 at 04:01:48PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> 
> > My concern would be that we are defining the user space interface.
> > Once we have this working as a single operation I could see us having
> > to support it that way going forward as somebody will script something
> > not expecting an "offline" sysfs file, and the complaint would be that
> > we are breaking userspace if we require the use of an "offline"
> > file.
> 
> Well, we wouldn't do that. The semantic we define here is that the
> msix_count interface 'auto-offlines' if that is what is required. If
> we add some formal offline someday then 'auto-offline' would be a NOP
> when the device is offline and do the same online/offline sequence as
> today if it isn't.

Alexander, Keith, any more thoughts on this?

I think I misunderstood Greg's subdirectory comment.  We already have
directories like this:

  /sys/bus/pci/devices/0000:01:00.0/link/
  /sys/bus/pci/devices/0000:01:00.0/msi_irqs/
  /sys/bus/pci/devices/0000:01:00.0/power/

and aspm_ctrl_attr_group (for "link") is nicely done with static
attributes.  So I think we could do something like this:

  /sys/bus/pci/devices/0000:01:00.0/   # PF directory
    sriov/                             # SR-IOV related stuff
      vf_total_msix
      vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
      ...
      vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF

And I think this could support the mlx5 model as well as the NVMe
model.

For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
before asking the PF to assign the vectors, as Jason suggests above.
Before VF Enable is set, the vf_msix_count_* files wouldn't exist and
we wouldn't be able to assign vectors to VFs; IIUC that's a difference
from the NVMe interface, but maybe not a terrible one?

I'm not proposing changing nvme-cli to use this, but if the interface
is general enough to support both, that would be a good clue that it
might be able to support future devices with similar functionality.

Bjorn

Jason Gunthorpe March 30, 2021, 1:57 p.m. UTC | #53

On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:

> I think I misunderstood Greg's subdirectory comment.  We already have
> directories like this:

Yes, IIRC, Greg's remark applies if you have to start creating
directories with manual kobjects.

> and aspm_ctrl_attr_group (for "link") is nicely done with static
> attributes.  So I think we could do something like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
>     sriov/                             # SR-IOV related stuff
>       vf_total_msix
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
>       ...
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF

It looks a bit odd that it isn't a subdirectory, but this seems
reasonable.

> For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> before asking the PF to assign the vectors, as Jason suggests above.

It is also not awful if it returns EBUSY if the admin hasn't done
some device-specific offline sequence.

I'm just worried adding the idea of offline here is going to open a
huge can of worms in terms of defining what it means, and the very
next ask will be to start all VFs in offline mode. This would be some
weird overlap with the no-driver-autoprobing sysfs. We've been
thinking about this alot here and there are not easy answers.

mlx5 sort of has an offline concept too, but we have been modeling it
in devlink, which is kind of like nvme-cli for networking.

Jason

Bjorn Helgaas March 30, 2021, 3 p.m. UTC | #54

On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> 
> > I think I misunderstood Greg's subdirectory comment.  We already have
> > directories like this:
> 
> Yes, IIRC, Greg's remark applies if you have to start creating
> directories with manual kobjects.
> 
> > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > attributes.  So I think we could do something like this:
> > 
> >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> >     sriov/                             # SR-IOV related stuff
> >       vf_total_msix
> >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> >       ...
> >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> 
> It looks a bit odd that it isn't a subdirectory, but this seems
> reasonable.

Sorry, I missed your point; you'll have to lay it out more explicitly.
I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
directory.  The full paths would be:

  /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
  /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
  ...

> > For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> > before asking the PF to assign the vectors, as Jason suggests above.
> 
> It is also not awful if it returns EBUSY if the admin hasn't done
> some device-specific offline sequence.

Agreed.  The concept of "offline" is not visible in this interface.

> I'm just worried adding the idea of offline here is going to open a
> huge can of worms in terms of defining what it means, and the very
> next ask will be to start all VFs in offline mode. This would be some
> weird overlap with the no-driver-autoprobing sysfs. We've been
> thinking about this alot here and there are not easy answers.

We haven't added any idea of offline in the sysfs interface.  I'm
only trying to figure out whether it would be possible to use this
interface on top of devices with an offline concept, e.g., NVMe.

> mlx5 sort of has an offline concept too, but we have been modeling it
> in devlink, which is kind of like nvme-cli for networking.
> 
> Jason

Keith Busch March 30, 2021, 6:10 p.m. UTC | #55

On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 26, 2021 at 04:01:48PM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> > 
> > > My concern would be that we are defining the user space interface.
> > > Once we have this working as a single operation I could see us having
> > > to support it that way going forward as somebody will script something
> > > not expecting an "offline" sysfs file, and the complaint would be that
> > > we are breaking userspace if we require the use of an "offline"
> > > file.
> > 
> > Well, we wouldn't do that. The semantic we define here is that the
> > msix_count interface 'auto-offlines' if that is what is required. If
> > we add some formal offline someday then 'auto-offline' would be a NOP
> > when the device is offline and do the same online/offline sequence as
> > today if it isn't.
> 
> Alexander, Keith, any more thoughts on this?
> 
> I think I misunderstood Greg's subdirectory comment.  We already have
> directories like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/link/
>   /sys/bus/pci/devices/0000:01:00.0/msi_irqs/
>   /sys/bus/pci/devices/0000:01:00.0/power/
> 
> and aspm_ctrl_attr_group (for "link") is nicely done with static
> attributes.  So I think we could do something like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
>     sriov/                             # SR-IOV related stuff
>       vf_total_msix
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
>       ...
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> 
> And I think this could support the mlx5 model as well as the NVMe
> model.
> 
> For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> before asking the PF to assign the vectors, as Jason suggests above.
> Before VF Enable is set, the vf_msix_count_* files wouldn't exist and
> we wouldn't be able to assign vectors to VFs; IIUC that's a difference
> from the NVMe interface, but maybe not a terrible one?

Yes, that's fine, nvme can handle this flow. It is a little easier to
avoid nvme user error if we could mainpulate the counts prior to VF Enable,
but it's really not a problem this way either.

I think it's reasonable for nvme to subscribe to this interface, but I
will have to defer to someone with capable nvme devices to implement it.

Jason Gunthorpe March 30, 2021, 7:47 p.m. UTC | #56

On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > 
> > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > directories like this:
> > 
> > Yes, IIRC, Greg's remark applies if you have to start creating
> > directories with manual kobjects.
> > 
> > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > attributes.  So I think we could do something like this:
> > > 
> > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > >     sriov/                             # SR-IOV related stuff
> > >       vf_total_msix
> > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > >       ...
> > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > 
> > It looks a bit odd that it isn't a subdirectory, but this seems
> > reasonable.
> 
> Sorry, I missed your point; you'll have to lay it out more explicitly.
> I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> directory.  The full paths would be:
>
>   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
>   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
>   ...

Sorry, I was meaning what you first proposed:

   /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count

Which has the extra sub directory to organize the child VFs.

Keep in mind there is going to be alot of VFs here, > 1k - so this
will be a huge directory.

Jason

Bjorn Helgaas March 30, 2021, 8:41 p.m. UTC | #57

On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > 
> > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > directories like this:
> > > 
> > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > directories with manual kobjects.
> > > 
> > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > attributes.  So I think we could do something like this:
> > > > 
> > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > >     sriov/                             # SR-IOV related stuff
> > > >       vf_total_msix
> > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > >       ...
> > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > 
> > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > reasonable.
> > 
> > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > directory.  The full paths would be:
> >
> >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> >   ...
> 
> Sorry, I was meaning what you first proposed:
> 
>    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> 
> Which has the extra sub directory to organize the child VFs.
> 
> Keep in mind there is going to be alot of VFs here, > 1k - so this
> will be a huge directory.

With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
1 + 1K files ("vf_total_msix" + 1 per VF).

With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
1 file and 1K subdirectories.

No real difference now, but if we add more files per VF, a BB:DD.F/
subdirectory would certainly be nicer.

I'm dense and don't fully understand Greg's subdirectory comment.

The VF will have its own "pci/devices/DDDD:BB:DD.F/" directory, so
adding sriov/BB:DD.F/ under the PF shouldn't affect any udev events or
rules for the VF.

I see "ATTR{power/control}" in lots of udev rules, so I guess udev
could manage a single subdirectory like "ATTR{sriov/vf_total_msix}".
I doubt it could do "ATTR{sriov/adm/vf_total_msix}" (with another
level) or "ATTR{sriov/BBB:DD.F/vf_msix_count}" (with variable VF text
in the path).

But it doesn't seem like that level of control would be in a udev rule
anyway.  A PF udev rule might *start* a program to manage MSI-X
vectors, but such a program should be able to deal with whatever
directory structure we want.

If my uninformed udev speculation makes sense *and* we think there
will be more per-VF files later, I think I'm OK either way.

Bjorn

Jason Gunthorpe March 30, 2021, 10:43 p.m. UTC | #58

On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > 
> > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > directories like this:
> > > > 
> > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > directories with manual kobjects.
> > > > 
> > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > attributes.  So I think we could do something like this:
> > > > > 
> > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > >     sriov/                             # SR-IOV related stuff
> > > > >       vf_total_msix
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > >       ...
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > 
> > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > reasonable.
> > > 
> > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > directory.  The full paths would be:
> > >
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > >   ...
> > 
> > Sorry, I was meaning what you first proposed:
> > 
> >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > 
> > Which has the extra sub directory to organize the child VFs.
> > 
> > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > will be a huge directory.
> 
> With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> 1 + 1K files ("vf_total_msix" + 1 per VF).
> 
> With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> 1 file and 1K subdirectories.

The smallest directory sizes is with the current patch since it
re-uses the existing VF directory. Do we care about directory size at
the sysfs level?
 
> No real difference now, but if we add more files per VF, a BB:DD.F/
> subdirectory would certainly be nicer.

Hard to know if that will happen, there is a lot of 'pre-driver'
configuration but it seems to mostly be living in other places. 

If this is restricted to be only the generic PCI attributes (and I
think it should be) I'm having a hard time coming up with a future
extension.

> I'm dense and don't fully understand Greg's subdirectory comment.

I also don't know udev well enough. I've certainly seen drivers
creating extra subdirectories using kobjects.

> But it doesn't seem like that level of control would be in a udev rule
> anyway.  A PF udev rule might *start* a program to manage MSI-X
> vectors, but such a program should be able to deal with whatever
> directory structure we want.

Yes, I can't really see this being used from udev either. 

I assume there is also the usual race about triggering the uevent
before the subdirectories are created, but we have the
dev_set_uevent_suppress() thing now for that..

Jason

Leon Romanovsky March 31, 2021, 4:08 a.m. UTC | #59

On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > 
> > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > directories like this:
> > > > 
> > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > directories with manual kobjects.
> > > > 
> > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > attributes.  So I think we could do something like this:
> > > > > 
> > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > >     sriov/                             # SR-IOV related stuff
> > > > >       vf_total_msix
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > >       ...
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > 
> > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > reasonable.
> > > 
> > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > directory.  The full paths would be:
> > >
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > >   ...
> > 
> > Sorry, I was meaning what you first proposed:
> > 
> >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > 
> > Which has the extra sub directory to organize the child VFs.
> > 
> > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > will be a huge directory.
> 
> With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> 1 + 1K files ("vf_total_msix" + 1 per VF).
> 
> With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> 1 file and 1K subdirectories.

This is racy by design, in order to add new file and create BB:DD.F
directory, the VF will need to do it after or during it's creation.
During PF creation it is unknown to PF those BB:DD.F values.

The race here is due to the events of PF,VF directory already sent but
new directory structure is not ready yet.

From code perspective, we will need to add something similar to pci_iov_sysfs_link()
with the code that you didn't like in previous variants (the one that messes with
sysfs_create_file API).

It looks not good for large SR-IOV systems with >1K VFs with gazillion
subdirectories inside PF, while the more natural is to see them in VF.

So I'm completely puzzled why you want to do these files on PF and not
on VF as v0, v7 and v8 proposed.

Thanks

gregkh@linuxfoundation.org March 31, 2021, 6:38 a.m. UTC | #60

On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > 1 file and 1K subdirectories.
> 
> The smallest directory sizes is with the current patch since it
> re-uses the existing VF directory. Do we care about directory size at
> the sysfs level?

No, that should not matter.

The "issue" here is that you "broke" the device chain here by adding a
random kobject to the directory tree: "BB:DD.F"

Again, devices are allowed to have attributes associated with it to be
_ONE_ subdirectory level deep.

So, to use your path above, this is allowed:
	0000:01:00.0/sriov/vf_msix_count

as these are sriov attributes for the 0000:01:00.0 device, but this is
not:
	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
as you "threw" a random kobject called BB:DD.F into the middle.

If you want to have "BB:DD.F" in there, then it needs to be a real
struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
another struct device, as "sriov" is NOT ANYTHING in the heirachy here
at all.

Does that help?  The rules are:
	- Once you use a 'struct device', all subdirs below that device
	  are either an attribute group for that device or a child
	  device.
	- A struct device can NOT have an attribute group as a parent,
	  it can ONLY have another struct device as a parent.

If you break those rules, the kernel has the ability to get really
confused unless you are very careful, and userspace will be totally lost
as you can not do anything special there.

> > I'm dense and don't fully understand Greg's subdirectory comment.
> 
> I also don't know udev well enough. I've certainly seen drivers
> creating extra subdirectories using kobjects.

And those drivers are broken.  Please point them out to me and I will be
glad to go fix them.  Or tell their authors why they are broken :)

> > But it doesn't seem like that level of control would be in a udev rule
> > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > vectors, but such a program should be able to deal with whatever
> > directory structure we want.
> 
> Yes, I can't really see this being used from udev either. 

It doesn't matter if you think it could be used, it _will_ be used as
you are exposing this stuff to userspace.

> I assume there is also the usual race about triggering the uevent
> before the subdirectories are created, but we have the
> dev_set_uevent_suppress() thing now for that..

Unless you are "pci bus code" you shouldn't be using that :)

thanks,

greg k-h

Jason Gunthorpe March 31, 2021, 12:19 p.m. UTC | #61

On Wed, Mar 31, 2021 at 08:38:39AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > 1 file and 1K subdirectories.
> > 
> > The smallest directory sizes is with the current patch since it
> > re-uses the existing VF directory. Do we care about directory size at
> > the sysfs level?
> 
> No, that should not matter.
> 
> The "issue" here is that you "broke" the device chain here by adding a
> random kobject to the directory tree: "BB:DD.F"
> 
> Again, devices are allowed to have attributes associated with it to be
> _ONE_ subdirectory level deep.
> 
> So, to use your path above, this is allowed:
> 	0000:01:00.0/sriov/vf_msix_count
> 
> as these are sriov attributes for the 0000:01:00.0 device, but this is
> not:
> 	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> as you "threw" a random kobject called BB:DD.F into the middle.
>
> If you want to have "BB:DD.F" in there, then it needs to be a real
> struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
> another struct device, as "sriov" is NOT ANYTHING in the heirachy here
> at all.

It isn't a struct device object at all though, it just organizing
attributes.

> Does that help?  The rules are:
> 	- Once you use a 'struct device', all subdirs below that device
> 	  are either an attribute group for that device or a child
> 	  device.
> 	- A struct device can NOT have an attribute group as a parent,
> 	  it can ONLY have another struct device as a parent.
> 
> If you break those rules, the kernel has the ability to get really
> confused unless you are very careful, and userspace will be totally lost
> as you can not do anything special there.

The kernel gets confused?

I'm not sure I understand why userspace gets confused. I can guess
udev has some issue, but everything else seems OK, it is just a path.

> > > I'm dense and don't fully understand Greg's subdirectory comment.
> > 
> > I also don't know udev well enough. I've certainly seen drivers
> > creating extra subdirectories using kobjects.
> 
> And those drivers are broken.  Please point them out to me and I will be
> glad to go fix them.  Or tell their authors why they are broken :)

How do you fix them? It is uAPI at this point so we can't change the
directory names. Can't make them struct devices (userspace would get
confused if we add *more* sysfs files)

Grep for kobject_init_and_add() under drivers/ and I think you get a
pretty good overview of the places.

Since it seems like kind of a big problem can we make this allowed
somehow?

> > > But it doesn't seem like that level of control would be in a udev rule
> > > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > > vectors, but such a program should be able to deal with whatever
> > > directory structure we want.
> >
> > Yes, I can't really see this being used from udev either. 
> 
> It doesn't matter if you think it could be used, it _will_ be used as
> you are exposing this stuff to userspace.

Well, from what I understand, it wont be used because udev can't do
three level deep attributes, and if that hasn't been a problem in that
last 10 years for the existing places, it might not ever be needed in
udev at all.

> > I assume there is also the usual race about triggering the uevent
> > before the subdirectories are created, but we have the
> > dev_set_uevent_suppress() thing now for that..
> 
> Unless you are "pci bus code" you shouldn't be using that :)

There are over 40 users now.

Jason

gregkh@linuxfoundation.org March 31, 2021, 3:03 p.m. UTC | #62

On Wed, Mar 31, 2021 at 09:19:29AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 08:38:39AM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > > 1 file and 1K subdirectories.
> > > 
> > > The smallest directory sizes is with the current patch since it
> > > re-uses the existing VF directory. Do we care about directory size at
> > > the sysfs level?
> > 
> > No, that should not matter.
> > 
> > The "issue" here is that you "broke" the device chain here by adding a
> > random kobject to the directory tree: "BB:DD.F"
> > 
> > Again, devices are allowed to have attributes associated with it to be
> > _ONE_ subdirectory level deep.
> > 
> > So, to use your path above, this is allowed:
> > 	0000:01:00.0/sriov/vf_msix_count
> > 
> > as these are sriov attributes for the 0000:01:00.0 device, but this is
> > not:
> > 	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > as you "threw" a random kobject called BB:DD.F into the middle.
> >
> > If you want to have "BB:DD.F" in there, then it needs to be a real
> > struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
> > another struct device, as "sriov" is NOT ANYTHING in the heirachy here
> > at all.
> 
> It isn't a struct device object at all though, it just organizing
> attributes.

That's the point, it really is not.  You are forced to create a real
object for that subdirectory, and by doing so, you are "breaking" the
driver/device model.  As is evident by userspace not knowing what is
going on here.

> > Does that help?  The rules are:
> > 	- Once you use a 'struct device', all subdirs below that device
> > 	  are either an attribute group for that device or a child
> > 	  device.
> > 	- A struct device can NOT have an attribute group as a parent,
> > 	  it can ONLY have another struct device as a parent.
> > 
> > If you break those rules, the kernel has the ability to get really
> > confused unless you are very careful, and userspace will be totally lost
> > as you can not do anything special there.
> 
> The kernel gets confused?

Putting a kobject as a child of a struct device can easily cause
confusion as that is NOT what you should be doing.  Especially if you
then try to add a device to be a child of that kobject.  Now the kernel
core can not walk all devices in the correct order and lots of other
problems that you are lucky you do not hit.

> I'm not sure I understand why userspace gets confused. I can guess
> udev has some issue, but everything else seems OK, it is just a path.

No, it is not a "path".

Again, here are the driver/device model rules:
	- once you have a "struct device", only "struct device" can be
	  children.
	- You are allowed to place attributes in subdirectories if you
	  want to make things cleaner.  Don't make me doubt giving that
	  type of permission to people by trying to abuse it by going
	  "lower" than one level.
	- If you have to represent something dynamic below a struct
	  device that is not an attribute, make it a real struct device.

And userspace "gets confused" because it thinks it can properly walk the
tree and get a record of all devices in the system.  When you put a
kobject in there just to make a new subdirectory, you break the
notification model and everything else.

Again, do not do that.

> > > > I'm dense and don't fully understand Greg's subdirectory comment.
> > > 
> > > I also don't know udev well enough. I've certainly seen drivers
> > > creating extra subdirectories using kobjects.
> > 
> > And those drivers are broken.  Please point them out to me and I will be
> > glad to go fix them.  Or tell their authors why they are broken :)
> 
> How do you fix them? It is uAPI at this point so we can't change the
> directory names. Can't make them struct devices (userspace would get
> confused if we add *more* sysfs files)

How would userspace get confused?  If anything it would suddenly "wake
up" and see these attributes properly.

> Grep for kobject_init_and_add() under drivers/ and I think you get a
> pretty good overview of the places.

Yes, lots of places where people are abusing things and it should not be
done.  Do not add to the mess by knowingly adding broken code please.

> Since it seems like kind of a big problem can we make this allowed
> somehow?

No, not at all.  Please do not do that.  I will look into the existing
users and try to see if I can fix them up.  Maybe start annoying people
by throwing warnings if you try to register a kobject as a child of a
device...

> > > > But it doesn't seem like that level of control would be in a udev rule
> > > > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > > > vectors, but such a program should be able to deal with whatever
> > > > directory structure we want.
> > >
> > > Yes, I can't really see this being used from udev either. 
> > 
> > It doesn't matter if you think it could be used, it _will_ be used as
> > you are exposing this stuff to userspace.
> 
> Well, from what I understand, it wont be used because udev can't do
> three level deep attributes, and if that hasn't been a problem in that
> last 10 years for the existing places, it might not ever be needed in
> udev at all.

If userspace is not seeing these attributes then WHY CREATE THEM AT
ALL???

Seriously, what is needing to see these in sysfs if not the tools that
we have today to use sysfs?  Are you wanting to create new tools instead
to handle these new attributes?  Maybe just do not create them in the
first place?

> > > I assume there is also the usual race about triggering the uevent
> > > before the subdirectories are created, but we have the
> > > dev_set_uevent_suppress() thing now for that..
> > 
> > Unless you are "pci bus code" you shouldn't be using that :)
> 
> There are over 40 users now.

Let's not add more if you do not have to.  PCI has not needed it so far,
no need to rush to add it here if at all possible please.

thanks,

greg k-h

Jason Gunthorpe March 31, 2021, 5:07 p.m. UTC | #63

On Wed, Mar 31, 2021 at 05:03:45PM +0200, Greg Kroah-Hartman wrote:
> > It isn't a struct device object at all though, it just organizing
> > attributes.
> 
> That's the point, it really is not.  You are forced to create a real
> object for that subdirectory, and by doing so, you are "breaking" the
> driver/device model.  As is evident by userspace not knowing what is
> going on here.

I'm still not really sure about what this means in practice..

I found an nested attribute in RDMA land so lets see how it behaves.

   /sys/class/infiniband/ibp0s9/ <-- This is a struct device/ib_device

Then we have 261 'attribute' files under a ports subdirectory, for
instance:

 /sys/class/infiniband/ibp0s9/ports/1/cm_tx_retries/dreq

Open/read works fine, and the specialty userspace that people built on
this has been working for a long time.

Does udev see the deeply nested attributes? Apparently yes:

$ udevadm info -a /sys/class/infiniband/ibp0s9
    ATTR{ports/1/cm_rx_duplicates/dreq}=="0"
    [..]

Given your remarks, I'm surprised, but it seems to work - I assume if
udevadm shows it then all the rules will work too.

Has udev become confused about what is a struct device? Looks like no:

$ udevadm info -a /sys/class/infiniband/ibp0s9/port
Unknown device "/sys/class/infiniband/ibp0s9/port": No such device

Can you give an example where things go wrong?

(and I inherited this RDMA stuff. In the last two years we moved it
 all to netlink and modern userspace largely no longer touches sysfs,
 but I can't break in-use uAPI)

> > > Does that help?  The rules are:
> > > 	- Once you use a 'struct device', all subdirs below that device
> > > 	  are either an attribute group for that device or a child
> > > 	  device.
> > > 	- A struct device can NOT have an attribute group as a parent,
> > > 	  it can ONLY have another struct device as a parent.
> > > 
> > > If you break those rules, the kernel has the ability to get really
> > > confused unless you are very careful, and userspace will be totally lost
> > > as you can not do anything special there.
> > 
> > The kernel gets confused?
> 
> Putting a kobject as a child of a struct device can easily cause
> confusion as that is NOT what you should be doing.  Especially if you
> then try to add a device to be a child of that kobject. 

That I've never seen. I've only seen people making extra levels of
directories for organizing attributes.

> > How do you fix them? It is uAPI at this point so we can't change the
> > directory names. Can't make them struct devices (userspace would get
> > confused if we add *more* sysfs files)
> 
> How would userspace get confused?  If anything it would suddenly "wake
> up" and see these attributes properly.

We are talking about specialty userspace that is designed to work with
the sysfs layout as-is. Not udev. In some of these subdirs the
userspace does readdir() on - if you start adding random stuff it will
break it.

> > Since it seems like kind of a big problem can we make this allowed
> > somehow?
> 
> No, not at all.  Please do not do that.  I will look into the existing
> users and try to see if I can fix them up.  Maybe start annoying people
> by throwing warnings if you try to register a kobject as a child of a
> device...

How does that mesh with our don't break userspace ideal?? :(

> > Well, from what I understand, it wont be used because udev can't do
> > three level deep attributes, and if that hasn't been a problem in that
> > last 10 years for the existing places, it might not ever be needed in
> > udev at all.
> 
> If userspace is not seeing these attributes then WHY CREATE THEM AT
> ALL???

*udev* is not the userspace! People expose sysfs attributes and then
make specialty userspace to consume them! I've seen it many times now.

> Seriously, what is needing to see these in sysfs if not the tools that
> we have today to use sysfs?  Are you wanting to create new tools instead
> to handle these new attributes?  Maybe just do not create them in the
> first place?

This advice is about 10 years too late :(

Regardless, lets not do deeply nested attributes here in PCI. They are
PITA anyhow.

Jason

Bjorn Helgaas April 1, 2021, 1:23 a.m. UTC | #64

[+cc Rafael, in case you're interested in the driver core issue here]

On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
> On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > > 
> > > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > > directories like this:
> > > > > 
> > > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > > directories with manual kobjects.
> > > > > 
> > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > > attributes.  So I think we could do something like this:
> > > > > > 
> > > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > > >     sriov/                             # SR-IOV related stuff
> > > > > >       vf_total_msix
> > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > > >       ...
> > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > > 
> > > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > > reasonable.
> > > > 
> > > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > > directory.  The full paths would be:
> > > >
> > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > > >   ...
> > > 
> > > Sorry, I was meaning what you first proposed:
> > > 
> > >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > > 
> > > Which has the extra sub directory to organize the child VFs.
> > > 
> > > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > > will be a huge directory.
> > 
> > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> > 1 + 1K files ("vf_total_msix" + 1 per VF).
> > 
> > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > 1 file and 1K subdirectories.
> 
> This is racy by design, in order to add new file and create BB:DD.F
> directory, the VF will need to do it after or during it's creation.
> During PF creation it is unknown to PF those BB:DD.F values.
> 
> The race here is due to the events of PF,VF directory already sent
> but new directory structure is not ready yet.
>
> From code perspective, we will need to add something similar to
> pci_iov_sysfs_link() with the code that you didn't like in previous
> variants (the one that messes with sysfs_create_file API).
> 
> It looks not good for large SR-IOV systems with >1K VFs with
> gazillion subdirectories inside PF, while the more natural is to see
> them in VF.
> 
> So I'm completely puzzled why you want to do these files on PF and
> not on VF as v0, v7 and v8 proposed.

On both mlx5 and NVMe, the "assign vectors to VF" functionality is
implemented by the PF, so I think it's reasonable to explore the idea
of "associate the vector assignment sysfs file with the PF."

Assume 1K VFs.  Either way we have >1K subdirectories of
/sys/devices/pci0000:00/.  I think we should avoid an extra
subdirectory level, so I think the choices on the table are:

Associate "vf_msix_count" with the PF:

  - /sys/.../<PF>/sriov/vf_total_msix    # all on PF

  - /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these).  Greg
    says the number of these is not a problem.

  - The "vf_total_msix" and "vf_msix_count_*" files are all related
    and are grouped together in PF/sriov/.

  - The "vf_msix_count_*" files operate directly on the PF.  Lock the
    PF for serialization, lookup and lock the VF to ensure no VF
    driver, call PF driver callback to assign vectors.

  - Requires special sysfs code to create/remove "vf_msix_count_*"
    files when setting/clearing VF Enable.  This code could create
    them only when the PF driver actually supports vector assignment.
    Unavoidable sysfs/uevent race, see below.

Associate "vf_msix_count" with the VF:

  - /sys/.../<PF>/sriov_vf_total_msix    # on PF

  - /sys/.../<VF>/sriov_vf_msix_count    # on each VF

  - The "sriov_vf_msix_count" files enter via the VF.  Lock the VF to
    ensure no VF driver, lookup and lock the PF for serialization,
    call PF driver callback to assign vectors.

  - Can be done with static sysfs attributes.  This means creating
    "sriov_vf_msix_count" *always*, even if PF driver doesn't support
    vector assignment.

IIUC, putting "vf_msix_count_*" under the PF involves a race.  When we
call device_add() for each new VF, it creates the VF sysfs directory
and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
exist yet.  It can't be created before device_add() because the sysfs
directory doesn't exist.  If we create it after device_add(), the "add
VF" uevent has already been emitted, so userspace may consume it
before "vf_msix_count_*" is created.

  sriov_enable
    <set VF Enable>                     <-- VFs created on PCI
    sriov_add_vfs
      for (i = 0; i < num_vfs; i++) {
        pci_iov_add_virtfn
          pci_device_add
            device_initialize
            device_add
              device_add_attrs          <-- add VF sysfs attrs
              kobject_uevent(KOBJ_ADD)  <-- emit uevent
                                        <-- add "vf_msix_count_*" sysfs attr
          pci_iov_sysfs_link
          pci_bus_add_device
            pci_create_sysfs_dev_files
            device_attach
      }

Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
files associated directly with the PF.  I think that's more natural
because they both operate directly on the PF.

But I don't like the race, and using static attributes seems much
cleaner implementation-wise.

Bjorn

Leon Romanovsky April 1, 2021, 11:49 a.m. UTC | #65

On Wed, Mar 31, 2021 at 08:23:40PM -0500, Bjorn Helgaas wrote:
> [+cc Rafael, in case you're interested in the driver core issue here]
> 
> On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
> > On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > > > 
> > > > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > > > directories like this:
> > > > > > 
> > > > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > > > directories with manual kobjects.
> > > > > > 
> > > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > > > attributes.  So I think we could do something like this:
> > > > > > > 
> > > > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > > > >     sriov/                             # SR-IOV related stuff
> > > > > > >       vf_total_msix
> > > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > > > >       ...
> > > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > > > 
> > > > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > > > reasonable.
> > > > > 
> > > > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > > > directory.  The full paths would be:
> > > > >
> > > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > > > >   ...
> > > > 
> > > > Sorry, I was meaning what you first proposed:
> > > > 
> > > >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > > > 
> > > > Which has the extra sub directory to organize the child VFs.
> > > > 
> > > > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > > > will be a huge directory.
> > > 
> > > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> > > 1 + 1K files ("vf_total_msix" + 1 per VF).
> > > 
> > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > 1 file and 1K subdirectories.
> > 
> > This is racy by design, in order to add new file and create BB:DD.F
> > directory, the VF will need to do it after or during it's creation.
> > During PF creation it is unknown to PF those BB:DD.F values.
> > 
> > The race here is due to the events of PF,VF directory already sent
> > but new directory structure is not ready yet.
> >
> > From code perspective, we will need to add something similar to
> > pci_iov_sysfs_link() with the code that you didn't like in previous
> > variants (the one that messes with sysfs_create_file API).
> > 
> > It looks not good for large SR-IOV systems with >1K VFs with
> > gazillion subdirectories inside PF, while the more natural is to see
> > them in VF.
> > 
> > So I'm completely puzzled why you want to do these files on PF and
> > not on VF as v0, v7 and v8 proposed.
> 
> On both mlx5 and NVMe, the "assign vectors to VF" functionality is
> implemented by the PF, so I think it's reasonable to explore the idea
> of "associate the vector assignment sysfs file with the PF."

As you said, it is our (Linux kernel) implementation, but from user space
perspective it is seen different. The user "configures" specific VF and
doesn't need to know that we are routing his requests through PF.

> 
> Assume 1K VFs.  Either way we have >1K subdirectories of
> /sys/devices/pci0000:00/.  I think we should avoid an extra
> subdirectory level, so I think the choices on the table are:

Right, we already have 1k subdirectories and PF will have 1k virtfnX
links anyway. So for me it doesn't look appealing to see extra 1K files.

In case someone else will need to add another parameter to this sriov
overlay directory, we will find an extra 1K files and repeat again for
any new parameter.

At the end, the discussion here is to make this sysfs feature to be very
general and 1k files in one folder for every parameter is not nice (IMHO).

In theory, SR-IOV spec allows up-to 64K VFs per-PF.

> 
> Associate "vf_msix_count" with the PF:
> 
>   - /sys/.../<PF>/sriov/vf_total_msix    # all on PF
> 
>   - /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these).  Greg
>     says the number of these is not a problem.
> 
>   - The "vf_total_msix" and "vf_msix_count_*" files are all related
>     and are grouped together in PF/sriov/.
> 
>   - The "vf_msix_count_*" files operate directly on the PF.  Lock the
>     PF for serialization, lookup and lock the VF to ensure no VF
>     driver, call PF driver callback to assign vectors.
> 
>   - Requires special sysfs code to create/remove "vf_msix_count_*"
>     files when setting/clearing VF Enable.  This code could create
>     them only when the PF driver actually supports vector assignment.
>     Unavoidable sysfs/uevent race, see below.
> 
> Associate "vf_msix_count" with the VF:
> 
>   - /sys/.../<PF>/sriov_vf_total_msix    # on PF
> 
>   - /sys/.../<VF>/sriov_vf_msix_count    # on each VF
> 
>   - The "sriov_vf_msix_count" files enter via the VF.  Lock the VF to
>     ensure no VF driver, lookup and lock the PF for serialization,
>     call PF driver callback to assign vectors.
> 
>   - Can be done with static sysfs attributes.  This means creating
>     "sriov_vf_msix_count" *always*, even if PF driver doesn't support
>     vector assignment.

The same goes for the next parameter that someone will add.

> 
> IIUC, putting "vf_msix_count_*" under the PF involves a race.  When we
> call device_add() for each new VF, it creates the VF sysfs directory
> and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
> exist yet.  It can't be created before device_add() because the sysfs
> directory doesn't exist.  If we create it after device_add(), the "add
> VF" uevent has already been emitted, so userspace may consume it
> before "vf_msix_count_*" is created.
> 
>   sriov_enable
>     <set VF Enable>                     <-- VFs created on PCI
>     sriov_add_vfs
>       for (i = 0; i < num_vfs; i++) {
>         pci_iov_add_virtfn
>           pci_device_add
>             device_initialize
>             device_add
>               device_add_attrs          <-- add VF sysfs attrs
>               kobject_uevent(KOBJ_ADD)  <-- emit uevent
>                                         <-- add "vf_msix_count_*" sysfs attr
>           pci_iov_sysfs_link
>           pci_bus_add_device
>             pci_create_sysfs_dev_files
>             device_attach
>       }
> 
> Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
> files associated directly with the PF.  I think that's more natural
> because they both operate directly on the PF.

And this is where we disagree :)

> 
> But I don't like the race, and using static attributes seems much
> cleaner implementation-wise.

Right, so what should be my next steps in order to do not miss this merge
window?

Thanks

> 
> Bjorn

[mlx5-next,v7,0/4] Dynamically assign MSI-X vectors count

Message

Comments