diff mbox

[v2] PCI: IOV: read SRIOV_NUM_VF after enabling ARI

Message ID 1444317617-13399-1-git-send-email-benjamin.h.shelton@intel.com
State Accepted
Headers show

Commit Message

Ben Shelton Oct. 8, 2015, 3:20 p.m. UTC
For some SR-IOV devices, the number of available virtual functions increases
after enabling ARI.  Currently, SRIOV_NUM_VF is read and saved off before the
ARI control bit is enabled in SRIOV_CTRL.  This causes an issue when VFs are
enabled.

At device init, SRIOV_INITIAL_VF and SRIOV_NUM_VF are specified to contain the
number of available VFs for the device.  sriov_enable() does a sanity check
that SRIOV_INITIAL_VF is not greater than iov->total_VFs, the saved-off value
of SRIOV_NUM_VF.  Since the value of both SRIOV_INITIAL_VF and SRIOV_NUM_VF has
increased after enabling the ARI bit, the check fails, and the VFs cannot be
enabled.

To fix the issue, write SRIOV_CTRL first, and then read SRIOV_NUM_VF.

Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
---

Changes since v1:
 * Moved read of SRIOV_NUM_VF rather than re-reading it if ARI was enabled.

 drivers/pci/iov.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Bjorn Helgaas Oct. 15, 2015, 5:58 p.m. UTC | #1
Hi Ben,

On Thu, Oct 08, 2015 at 10:20:17AM -0500, Ben Shelton wrote:
> For some SR-IOV devices, the number of available virtual functions increases
> after enabling ARI.  Currently, SRIOV_NUM_VF is read and saved off before the
> ARI control bit is enabled in SRIOV_CTRL.  This causes an issue when VFs are
> enabled.
> 
> At device init, SRIOV_INITIAL_VF and SRIOV_NUM_VF are specified to contain the
> number of available VFs for the device.  sriov_enable() does a sanity check
> that SRIOV_INITIAL_VF is not greater than iov->total_VFs, the saved-off value
> of SRIOV_NUM_VF.  Since the value of both SRIOV_INITIAL_VF and SRIOV_NUM_VF has
> increased after enabling the ARI bit, the check fails, and the VFs cannot be
> enabled.
> 
> To fix the issue, write SRIOV_CTRL first, and then read SRIOV_NUM_VF.

I think you mean PCI_SRIOV_TOTAL_VR (not NUM_VF), right?

This is interesting because the spec says TotalVFs is HwInit, which
means it's read-only, and it doesn't mention anything about it
changing when ARIis enabled.  I can see why it would change in that
case, so maybe this is just a goof in the spec.

Bjorn

> Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>


> ---
> 
> Changes since v1:
>  * Moved read of SRIOV_NUM_VF rather than re-reading it if ARI was enabled.
> 
>  drivers/pci/iov.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index ee0ebff..0174044 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -399,10 +399,6 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  		ssleep(1);
>  	}
>  
> -	pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
> -	if (!total)
> -		return 0;
> -
>  	ctrl = 0;
>  	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
>  		if (pdev->is_physfn)
> @@ -414,6 +410,11 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  
>  found:
>  	pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> +
> +	pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
> +	if (!total)
> +		return 0;
> +
>  	pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0);
>  	pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
>  	pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
> -- 
> 1.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Oct. 15, 2015, 8 p.m. UTC | #2
On 10/15/2015 10:58 AM, Bjorn Helgaas wrote:
> Hi Ben,
>
> On Thu, Oct 08, 2015 at 10:20:17AM -0500, Ben Shelton wrote:
>> For some SR-IOV devices, the number of available virtual functions increases
>> after enabling ARI.  Currently, SRIOV_NUM_VF is read and saved off before the
>> ARI control bit is enabled in SRIOV_CTRL.  This causes an issue when VFs are
>> enabled.
>>
>> At device init, SRIOV_INITIAL_VF and SRIOV_NUM_VF are specified to contain the
>> number of available VFs for the device.  sriov_enable() does a sanity check
>> that SRIOV_INITIAL_VF is not greater than iov->total_VFs, the saved-off value
>> of SRIOV_NUM_VF.  Since the value of both SRIOV_INITIAL_VF and SRIOV_NUM_VF has
>> increased after enabling the ARI bit, the check fails, and the VFs cannot be
>> enabled.
>>
>> To fix the issue, write SRIOV_CTRL first, and then read SRIOV_NUM_VF.
>
> I think you mean PCI_SRIOV_TOTAL_VR (not NUM_VF), right?
>
> This is interesting because the spec says TotalVFs is HwInit, which
> means it's read-only, and it doesn't mention anything about it
> changing when ARIis enabled.  I can see why it would change in that
> case, so maybe this is just a goof in the spec.
>
> Bjorn

I think it is supposed to be HwInit because changing the value can cause 
issues with resource allocation for the VFs.  Specifically if the number 
of VFs increases after the BIOS has come through and assigned MMIO 
resources it is possible that there may not be resources available.

I suspect we are going to end up having to quirk a number of devices in 
the future because of this as I can see this easily causing issues.

- Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Oct. 15, 2015, 9:36 p.m. UTC | #3
On Thu, Oct 15, 2015 at 01:00:55PM -0700, Alexander Duyck wrote:
> On 10/15/2015 10:58 AM, Bjorn Helgaas wrote:
> >Hi Ben,
> >
> >On Thu, Oct 08, 2015 at 10:20:17AM -0500, Ben Shelton wrote:
> >>For some SR-IOV devices, the number of available virtual functions increases
> >>after enabling ARI.  Currently, SRIOV_NUM_VF is read and saved off before the
> >>ARI control bit is enabled in SRIOV_CTRL.  This causes an issue when VFs are
> >>enabled.
> >>
> >>At device init, SRIOV_INITIAL_VF and SRIOV_NUM_VF are specified to contain the
> >>number of available VFs for the device.  sriov_enable() does a sanity check
> >>that SRIOV_INITIAL_VF is not greater than iov->total_VFs, the saved-off value
> >>of SRIOV_NUM_VF.  Since the value of both SRIOV_INITIAL_VF and SRIOV_NUM_VF has
> >>increased after enabling the ARI bit, the check fails, and the VFs cannot be
> >>enabled.
> >>
> >>To fix the issue, write SRIOV_CTRL first, and then read SRIOV_NUM_VF.
> >
> >I think you mean PCI_SRIOV_TOTAL_VR (not NUM_VF), right?
> >
> >This is interesting because the spec says TotalVFs is HwInit, which
> >means it's read-only, and it doesn't mention anything about it
> >changing when ARIis enabled.  I can see why it would change in that
> >case, so maybe this is just a goof in the spec.
> 
> I think it is supposed to be HwInit because changing the value can
> cause issues with resource allocation for the VFs.  Specifically if
> the number of VFs increases after the BIOS has come through and
> assigned MMIO resources it is possible that there may not be
> resources available.

Maybe, although sufficiently smart software could deal with that by
reassigning resources.  Theoretically, anyway.

> I suspect we are going to end up having to quirk a number of devices
> in the future because of this as I can see this easily causing
> issues.

I guess the issue if we made this change would be:

  - BIOS sees "ARI Capable Hierarchy" is zero
  - BIOS sees TotalVFs = X
  - BIOS allocates space for X VFs (size = "S * X")
  - Linux sets ARI Capable Hierarchy
  - Linux sees TotalVFs = X + Y
  - Linux reads SR-IOV BAR, computes size as "S * (X + Y)"
  - Linux tries to claim SR-IOV BAR, but fails because size is now too
    large to fit where BIOS put it

Right?  What sort of quirk would you envision?  Something to keep us
from increasing "total" beyond what it was before we turned on ARI
Capable?

What problem does this patch solve, Ben?  I assume you have devices
that do change TotalVFs when ARI is enabled, and you do want the new
value?

Or is the problem something like the following:

  - ...
  - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
  - Linux sets ARI Capable Hierarchy
  - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
  - Driver reads TotalVFs and sees X + Y
  - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
    sriov_enable() sees "X + Y > iov->total_VFs"

I'm a little dubious about drivers reading the SRIOV capability
directly, so maybe this is a symptom of deeper problems.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Oct. 15, 2015, 10:14 p.m. UTC | #4
On 10/15/2015 02:36 PM, Bjorn Helgaas wrote:
> On Thu, Oct 15, 2015 at 01:00:55PM -0700, Alexander Duyck wrote:
>> On 10/15/2015 10:58 AM, Bjorn Helgaas wrote:
>>> Hi Ben,
>>>
>>> On Thu, Oct 08, 2015 at 10:20:17AM -0500, Ben Shelton wrote:
>>>> For some SR-IOV devices, the number of available virtual functions increases
>>>> after enabling ARI.  Currently, SRIOV_NUM_VF is read and saved off before the
>>>> ARI control bit is enabled in SRIOV_CTRL.  This causes an issue when VFs are
>>>> enabled.
>>>>
>>>> At device init, SRIOV_INITIAL_VF and SRIOV_NUM_VF are specified to contain the
>>>> number of available VFs for the device.  sriov_enable() does a sanity check
>>>> that SRIOV_INITIAL_VF is not greater than iov->total_VFs, the saved-off value
>>>> of SRIOV_NUM_VF.  Since the value of both SRIOV_INITIAL_VF and SRIOV_NUM_VF has
>>>> increased after enabling the ARI bit, the check fails, and the VFs cannot be
>>>> enabled.
>>>>
>>>> To fix the issue, write SRIOV_CTRL first, and then read SRIOV_NUM_VF.
>>> I think you mean PCI_SRIOV_TOTAL_VR (not NUM_VF), right?
>>>
>>> This is interesting because the spec says TotalVFs is HwInit, which
>>> means it's read-only, and it doesn't mention anything about it
>>> changing when ARIis enabled.  I can see why it would change in that
>>> case, so maybe this is just a goof in the spec.
>> I think it is supposed to be HwInit because changing the value can
>> cause issues with resource allocation for the VFs.  Specifically if
>> the number of VFs increases after the BIOS has come through and
>> assigned MMIO resources it is possible that there may not be
>> resources available.
> Maybe, although sufficiently smart software could deal with that by
> reassigning resources.  Theoretically, anyway.
>
>> I suspect we are going to end up having to quirk a number of devices
>> in the future because of this as I can see this easily causing
>> issues.
> I guess the issue if we made this change would be:
>
>    - BIOS sees "ARI Capable Hierarchy" is zero
>    - BIOS sees TotalVFs = X
>    - BIOS allocates space for X VFs (size = "S * X")
>    - Linux sets ARI Capable Hierarchy
>    - Linux sees TotalVFs = X + Y
>    - Linux reads SR-IOV BAR, computes size as "S * (X + Y)"
>    - Linux tries to claim SR-IOV BAR, but fails because size is now too
>      large to fit where BIOS put it
>
> Right?  What sort of quirk would you envision?  Something to keep us
> from increasing "total" beyond what it was before we turned on ARI
> Capable?

The thing we would have to do in such a situation is force a 
reallocation of the BARs in the SR-IOV area.  Maybe instead of adding a 
quirk we could just add code here so that if totalVFs increases after we 
set ARI we clear the BAR registers and force reallocation.  If I am not 
mistaken the reallocation for unassigned bars would take place after 
this code is run so it is probably the right place to do it.

> What problem does this patch solve, Ben?  I assume you have devices
> that do change TotalVFs when ARI is enabled, and you do want the new
> value?
>
> Or is the problem something like the following:
>
>    - ...
>    - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
>    - Linux sets ARI Capable Hierarchy
>    - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
>    - Driver reads TotalVFs and sees X + Y
>    - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
>      sriov_enable() sees "X + Y > iov->total_VFs"
>
> I'm a little dubious about drivers reading the SRIOV capability
> directly, so maybe this is a symptom of deeper problems.

I don't think the issue is the drivers reading the SR-IOV config, it is 
likely the end users.  They will want to get full use of the device and 
they would see the config lists something like 64 VFs being available 
via lspci, but the kernel would have them capped at 7.

I think instead of just moving the read we should read it before and 
after.  If the value increases we should just drop the contents of the 
base address registers so that they can be reallocated now that the 
memory footprint has changed.

- Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Shelton Oct. 16, 2015, 4:56 p.m. UTC | #5
Hi Bjorn,

> What problem does this patch solve, Ben?  I assume you have devices
> that do change TotalVFs when ARI is enabled, and you do want the new
> value?
> 
> Or is the problem something like the following:
> 
>   - ...
>   - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
>   - Linux sets ARI Capable Hierarchy
>   - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
>   - Driver reads TotalVFs and sees X + Y
>   - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
>     sriov_enable() sees "X + Y > iov->total_VFs"

Here's a short snippet from the databook for the PCI Express controller we're
using:

"Supports two sets of VF Stride, First VF Offset, InitialVFs, and TotalVFs
registers per PF—one each for ARI and non-ARI hierarchies. Selection is
performed by host software through the ARI Capable Hierarchy bit of the Control
register in the PF0 SR-IOV capability structure."

The values in InitialVFs and TotalVFs are HWinit for each set of registers.

So the issue this is intended to fix is the following:

- Linux PCI core sees TotalVFs = X (saved as iov->total_VFs).
- Linux sets ARI Capable Hierarchy.
- Device switches to exposing the second set of registers, where
  InitialVFs = TotalVFs = Y (where Y > X).
- User enables one or more VFs on the device, e.g. by writing a value to
  sriov_numvfs in the sysfs.
- Driver calls pci_enable_sriov() for the device, which then calls
  sriov_enable().  sriov_enable() reads InitialVFs (= Y) and then checks if it's
  greater than iov->total_VFs (= X).  Since Y > X, the comparison is true, so
  sriov_enable() fails out and returns -EIO.

> 
> I'm a little dubious about drivers reading the SRIOV capability
> directly, so maybe this is a symptom of deeper problems.

I agree that the driver should not be reading the capability directly, but from
what I understand, it's intended for the device itself to do this.  From the PCI
SR-IOV spec revision 1.1:

"ARI Capable Hierarchy is a hint to the Device that ARI has been enabled in the
Root Port or Switch Downstream Port immediately above the Device."

Ben

> 
> Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Oct. 16, 2015, 6:07 p.m. UTC | #6
Hi Ben,

Thanks for the detailed response!

On Fri, Oct 16, 2015 at 11:56:28AM -0500, Ben Shelton wrote:
> Hi Bjorn,
> 
> > What problem does this patch solve, Ben?  I assume you have devices
> > that do change TotalVFs when ARI is enabled, and you do want the new
> > value?
> > 
> > Or is the problem something like the following:
> > 
> >   - ...
> >   - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs)
> >   - Linux sets ARI Capable Hierarchy
> >   - Device changes TotalVFs to X + Y (but PCI core doesn't notice)
> >   - Driver reads TotalVFs and sees X + Y
> >   - Driver attempts pci_enable_sriov(dev, X + Y), which fails because
> >     sriov_enable() sees "X + Y > iov->total_VFs"
> 
> Here's a short snippet from the databook for the PCI Express controller we're
> using:
> 
> "Supports two sets of VF Stride, First VF Offset, InitialVFs, and TotalVFs
> registers per PF—one each for ARI and non-ARI hierarchies. Selection is
> performed by host software through the ARI Capable Hierarchy bit of the Control
> register in the PF0 SR-IOV capability structure."
> 
> The values in InitialVFs and TotalVFs are HWinit for each set of registers.

The HwInit description says "bits are read-only after initialization
and can only be reset (for write-once by firmware) with 'Power Good
Reset'."  I don't see any provision for different values based on a
control register bit, so I think this device is actually out of spec.

We should be able to deal with it, so it's not that big a deal, but we
will have to keep it in mind and probably mention it in a comment in
the code.

> So the issue this is intended to fix is the following:
> 
> - Linux PCI core sees TotalVFs = X (saved as iov->total_VFs).
> - Linux sets ARI Capable Hierarchy.
> - Device switches to exposing the second set of registers, where
>   InitialVFs = TotalVFs = Y (where Y > X).
> - User enables one or more VFs on the device, e.g. by writing a value to
>   sriov_numvfs in the sysfs.
> - Driver calls pci_enable_sriov() for the device, which then calls
>   sriov_enable().  sriov_enable() reads InitialVFs (= Y) and then checks if it's
>   greater than iov->total_VFs (= X).  Since Y > X, the comparison is true, so
>   sriov_enable() fails out and returns -EIO.

I think there are two problems here:

  1) We should be reading some registers together to make sure we get
     consistent values.  For example, we always read VFOffset and
     VFStride immediately after writing NumVFs.  I think we should
     read InitialVFs and TotalVFs together.  I don't see the point of
     reading TotalVFs in sriov_init() and reading InitialVFs in
     sriov_enable().  If we read them both in sriov_init(), I don't
     think we'd have this problem of sriov_enable() returning -EIO.

  2) To work around this non-compliant device, we should read InitialVFs
     and TotalVFs after setting the ARI bit.

Ideally, I think this would be two patches: one to move the InitialVFs
read from sriov_enable() to sriov_init(), and a second to move the
pair from before setting ARI to after.

> > I'm a little dubious about drivers reading the SRIOV capability
> > directly, so maybe this is a symptom of deeper problems.
> 
> I agree that the driver should not be reading the capability directly, but from
> what I understand, it's intended for the device itself to do this.  From the PCI
> SR-IOV spec revision 1.1:
> 
> "ARI Capable Hierarchy is a hint to the Device that ARI has been enabled in the
> Root Port or Switch Downstream Port immediately above the Device."

Sure, of course, the device should behave differently based on how the
registers are programmed; that's the whole point of having writable
registers.  I think the particular case of the device changing HwInit
registers is out of spec, but changing things like VFOffset and
VFStride is completely expected.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..0174044 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -399,10 +399,6 @@  static int sriov_init(struct pci_dev *dev, int pos)
 		ssleep(1);
 	}
 
-	pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
-	if (!total)
-		return 0;
-
 	ctrl = 0;
 	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
 		if (pdev->is_physfn)
@@ -414,6 +410,11 @@  static int sriov_init(struct pci_dev *dev, int pos)
 
 found:
 	pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+
+	pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
+	if (!total)
+		return 0;
+
 	pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0);
 	pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
 	pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);