diff mbox

pci: Use a bus-global mutex to protect VPD operations

Message ID 20150519000037.56109.68356.stgit@mdrustad-wks.jf.intel.com
State Awaiting Upstream
Headers show

Commit Message

Rustad, Mark D May 19, 2015, midnight UTC
Some devices have a problem with concurrent VPD access to different
functions of the same physical device, so move the protecting mutex
from the pci_vpd structure to the pci_bus structure. There are a
number of reports on support sites for a variety of devices from
various vendors getting the "vpd r/w failed" message. This is likely
to at least fix some of them. Thanks to Shannon Nelson for helping
to come up with this approach.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/pci/access.c |   10 ++++------
 drivers/pci/probe.c  |    1 +
 2 files changed, 5 insertions(+), 6 deletions(-)

Comments

Alexander Duyck May 19, 2015, 5:55 p.m. UTC | #1
On 05/18/2015 05:00 PM, Mark D Rustad wrote:
> Some devices have a problem with concurrent VPD access to different
> functions of the same physical device, so move the protecting mutex
> from the pci_vpd structure to the pci_bus structure. There are a
> number of reports on support sites for a variety of devices from
> various vendors getting the "vpd r/w failed" message. This is likely
> to at least fix some of them. Thanks to Shannon Nelson for helping
> to come up with this approach.
>
> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Instead of moving the mutex lock around you would be much better served 
by simply removing the duplicate VPD entries for a given device in a 
PCIe quirk.  Then you can save yourself the extra pain and effort of 
having to deal with serialized VPD accesses for a multifunction device.

The logic for the quirk should be fairly simple.
   1.  Scan for any other devices with VPD that share the same bus and 
device number.
   2.  If bdf is equal to us keep searching.
   3.  If bdf is less than our bdf we release our VPD area and set VPD 
pointer to NULL.

- Alex
Rustad, Mark D May 19, 2015, 6:28 p.m. UTC | #2
> On May 19, 2015, at 10:55 AM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
> 
> On 05/18/2015 05:00 PM, Mark D Rustad wrote:
>> Some devices have a problem with concurrent VPD access to different
>> functions of the same physical device, so move the protecting mutex
>> from the pci_vpd structure to the pci_bus structure. There are a
>> number of reports on support sites for a variety of devices from
>> various vendors getting the "vpd r/w failed" message. This is likely
>> to at least fix some of them. Thanks to Shannon Nelson for helping
>> to come up with this approach.
>> 
>> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
>> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
>> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Instead of moving the mutex lock around you would be much better served by simply removing the duplicate VPD entries for a given device in a PCIe quirk.  Then you can save yourself the extra pain and effort of having to deal with serialized VPD accesses for a multifunction device.
> 
> The logic for the quirk should be fairly simple.
>  1.  Scan for any other devices with VPD that share the same bus and device number.
>  2.  If bdf is equal to us keep searching.
>  3.  If bdf is less than our bdf we release our VPD area and set VPD pointer to NULL.

I could do that. If this issue only affected Intel devices, I would be more inclined to consider something like that. I am avoiding discussing other vendors directly, so please go and do a Google search on "vpd r/w failed" and see if you really want to quirk all those devices. It isn't just Intel and it isn't just networking.

If after doing that you still feel that this isn't the best solution, I can go and cook up something much bigger (I already had two much bigger patches that I abandoned in favor of this approach). Bear in mind that the quirk code is dead weight in all the kernels.

As you said in another message, VPD is not that vital. Given that, I would think that some possible needless synchronization that resolves real problems for many devices should be acceptable. If it was synchronizing all config space or something that would be different, but this is just VPD.

--
Mark Rustad, Networking Division, Intel Corporation
Alexander Duyck May 19, 2015, 8:58 p.m. UTC | #3
On 05/19/2015 11:28 AM, Rustad, Mark D wrote:
>> On May 19, 2015, at 10:55 AM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
>>
>> On 05/18/2015 05:00 PM, Mark D Rustad wrote:
>>> Some devices have a problem with concurrent VPD access to different
>>> functions of the same physical device, so move the protecting mutex
>>> from the pci_vpd structure to the pci_bus structure. There are a
>>> number of reports on support sites for a variety of devices from
>>> various vendors getting the "vpd r/w failed" message. This is likely
>>> to at least fix some of them. Thanks to Shannon Nelson for helping
>>> to come up with this approach.
>>>
>>> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
>>> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
>>> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Instead of moving the mutex lock around you would be much better served by simply removing the duplicate VPD entries for a given device in a PCIe quirk.  Then you can save yourself the extra pain and effort of having to deal with serialized VPD accesses for a multifunction device.
>>
>> The logic for the quirk should be fairly simple.
>>   1.  Scan for any other devices with VPD that share the same bus and device number.
>>   2.  If bdf is equal to us keep searching.
>>   3.  If bdf is less than our bdf we release our VPD area and set VPD pointer to NULL.
> I could do that. If this issue only affected Intel devices, I would be more inclined to consider something like that. I am avoiding discussing other vendors directly, so please go and do a Google search on "vpd r/w failed" and see if you really want to quirk all those devices. It isn't just Intel and it isn't just networking.

I think you are assuming too much.  Have you root caused the other 
devices?  All the "vpd r/w failed" means is that a given read or write 
timed out.  There are any number of possible reasons for that including 
VPD being unimplemented in firmware, a device falling off of the bus, 
and many others.  What you have is a common symptom, it doesn't mean it 
is a common ailment for all the other parts.

> If after doing that you still feel that this isn't the best solution, I can go and cook up something much bigger (I already had two much bigger patches that I abandoned in favor of this approach). Bear in mind that the quirk code is dead weight in all the kernels.

The quirk code is there to deal with buggy hardware implementations and 
that is what it sounds like you have in this case.  I highly doubt all 
of the other vendors implemented the same mistake.  If they did then 
maybe you should implement a the quirk as a new function that can be 
used by any of the devices that implement VPD with the same limitation.

> As you said in another message, VPD is not that vital. Given that, I would think that some possible needless synchronization that resolves real problems for many devices should be acceptable. If it was synchronizing all config space or something that would be different, but this is just VPD.

The problem is it is needless synchronization, and it only resolves 
problems for some very specific devices.  The extra overhead for 
serializing what is already slow accesses can be very high so I am 
against it.  For example it will likely have unintended consequences on 
things like virtual machines where all direct assigned devices end up on 
the same bus.

We know that all Intel Ethernet parts to date only implement one VPD 
area per EEPROM, and that all functions in a multi-function Ethernet 
part share the same EEPROM, so the fix is to only present one VPD area 
so that the data is still there, without the ability of multiple 
access.  It seems like function 0 would be the obvious choice in that 
case so if somebody wants to be able to do their inventory they can go 
through and pull inventory information from function 0.  Really this is 
how this should have probably been implemented in the hardware in the 
first place if they couldn't support multiple accesses.

- Alex
Rustad, Mark D May 19, 2015, 9:53 p.m. UTC | #4
> On May 19, 2015, at 1:58 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
> 
> I think you are assuming too much.  Have you root caused the other devices?

It is possible that you are right, but I think it is much more likely that this is a common design problem with many devices.

> All the "vpd r/w failed" means is that a given read or write timed out.  There are any number of possible reasons for that including VPD being unimplemented in firmware, a device falling off of the bus, and many others.  What you have is a common symptom, it doesn't mean it is a common ailment for all the other parts.

Yes, but it sure looks like many PCIe designers have shared the VPD registers across functions. It isn't just that the area they address is shared, it is that the register implementations are also shared across functions. That is the case with all the Intel devices (it is indicated in the datasheets) and I'm guessing that other designers have done the same thing. It is a guess at this point, but think it is a very good one.

>> If after doing that you still feel that this isn't the best solution, I can go and cook up something much bigger (I already had two much bigger patches that I abandoned in favor of this approach). Bear in mind that the quirk code is dead weight in all the kernels.
> 
> The quirk code is there to deal with buggy hardware implementations and that is what it sounds like you have in this case.  I highly doubt all of the other vendors implemented the same mistake.  If they did then maybe you should implement a the quirk as a new function that can be used by any of the devices that implement VPD with the same limitation.

Just the device table to list the affected devices, even if Intel is the only vendor with this issue, will be very much larger than this patch and every kernel will carry that weight.

>> As you said in another message, VPD is not that vital. Given that, I would think that some possible needless synchronization that resolves real problems for many devices should be acceptable. If it was synchronizing all config space or something that would be different, but this is just VPD.
> 
> The problem is it is needless synchronization, and it only resolves problems for some very specific devices.  The extra overhead for serializing what is already slow accesses can be very high so I am against it.

In any PCIe environment, it only serializes VPD access for one physical device's functions. That just isn't that bad.

> For example it will likely have unintended consequences on things like virtual machines where all direct assigned devices end up on the same bus.

Direct-assigned devices to virtual machines have problems here that I can't address. Hypervisors would have to be involved in resolving that problem, if it is important enough to do. At present I could only suggest that virtual machines should never access VPD at all. This is a problem. At least VFs, which are more commonly assigned to guests, do not have VPD.

Cutting off the VPD for non-zero functions will have consequences as well. Some may notice devices dropping out of their inventory. That can't be a good thing and could upset customers. I think that is a far larger problem with your approach. Arguably, that is a kernel interface change.

> We know that all Intel Ethernet parts to date only implement one VPD area per EEPROM, and that all functions in a multi-function Ethernet part share the same EEPROM, so the fix is to only present one VPD area so that the data is still there, without the ability of multiple access.  It seems like function 0 would be the obvious choice in that case so if somebody wants to be able to do their inventory they can go through and pull inventory information from function 0.  Really this is how this should have probably been implemented in the hardware in the first place if they couldn't support multiple accesses.

No argument from me there, but I don't have a time machine to go back and try to change the thinking on VPD. I'm sure it would take more gates to make the functions different, just as it would take more gates to do it right. Which is why I think the problem is as widespread as the Google results suggest.

Maybe I should send one of my big, ugly patches that only addresses the problem for Intel devices. This patch will look good in comparison.

Isn't getting rid of scary messages that upset customers worth something?

--
Mark Rustad, Networking Division, Intel Corporation
Jesse Brandeburg May 19, 2015, 11:01 p.m. UTC | #5
On Tue, 19 May 2015 10:55:03 -0700
Alexander Duyck <alexander.h.duyck@redhat.com> wrote:

> On 05/18/2015 05:00 PM, Mark D Rustad wrote:
> > Some devices have a problem with concurrent VPD access to different
> > functions of the same physical device, so move the protecting mutex
> > from the pci_vpd structure to the pci_bus structure. There are a
> > number of reports on support sites for a variety of devices from
> > various vendors getting the "vpd r/w failed" message. This is likely
> > to at least fix some of them. Thanks to Shannon Nelson for helping
> > to come up with this approach.
> >
> > Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> > Acked-by: Shannon Nelson <shannon.nelson@intel.com>
> > Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Instead of moving the mutex lock around you would be much better served 
> by simply removing the duplicate VPD entries for a given device in a 
> PCIe quirk.  Then you can save yourself the extra pain and effort of 
> having to deal with serialized VPD accesses for a multifunction device.
> 
> The logic for the quirk should be fairly simple.
>    1.  Scan for any other devices with VPD that share the same bus and 
> device number.
>    2.  If bdf is equal to us keep searching.
>    3.  If bdf is less than our bdf we release our VPD area and set VPD 
> pointer to NULL.

But Alex if you do this you're violating the principle of least
surprise, not to mention changing a user-space interface which should
not be done.

Mark's solution is pretty graceful and solves the issue at heart, which
is that
1) several Intel chips have this issue
2) it appears that several other vendor's chips have this issue (or
similar) as well, but even if they don't Mark's fix will not change
their general operation, only make a small serializing effect when
multiple simultaneous reads are made.

This is a reasonably small fix, with a small kernel footprint, which
does not require changing user expectations or violating user-space
semantics that are already established, so I support it as is.
Alexander Duyck May 19, 2015, 11:19 p.m. UTC | #6
On 05/19/2015 02:53 PM, Rustad, Mark D wrote:
>> On May 19, 2015, at 1:58 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
>>
>> I think you are assuming too much.  Have you root caused the other devices?
> It is possible that you are right, but I think it is much more likely that this is a common design problem with many devices.

Well for example, can you explain to me how this can help a single 
function device?  I see this issue reported with things such as an r8169 
which last I knew was a single function interface.  I fail to see how 
moving the lock would solve VPD issues reported on this part.

>> All the "vpd r/w failed" means is that a given read or write timed out.  There are any number of possible reasons for that including VPD being unimplemented in firmware, a device falling off of the bus, and many others.  What you have is a common symptom, it doesn't mean it is a common ailment for all the other parts.
> Yes, but it sure looks like many PCIe designers have shared the VPD registers across functions. It isn't just that the area they address is shared, it is that the register implementations are also shared across functions. That is the case with all the Intel devices (it is indicated in the datasheets) and I'm guessing that other designers have done the same thing. It is a guess at this point, but think it is a very good one.

Hard to say unless you have also looked into a number of other devices 
to verify this.  For example in the case of one of the HP parts out 
there that the VPD issue was reported for they suggested installing 
their configuration tools and writing the firmware as I am guessing they 
had some bad images floating around out there.  The problem is it is a 
very common symptom and I have seen it myself on several boards and I 
can guarantee you the issue wasn't something that would be solved by a 
shared lock on a single bus.

>
>>> If after doing that you still feel that this isn't the best solution, I can go and cook up something much bigger (I already had two much bigger patches that I abandoned in favor of this approach). Bear in mind that the quirk code is dead weight in all the kernels.
>> The quirk code is there to deal with buggy hardware implementations and that is what it sounds like you have in this case.  I highly doubt all of the other vendors implemented the same mistake.  If they did then maybe you should implement a the quirk as a new function that can be used by any of the devices that implement VPD with the same limitation.
> Just the device table to list the affected devices, even if Intel is the only vendor with this issue, will be very much larger than this patch and every kernel will carry that weight.

The device table could be something like what you did in your other 
patch.  Anything that is Intel and uses Ethernet is affected currently.  
I think you might be over thinking it.  Don't try to fix other vendors 
hardware unless you actually have root cause for the issues you are 
seeing on it.  If you are aware of other vendors hardware that has the 
same issue you should call it out and Cc the maintainers for those 
drivers so that they can verify the fix on their hardware.

>>> As you said in another message, VPD is not that vital. Given that, I would think that some possible needless synchronization that resolves real problems for many devices should be acceptable. If it was synchronizing all config space or something that would be different, but this is just VPD.
>> The problem is it is needless synchronization, and it only resolves problems for some very specific devices.  The extra overhead for serializing what is already slow accesses can be very high so I am against it.
> In any PCIe environment, it only serializes VPD access for one physical device's functions. That just isn't that bad.

No, it serializes accesses on one bus.  So for example all of the 
devices on a root complex would be affected including things like LAN 
built into a motherboard.  The issue is that virtualization emulates 
that as the default configuration so in the case of virtualization you 
would be placing all of the devices on the same bus.

>> For example it will likely have unintended consequences on things like virtual machines where all direct assigned devices end up on the same bus.
> Direct-assigned devices to virtual machines have problems here that I can't address. Hypervisors would have to be involved in resolving that problem, if it is important enough to do. At present I could only suggest that virtual machines should never access VPD at all. This is a problem. At least VFs, which are more commonly assigned to guests, do not have VPD.

If needed a quirk could probably be added for KVM but that would be a 
separate patch.

> Cutting off the VPD for non-zero functions will have consequences as well. Some may notice devices dropping out of their inventory. That can't be a good thing and could upset customers. I think that is a far larger problem with your approach. Arguably, that is a kernel interface change.

The alternative if you don't want to free it would be to add a pointer 
and reference count to pci_vpd_pci22 so that you could share the mutex 
between multiple functions in the case of hardware that has this issue.  
I just don't want things being grouped by bus as there are scenarios, 
virtualization specifically, where that can get messy as you would be 
combining multiple unrelated devices under the same lock.

>> We know that all Intel Ethernet parts to date only implement one VPD area per EEPROM, and that all functions in a multi-function Ethernet part share the same EEPROM, so the fix is to only present one VPD area so that the data is still there, without the ability of multiple access.  It seems like function 0 would be the obvious choice in that case so if somebody wants to be able to do their inventory they can go through and pull inventory information from function 0.  Really this is how this should have probably been implemented in the hardware in the first place if they couldn't support multiple accesses.
> No argument from me there, but I don't have a time machine to go back and try to change the thinking on VPD. I'm sure it would take more gates to make the functions different, just as it would take more gates to do it right. Which is why I think the problem is as widespread as the Google results suggest.
>
> Maybe I should send one of my big, ugly patches that only addresses the problem for Intel devices. This patch will look good in comparison.
>
> Isn't getting rid of scary messages that upset customers worth something?

Getting rid of scary messages is one thing.  Hurting performance and 
adding unnecessary locking overhead is another.  The problem with your 
"fix" is that I feel you are applying far to wide of a net for this.  I 
would understand if you applied it to all Intel Ethernet hardware and 
grouped functions of the same device, but you are applying it to 
everything that supports VPD and you are doing it by bus.

The problem is specific to Intel Ethernet silicon.  Unless you can show 
me another part that has the same issue I would say you should probably 
just add a quirk for your part.

Also you might try taking a look at the function 
quirk_brcm_570x_limit_vpd().  It seems that there are parts that hang 
for example if you read past the end marker for VPD.  I would find it 
much more likely that this would be responsible for many of the hangs 
reported out there since the driver is doing something that the PCI spec 
does not define the behavior for.

- Alex
Alexander Duyck May 20, 2015, 12:07 a.m. UTC | #7
On 05/19/2015 04:01 PM, Jesse Brandeburg wrote:
> On Tue, 19 May 2015 10:55:03 -0700
> Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
>
>> On 05/18/2015 05:00 PM, Mark D Rustad wrote:
>>> Some devices have a problem with concurrent VPD access to different
>>> functions of the same physical device, so move the protecting mutex
>>> from the pci_vpd structure to the pci_bus structure. There are a
>>> number of reports on support sites for a variety of devices from
>>> various vendors getting the "vpd r/w failed" message. This is likely
>>> to at least fix some of them. Thanks to Shannon Nelson for helping
>>> to come up with this approach.
>>>
>>> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
>>> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
>>> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Instead of moving the mutex lock around you would be much better served
>> by simply removing the duplicate VPD entries for a given device in a
>> PCIe quirk.  Then you can save yourself the extra pain and effort of
>> having to deal with serialized VPD accesses for a multifunction device.
>>
>> The logic for the quirk should be fairly simple.
>>     1.  Scan for any other devices with VPD that share the same bus and
>> device number.
>>     2.  If bdf is equal to us keep searching.
>>     3.  If bdf is less than our bdf we release our VPD area and set VPD
>> pointer to NULL.
> But Alex if you do this you're violating the principle of least
> surprise, not to mention changing a user-space interface which should
> not be done.

I'm willing to back off on dropping the VPD info for those functions 
entirely, but the lock should not be pushed to the bus.

> Mark's solution is pretty graceful and solves the issue at heart, which
> is that
> 1) several Intel chips have this issue
> 2) it appears that several other vendor's chips have this issue (or
> similar) as well, but even if they don't Mark's fix will not change
> their general operation, only make a small serializing effect when
> multiple simultaneous reads are made.

2 is based on a false premise.  The "vpd r/w failed" error is about as 
common as dev_watchdog().  Just because it presents with a similar 
symptom doesn't mean it is the same issue.

> This is a reasonably small fix, with a small kernel footprint, which
> does not require changing user expectations or violating user-space
> semantics that are already established, so I support it as is.

I am not against the shared lock approach, but the bus is the wrong 
place for this.  Sharing a bus does not mean that the devices are all a 
part of the same chip, it only means they share a bus.  I would guess 
that this fix has not been tested with any LOM parts such as e1000e, or 
in a virtualization environment, as this would exhibit different 
behavior with this patch.  For example does it make sense for an e1000e 
LOM to be joined at the hip with a SATA or USB controller.  They could 
all be from different manufacturers with different requirements.

If the bug is in Intel Ethernet with VPD then I would suggest tweaking 
the VPD logic and adding a Intel Ethernet PCI quirk.  It doesn't make 
sense to assume based on one common error message that all of creation 
has the same issue.

If anything I believe Mark's patches have revealed a bigger issue. That 
is the fact that the sysfs file is reading outside of the VPD area which 
the PCI spec doesn't have a defined behavior for.  I suspect this is the 
cause of a number of the issues being reported as Broadcom had to 
specifically quirk to prevent it, and I found one discussion that 
indicated something similar might be needed for Realtek.

- Alex
Rustad, Mark D May 20, 2015, 12:34 a.m. UTC | #8
> On May 19, 2015, at 5:07 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
> 
> 
> 
> On 05/19/2015 04:01 PM, Jesse Brandeburg wrote:

>> But Alex if you do this you're violating the principle of least
>> surprise, not to mention changing a user-space interface which should
>> not be done.
> 
> I'm willing to back off on dropping the VPD info for those functions entirely, but the lock should not be pushed to the bus.

Yeah, I think suddenly dropping the VPD from non-0 functions would be disruptive.

>> Mark's solution is pretty graceful and solves the issue at heart, which
>> is that
>> 1) several Intel chips have this issue
>> 2) it appears that several other vendor's chips have this issue (or
>> similar) as well, but even if they don't Mark's fix will not change
>> their general operation, only make a small serializing effect when
>> multiple simultaneous reads are made.
> 
> 2 is based on a false premise.  The "vpd r/w failed" error is about as common as dev_watchdog().  Just because it presents with a similar symptom doesn't mean it is the same issue.

I don't know if it is false, but it is possible that other devices could have the same behavior. I didn't expect that it would fix them all by any means, but I figured there would be some fellow travelers.

> If the bug is in Intel Ethernet with VPD then I would suggest tweaking the VPD logic and adding a Intel Ethernet PCI quirk.  It doesn't make sense to assume based on one common error message that all of creation has the same issue.

> If anything I believe Mark's patches have revealed a bigger issue. That is the fact that the sysfs file is reading outside of the VPD area which the PCI spec doesn't have a defined behavior for.  I suspect this is the cause of a number of the issues being reported as Broadcom had to specifically quirk to prevent it, and I found one discussion that indicated something similar might be needed for Realtek.

It turns out that I missed something very important here - the state of the F bit. Because of how that works, and how the kernel knows what the last access was, it is vital to know which address/data registers are shared and which ones aren't. This is going to result in a much bigger fix. It will be necessary to positively know when this register sharing is happening. This will result in significant changes to the VPD code in order to model the behavior right. Essentially, devices with this issue will need to have the vpd pointer point to the same structure. That automatically fixes the locking issue. I will look into what can be done for KVM while I am at it. It will be a big device table, but that is unavoidable.

Doggone it. It seemed too good to be true yesterday and now I know that is because it is. So close. If only it weren't for VPD writes... I'm going to start over now.

--
Mark Rustad, Networking Division, Intel Corporation
Alexander Duyck May 20, 2015, 1:02 a.m. UTC | #9
On 05/19/2015 05:34 PM, Rustad, Mark D wrote:
>> On May 19, 2015, at 5:07 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
>>
>>
>>
>> On 05/19/2015 04:01 PM, Jesse Brandeburg wrote:
>>> But Alex if you do this you're violating the principle of least
>>> surprise, not to mention changing a user-space interface which should
>>> not be done.
>> I'm willing to back off on dropping the VPD info for those functions entirely, but the lock should not be pushed to the bus.
> Yeah, I think suddenly dropping the VPD from non-0 functions would be disruptive.
>
>>> Mark's solution is pretty graceful and solves the issue at heart, which
>>> is that
>>> 1) several Intel chips have this issue
>>> 2) it appears that several other vendor's chips have this issue (or
>>> similar) as well, but even if they don't Mark's fix will not change
>>> their general operation, only make a small serializing effect when
>>> multiple simultaneous reads are made.
>> 2 is based on a false premise.  The "vpd r/w failed" error is about as common as dev_watchdog().  Just because it presents with a similar symptom doesn't mean it is the same issue.
> I don't know if it is false, but it is possible that other devices could have the same behavior. I didn't expect that it would fix them all by any means, but I figured there would be some fellow travelers.
>
>> If the bug is in Intel Ethernet with VPD then I would suggest tweaking the VPD logic and adding a Intel Ethernet PCI quirk.  It doesn't make sense to assume based on one common error message that all of creation has the same issue.
>> If anything I believe Mark's patches have revealed a bigger issue. That is the fact that the sysfs file is reading outside of the VPD area which the PCI spec doesn't have a defined behavior for.  I suspect this is the cause of a number of the issues being reported as Broadcom had to specifically quirk to prevent it, and I found one discussion that indicated something similar might be needed for Realtek.
> It turns out that I missed something very important here - the state of the F bit. Because of how that works, and how the kernel knows what the last access was, it is vital to know which address/data registers are shared and which ones aren't. This is going to result in a much bigger fix. It will be necessary to positively know when this register sharing is happening. This will result in significant changes to the VPD code in order to model the behavior right. Essentially, devices with this issue will need to have the vpd pointer point to the same structure. That automatically fixes the locking issue. I will look into what can be done for KVM while I am at it. It will be a big device table, but that is unavoidable.
>
> Doggone it. It seemed too good to be true yesterday and now I know that is because it is. So close. If only it weren't for VPD writes... I'm going to start over now.

My suspicion is that we have a number of bugs floating around out there 
like the Broadcom issue.  Specifically, one of the ones I found was that 
the r8169 seems to have a similar issue as called out in the email 
thread at http://permalink.gmane.org/gmane.linux.network/232260.  I'm 
wondering if we shouldn't add an initializer for the read/write 
functions that will go through and pull out the 3 or 4 headers from the 
VPD data needed to get the actual length.  Then it would lock down the 
VPD and save some serious time on reads since most devices don't have 
32K of VPD to read.

- Alex
Rustad, Mark D May 20, 2015, 4 p.m. UTC | #10
> On May 19, 2015, at 6:02 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
> 
> My suspicion is that we have a number of bugs floating around out there like the Broadcom issue.  Specifically, one of the ones I found was that the r8169 seems to have a similar issue as called out in the email thread at http://permalink.gmane.org/gmane.linux.network/232260.  I'm wondering if we shouldn't add an initializer for the read/write functions that will go through and pull out the 3 or 4 headers from the VPD data needed to get the actual length.  Then it would lock down the VPD and save some serious time on reads since most devices don't have 32K of VPD to read.

That is interesting. I noticed that there are functions already present to find VPD tags. If the VPD were invalid, would this block its being read at all, or would it default to allow reading/writing anything? I don't know if there might be people using Linux to completely write the VPD area. Presumably your idea would prevent rewriting the VPD area to something larger.

--
Mark Rustad, Networking Division, Intel Corporation
Alexander H Duyck May 20, 2015, 9:26 p.m. UTC | #11
On 05/20/2015 09:00 AM, Rustad, Mark D wrote:
>> On May 19, 2015, at 6:02 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote:
>>
>> My suspicion is that we have a number of bugs floating around out there like the Broadcom issue.  Specifically, one of the ones I found was that the r8169 seems to have a similar issue as called out in the email thread at http://permalink.gmane.org/gmane.linux.network/232260.  I'm wondering if we shouldn't add an initializer for the read/write functions that will go through and pull out the 3 or 4 headers from the VPD data needed to get the actual length.  Then it would lock down the VPD and save some serious time on reads since most devices don't have 32K of VPD to read.
> That is interesting. I noticed that there are functions already present to find VPD tags. If the VPD were invalid, would this block its being read at all, or would it default to allow reading/writing anything? I don't know if there might be people using Linux to completely write the VPD area. Presumably your idea would prevent rewriting the VPD area to something larger.

What we probably would need to do is split the vpd read/write functions 
up a bit further as it turns out some vendors are using it as a means of 
reading/writing the EEPROM for the device.  So we could have something 
like maybe a _raw version of the read/write and one that is intended for 
actually reading VPD.  The VPD one could call something to initialize a 
set of offsets for the read-only descriptor, the read-write descriptor, 
and the end descriptor.  If any read/write goes past the end descriptor 
you could then just return 0 for the read value or skip it for the write.

By my math that means only having to read at most 6 locations in order 
to fill in all the descriptor info and then you could save significant 
time on VPD read for all drivers because would would cut the 32K read 
down to something like 256 bytes which is the more common VPD size.

- Alex
Bjorn Helgaas May 27, 2015, 5:27 p.m. UTC | #12
On Mon, May 18, 2015 at 05:00:37PM -0700, Mark D Rustad wrote:
> Some devices have a problem with concurrent VPD access to different
> functions of the same physical device, so move the protecting mutex
> from the pci_vpd structure to the pci_bus structure. There are a
> number of reports on support sites for a variety of devices from
> various vendors getting the "vpd r/w failed" message. This is likely
> to at least fix some of them. Thanks to Shannon Nelson for helping
> to come up with this approach.
> 
> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

It sounds like there are real problems here that would be fixed by changing
the mutex strategy and limiting VPD read lengths, but that we don't quite
have consensus on how to solve them yet.

VPD is a bit of a tar pit.  We've talked about limiting VPD read length
before (see links below), which requires interpreting the VPD contents
(just the tags & sizes) the kernel.  I think I'm OK with doing that,
provided we should audit existing users and make sure we don't break them.

http://lkml.kernel.org/r/c67cd7f4-8d8f-48fc-a63c-dbdafde873a2@CMEXHTCAS1.ad.emulex.com
http://lkml.kernel.org/r/1f6d7b6c-7189-4fe3-926b-42609724cab9@CMEXHTCAS2.ad.emulex.com

I'd also like to include specifics, e.g., bugzillas with complete dmesg and
lspci information, for the problems we're fixing.

Bjorn


> ---
>  drivers/pci/access.c |   10 ++++------
>  drivers/pci/probe.c  |    1 +
>  2 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/access.c b/drivers/pci/access.c
> index d9b64a175990..6a1c8d6f95f1 100644
> --- a/drivers/pci/access.c
> +++ b/drivers/pci/access.c
> @@ -281,7 +281,6 @@ PCI_USER_WRITE_CONFIG(dword, u32)
>  
>  struct pci_vpd_pci22 {
>  	struct pci_vpd base;
> -	struct mutex lock;
>  	u16	flag;
>  	bool	busy;
>  	u8	cap;
> @@ -340,7 +339,7 @@ static ssize_t pci_vpd_pci22_read(struct pci_dev *dev, loff_t pos, size_t count,
>  	if (pos < 0 || pos > vpd->base.len || end > vpd->base.len)
>  		return -EINVAL;
>  
> -	if (mutex_lock_killable(&vpd->lock))
> +	if (mutex_lock_killable(&dev->bus->vpd_mutex))
>  		return -EINTR;
>  
>  	ret = pci_vpd_pci22_wait(dev);
> @@ -376,7 +375,7 @@ static ssize_t pci_vpd_pci22_read(struct pci_dev *dev, loff_t pos, size_t count,
>  		}
>  	}
>  out:
> -	mutex_unlock(&vpd->lock);
> +	mutex_unlock(&dev->bus->vpd_mutex);
>  	return ret ? ret : count;
>  }
>  
> @@ -392,7 +391,7 @@ static ssize_t pci_vpd_pci22_write(struct pci_dev *dev, loff_t pos, size_t count
>  	if (pos < 0 || (pos & 3) || (count & 3) || end > vpd->base.len)
>  		return -EINVAL;
>  
> -	if (mutex_lock_killable(&vpd->lock))
> +	if (mutex_lock_killable(&dev->bus->vpd_mutex))
>  		return -EINTR;
>  
>  	ret = pci_vpd_pci22_wait(dev);
> @@ -424,7 +423,7 @@ static ssize_t pci_vpd_pci22_write(struct pci_dev *dev, loff_t pos, size_t count
>  		pos += sizeof(u32);
>  	}
>  out:
> -	mutex_unlock(&vpd->lock);
> +	mutex_unlock(&dev->bus->vpd_mutex);
>  	return ret ? ret : count;
>  }
>  
> @@ -453,7 +452,6 @@ int pci_vpd_pci22_init(struct pci_dev *dev)
>  
>  	vpd->base.len = PCI_VPD_PCI22_SIZE;
>  	vpd->base.ops = &pci_vpd_pci22_ops;
> -	mutex_init(&vpd->lock);
>  	vpd->cap = cap;
>  	vpd->busy = false;
>  	dev->vpd = &vpd->base;
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 6675a7a1b9fc..40c2a5a751d0 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -494,6 +494,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
>  	INIT_LIST_HEAD(&b->devices);
>  	INIT_LIST_HEAD(&b->slots);
>  	INIT_LIST_HEAD(&b->resources);
> +	mutex_init(&b->vpd_mutex);
>  	b->max_bus_speed = PCI_SPEED_UNKNOWN;
>  	b->cur_bus_speed = PCI_SPEED_UNKNOWN;
>  #ifdef CONFIG_PCI_DOMAINS_GENERIC
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 353db8dc4c6e..f8a51d172255 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -454,6 +454,7 @@ struct pci_bus {
>  	struct msi_controller *msi;	/* MSI controller */
>  	void		*sysdata;	/* hook for sys-specific extension */
>  	struct proc_dir_entry *procdir;	/* directory entry in /proc/bus/pci */
> +	struct mutex	vpd_mutex;	/* bus-wide VPD access mutex */
>  
>  	unsigned char	number;		/* bus number */
>  	unsigned char	primary;	/* number of primary bridge */
>
Rustad, Mark D May 27, 2015, 7:11 p.m. UTC | #13
> On May 27, 2015, at 10:27 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> 
> It sounds like there are real problems here that would be fixed by changing
> the mutex strategy and limiting VPD read lengths, but that we don't quite
> have consensus on how to solve them yet.

I have a new pair of patches that I am getting internal feedback on. I
will be on vacation starting tomorrow and won't be back until Tuesday, so
I think it best to send them when I get back so that I will be available
to respond. I could be convinced to send them now.

> VPD is a bit of a tar pit.

It sure is.

> We've talked about limiting VPD read length
> before (see links below), which requires interpreting the VPD contents
> (just the tags & sizes) the kernel.  I think I'm OK with doing that,
> provided we should audit existing users and make sure we don't break them.
> 
> http://lkml.kernel.org/r/c67cd7f4-8d8f-48fc-a63c-dbdafde873a2@CMEXHTCAS1.ad.emulex.com
> http://lkml.kernel.org/r/1f6d7b6c-7189-4fe3-926b-42609724cab9@CMEXHTCAS2.ad.emulex.com
> 
> I'd also like to include specifics, e.g., bugzillas with complete dmesg and
> lspci information, for the problems we're fixing.

The issue I am addressing is for when the VPD Address/F and Data registers
are shared across multiple functions. My latest patch addresses this by
always performing the access through function 0. I added a dev_flags bit
to indicate when this should be done, so only devices that have a quirk
that sets that bit will get that treatment.

The patch I have still doesn't address the issue for direct-assigned
devices. I have no answer to that, but will point out that most guests
seem to use virtual machines that don't support extended config space
anyway. Only those that do support extended config space and access VPD
could be a source of trouble.

I have to say that I haven't actually reproduced the failure, but given
the design of the hardware it is clear that somehow a common mutex needs
to protect those shared registers.

--
Mark Rustad, Networking Division, Intel Corporation
diff mbox

Patch

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index d9b64a175990..6a1c8d6f95f1 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -281,7 +281,6 @@  PCI_USER_WRITE_CONFIG(dword, u32)
 
 struct pci_vpd_pci22 {
 	struct pci_vpd base;
-	struct mutex lock;
 	u16	flag;
 	bool	busy;
 	u8	cap;
@@ -340,7 +339,7 @@  static ssize_t pci_vpd_pci22_read(struct pci_dev *dev, loff_t pos, size_t count,
 	if (pos < 0 || pos > vpd->base.len || end > vpd->base.len)
 		return -EINVAL;
 
-	if (mutex_lock_killable(&vpd->lock))
+	if (mutex_lock_killable(&dev->bus->vpd_mutex))
 		return -EINTR;
 
 	ret = pci_vpd_pci22_wait(dev);
@@ -376,7 +375,7 @@  static ssize_t pci_vpd_pci22_read(struct pci_dev *dev, loff_t pos, size_t count,
 		}
 	}
 out:
-	mutex_unlock(&vpd->lock);
+	mutex_unlock(&dev->bus->vpd_mutex);
 	return ret ? ret : count;
 }
 
@@ -392,7 +391,7 @@  static ssize_t pci_vpd_pci22_write(struct pci_dev *dev, loff_t pos, size_t count
 	if (pos < 0 || (pos & 3) || (count & 3) || end > vpd->base.len)
 		return -EINVAL;
 
-	if (mutex_lock_killable(&vpd->lock))
+	if (mutex_lock_killable(&dev->bus->vpd_mutex))
 		return -EINTR;
 
 	ret = pci_vpd_pci22_wait(dev);
@@ -424,7 +423,7 @@  static ssize_t pci_vpd_pci22_write(struct pci_dev *dev, loff_t pos, size_t count
 		pos += sizeof(u32);
 	}
 out:
-	mutex_unlock(&vpd->lock);
+	mutex_unlock(&dev->bus->vpd_mutex);
 	return ret ? ret : count;
 }
 
@@ -453,7 +452,6 @@  int pci_vpd_pci22_init(struct pci_dev *dev)
 
 	vpd->base.len = PCI_VPD_PCI22_SIZE;
 	vpd->base.ops = &pci_vpd_pci22_ops;
-	mutex_init(&vpd->lock);
 	vpd->cap = cap;
 	vpd->busy = false;
 	dev->vpd = &vpd->base;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 6675a7a1b9fc..40c2a5a751d0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -494,6 +494,7 @@  static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
 	INIT_LIST_HEAD(&b->devices);
 	INIT_LIST_HEAD(&b->slots);
 	INIT_LIST_HEAD(&b->resources);
+	mutex_init(&b->vpd_mutex);
 	b->max_bus_speed = PCI_SPEED_UNKNOWN;
 	b->cur_bus_speed = PCI_SPEED_UNKNOWN;
 #ifdef CONFIG_PCI_DOMAINS_GENERIC
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8dc4c6e..f8a51d172255 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -454,6 +454,7 @@  struct pci_bus {
 	struct msi_controller *msi;	/* MSI controller */
 	void		*sysdata;	/* hook for sys-specific extension */
 	struct proc_dir_entry *procdir;	/* directory entry in /proc/bus/pci */
+	struct mutex	vpd_mutex;	/* bus-wide VPD access mutex */
 
 	unsigned char	number;		/* bus number */
 	unsigned char	primary;	/* number of primary bridge */