diff mbox

[BUG] Bisected Problem with LSI PCI FC Adapter

Message ID CAE9FiQUXvuZo2bQSRg0vpuDVYrnkcXDu7KZBNp-YgXbKa2LQhw@mail.gmail.com
State Not Applicable
Headers show

Commit Message

Yinghai Lu Sept. 11, 2014, 7:26 p.m. UTC
On Thu, Sep 11, 2014 at 10:30 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> [+cc linux-pci]
>
>
> On Thu, Sep 11, 2014 at 7:43 AM, Dirk Gouders <dirk@gouders.net> wrote:
>> Andreas Noever <andreas.noever@gmail.com> writes:
>>
>>> On Wed, Sep 3, 2014 at 2:47 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>> Andreas Noever <andreas.noever@gmail.com> writes:
>>>>
>>>>> On Wed, Sep 3, 2014 at 12:57 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>>>> On a Tyan VX50 (B4985) I ran into problems when updating the kernel: the
>>>>>> PCI FC Adapter is no longer recognized.
>>>>>
>>>>> Can you provide the output of lspci -vvv and the output of dmesg from
>>>>> a working boot? Which card is the one that is not recognized?
>>>>
>>>> Sure, the card that disappeared is:
>>>>
>>>> 0a:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
>>>
>>> As far as I can tell the following is happening:
>>> The root bus resource window (advertised by the bios?) is to small:
>>> pci_bus 0000:00: root bus resource [bus 00-07]
>>> Previously we didn't really care. There is a resource conflict but we
>>> ignored it:
>>> pci_bus 0000:0a: busn_res: can not insert [bus 0a] under [bus 00-07]
>>> (conflicts with (null) [bus 00-07])
>>> With the patch we mark the bridge as broken and reassign the bus to 06:
>>> pci 0000:00:0e.0: bridge configuration invalid ([bus 0a-0a]), reconfiguring
>>> pci 0000:00:0e.0: PCI bridge to [bus 06-07]
>>> pci 0000:00:0e.0:   bridge window [io  0x3000-0x3fff]
>>> pci 0000:00:0e.0:   bridge window [mem 0xd4200000-0xd42fffff]
>>> pci_bus 0000:06: busn_res: [bus 06-07] end is updated to 06

> Thanks for following up on this.  It had fallen off my radar, so I
> opened https://bugzilla.kernel.org/show_bug.cgi?id=84281 to make sure
> I don't forget again.  Please continue the debug discussion here in
> email.

Two problems here:
1. This is amd two node systems. amd_bus.c tell us bus [00, 7f] is from
first socket, but _OSC says only [0,7] is from first socket.

So solution (1):
According to Linus's principle, we should always trust HW than firmware,
so should we just adjust bus range from _OSC before we use it?

2. After moving, LSI FC card from bus 0a to bus 07, the LSI refuse to respond.

During my testing with pci busn allocation patchset, I found that if changing
LSI Erie card to different bus, it will refuse to responding. Only
thing that will
make the LSI card again, is resetting the pcie link. This should be LSI firmware
bug.

Dirk, please check if you can apply attached patches to use

echo 1 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable
echo 0 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable

to reset the link.

Solution (2)
To workaround the problem, we could reset the pcie link after change bus num
in the pcie bridges ?

Soultion (3)
Or we just revert the offending 1820ffdccb9b4398 (PCI: Make sure
bus number resources stay within their parents bounds) ?

Thanks

Yinghai

Comments

Dirk Gouders Sept. 11, 2014, 8:33 p.m. UTC | #1
Yinghai Lu <yinghai@kernel.org> writes:

> On Thu, Sep 11, 2014 at 10:30 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> [+cc linux-pci]
>>
>>
>> On Thu, Sep 11, 2014 at 7:43 AM, Dirk Gouders <dirk@gouders.net> wrote:
>>> Andreas Noever <andreas.noever@gmail.com> writes:
>>>
>>>> On Wed, Sep 3, 2014 at 2:47 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>>> Andreas Noever <andreas.noever@gmail.com> writes:
>>>>>
>>>>>> On Wed, Sep 3, 2014 at 12:57 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>>>>> On a Tyan VX50 (B4985) I ran into problems when updating the kernel: the
>>>>>>> PCI FC Adapter is no longer recognized.
>>>>>>
>>>>>> Can you provide the output of lspci -vvv and the output of dmesg from
>>>>>> a working boot? Which card is the one that is not recognized?
>>>>>
>>>>> Sure, the card that disappeared is:
>>>>>
>>>>> 0a:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
>>>>
>>>> As far as I can tell the following is happening:
>>>> The root bus resource window (advertised by the bios?) is to small:
>>>> pci_bus 0000:00: root bus resource [bus 00-07]
>>>> Previously we didn't really care. There is a resource conflict but we
>>>> ignored it:
>>>> pci_bus 0000:0a: busn_res: can not insert [bus 0a] under [bus 00-07]
>>>> (conflicts with (null) [bus 00-07])
>>>> With the patch we mark the bridge as broken and reassign the bus to 06:
>>>> pci 0000:00:0e.0: bridge configuration invalid ([bus 0a-0a]), reconfiguring
>>>> pci 0000:00:0e.0: PCI bridge to [bus 06-07]
>>>> pci 0000:00:0e.0:   bridge window [io  0x3000-0x3fff]
>>>> pci 0000:00:0e.0:   bridge window [mem 0xd4200000-0xd42fffff]
>>>> pci_bus 0000:06: busn_res: [bus 06-07] end is updated to 06
>
>> Thanks for following up on this.  It had fallen off my radar, so I
>> opened https://bugzilla.kernel.org/show_bug.cgi?id=84281 to make sure
>> I don't forget again.  Please continue the debug discussion here in
>> email.
>
> Two problems here:
> 1. This is amd two node systems. amd_bus.c tell us bus [00, 7f] is from
> first socket, but _OSC says only [0,7] is from first socket.
>
> So solution (1):
> According to Linus's principle, we should always trust HW than firmware,
> so should we just adjust bus range from _OSC before we use it?
>
> 2. After moving, LSI FC card from bus 0a to bus 07, the LSI refuse to respond.
>
> During my testing with pci busn allocation patchset, I found that if changing
> LSI Erie card to different bus, it will refuse to responding. Only
> thing that will
> make the LSI card again, is resetting the pcie link. This should be LSI firmware
> bug.
>
> Dirk, please check if you can apply attached patches to use
>
> echo 1 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable
> echo 0 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable
>
> to reset the link.

Thanks, Yinghai, I will apply them tomorrow and report.

What I was currently trying was to construct a test-environment so that
I do not need to do tests and diagnosis on a busy machine.

I noticed that this problem seems to start with the narrow Root
Bridge window (00-07) but every other machine that I had a look at,
starts with (00-ff), so those will not trigger my problem.

I thought I could perhaps try to shrink the window in
acpi_pci_root_add() to trigger the problem and that kind of works: it
triggers it but not exactly the same way, because it basically ends at
this code in pci_scan_bridge():

	if (max >= bus->busn_res.end) {
		dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n",
			 max, &bus->busn_res, pass);
		goto out;
	}

If this could work but I am just missing a small detail, I would be
glad to hear about it and do the first tests this way.  If it is
complete nonsense, I will just use the machine that triggers the problem
for the tests.

Dirk


> Solution (2)
> To workaround the problem, we could reset the pcie link after change bus num
> in the pcie bridges ?
>
> Soultion (3)
> Or we just revert the offending 1820ffdccb9b4398 (PCI: Make sure
> bus number resources stay within their parents bounds) ?
>
> Thanks
>
> Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dirk Gouders Sept. 11, 2014, 8:35 p.m. UTC | #2
Yinghai Lu <yinghai@kernel.org> writes:

> On Thu, Sep 11, 2014 at 10:30 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> [+cc linux-pci]
>>
>>
>> On Thu, Sep 11, 2014 at 7:43 AM, Dirk Gouders <dirk@gouders.net> wrote:
>>> Andreas Noever <andreas.noever@gmail.com> writes:
>>>
>>>> On Wed, Sep 3, 2014 at 2:47 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>>> Andreas Noever <andreas.noever@gmail.com> writes:
>>>>>
>>>>>> On Wed, Sep 3, 2014 at 12:57 PM, Dirk Gouders <dirk@gouders.net> wrote:
>>>>>>> On a Tyan VX50 (B4985) I ran into problems when updating the kernel: the
>>>>>>> PCI FC Adapter is no longer recognized.
>>>>>>
>>>>>> Can you provide the output of lspci -vvv and the output of dmesg from
>>>>>> a working boot? Which card is the one that is not recognized?
>>>>>
>>>>> Sure, the card that disappeared is:
>>>>>
>>>>> 0a:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
>>>>
>>>> As far as I can tell the following is happening:
>>>> The root bus resource window (advertised by the bios?) is to small:
>>>> pci_bus 0000:00: root bus resource [bus 00-07]
>>>> Previously we didn't really care. There is a resource conflict but we
>>>> ignored it:
>>>> pci_bus 0000:0a: busn_res: can not insert [bus 0a] under [bus 00-07]
>>>> (conflicts with (null) [bus 00-07])
>>>> With the patch we mark the bridge as broken and reassign the bus to 06:
>>>> pci 0000:00:0e.0: bridge configuration invalid ([bus 0a-0a]), reconfiguring
>>>> pci 0000:00:0e.0: PCI bridge to [bus 06-07]
>>>> pci 0000:00:0e.0:   bridge window [io  0x3000-0x3fff]
>>>> pci 0000:00:0e.0:   bridge window [mem 0xd4200000-0xd42fffff]
>>>> pci_bus 0000:06: busn_res: [bus 06-07] end is updated to 06
>
>> Thanks for following up on this.  It had fallen off my radar, so I
>> opened https://bugzilla.kernel.org/show_bug.cgi?id=84281 to make sure
>> I don't forget again.  Please continue the debug discussion here in
>> email.
>
> Two problems here:
> 1. This is amd two node systems. amd_bus.c tell us bus [00, 7f] is from
> first socket, but _OSC says only [0,7] is from first socket.
>
> So solution (1):
> According to Linus's principle, we should always trust HW than firmware,
> so should we just adjust bus range from _OSC before we use it?
>
> 2. After moving, LSI FC card from bus 0a to bus 07, the LSI refuse to respond.
>
> During my testing with pci busn allocation patchset, I found that if changing
> LSI Erie card to different bus, it will refuse to responding. Only
> thing that will
> make the LSI card again, is resetting the pcie link. This should be LSI firmware
> bug.
>
> Dirk, please check if you can apply attached patches to use
>
> echo 1 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable
> echo 0 > /sys/bus/pci/devices/0000\:00\0e.0/link_disable
>
> to reset the link.

Thanks, Yinghai, I will apply them tomorrow and report.

What I was currently trying was to construct a test-environment so that
I do not need to do tests and diagnosis on a busy machine.

I noticed that this problem seems to start with the narrow Root
Bridge window (00-07) but every other machine that I had a look at,
starts with (00-ff), so those will not trigger my problem.

I thought I could perhaps try to shrink the window in
acpi_pci_root_add() to trigger the problem and that kind of works: it
triggers it but not exactly the same way, because it basically ends at
this code in pci_scan_bridge():

	if (max >= bus->busn_res.end) {
		dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n",
			 max, &bus->busn_res, pass);
		goto out;
	}

If this could work but I am just missing a small detail, I would be
glad to hear about it and do the first tests this way.  If it is
complete nonsense, I will just use the machine that triggers the problem
for the tests.

Dirk


> Solution (2)
> To workaround the problem, we could reset the pcie link after change bus num
> in the pcie bridges ?
>
> Soultion (3)
> Or we just revert the offending 1820ffdccb9b4398 (PCI: Make sure
> bus number resources stay within their parents bounds) ?
>
> Thanks
>
> Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Sept. 11, 2014, 8:42 p.m. UTC | #3
On Thu, Sep 11, 2014 at 2:33 PM, Dirk Gouders <dirk@gouders.net> wrote:
> What I was currently trying was to construct a test-environment so that
> I do not need to do tests and diagnosis on a busy machine.
>
> I noticed that this problem seems to start with the narrow Root
> Bridge window (00-07) but every other machine that I had a look at,
> starts with (00-ff), so those will not trigger my problem.
>
> I thought I could perhaps try to shrink the window in
> acpi_pci_root_add() to trigger the problem and that kind of works: it
> triggers it but not exactly the same way, because it basically ends at
> this code in pci_scan_bridge():
>
>         if (max >= bus->busn_res.end) {
>                 dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n",
>                          max, &bus->busn_res, pass);
>                 goto out;
>         }
>
> If this could work but I am just missing a small detail, I would be
> glad to hear about it and do the first tests this way.  If it is
> complete nonsense, I will just use the machine that triggers the problem
> for the tests.

I was about to suggest the same thing.  If the problem is related to
the bus number change, we should be able to force that to happen on a
different machine.  Your approach sounds good, so I'm guessing we just
need a tweak.

I would first double-check that the PCI adapters are identical,
including the firmware on the card.  Can you also include your patch
and the resulting dmesg (with debug enabled as before)?

Is the test machine itself similar to the failing one?  Do they have
the same BIOS version?  From [1], it looks like you already have the
latest BIOS on the failing machine.  It's interesting that the notes
for your version mention "Fixed a PCI Bus Number re-allocation error."

Bjorn

[1] http://www.tyan.com/support_download_bios.aspx?model=B.VX50B4985-E
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 11, 2014, 8:42 p.m. UTC | #4
On Thu, Sep 11, 2014 at 12:26 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> This is amd two node systems. amd_bus.c tell us bus [00, 7f] is from
> first socket, but _OSC says only [0,7] is from first socket.

That might also explain why Dirk doesn't see it on his other machine.
The other machine doesn't have a buggy ACPI table bus limit.

So yeah, trusting actual hw more than the _OSC entry sounds like the
fix. I don't understand why hardware designers seem to think that ACPI
is a good model, when it would have been so much better to just
introduce hardware standards instead.

Oh well. I guess that boat sailed long long ago.

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Subject: [PATCH] PCI: Add link_disable in /sysfs for pcie device

Found PCIe cards from one vendor, will not respond to scan from bridge,
if we change bus number setting in bridge device.

Have to do link disable/enable on the pcie root port.

So try to expose link disable bit of pcie link control register. We can use
 echo 1 > /sys/..../link_disable
 echo 0 > /sys/..../link_disable
to bring the pcie device back to respond to scan.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 drivers/pci/pcie-sysfs.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

Index: linux-2.6/drivers/pci/pcie-sysfs.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie-sysfs.c
+++ linux-2.6/drivers/pci/pcie-sysfs.c
@@ -1,7 +1,35 @@ 
 #include <linux/kernel.h>
 #include <linux/pci.h>
 
+static ssize_t
+pcie_link_disable_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%u\n", pcie_link_disable_get(pdev));
+}
+static ssize_t
+pcie_link_disable_store(struct device *dev, struct device_attribute *attr,
+			const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	pcie_link_disable_set(pdev, val);
+
+	return count;
+}
+
+static struct device_attribute pcie_link_disable_attr =
+		__ATTR(pcie_link_disable, 0644,
+		       pcie_link_disable_show, pcie_link_disable_store);
+
 static struct attribute *pci_dev_pcie_dev_attrs[] = {
+	&pcie_link_disable_attr.attr,
 	NULL,
 };
 
@@ -14,6 +42,11 @@  static umode_t pci_dev_pcie_attrs_are_vi
 	if (!pci_is_pcie(pdev))
 		return 0;
 
+	if (a == &pcie_link_disable_attr.attr)
+		if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+		    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
+			return 0;
+
 	return a->mode;
 }