diff mbox

pci/aspm: Fix null pointer dereference during re-enumeration of PCI bus in pcie_aspm_init_link_state even though ASPM is off

Message ID 4cec62c2-218a-672b-8c12-d44e8df56aae@comsys.rwth-aachen.de
State Superseded
Headers show

Commit Message

Jan Rüth Jan. 3, 2017, 8:04 a.m. UTC
This patch fixes a null pointer dereference during PCI bus enumeration
when ASPM is off. On an IBM x3850 8664 this bug causes the kernel to
halt, this behavior did not appear in 3.10, so this is a regression.

pcie_aspm_sanity_check should only be called if ASPM is on.

Signed-off-by: Jan Rueth <rueth@comsys.rwth-aachen.de>
---
 drivers/pci/pcie/aspm.c | +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas Jan. 3, 2017, 8:53 p.m. UTC | #1
On Tue, Jan 03, 2017 at 09:04:36AM +0100, Jan Rüth wrote:
> This patch fixes a null pointer dereference during PCI bus enumeration
> when ASPM is off. On an IBM x3850 8664 this bug causes the kernel to
> halt, this behavior did not appear in 3.10, so this is a regression.
> 
> pcie_aspm_sanity_check should only be called if ASPM is on.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=187731

> Signed-off-by: Jan Rueth <rueth@comsys.rwth-aachen.de>
> ---
>  drivers/pci/pcie/aspm.c | +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index f981129..e758b56 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -552,11 +552,12 @@ static struct pcie_link_state
> *alloc_pcie_link_state(struct pci_dev *pdev)
>  void pcie_aspm_init_link_state(struct pci_dev *pdev)
>  {
>         struct pcie_link_state *link;
> -       int blacklist = !!pcie_aspm_sanity_check(pdev);
> -
> +       int blacklist;
>         if (!aspm_support_enabled)
>                 return;
> 
> +       blacklist = !!pcie_aspm_sanity_check(pdev);
> +
>         if (pdev->link_state)
>                 return;

For some reason this doesn't apply for me (complains about a malformed
patch), but I can apply it by hand easily enough.

However, this path is a little obtuse, and I'd like to understand
what's going on better.  We currently have:

  pci_scan_slot
    if (bus->self && nr)
      pcie_aspm_init_link_state(bus->self)
        pcie_aspm_sanity_check
	  list_for_each_entry(child, &pdev->subordinate->devices, ...)

I assume the null pointer is pdev->subordinate, so maybe we're calling
pcie_aspm_init_link_state() for a bridge where we weren't able to
create a child bus ("pdev->subordinate")?

I think it's legal to have a bridge device where we haven't set
pdev->subordinate, but it does seem unusual.

I don't see a log with the actual null pointer dereference in the
bugzilla.  Do you have any more details about which device blows up
here and why it's unusual?

Part of the reason I'm curious is that I'd like to identify the commit
that introduced the problem so we can figure out where the fix might
need to be backported.  I suspected b7206cbf024d ("PCI ASPM: support
partial aspm enablement") because it moved the
pcie_aspm_sanity_check() call before the checks for aspm_disabled, but
that's been around since 2.6 days, and the problem you're seeing
didn't happen until after 3.10.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Rüth Jan. 4, 2017, 11:41 a.m. UTC | #2
On 03/01/2017 21:53, Bjorn Helgaas wrote:
> On Tue, Jan 03, 2017 at 09:04:36AM +0100, Jan Rüth wrote:
>> This patch fixes a null pointer dereference during PCI bus enumeration
>> when ASPM is off. On an IBM x3850 8664 this bug causes the kernel to
>> halt, this behavior did not appear in 3.10, so this is a regression.
>>
>> pcie_aspm_sanity_check should only be called if ASPM is on.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=187731
> 
>> Signed-off-by: Jan Rueth <rueth@comsys.rwth-aachen.de>
>> ---
>>  drivers/pci/pcie/aspm.c | +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
>> index f981129..e758b56 100644
>> --- a/drivers/pci/pcie/aspm.c
>> +++ b/drivers/pci/pcie/aspm.c
>> @@ -552,11 +552,12 @@ static struct pcie_link_state
>> *alloc_pcie_link_state(struct pci_dev *pdev)
>>  void pcie_aspm_init_link_state(struct pci_dev *pdev)
>>  {
>>         struct pcie_link_state *link;
>> -       int blacklist = !!pcie_aspm_sanity_check(pdev);
>> -
>> +       int blacklist;
>>         if (!aspm_support_enabled)
>>                 return;
>>
>> +       blacklist = !!pcie_aspm_sanity_check(pdev);
>> +
>>         if (pdev->link_state)
>>                 return;
> 
> For some reason this doesn't apply for me (complains about a malformed
> patch), but I can apply it by hand easily enough.
> 
> However, this path is a little obtuse, and I'd like to understand
> what's going on better.  We currently have:
> 
>   pci_scan_slot
>     if (bus->self && nr)
>       pcie_aspm_init_link_state(bus->self)
>         pcie_aspm_sanity_check
> 	  list_for_each_entry(child, &pdev->subordinate->devices, ...)
> 
> I assume the null pointer is pdev->subordinate, so maybe we're calling
> pcie_aspm_init_link_state() for a bridge where we weren't able to
> create a child bus ("pdev->subordinate")?

At the time when I debugged it could trace it back to the loop but I
can't remember which part actually broke. But I can add some debug
output and crash it again.
> 
> I think it's legal to have a bridge device where we haven't set
> pdev->subordinate, but it does seem unusual.
> 
> I don't see a log with the actual null pointer dereference in the
> bugzilla.  Do you have any more details about which device blows up
> here and why it's unusual?

If you need it, I can boot an unpatched kernel again and post the actual
nullpointer dereference. The device in question is an ethernet card:
Intel Corporation 82576
Here is also a lspci -tv if that is of any help:

-+-[0000:19]---00.0-[1a-1d]----00.0-[1b-1d]--+-02.0-[1c]--+-00.0  Intel
Corporation 82576 Gigabit Network Connection
 |                                           |            \-00.1  Intel
Corporation 82576 Gigabit Network Connection
 |                                           \-04.0-[1d]--+-00.0  Intel
Corporation 82576 Gigabit Network Connection
 |                                                        \-00.1  Intel
Corporation 82576 Gigabit Network Connection
 +-[0000:14]---00.0-[15-18]--
 +-[0000:0f]---00.0-[10-13]--
 +-[0000:0a]---00.0-[0b-0e]--
 +-[0000:06]---00.0  IBM Calgary PCI-X Host Bridge
 +-[0000:02]---00.0  IBM Calgary PCI-X Host Bridge
 +-[0000:01]-+-00.0  IBM Calgary PCI-X Host Bridge
 |           +-01.0  Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
 |           +-01.1  Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
 |           \-02.0  Adaptec AAC-RAID
 \-[0000:00]-+-00.0  IBM Calgary PCI-X Host Bridge
             +-01.0  Advanced Micro Devices, Inc. [AMD/ATI] RV100
[Radeon 7000 / Radeon VE]
             +-03.0  NEC Corporation OHCI USB Controller
             +-03.1  NEC Corporation OHCI USB Controller
             +-03.2  NEC Corporation uPD72010x USB 2.0 Controller
             +-0f.0  Broadcom CSB6 South Bridge
             +-0f.1  Broadcom CSB6 RAID/IDE Controller
             \-0f.3  Broadcom GCLE-2 Host Bridge

> 
> Part of the reason I'm curious is that I'd like to identify the commit
> that introduced the problem so we can figure out where the fix might
> need to be backported.  I suspected b7206cbf024d ("PCI ASPM: support
> partial aspm enablement") because it moved the
> pcie_aspm_sanity_check() call before the checks for aspm_disabled, but
> that's been around since 2.6 days, and the problem you're seeing
> didn't happen until after 3.10.

The distribution I tested where it ran with a 3.10 kernel was a CentOS
(the machines were once certified for REHL so I figured there might be
something special). I'm not 100% sure anymore but I think I used a
CentOS 6.5 and updated the kernel to 3.10 there and it worked. (I think
I followed this guide:
http://bicofino.io/2014/10/25/install-kernel-3-dot-10-on-centos-6-dot-5/)
So I guess it could be that the patch that introduces the fix was just
not pulled into the CentOS 3.10 kernel?

> 
> Bjorn
> 

Best
 Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index f981129..e758b56 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -552,11 +552,12 @@  static struct pcie_link_state
*alloc_pcie_link_state(struct pci_dev *pdev)
 void pcie_aspm_init_link_state(struct pci_dev *pdev)
 {
        struct pcie_link_state *link;
-       int blacklist = !!pcie_aspm_sanity_check(pdev);
-
+       int blacklist;
        if (!aspm_support_enabled)
                return;

+       blacklist = !!pcie_aspm_sanity_check(pdev);
+
        if (pdev->link_state)
                return;