diff mbox series

PCI: shpchp: Fix probing logic inversion

Message ID 20180621164715.28160-1-marc.zyngier@arm.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: shpchp: Fix probing logic inversion | expand

Commit Message

Marc Zyngier June 21, 2018, 4:47 p.m. UTC
Until recently, shpc_probe() would bail out pretty early in the
absence of the SHPC capability. A logic change in the way the
driver now checks that capability makes it go and probe the
firmware anyway, with ugly consequences if the system is not
ACPI based (my arm64 ThunderX is DT driven, and explodes in
a spectacular way after getting a NULL root bridge from the
non-existent ACPI tables...).

Take this opportunity to move the call to shpchp_is_native()
back into shpc_probe(), making it clear that a non-ACPI system
is not expected to use this driver.

Fixes: 90cc0c3cc709 ("PCI: shpchp: Add shpchp_is_native()")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 drivers/pci/hotplug/acpi_pcihp.c  | 3 ---
 drivers/pci/hotplug/shpchp_core.c | 3 ++-
 2 files changed, 2 insertions(+), 4 deletions(-)

Comments

Mika Westerberg June 25, 2018, 12:52 p.m. UTC | #1
On Thu, Jun 21, 2018 at 05:47:15PM +0100, Marc Zyngier wrote:
> Until recently, shpc_probe() would bail out pretty early in the
> absence of the SHPC capability. A logic change in the way the
> driver now checks that capability makes it go and probe the
> firmware anyway, with ugly consequences if the system is not
> ACPI based (my arm64 ThunderX is DT driven, and explodes in
> a spectacular way after getting a NULL root bridge from the
> non-existent ACPI tables...).

Could you share log from the failure? I would like to understand a bit
better where it crashes and why.

> Take this opportunity to move the call to shpchp_is_native()
> back into shpc_probe(), making it clear that a non-ACPI system
> is not expected to use this driver.

It is fine to use SHPC in non-ACPI systems. However, in ACPI systems we
should negotiate whether it is the OS or the firmware who handles it.
Marc Zyngier June 25, 2018, 1:13 p.m. UTC | #2
On 25/06/18 13:52, Mika Westerberg wrote:
> On Thu, Jun 21, 2018 at 05:47:15PM +0100, Marc Zyngier wrote:
>> Until recently, shpc_probe() would bail out pretty early in the
>> absence of the SHPC capability. A logic change in the way the
>> driver now checks that capability makes it go and probe the
>> firmware anyway, with ugly consequences if the system is not
>> ACPI based (my arm64 ThunderX is DT driven, and explodes in
>> a spectacular way after getting a NULL root bridge from the
>> non-existent ACPI tables...).
> 
> Could you share log from the failure? I would like to understand a bit
> better where it crashes and why.

Here you go:

[   12.330017] pcieport 0008:1f:00.0: enabling device (0506 -> 0507)
[   12.336315] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[   12.345104] Mem abort info:
[   12.347895]   ESR = 0x96000004
[   12.350945]   Exception class = DABT (current EL), IL = 32 bits
[   12.356860]   SET = 0, FnV = 0
[   12.359910]   EA = 0, S1PTW = 0
[   12.363046] Data abort info:
[   12.365922]   ISV = 0, ISS = 0x00000004
[   12.369748]   CM = 0, WnR = 0
[   12.372712] [0000000000000058] user address but active_mm is swapper
[   12.379063] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   12.384625] Modules linked in:
[   12.387676] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 4.18.0-rc1-00006-ga2c5c6a64fb6 #10
[   12.395929] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
[   12.403408] Workqueue: events work_for_cpu_fn
[   12.407758] pstate: 60000005 (nZCv daif -PAN -UAO)
[   12.412544] pc : acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
[   12.418541] lr : acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
[   12.424537] sp : ffff00000a613c60
[   12.427841] x29: ffff00000a613c60 x28: 0000000000000000 
[   12.433147] x27: ffff800fbeb8d0b0 x26: ffff00000982a068 
[   12.438453] x25: 0000000000000000 x24: ffff000009feb898 
[   12.443759] x23: 0000000000000000 x22: ffff000009808000 
[   12.449064] x21: ffff0000098dee98 x20: ffff810fa60d9000 
[   12.454370] x19: 0000000000000000 x18: ffffffffffffffff 
[   12.459675] x17: 0000000000000e00 x16: 0000000000000020 
[   12.464981] x15: ffff000009808648 x14: ffff000089a31b8f 
[   12.470286] x13: 0000000000000000 x12: 0000000000000040 
[   12.475591] x11: 000000000000004c x10: 0000000000000a20 
[   12.480897] x9 : ffff00000a613d30 x8 : 0000000000000000 
[   12.486202] x7 : ffff000009808648 x6 : 0000000000000057 
[   12.491508] x5 : 0000000000000010 x4 : 000000000000001f 
[   12.496813] x3 : 0000000000000000 x2 : 0d68a3f9b2251000 
[   12.502118] x1 : 0000000000000000 x0 : 0000000000000000 
[   12.507425] Process kworker/0:0 (pid: 5, stack limit = 0x(____ptrval____))
[   12.514289] Call trace:
[   12.516727]  acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
[   12.522377]  shpc_probe+0x48/0x3b0
[   12.525772]  local_pci_probe+0x44/0xb0
[   12.529511]  work_for_cpu_fn+0x20/0x30
[   12.533252]  process_one_work+0x208/0x480
[   12.537251]  worker_thread+0x254/0x448
[   12.540992]  kthread+0x104/0x130
[   12.544212]  ret_from_fork+0x10/0x1c
[   12.547779] Code: f100427f 54000040 f85f8260 94009f63 (b9405800) 
[   12.553866] ---[ end trace 8ee5b8cc95cd4f02 ]---

>> Take this opportunity to move the call to shpchp_is_native()
>> back into shpc_probe(), making it clear that a non-ACPI system
>> is not expected to use this driver.
> 
> It is fine to use SHPC in non-ACPI systems. However, in ACPI systems we
> should negotiate whether it is the OS or the firmware who handles it.
Fair enough. In which case we should at the very least make sure
we can handle acpi_pci_find_root() returning NULL (which is what
generates the crash in my case).

Thanks,

	M.
Mika Westerberg June 25, 2018, 1:33 p.m. UTC | #3
On Mon, Jun 25, 2018 at 02:13:26PM +0100, Marc Zyngier wrote:
> On 25/06/18 13:52, Mika Westerberg wrote:
> > On Thu, Jun 21, 2018 at 05:47:15PM +0100, Marc Zyngier wrote:
> >> Until recently, shpc_probe() would bail out pretty early in the
> >> absence of the SHPC capability. A logic change in the way the
> >> driver now checks that capability makes it go and probe the
> >> firmware anyway, with ugly consequences if the system is not
> >> ACPI based (my arm64 ThunderX is DT driven, and explodes in
> >> a spectacular way after getting a NULL root bridge from the
> >> non-existent ACPI tables...).
> > 
> > Could you share log from the failure? I would like to understand a bit
> > better where it crashes and why.
> 
> Here you go:
> 
> [   12.330017] pcieport 0008:1f:00.0: enabling device (0506 -> 0507)
> [   12.336315] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
> [   12.345104] Mem abort info:
> [   12.347895]   ESR = 0x96000004
> [   12.350945]   Exception class = DABT (current EL), IL = 32 bits
> [   12.356860]   SET = 0, FnV = 0
> [   12.359910]   EA = 0, S1PTW = 0
> [   12.363046] Data abort info:
> [   12.365922]   ISV = 0, ISS = 0x00000004
> [   12.369748]   CM = 0, WnR = 0
> [   12.372712] [0000000000000058] user address but active_mm is swapper
> [   12.379063] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [   12.384625] Modules linked in:
> [   12.387676] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 4.18.0-rc1-00006-ga2c5c6a64fb6 #10
> [   12.395929] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
> [   12.403408] Workqueue: events work_for_cpu_fn
> [   12.407758] pstate: 60000005 (nZCv daif -PAN -UAO)
> [   12.412544] pc : acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
> [   12.418541] lr : acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
> [   12.424537] sp : ffff00000a613c60
> [   12.427841] x29: ffff00000a613c60 x28: 0000000000000000 
> [   12.433147] x27: ffff800fbeb8d0b0 x26: ffff00000982a068 
> [   12.438453] x25: 0000000000000000 x24: ffff000009feb898 
> [   12.443759] x23: 0000000000000000 x22: ffff000009808000 
> [   12.449064] x21: ffff0000098dee98 x20: ffff810fa60d9000 
> [   12.454370] x19: 0000000000000000 x18: ffffffffffffffff 
> [   12.459675] x17: 0000000000000e00 x16: 0000000000000020 
> [   12.464981] x15: ffff000009808648 x14: ffff000089a31b8f 
> [   12.470286] x13: 0000000000000000 x12: 0000000000000040 
> [   12.475591] x11: 000000000000004c x10: 0000000000000a20 
> [   12.480897] x9 : ffff00000a613d30 x8 : 0000000000000000 
> [   12.486202] x7 : ffff000009808648 x6 : 0000000000000057 
> [   12.491508] x5 : 0000000000000010 x4 : 000000000000001f 
> [   12.496813] x3 : 0000000000000000 x2 : 0d68a3f9b2251000 
> [   12.502118] x1 : 0000000000000000 x0 : 0000000000000000 
> [   12.507425] Process kworker/0:0 (pid: 5, stack limit = 0x(____ptrval____))
> [   12.514289] Call trace:
> [   12.516727]  acpi_get_hp_hw_control_from_firmware+0xa0/0x2b0
> [   12.522377]  shpc_probe+0x48/0x3b0
> [   12.525772]  local_pci_probe+0x44/0xb0
> [   12.529511]  work_for_cpu_fn+0x20/0x30
> [   12.533252]  process_one_work+0x208/0x480
> [   12.537251]  worker_thread+0x254/0x448
> [   12.540992]  kthread+0x104/0x130
> [   12.544212]  ret_from_fork+0x10/0x1c
> [   12.547779] Code: f100427f 54000040 f85f8260 94009f63 (b9405800) 
> [   12.553866] ---[ end trace 8ee5b8cc95cd4f02 ]---
> 
> >> Take this opportunity to move the call to shpchp_is_native()
> >> back into shpc_probe(), making it clear that a non-ACPI system
> >> is not expected to use this driver.
> > 
> > It is fine to use SHPC in non-ACPI systems. However, in ACPI systems we
> > should negotiate whether it is the OS or the firmware who handles it.
> Fair enough. In which case we should at the very least make sure
> we can handle acpi_pci_find_root() returning NULL (which is what
> generates the crash in my case).

Agreed. Can you try if the below patch fixes the issue for you?

diff --git a/drivers/pci/hotplug/acpi_pcihp.c b/drivers/pci/hotplug/acpi_pcihp.c
index 3979f89b250a..73bc92cf0594 100644
--- a/drivers/pci/hotplug/acpi_pcihp.c
+++ b/drivers/pci/hotplug/acpi_pcihp.c
@@ -88,6 +88,9 @@ int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 
 	/* If _OSC exists, we should not evaluate OSHP */
 	host = pci_find_host_bridge(pdev->bus);
+	if (!ACPI_HANDLE(&host->dev))
+		return 0;
+
 	root = acpi_pci_find_root(ACPI_HANDLE(&host->dev));
 	if (root->osc_support_set)
 		goto no_control;
Bjorn Helgaas June 25, 2018, 2:01 p.m. UTC | #4
Hi Marc,

Sorry we broke this!

On Thu, Jun 21, 2018 at 05:47:15PM +0100, Marc Zyngier wrote:
> Until recently, shpc_probe() would bail out pretty early in the
> absence of the SHPC capability. A logic change in the way the
> driver now checks that capability makes it go and probe the
> firmware anyway, with ugly consequences if the system is not
> ACPI based (my arm64 ThunderX is DT driven, and explodes in
> a spectacular way after getting a NULL root bridge from the
> non-existent ACPI tables...).
>
> Take this opportunity to move the call to shpchp_is_native()
> back into shpc_probe(), making it clear that a non-ACPI system
> is not expected to use this driver.

But a non-ACPI system *should* be able to use SHPC.

Here's my understanding of how it should work.  On an ACPI system,

  - If firmware has _OSC, the OS calls it to request permission to
    manage the SHPC.  If _OSC grants permission, it should also
    configure the hardware (interrupts, etc) to give the OS.
    
  - If there's no _OSC, shpchp assumes it's allowed to manage the
    SHPC, and it calls OSHP to configure the hardware appropriately.

On a non-ACPI system, shpchp assumes there's no firmware involved at
all, so it can manage the SHPC without doing anything special.

I see Mika has already posted a patch similar to the first one
below; I think either of those should fix the problem you're seeing.

The second is an attempt to clean things up so they make a
little more sense.

Bjorn


commit 7780896578a93fdec2b345def554355168cceee4
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Mon Jun 25 08:17:33 2018 -0500

    PCI: shpchp: Manage SHPC unconditionally on non-ACPI systems
    
    If CONFIG_ACPI=y but the current hardware/firmware platform doesn't support
    ACPI, acpi_get_hp_hw_control_from_firmware() is implemented and causes a
    NULL pointer dereference  because acpi_pci_find_root() returns NULL.
    
    If acpi_pci_find_root() returns NULL, it means there's no ACPI host bridge
    device (PNP0A03 or PNP0A08), and the OS is always allowed to manage the
    SHPC, so return success in that case.

diff --git a/drivers/pci/hotplug/acpi_pcihp.c b/drivers/pci/hotplug/acpi_pcihp.c
index 3979f89b250a..912d0de29caa 100644
--- a/drivers/pci/hotplug/acpi_pcihp.c
+++ b/drivers/pci/hotplug/acpi_pcihp.c
@@ -87,8 +87,17 @@ int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 		return 0;
 
 	/* If _OSC exists, we should not evaluate OSHP */
+
+	/*
+	 * If there's no ACPI host bridge (i.e., ACPI support is compiled
+	 * into the kernel but the hardware platform doesn't support ACPI),
+	 * there's nothing to do here.
+	 */
 	host = pci_find_host_bridge(pdev->bus);
 	root = acpi_pci_find_root(ACPI_HANDLE(&host->dev));
+	if (!root)
+		return 0;
+
 	if (root->osc_support_set)
 		goto no_control;
 

commit 82ca61ebf1e38bc05753d38103791f48f6a09d91
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Jun 22 17:48:11 2018 -0500

    PCI: shpchp: Separate existence of SHPC and permission to use it
    
    The shpchp driver registers for all PCI bridge devices.  Its probe method
    should fail if either (1) the bridge doesn't have an SHPC or (2) the OS
    isn't allowed to use it.
    
    Separate these two tests into:
    
      - A new shpc_capable() that looks for the SHPC hardware and is applicable
        on all systems, and
    
      - A simplified acpi_get_hp_hw_control_from_firmware() that we call only
        when we already know an SHPC exists and there may be ACPI methods to
        either request permission to use it (_OSC) or transfer control to the
        OS (OSHP).

diff --git a/drivers/pci/hotplug/acpi_pcihp.c b/drivers/pci/hotplug/acpi_pcihp.c
index 912d0de29caa..bd7358bcb2ea 100644
--- a/drivers/pci/hotplug/acpi_pcihp.c
+++ b/drivers/pci/hotplug/acpi_pcihp.c
@@ -74,20 +74,6 @@ int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 	acpi_handle chandle, handle;
 	struct acpi_buffer string = { ACPI_ALLOCATE_BUFFER, NULL };
 
-	/*
-	 * Per PCI firmware specification, we should run the ACPI _OSC
-	 * method to get control of hotplug hardware before using it. If
-	 * an _OSC is missing, we look for an OSHP to do the same thing.
-	 * To handle different BIOS behavior, we look for _OSC on a root
-	 * bridge preferentially (according to PCI fw spec). Later for
-	 * OSHP within the scope of the hotplug controller and its parents,
-	 * up to the host bridge under which this controller exists.
-	 */
-	if (shpchp_is_native(pdev))
-		return 0;
-
-	/* If _OSC exists, we should not evaluate OSHP */
-
 	/*
 	 * If there's no ACPI host bridge (i.e., ACPI support is compiled
 	 * into the kernel but the hardware platform doesn't support ACPI),
@@ -98,9 +84,25 @@ int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 	if (!root)
 		return 0;
 
-	if (root->osc_support_set)
-		goto no_control;
+	/*
+	 * If _OSC exists, it determines whether we're allowed to manage
+	 * the SHPC.  We executed it while enumerating the host bridge.
+	 */
+	if (root->osc_support_set) {
+		if (host->native_shpc_hotplug)
+			return 0;
+		return -ENODEV;
+	}
 
+	/*
+	 * In the absence of _OSC, we're always allowed to manage the SHPC.
+	 * However, if an OSHP method is present, we must execute it so the
+	 * firmware can transfer control to the OS, e.g., direct interrupts
+	 * to the OS instead of to the firmware.
+	 *
+	 * N.B. The PCI Firmware Spec (r3.2, sec 4.8) does not endorse
+	 * searching up the ACPI hierarchy, so the loops below are suspect.
+	 */
 	handle = ACPI_HANDLE(&pdev->dev);
 	if (!handle) {
 		/*
@@ -129,7 +131,7 @@ int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 		if (ACPI_FAILURE(status))
 			break;
 	}
-no_control:
+
 	pci_info(pdev, "Cannot get control of SHPC hotplug\n");
 	kfree(string.pointer);
 	return -ENODEV;
diff --git a/drivers/pci/hotplug/shpchp_core.c b/drivers/pci/hotplug/shpchp_core.c
index e91be287f292..7eb44fdc4445 100644
--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -270,11 +270,30 @@ static int get_adapter_status(struct hotplug_slot *hotplug_slot, u8 *value)
 	return 0;
 }
 
+static bool shpc_capable(struct pci_dev *bridge)
+{
+	/*
+	 * It is assumed that AMD GOLAM chips support SHPC but they do not
+	 * have SHPC capability.
+	 */
+	if (bridge->vendor == PCI_VENDOR_ID_AMD &&
+	    bridge->device == PCI_DEVICE_ID_AMD_GOLAM_7450)
+		return true;
+
+	if (pci_find_capability(bridge, PCI_CAP_ID_SHPC))
+		return true;
+
+	return false;
+}
+
 static int shpc_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int rc;
 	struct controller *ctrl;
 
+	if (!shpc_capable(pdev))
+		return -ENODEV;
+
 	if (acpi_get_hp_hw_control_from_firmware(pdev))
 		return -ENODEV;
 
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 65113b6eed14..b10dd6caeda1 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -408,17 +408,6 @@ bool shpchp_is_native(struct pci_dev *bridge)
 	if (!IS_ENABLED(CONFIG_HOTPLUG_PCI_SHPC))
 		return false;
 
-	/*
-	 * It is assumed that AMD GOLAM chips support SHPC but they do not
-	 * have SHPC capability.
-	 */
-	if (bridge->vendor == PCI_VENDOR_ID_AMD &&
-	    bridge->device == PCI_DEVICE_ID_AMD_GOLAM_7450)
-		return true;
-
-	if (!pci_find_capability(bridge, PCI_CAP_ID_SHPC))
-		return false;
-
 	host = pci_find_host_bridge(bridge->bus);
 	return host->native_shpc_hotplug;
 }
Marc Zyngier June 25, 2018, 4:18 p.m. UTC | #5
Bjorn, Mika,

On 25/06/18 15:01, Bjorn Helgaas wrote:
> Hi Marc,
> 
> Sorry we broke this!

No worries, gave me a chance to stare at something else! ;-)

> 
> On Thu, Jun 21, 2018 at 05:47:15PM +0100, Marc Zyngier wrote:
>> Until recently, shpc_probe() would bail out pretty early in the
>> absence of the SHPC capability. A logic change in the way the
>> driver now checks that capability makes it go and probe the
>> firmware anyway, with ugly consequences if the system is not
>> ACPI based (my arm64 ThunderX is DT driven, and explodes in
>> a spectacular way after getting a NULL root bridge from the
>> non-existent ACPI tables...).
>>
>> Take this opportunity to move the call to shpchp_is_native()
>> back into shpc_probe(), making it clear that a non-ACPI system
>> is not expected to use this driver.
> 
> But a non-ACPI system *should* be able to use SHPC.

Yeah, I only realized that once Mika pointed it out.

> Here's my understanding of how it should work.  On an ACPI system,
> 
>   - If firmware has _OSC, the OS calls it to request permission to
>     manage the SHPC.  If _OSC grants permission, it should also
>     configure the hardware (interrupts, etc) to give the OS.
>     
>   - If there's no _OSC, shpchp assumes it's allowed to manage the
>     SHPC, and it calls OSHP to configure the hardware appropriately.
> 
> On a non-ACPI system, shpchp assumes there's no firmware involved at
> all, so it can manage the SHPC without doing anything special.
> 
> I see Mika has already posted a patch similar to the first one
> below; I think either of those should fix the problem you're seeing.

Absolutely. With this early exit when root is NULL, my test box is back
up and running.

> The second is an attempt to clean things up so they make a
> little more sense.

I haven't tried that second patch yet, but from reading it, it
definitely helps making sense of this driver.

Thanks,

	M.
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/acpi_pcihp.c b/drivers/pci/hotplug/acpi_pcihp.c
index 3979f89b250a..b57f29753a08 100644
--- a/drivers/pci/hotplug/acpi_pcihp.c
+++ b/drivers/pci/hotplug/acpi_pcihp.c
@@ -83,9 +83,6 @@  int acpi_get_hp_hw_control_from_firmware(struct pci_dev *pdev)
 	 * OSHP within the scope of the hotplug controller and its parents,
 	 * up to the host bridge under which this controller exists.
 	 */
-	if (shpchp_is_native(pdev))
-		return 0;
-
 	/* If _OSC exists, we should not evaluate OSHP */
 	host = pci_find_host_bridge(pdev->bus);
 	root = acpi_pci_find_root(ACPI_HANDLE(&host->dev));
diff --git a/drivers/pci/hotplug/shpchp_core.c b/drivers/pci/hotplug/shpchp_core.c
index e91be287f292..8902fe18a636 100644
--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -275,7 +275,8 @@  static int shpc_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	int rc;
 	struct controller *ctrl;
 
-	if (acpi_get_hp_hw_control_from_firmware(pdev))
+	if (!shpchp_is_native(pdev) ||
+	    acpi_get_hp_hw_control_from_firmware(pdev))
 		return -ENODEV;
 
 	ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);