PCI: Add quirk for Cavium Thunder-X2 PCIe erratum #173

Message ID 1517554846-16703-1-git-send-email-george.cherian@cavium.com
State Not Applicable
Headers show
Series
  • PCI: Add quirk for Cavium Thunder-X2 PCIe erratum #173
Related show

Commit Message

George Cherian Feb. 2, 2018, 7 a.m.
The PCIe Controller on Cavium ThunderX2 processors does not
respond to downstream CFG/ECFG cycles when root port is
in power management D3-hot state.

In our tests the above mentioned errata causes the following crash when
the downstream endpoint config space is accessed, while root port is in
D3 state.

[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
[   12.783453] Internal error: : 96000610 [#1] SMP
[   12.787971] Modules linked in: aes_neon_blk ablk_helper cryptd
[   12.793799] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-32-generic #34
[   12.800659] Hardware name: Cavium Inc. Unknown/Unknown, BIOS 1.0 01/01/2018
[   12.807607] task: ffff808f346b8d80 task.stack: ffff808f346b4000
[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
[   12.823767] pc : [<ffff000008506f34>] lr : [<ffff000008506f20>] pstate: 204000c9
[   12.831148] sp : ffff808f346b7bf0
[   12.834449] x29: ffff808f346b7bf0 x28: ffff000008e2b848
[   12.839750] x27: ffff000008dc3070 x26: ffff000008d516c0
[   12.845050] x25: 0000000000000040 x24: ffff00000937a480
[   12.850351] x23: 000000000000006c x22: 0000000000000000
[   12.855651] x21: ffff808f346b7c84 x20: 0000000000000004
[   12.860951] x19: ffff808f31076000 x18: 0000000000000000
[   12.866251] x17: 000000001b3613e6 x16: 000000007f330457
[   12.871551] x15: 0000000067268ad7 x14: 000000005c6254ac
[   12.876851] x13: 00000000f1e100cb x12: 0000000000000030
[   12.882151] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[   12.887452] x9 : ff656d6e626d686f x8 : 7f7f7f7f7f7f7f7f
[   12.892752] x7 : ffff808f310da108 x6 : 0000000000000000
[   12.898052] x5 : 0000000000000003 x4 : ffff808f3107a800
[   12.903352] x3 : 000000000030006c x2 : 0000000000000014
[   12.908652] x1 : ffff000020000000 x0 : ffff00002030006c
[   12.913952]
[   12.915431] Process swapper/0 (pid: 1, stack limit = 0xffff808f346b4020)
[   12.922118] Stack: (0xffff808f346b7bf0 to 0xffff808f346b8000)
[   12.927850] 7be0:                                   ffff808f346b7c30 ffff000008506e2c
[   12.935665] 7c00: ffff000009185000 000000000000006c ffff808f31076000 ffff808f346b7d14
[   12.943481] 7c20: 0000000000000000 ffff000008309488 ffff808f346b7c90 ffff0000085089f4
[   12.951296] 7c40: 0000000000000004 ffff808f310d4000 0000000000000000 ffff808f346b7d14
[   12.959111] 7c60: 0000000000000068 ffff000008dc3078 ffff000008d604c8 ffff0000085089d8
[   12.966927] 7c80: 0000000000000004 000000000004080b ffff808f346b7cd0 ffff000008513d28
[   12.974742] 7ca0: ffff000009185000 00000000ffffffe7 0000000000000001 ffff808f310d4000
[   12.982557] 7cc0: ffff0000092ae000 ffff808f310d4000 ffff808f346b7d20 ffff0000085142d4
[   12.990372] 7ce0: ffff808f310d4000 ffff808f310d4000 ffff000009214000 ffff808f310d40b0
[   12.998188] 7d00: ffff0000092ae000 ffff808f310d40b0 00000000092ae000 000000000004080b
[   13.006003] 7d20: ffff808f346b7d40 ffff000008518754 0000000000000000 ffff808f310d4000
[   13.013818] 7d40: ffff808f346b7d80 ffff000008d9a974 0000000000000000 ffff808f310d4000
[   13.021634] 7d60: ffff000008d9a93c 0000000000000000 ffff0000092ae000 000000000004080b
[   13.029449] 7d80: ffff808f346b7da0 ffff000008083b4c ffff000009185000 ffff808f346b4000
[   13.037264] 7da0: ffff808f346b7e30 ffff000008d60dfc 00000000000000f5 ffff000009185000
[   13.045079] 7dc0: ffff0000092ae000 0000000000000007 ffff0000092ae000 ffff000008dc3078
[   13.052895] 7de0: ffff000008d604c8 ffff000008d51600 ffff000008dc3070 ffff000008e2b720
[   13.060710] 7e00: ffff0000091a68d8 ffff000008c09678 0000000000000000 0000000700000007
[   13.068526] 7e20: 0000000000000000 000000000004080b ffff808f346b7ea0 ffff000008980d90
[   13.076342] 7e40: ffff000008980d78 0000000000000000 0000000000000000 0000000000000000
[   13.084157] 7e60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.091972] 7e80: 0000000000000000 0000000000000000 0000000000000000 000000000004080b
[   13.099788] 7ea0: 0000000000000000 ffff000008083690 ffff000008980d78 0000000000000000
[   13.107603] 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.115418] 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.123233] 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.131048] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.138864] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.146679] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.154494] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.162309] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.170125] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
[   13.177940] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.185755] Call trace:
[   13.188190] Exception stack(0xffff808f346b7a00 to 0xffff808f346b7b30)
[   13.194616] 7a00: ffff808f31076000 0001000000000000 ffff808f346b7bf0 ffff000008506f34
[   13.202432] 7a20: 00000000204000c9 0000000000000003 ffff000009660000 0000000000000007
[   13.210247] 7a40: ffff000000000000 0000000000000000 ffff0000092c2d3b 414d48203a6d7665
[   13.218062] 7a60: 3a73727474612043 0000000000000024 0000000005f5e0ff 0000000000000006
[   13.225878] 7a80: 0000000000000007 ffff0000092c2d25 ffff808f346b7aa0 ffff00000849bc1c
[   13.233693] 7aa0: ffff808f346b7b20 ffff00000849c100 0000000000000000 000000000004080b
[   13.241508] 7ac0: ffff00002030006c ffff000020000000 0000000000000014 000000000030006c
[   13.249323] 7ae0: ffff808f3107a800 0000000000000003 0000000000000000 ffff808f310da108
[   13.257139] 7b00: 7f7f7f7f7f7f7f7f ff656d6e626d686f 7f7f7f7f7f7f7f7f 0101010101010101
[   13.264953] 7b20: 0000000000000030 00000000f1e100cb
[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
[   13.334716] Code: 7100069f 540003c0 71000a9f 54000240 (b9400001)
[   13.340805] ---[ end trace fc992038acd29ec3 ]---

Fix this by adding a quirk that prevents the root port from 
entering D3 state. This is seen on both Ax/Bx variants of the processor.

Signed-off-by: George Cherian <george.cherian@cavium.com>
---
 drivers/pci/quirks.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Jayachandran C Feb. 13, 2018, 6:23 a.m. | #1
On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> The PCIe Controller on Cavium ThunderX2 processors does not
> respond to downstream CFG/ECFG cycles when root port is
> in power management D3-hot state.
> 
> In our tests the above mentioned errata causes the following crash when
> the downstream endpoint config space is accessed, while root port is in
> D3 state.
> 
> [   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> [   12.783453] Internal error: : 96000610 [#1] SMP
> [   12.787971] Modules linked in: aes_neon_blk ablk_helper cryptd
> [   12.793799] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-32-generic #34
> [   12.800659] Hardware name: Cavium Inc. Unknown/Unknown, BIOS 1.0 01/01/2018
> [   12.807607] task: ffff808f346b8d80 task.stack: ffff808f346b4000
> [   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> [   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> [   12.823767] pc : [<ffff000008506f34>] lr : [<ffff000008506f20>] pstate: 204000c9
> [   12.831148] sp : ffff808f346b7bf0
> [   12.834449] x29: ffff808f346b7bf0 x28: ffff000008e2b848
> [   12.839750] x27: ffff000008dc3070 x26: ffff000008d516c0
> [   12.845050] x25: 0000000000000040 x24: ffff00000937a480
> [   12.850351] x23: 000000000000006c x22: 0000000000000000
> [   12.855651] x21: ffff808f346b7c84 x20: 0000000000000004
> [   12.860951] x19: ffff808f31076000 x18: 0000000000000000
> [   12.866251] x17: 000000001b3613e6 x16: 000000007f330457
> [   12.871551] x15: 0000000067268ad7 x14: 000000005c6254ac
> [   12.876851] x13: 00000000f1e100cb x12: 0000000000000030
> [   12.882151] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
> [   12.887452] x9 : ff656d6e626d686f x8 : 7f7f7f7f7f7f7f7f
> [   12.892752] x7 : ffff808f310da108 x6 : 0000000000000000
> [   12.898052] x5 : 0000000000000003 x4 : ffff808f3107a800
> [   12.903352] x3 : 000000000030006c x2 : 0000000000000014
> [   12.908652] x1 : ffff000020000000 x0 : ffff00002030006c
> [   12.913952]
> [   12.915431] Process swapper/0 (pid: 1, stack limit = 0xffff808f346b4020)
> [   12.922118] Stack: (0xffff808f346b7bf0 to 0xffff808f346b8000)
> [   12.927850] 7be0:                                   ffff808f346b7c30 ffff000008506e2c
[...]
> [   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> [   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> [   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> [   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> [   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> [   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> [   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> [   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> [   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> [   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> [   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> [   13.334716] Code: 7100069f 540003c0 71000a9f 54000240 (b9400001)
> [   13.340805] ---[ end trace fc992038acd29ec3 ]---
> 
> Fix this by adding a quirk that prevents the root port from 
> entering D3 state. This is seen on both Ax/Bx variants of the processor.
> 
> Signed-off-by: George Cherian <george.cherian@cavium.com>
> ---
>  drivers/pci/quirks.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 10684b1..2eb08a8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1154,6 +1154,18 @@ static void quirk_ide_samemode(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
>  
>  /*
> + * Cavium's Thunder-X2 Processors root port doesnot handle cfg/ecfg access to
> + * downstream properly if root port is put into D3
> + */

This comment can be fixed up a bit.

> +
> +static void quirk_no_rootport_d3(struct pci_dev *pdev)
> +{
> +	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_BROADCOM, 0x9084, quirk_no_rootport_d3);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xaf84, quirk_no_rootport_d3);
> +
> +/*
>   * Some ATA devices break if put into D3
>   */

Bjorn, if you need an ack for ThunderX2:
Acked-by: Jayachandran C <jnair@caviumnetworks.com>

This fixes the crash seen on ThunderX2 with a few PCI cards. We had worked
around the crash earlier by passing "pcie_port_pm=off" on kernel command line.

JC.
Bjorn Helgaas Feb. 13, 2018, 3:09 p.m. | #2
[+cc Lorenzo]

On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> The PCIe Controller on Cavium ThunderX2 processors does not
> respond to downstream CFG/ECFG cycles when root port is
> in power management D3-hot state.

I think you're talking about the CPU initiating a config cycle to
a device below the root port, right?

This erratum isn't published anywhere where we could include a URL,
is it?

> In our tests the above mentioned errata causes the following crash when
> the downstream endpoint config space is accessed, while root port is in
> D3 state.
> 
> [   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000

This looks like another example of not being able to deal with error
responses to PCIe events, for example, the whole mess with drivers
checking whether the link is up before issuing a config access.

In that sense, this looks like a band-aid that avoids the issue by
preventing the use of D3, but doesn't fix the underlying problem.  If
we *could* deal nicely with this error, maybe we could use D3 on these
root ports.

So I'm not convinced yet that this is actually a PCIe erratum.  Does
the hardware actually violate the PCIe spec, or is this just a problem
with the kernel not knowing how to deal nicely with a PCIe error?

If you could include the erratum text and reference to the relevant
section of the PCIe spec, that would be useful.

> [   12.783453] Internal error: : 96000610 [#1] SMP
> [   12.787971] Modules linked in: aes_neon_blk ablk_helper cryptd
> [   12.793799] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-32-generic #34
> [   12.800659] Hardware name: Cavium Inc. Unknown/Unknown, BIOS 1.0 01/01/2018
> [   12.807607] task: ffff808f346b8d80 task.stack: ffff808f346b4000
> [   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> [   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> [   12.823767] pc : [<ffff000008506f34>] lr : [<ffff000008506f20>] pstate: 204000c9
> [   12.831148] sp : ffff808f346b7bf0
> [   12.834449] x29: ffff808f346b7bf0 x28: ffff000008e2b848
> [   12.839750] x27: ffff000008dc3070 x26: ffff000008d516c0
> [   12.845050] x25: 0000000000000040 x24: ffff00000937a480
> [   12.850351] x23: 000000000000006c x22: 0000000000000000
> [   12.855651] x21: ffff808f346b7c84 x20: 0000000000000004
> [   12.860951] x19: ffff808f31076000 x18: 0000000000000000
> [   12.866251] x17: 000000001b3613e6 x16: 000000007f330457
> [   12.871551] x15: 0000000067268ad7 x14: 000000005c6254ac
> [   12.876851] x13: 00000000f1e100cb x12: 0000000000000030
> [   12.882151] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
> [   12.887452] x9 : ff656d6e626d686f x8 : 7f7f7f7f7f7f7f7f
> [   12.892752] x7 : ffff808f310da108 x6 : 0000000000000000
> [   12.898052] x5 : 0000000000000003 x4 : ffff808f3107a800
> [   12.903352] x3 : 000000000030006c x2 : 0000000000000014
> [   12.908652] x1 : ffff000020000000 x0 : ffff00002030006c
> [   12.913952]
> [   12.915431] Process swapper/0 (pid: 1, stack limit = 0xffff808f346b4020)
> [   12.922118] Stack: (0xffff808f346b7bf0 to 0xffff808f346b8000)
> [   12.927850] 7be0:                                   ffff808f346b7c30 ffff000008506e2c
> [   12.935665] 7c00: ffff000009185000 000000000000006c ffff808f31076000 ffff808f346b7d14
> [   12.943481] 7c20: 0000000000000000 ffff000008309488 ffff808f346b7c90 ffff0000085089f4
> [   12.951296] 7c40: 0000000000000004 ffff808f310d4000 0000000000000000 ffff808f346b7d14
> [   12.959111] 7c60: 0000000000000068 ffff000008dc3078 ffff000008d604c8 ffff0000085089d8
> [   12.966927] 7c80: 0000000000000004 000000000004080b ffff808f346b7cd0 ffff000008513d28
> [   12.974742] 7ca0: ffff000009185000 00000000ffffffe7 0000000000000001 ffff808f310d4000
> [   12.982557] 7cc0: ffff0000092ae000 ffff808f310d4000 ffff808f346b7d20 ffff0000085142d4
> [   12.990372] 7ce0: ffff808f310d4000 ffff808f310d4000 ffff000009214000 ffff808f310d40b0
> [   12.998188] 7d00: ffff0000092ae000 ffff808f310d40b0 00000000092ae000 000000000004080b
> [   13.006003] 7d20: ffff808f346b7d40 ffff000008518754 0000000000000000 ffff808f310d4000
> [   13.013818] 7d40: ffff808f346b7d80 ffff000008d9a974 0000000000000000 ffff808f310d4000
> [   13.021634] 7d60: ffff000008d9a93c 0000000000000000 ffff0000092ae000 000000000004080b
> [   13.029449] 7d80: ffff808f346b7da0 ffff000008083b4c ffff000009185000 ffff808f346b4000
> [   13.037264] 7da0: ffff808f346b7e30 ffff000008d60dfc 00000000000000f5 ffff000009185000
> [   13.045079] 7dc0: ffff0000092ae000 0000000000000007 ffff0000092ae000 ffff000008dc3078
> [   13.052895] 7de0: ffff000008d604c8 ffff000008d51600 ffff000008dc3070 ffff000008e2b720
> [   13.060710] 7e00: ffff0000091a68d8 ffff000008c09678 0000000000000000 0000000700000007
> [   13.068526] 7e20: 0000000000000000 000000000004080b ffff808f346b7ea0 ffff000008980d90
> [   13.076342] 7e40: ffff000008980d78 0000000000000000 0000000000000000 0000000000000000
> [   13.084157] 7e60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.091972] 7e80: 0000000000000000 0000000000000000 0000000000000000 000000000004080b
> [   13.099788] 7ea0: 0000000000000000 ffff000008083690 ffff000008980d78 0000000000000000
> [   13.107603] 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.115418] 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.123233] 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.131048] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.138864] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.146679] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.154494] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.162309] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.170125] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
> [   13.177940] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [   13.185755] Call trace:
> [   13.188190] Exception stack(0xffff808f346b7a00 to 0xffff808f346b7b30)
> [   13.194616] 7a00: ffff808f31076000 0001000000000000 ffff808f346b7bf0 ffff000008506f34
> [   13.202432] 7a20: 00000000204000c9 0000000000000003 ffff000009660000 0000000000000007
> [   13.210247] 7a40: ffff000000000000 0000000000000000 ffff0000092c2d3b 414d48203a6d7665
> [   13.218062] 7a60: 3a73727474612043 0000000000000024 0000000005f5e0ff 0000000000000006
> [   13.225878] 7a80: 0000000000000007 ffff0000092c2d25 ffff808f346b7aa0 ffff00000849bc1c
> [   13.233693] 7aa0: ffff808f346b7b20 ffff00000849c100 0000000000000000 000000000004080b
> [   13.241508] 7ac0: ffff00002030006c ffff000020000000 0000000000000014 000000000030006c
> [   13.249323] 7ae0: ffff808f3107a800 0000000000000003 0000000000000000 ffff808f310da108
> [   13.257139] 7b00: 7f7f7f7f7f7f7f7f ff656d6e626d686f 7f7f7f7f7f7f7f7f 0101010101010101
> [   13.264953] 7b20: 0000000000000030 00000000f1e100cb
> [   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> [   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> [   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> [   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> [   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> [   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> [   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> [   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> [   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> [   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> [   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> [   13.334716] Code: 7100069f 540003c0 71000a9f 54000240 (b9400001)
> [   13.340805] ---[ end trace fc992038acd29ec3 ]---

Most of this dmesg output is overkill -- I don't think the timestamps,
the register contents, or the hex stack dump will contribute to
understanding the issue or helping a user match a problem with this
patch.

> Fix this by adding a quirk that prevents the root port from 
> entering D3 state. This is seen on both Ax/Bx variants of the processor.
> 
> Signed-off-by: George Cherian <george.cherian@cavium.com>
> ---
>  drivers/pci/quirks.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 10684b1..2eb08a8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1154,6 +1154,18 @@ static void quirk_ide_samemode(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
>  
>  /*
> + * Cavium's Thunder-X2 Processors root port doesnot handle cfg/ecfg access to
> + * downstream properly if root port is put into D3
> + */
> +
> +static void quirk_no_rootport_d3(struct pci_dev *pdev)
> +{
> +	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_BROADCOM, 0x9084, quirk_no_rootport_d3);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xaf84, quirk_no_rootport_d3);

I assume this is really only of interest if we have the Thunder-X2
host bridge driver in the kernel, right?  Could the quirk be moved to
that driver so it's not included in everybody's kernel?

> +
> +/*
>   * Some ATA devices break if put into D3
>   */
>  
> -- 
> 2.1.4
>
George Cherian Feb. 14, 2018, 11:28 a.m. | #3
Hi Bjorn,

Thanks for the review.

On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> [+cc Lorenzo]
> 
> On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
>> The PCIe Controller on Cavium ThunderX2 processors does not
>> respond to downstream CFG/ECFG cycles when root port is
>> in power management D3-hot state.
> 
> I think you're talking about the CPU initiating a config cycle to
> a device below the root port, right?
Yes
> 
> This erratum isn't published anywhere where we could include a URL,
> is it?
This erratum is available at support.cavium.com, You might need to 
register to the website to get hold of a copy.
> 
>> In our tests the above mentioned errata causes the following crash when
>> the downstream endpoint config space is accessed, while root port is in
>> D3 state.
>>
>> [   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> 
> This looks like another example of not being able to deal with error
> responses to PCIe events, for example, the whole mess with drivers
> checking whether the link is up before issuing a config access.
> 
> In that sense, this looks like a band-aid that avoids the issue by
> preventing the use of D3, but doesn't fix the underlying problem.  If
> we *could* deal nicely with this error, maybe we could use D3 on these
> root ports.
> 
> So I'm not convinced yet that this is actually a PCIe erratum.  Does
> the hardware actually violate the PCIe spec, or is this just a problem
> with the kernel not knowing how to deal nicely with a PCIe error?
> 
This is not an issue with the way kernel handles the PCIe error.

> If you could include the erratum text and reference to the relevant
> section of the PCIe spec, that would be useful.
>
The relevant section of the PCIe Spec is the following Section 5.3 on 
wards (subsection  5.3.1.4.1)

Configuration and Message requests are the only TLPs accepted by a 
Function in the D3hot state. All other received Requests must be handled 
as Unsupported Requests, and all received Completions may optionally be 
handled as Unexpected Completions.

>> [   12.783453] Internal error: : 96000610 [#1] SMP
>> [   12.787971] Modules linked in: aes_neon_blk ablk_helper cryptd
>> [   12.793799] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-32-generic #34
>> [   12.800659] Hardware name: Cavium Inc. Unknown/Unknown, BIOS 1.0 01/01/2018
>> [   12.807607] task: ffff808f346b8d80 task.stack: ffff808f346b4000
>> [   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
>> [   12.818643] LR is at pci_generic_config_read+0x48/0xf0
>> [   12.823767] pc : [<ffff000008506f34>] lr : [<ffff000008506f20>] pstate: 204000c9
>> [   12.831148] sp : ffff808f346b7bf0
>> [   12.834449] x29: ffff808f346b7bf0 x28: ffff000008e2b848
>> [   12.839750] x27: ffff000008dc3070 x26: ffff000008d516c0
>> [   12.845050] x25: 0000000000000040 x24: ffff00000937a480
>> [   12.850351] x23: 000000000000006c x22: 0000000000000000
>> [   12.855651] x21: ffff808f346b7c84 x20: 0000000000000004
>> [   12.860951] x19: ffff808f31076000 x18: 0000000000000000
>> [   12.866251] x17: 000000001b3613e6 x16: 000000007f330457
>> [   12.871551] x15: 0000000067268ad7 x14: 000000005c6254ac
>> [   12.876851] x13: 00000000f1e100cb x12: 0000000000000030
>> [   12.882151] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
>> [   12.887452] x9 : ff656d6e626d686f x8 : 7f7f7f7f7f7f7f7f
>> [   12.892752] x7 : ffff808f310da108 x6 : 0000000000000000
>> [   12.898052] x5 : 0000000000000003 x4 : ffff808f3107a800
>> [   12.903352] x3 : 000000000030006c x2 : 0000000000000014
>> [   12.908652] x1 : ffff000020000000 x0 : ffff00002030006c
>> [   12.913952]
>> [   12.915431] Process swapper/0 (pid: 1, stack limit = 0xffff808f346b4020)
>> [   12.922118] Stack: (0xffff808f346b7bf0 to 0xffff808f346b8000)
>> [   12.927850] 7be0:                                   ffff808f346b7c30 ffff000008506e2c
>> [   12.935665] 7c00: ffff000009185000 000000000000006c ffff808f31076000 ffff808f346b7d14
>> [   12.943481] 7c20: 0000000000000000 ffff000008309488 ffff808f346b7c90 ffff0000085089f4
>> [   12.951296] 7c40: 0000000000000004 ffff808f310d4000 0000000000000000 ffff808f346b7d14
>> [   12.959111] 7c60: 0000000000000068 ffff000008dc3078 ffff000008d604c8 ffff0000085089d8
>> [   12.966927] 7c80: 0000000000000004 000000000004080b ffff808f346b7cd0 ffff000008513d28
>> [   12.974742] 7ca0: ffff000009185000 00000000ffffffe7 0000000000000001 ffff808f310d4000
>> [   12.982557] 7cc0: ffff0000092ae000 ffff808f310d4000 ffff808f346b7d20 ffff0000085142d4
>> [   12.990372] 7ce0: ffff808f310d4000 ffff808f310d4000 ffff000009214000 ffff808f310d40b0
>> [   12.998188] 7d00: ffff0000092ae000 ffff808f310d40b0 00000000092ae000 000000000004080b
>> [   13.006003] 7d20: ffff808f346b7d40 ffff000008518754 0000000000000000 ffff808f310d4000
>> [   13.013818] 7d40: ffff808f346b7d80 ffff000008d9a974 0000000000000000 ffff808f310d4000
>> [   13.021634] 7d60: ffff000008d9a93c 0000000000000000 ffff0000092ae000 000000000004080b
>> [   13.029449] 7d80: ffff808f346b7da0 ffff000008083b4c ffff000009185000 ffff808f346b4000
>> [   13.037264] 7da0: ffff808f346b7e30 ffff000008d60dfc 00000000000000f5 ffff000009185000
>> [   13.045079] 7dc0: ffff0000092ae000 0000000000000007 ffff0000092ae000 ffff000008dc3078
>> [   13.052895] 7de0: ffff000008d604c8 ffff000008d51600 ffff000008dc3070 ffff000008e2b720
>> [   13.060710] 7e00: ffff0000091a68d8 ffff000008c09678 0000000000000000 0000000700000007
>> [   13.068526] 7e20: 0000000000000000 000000000004080b ffff808f346b7ea0 ffff000008980d90
>> [   13.076342] 7e40: ffff000008980d78 0000000000000000 0000000000000000 0000000000000000
>> [   13.084157] 7e60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.091972] 7e80: 0000000000000000 0000000000000000 0000000000000000 000000000004080b
>> [   13.099788] 7ea0: 0000000000000000 ffff000008083690 ffff000008980d78 0000000000000000
>> [   13.107603] 7ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.115418] 7ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.123233] 7f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.131048] 7f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.138864] 7f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.146679] 7f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.154494] 7f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.162309] 7fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.170125] 7fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
>> [   13.177940] 7fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [   13.185755] Call trace:
>> [   13.188190] Exception stack(0xffff808f346b7a00 to 0xffff808f346b7b30)
>> [   13.194616] 7a00: ffff808f31076000 0001000000000000 ffff808f346b7bf0 ffff000008506f34
>> [   13.202432] 7a20: 00000000204000c9 0000000000000003 ffff000009660000 0000000000000007
>> [   13.210247] 7a40: ffff000000000000 0000000000000000 ffff0000092c2d3b 414d48203a6d7665
>> [   13.218062] 7a60: 3a73727474612043 0000000000000024 0000000005f5e0ff 0000000000000006
>> [   13.225878] 7a80: 0000000000000007 ffff0000092c2d25 ffff808f346b7aa0 ffff00000849bc1c
>> [   13.233693] 7aa0: ffff808f346b7b20 ffff00000849c100 0000000000000000 000000000004080b
>> [   13.241508] 7ac0: ffff00002030006c ffff000020000000 0000000000000014 000000000030006c
>> [   13.249323] 7ae0: ffff808f3107a800 0000000000000003 0000000000000000 ffff808f310da108
>> [   13.257139] 7b00: 7f7f7f7f7f7f7f7f ff656d6e626d686f 7f7f7f7f7f7f7f7f 0101010101010101
>> [   13.264953] 7b20: 0000000000000030 00000000f1e100cb
>> [   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
>> [   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
>> [   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
>> [   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
>> [   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
>> [   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
>> [   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
>> [   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
>> [   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
>> [   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
>> [   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
>> [   13.334716] Code: 7100069f 540003c0 71000a9f 54000240 (b9400001)
>> [   13.340805] ---[ end trace fc992038acd29ec3 ]---
> 
> Most of this dmesg output is overkill -- I don't think the timestamps,
> the register contents, or the hex stack dump will contribute to
> understanding the issue or helping a user match a problem with this
> patch.
Yes I will update the commit log in a better way to capture the issue.
> 
>> Fix this by adding a quirk that prevents the root port from
>> entering D3 state. This is seen on both Ax/Bx variants of the processor.
>>
>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>> ---
>>   drivers/pci/quirks.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>> index 10684b1..2eb08a8 100644
>> --- a/drivers/pci/quirks.c
>> +++ b/drivers/pci/quirks.c
>> @@ -1154,6 +1154,18 @@ static void quirk_ide_samemode(struct pci_dev *pdev)
>>   DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
>>   
>>   /*
>> + * Cavium's Thunder-X2 Processors root port doesnot handle cfg/ecfg access to
>> + * downstream properly if root port is put into D3
>> + */
>> +
>> +static void quirk_no_rootport_d3(struct pci_dev *pdev)
>> +{
>> +	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
>> +}
>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_BROADCOM, 0x9084, quirk_no_rootport_d3);
>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xaf84, quirk_no_rootport_d3);
> 
> I assume this is really only of interest if we have the Thunder-X2
> host bridge driver in the kernel, right?  Could the quirk be moved to
> that driver so it's not included in everybody's kernel?
> 
We dont have a separate driver for  THUNDERX2 PCIhost. It uses the 
pci-host-generic driver. Instead I can have the changes to be under
#ifdef CONFIG_ARCH_THUNDER2
#endif

so that it is only included for CONFIG_ARCH_THUNDER2 enabled builds.

>> +
>> +/*
>>    * Some ATA devices break if put into D3
>>    */
>>   
>> -- 
>> 2.1.4
>>

Regards,
-George
Bjorn Helgaas Feb. 14, 2018, 8:16 p.m. | #4
[+cc Rafael, PM question below]

On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> >>The PCIe Controller on Cavium ThunderX2 processors does not
> >>respond to downstream CFG/ECFG cycles when root port is
> >>in power management D3-hot state.
> >
> >I think you're talking about the CPU initiating a config cycle to
> >a device below the root port, right?
> Yes

If a bridge, e.g., a Root Port in your case, is in D3hot, we should be
able to access config space of the bridge itself, but the secondary
bus will be in B2 or B3 and we won't be able to access config space
for any devices below the bridge.  This is true for *all* bridges, not
just this Cavium Root Port.

The PCIe r4.0, sec 5.3.1, implementation note seems relevant:

  When a Type 1 Function associated with a Switch/Root Port (a
  "virtual bridge") is in a non-D0 power state, it will emulate the
  behavior of a conventional PCI bridge in its handling of Memory,
  I/O, and Configuration Requests and Completions. ...  All Type 1
  Configuration Requests are terminated as Unsupported Requests, ...

> >This erratum isn't published anywhere where we could include a URL,
> >is it?
> This erratum is available at support.cavium.com, You might need to
> register to the website to get hold of a copy.

That appears to require an NDA, so that doesn't work for me.  Can you
just include the erratum text in the changelog?

> >>In our tests the above mentioned errata causes the following crash when
> >>the downstream endpoint config space is accessed, while root port is in
> >>D3 state.
> >>
> >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> >
> >This looks like another example of not being able to deal with error
> >responses to PCIe events, for example, the whole mess with drivers
> >checking whether the link is up before issuing a config access.
> >
> >In that sense, this looks like a band-aid that avoids the issue by
> >preventing the use of D3, but doesn't fix the underlying problem.  If
> >we *could* deal nicely with this error, maybe we could use D3 on these
> >root ports.
> >
> >So I'm not convinced yet that this is actually a PCIe erratum.  Does
> >the hardware actually violate the PCIe spec, or is this just a problem
> >with the kernel not knowing how to deal nicely with a PCIe error?
> >
> This is not an issue with the way kernel handles the PCIe error.
> 
> >If you could include the erratum text and reference to the relevant
> >section of the PCIe spec, that would be useful.
> >
> The relevant section of the PCIe Spec is the following Section 5.3
> on wards (subsection  5.3.1.4.1)
> 
> Configuration and Message requests are the only TLPs accepted by a
> Function in the D3hot state. All other received Requests must be
> handled as Unsupported Requests, and all received Completions may
> optionally be handled as Unexpected Completions.

This isn't exactly relevant because it says requests *other than*
config and message requests must be handled as Unsupported Requests,
and we're talking about a config request.  I think sec 5.3.1 is more
to the point: the Root Port sees a Type 1 Configuration Request that
would be forwarded to its secondary interface if the port were in D0,
and that request should be terminated as an Unsupported Request.

I think the question is how the Root Complex turns this Unsupported
Request into some signal to the CPU.  The implementation note in 2.3.2
might be relevant:

  Some system configuration software depends on reading a data value
  of all 1’s when a Configuration Read Request is terminated as an
  Unsupported Request, particularly when probing to determine the
  existence of a device in the system. A Root Complex intended for use
  with software that depends on a read-data value of all 1’s must
  synthesize this value when UR Completion Status is returned for a
  Configuration Read Request.

So maybe the erratum is that the RC was intended to synthesize all 1's
but it doesn't?

There are other cases that can result in Unsupported Request
completions, so my fear is that this change will take care of one such
case but leave others that will cause similar unhandled external
aborts.

> >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000

> >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> >> ...
> >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40

I should have asked this before: why are we even trying to do this
config read to a device that's in D3?  I assume this root port started
out in D0 because we apparently enumerated it successfully, but it
must have been put into D3 later?

The pci_probe_reset_function() function comment says "the PCI device
must be responsive to PCI config space in order to use this function." 
So apparently the caller should make sure the device is in at least
D3hot and any bridges leading to it are in D0.

If we're missing whatever it is that makes sure the target device is
reachable and responsive to config space, we likely have issues on
other systems as well.  On Cavium this causes the external abort, but
on other systems, we'd probably just get all 1's data back from the
config read and silently do the wrong thing in
pci_probe_reset_function().

I don't know how this runtime PM works, but maybe Rafael can help us
out.

> >>Fix this by adding a quirk that prevents the root port from
> >>entering D3 state. This is seen on both Ax/Bx variants of the processor.
> >>
> >>Signed-off-by: George Cherian <george.cherian@cavium.com>
> >>---
> >>  drivers/pci/quirks.c | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >>index 10684b1..2eb08a8 100644
> >>--- a/drivers/pci/quirks.c
> >>+++ b/drivers/pci/quirks.c
> >>@@ -1154,6 +1154,18 @@ static void quirk_ide_samemode(struct pci_dev *pdev)
> >>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
> >>  /*
> >>+ * Cavium's Thunder-X2 Processors root port doesnot handle cfg/ecfg access to
> >>+ * downstream properly if root port is put into D3
> >>+ */
> >>+
> >>+static void quirk_no_rootport_d3(struct pci_dev *pdev)
> >>+{
> >>+	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> >>+}
> >>+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_BROADCOM, 0x9084, quirk_no_rootport_d3);
> >>+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xaf84, quirk_no_rootport_d3);
> >
> >I assume this is really only of interest if we have the Thunder-X2
> >host bridge driver in the kernel, right?  Could the quirk be moved to
> >that driver so it's not included in everybody's kernel?
> >
> We dont have a separate driver for  THUNDERX2 PCIhost. It uses the
> pci-host-generic driver. Instead I can have the changes to be under
> #ifdef CONFIG_ARCH_THUNDER2
> #endif
> 
> so that it is only included for CONFIG_ARCH_THUNDER2 enabled builds.
Rafael J. Wysocki Feb. 15, 2018, 9:57 p.m. | #5
On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> [+cc Rafael, PM question below]

+Mika

> On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> > >>The PCIe Controller on Cavium ThunderX2 processors does not
> > >>respond to downstream CFG/ECFG cycles when root port is
> > >>in power management D3-hot state.
> > >
> > >I think you're talking about the CPU initiating a config cycle to
> > >a device below the root port, right?
> > Yes
> 
> If a bridge, e.g., a Root Port in your case, is in D3hot, we should be
> able to access config space of the bridge itself, but the secondary
> bus will be in B2 or B3 and we won't be able to access config space
> for any devices below the bridge.  This is true for *all* bridges, not
> just this Cavium Root Port.

Right.

But AFAICS config space reads from devices that aren't there (which
effectively is what happens if the bridge is in D3hot) are at least
expected to return all ones.

> The PCIe r4.0, sec 5.3.1, implementation note seems relevant:
> 
>   When a Type 1 Function associated with a Switch/Root Port (a
>   "virtual bridge") is in a non-D0 power state, it will emulate the
>   behavior of a conventional PCI bridge in its handling of Memory,
>   I/O, and Configuration Requests and Completions. ...  All Type 1
>   Configuration Requests are terminated as Unsupported Requests, ...
> 
> > >This erratum isn't published anywhere where we could include a URL,
> > >is it?
> > This erratum is available at support.cavium.com, You might need to
> > register to the website to get hold of a copy.
> 
> That appears to require an NDA, so that doesn't work for me.  Can you
> just include the erratum text in the changelog?
> 
> > >>In our tests the above mentioned errata causes the following crash when
> > >>the downstream endpoint config space is accessed, while root port is in
> > >>D3 state.
> > >>
> > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > >
> > >This looks like another example of not being able to deal with error
> > >responses to PCIe events, for example, the whole mess with drivers
> > >checking whether the link is up before issuing a config access.
> > >
> > >In that sense, this looks like a band-aid that avoids the issue by
> > >preventing the use of D3, but doesn't fix the underlying problem.  If
> > >we *could* deal nicely with this error, maybe we could use D3 on these
> > >root ports.
> > >
> > >So I'm not convinced yet that this is actually a PCIe erratum.  Does
> > >the hardware actually violate the PCIe spec, or is this just a problem
> > >with the kernel not knowing how to deal nicely with a PCIe error?
> > >
> > This is not an issue with the way kernel handles the PCIe error.
> > 
> > >If you could include the erratum text and reference to the relevant
> > >section of the PCIe spec, that would be useful.
> > >
> > The relevant section of the PCIe Spec is the following Section 5.3
> > on wards (subsection  5.3.1.4.1)
> > 
> > Configuration and Message requests are the only TLPs accepted by a
> > Function in the D3hot state. All other received Requests must be
> > handled as Unsupported Requests, and all received Completions may
> > optionally be handled as Unexpected Completions.
> 
> This isn't exactly relevant because it says requests *other than*
> config and message requests must be handled as Unsupported Requests,
> and we're talking about a config request.  I think sec 5.3.1 is more
> to the point: the Root Port sees a Type 1 Configuration Request that
> would be forwarded to its secondary interface if the port were in D0,
> and that request should be terminated as an Unsupported Request.
> 
> I think the question is how the Root Complex turns this Unsupported
> Request into some signal to the CPU.  The implementation note in 2.3.2
> might be relevant:
> 
>   Some system configuration software depends on reading a data value
>   of all 1’s when a Configuration Read Request is terminated as an
>   Unsupported Request, particularly when probing to determine the
>   existence of a device in the system. A Root Complex intended for use
>   with software that depends on a read-data value of all 1’s must
>   synthesize this value when UR Completion Status is returned for a
>   Configuration Read Request.
> 
> So maybe the erratum is that the RC was intended to synthesize all 1's
> but it doesn't?
> 
> There are other cases that can result in Unsupported Request
> completions, so my fear is that this change will take care of one such
> case but leave others that will cause similar unhandled external
> aborts.
> 
> > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> 
> > >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> > >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> > >> ...
> > >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> > >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> > >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> > >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> > >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> > >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> > >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> > >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> > >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> > >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> > >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> 
> I should have asked this before: why are we even trying to do this
> config read to a device that's in D3?  I assume this root port started
> out in D0 because we apparently enumerated it successfully, but it
> must have been put into D3 later?
> 
> The pci_probe_reset_function() function comment says "the PCI device
> must be responsive to PCI config space in order to use this function." 
> So apparently the caller should make sure the device is in at least
> D3hot and any bridges leading to it are in D0.
> 
> If we're missing whatever it is that makes sure the target device is
> reachable and responsive to config space, we likely have issues on
> other systems as well.  On Cavium this causes the external abort, but
> on other systems, we'd probably just get all 1's data back from the
> config read and silently do the wrong thing in
> pci_probe_reset_function().
> 
> I don't know how this runtime PM works, but maybe Rafael can help us
> out.

I'm not sure what the question is to be honest.

Unused ports are autosuspended after 100ms as per pcie_portdrv_probe().

Now, it looks like they should be resumed in the code path in question,
so this case seems to have been overlooked.

Mika, what do you think?
Bjorn Helgaas Feb. 15, 2018, 11:39 p.m. | #6
On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> > > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> > > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> > > >>The PCIe Controller on Cavium ThunderX2 processors does not
> > > >>respond to downstream CFG/ECFG cycles when root port is
> > > >>in power management D3-hot state.
> > > >
> > > >I think you're talking about the CPU initiating a config cycle to
> > > >a device below the root port, right?
> > > Yes
> > 
> > If a bridge, e.g., a Root Port in your case, is in D3hot, we should be
> > able to access config space of the bridge itself, but the secondary
> > bus will be in B2 or B3 and we won't be able to access config space
> > for any devices below the bridge.  This is true for *all* bridges, not
> > just this Cavium Root Port.
> 
> Right.
> 
> But AFAICS config space reads from devices that aren't there (which
> effectively is what happens if the bridge is in D3hot) are at least
> expected to return all ones.

Yes.  AIUI, the PCIe spec doesn't actually *require* all ones in this
case, but the sec 2.3.2 implementation note says hardware intended for
use with software (like Linux) that assumes the all ones behavior
should synthesize all ones.  So I'm speculating that this Cavium
device is supposed to synthesize all ones, but does not.

But from the discussion below, it sounds like this may have helped
uncover a more serious Linux bug, i.e., we don't resume a device
before trying to use it.

> > The PCIe r4.0, sec 5.3.1, implementation note seems relevant:
> > 
> >   When a Type 1 Function associated with a Switch/Root Port (a
> >   "virtual bridge") is in a non-D0 power state, it will emulate the
> >   behavior of a conventional PCI bridge in its handling of Memory,
> >   I/O, and Configuration Requests and Completions. ...  All Type 1
> >   Configuration Requests are terminated as Unsupported Requests, ...
> > 
> > > >This erratum isn't published anywhere where we could include a URL,
> > > >is it?
> > > This erratum is available at support.cavium.com, You might need to
> > > register to the website to get hold of a copy.
> > 
> > That appears to require an NDA, so that doesn't work for me.  Can you
> > just include the erratum text in the changelog?
> > 
> > > >>In our tests the above mentioned errata causes the following crash when
> > > >>the downstream endpoint config space is accessed, while root port is in
> > > >>D3 state.
> > > >>
> > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > > >
> > > >This looks like another example of not being able to deal with error
> > > >responses to PCIe events, for example, the whole mess with drivers
> > > >checking whether the link is up before issuing a config access.
> > > >
> > > >In that sense, this looks like a band-aid that avoids the issue by
> > > >preventing the use of D3, but doesn't fix the underlying problem.  If
> > > >we *could* deal nicely with this error, maybe we could use D3 on these
> > > >root ports.
> > > >
> > > >So I'm not convinced yet that this is actually a PCIe erratum.  Does
> > > >the hardware actually violate the PCIe spec, or is this just a problem
> > > >with the kernel not knowing how to deal nicely with a PCIe error?
> > > >
> > > This is not an issue with the way kernel handles the PCIe error.
> > > 
> > > >If you could include the erratum text and reference to the relevant
> > > >section of the PCIe spec, that would be useful.
> > > >
> > > The relevant section of the PCIe Spec is the following Section 5.3
> > > on wards (subsection  5.3.1.4.1)
> > > 
> > > Configuration and Message requests are the only TLPs accepted by a
> > > Function in the D3hot state. All other received Requests must be
> > > handled as Unsupported Requests, and all received Completions may
> > > optionally be handled as Unexpected Completions.
> > 
> > This isn't exactly relevant because it says requests *other than*
> > config and message requests must be handled as Unsupported Requests,
> > and we're talking about a config request.  I think sec 5.3.1 is more
> > to the point: the Root Port sees a Type 1 Configuration Request that
> > would be forwarded to its secondary interface if the port were in D0,
> > and that request should be terminated as an Unsupported Request.
> > 
> > I think the question is how the Root Complex turns this Unsupported
> > Request into some signal to the CPU.  The implementation note in 2.3.2
> > might be relevant:
> > 
> >   Some system configuration software depends on reading a data value
> >   of all 1’s when a Configuration Read Request is terminated as an
> >   Unsupported Request, particularly when probing to determine the
> >   existence of a device in the system. A Root Complex intended for use
> >   with software that depends on a read-data value of all 1’s must
> >   synthesize this value when UR Completion Status is returned for a
> >   Configuration Read Request.
> > 
> > So maybe the erratum is that the RC was intended to synthesize all 1's
> > but it doesn't?
> > 
> > There are other cases that can result in Unsupported Request
> > completions, so my fear is that this change will take care of one such
> > case but leave others that will cause similar unhandled external
> > aborts.
> > 
> > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > 
> > > >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> > > >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> > > >> ...
> > > >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> > > >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> > > >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> > > >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> > > >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> > > >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> > > >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> > > >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> > > >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> > > >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> > > >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> > 
> > I should have asked this before: why are we even trying to do this
> > config read to a device that's in D3?  I assume this root port started
> > out in D0 because we apparently enumerated it successfully, but it
> > must have been put into D3 later?
> > 
> > The pci_probe_reset_function() function comment says "the PCI device
> > must be responsive to PCI config space in order to use this function." 
> > So apparently the caller should make sure the device is in at least
> > D3hot and any bridges leading to it are in D0.
> > 
> > If we're missing whatever it is that makes sure the target device is
> > reachable and responsive to config space, we likely have issues on
> > other systems as well.  On Cavium this causes the external abort, but
> > on other systems, we'd probably just get all 1's data back from the
> > config read and silently do the wrong thing in
> > pci_probe_reset_function().
> > 
> > I don't know how this runtime PM works, but maybe Rafael can help us
> > out.
> 
> I'm not sure what the question is to be honest.

My questions are basically "What does the PCI core need to do to make
sure a device is in D0 before it operates on it?  And where do we need
to do that?"

> Unused ports are autosuspended after 100ms as per pcie_portdrv_probe().

So I guess this is only done for PCIe bridges, not for conventional
PCI bridges.  Is this because autosuspend depends on a PCIe-only
feature?

Here's the path we're in now:

  pcie_portdrv_init                        # device_initcall (level 6)
    pci_register_driver(&pcie_portdriver)
      ...
        pcie_portdrv_probe                 # pcie_portdriver.probe
          pm_runtime_set_autosuspend_delay(&dev->dev, 100)

  pci_sysfs_init                           # late_initcall (level 7)
    for_each_pci_dev(dev)
      pci_create_sysfs_dev_files(dev)
        pci_create_capabilities_sysfs(dev)
          pci_probe_reset_function(dev)
            ...
              pcie_capability_read_dword   <-- config read causing abort

I guess it's conceivable that more than 100ms elapsed between
pcie_portdrv_probe() and pci_probe_reset_function(), and obviously we
can't depend on any particular timing.

> Now, it looks like they should be resumed in the code path in question,
> so this case seems to have been overlooked.

How do we identify which code paths need to resume the device, and
what do we need to do to resume it?

I'm kind of scared about this because most hardware just returns all
ones in this case and we might never notice.

Bjorn
Rafael J. Wysocki Feb. 16, 2018, 12:40 p.m. | #7
On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> > > > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> > > > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> > > > >>The PCIe Controller on Cavium ThunderX2 processors does not
> > > > >>respond to downstream CFG/ECFG cycles when root port is
> > > > >>in power management D3-hot state.
> > > > >
> > > > >I think you're talking about the CPU initiating a config cycle to
> > > > >a device below the root port, right?
> > > > Yes
> > > 
> > > If a bridge, e.g., a Root Port in your case, is in D3hot, we should be
> > > able to access config space of the bridge itself, but the secondary
> > > bus will be in B2 or B3 and we won't be able to access config space
> > > for any devices below the bridge.  This is true for *all* bridges, not
> > > just this Cavium Root Port.
> > 
> > Right.
> > 
> > But AFAICS config space reads from devices that aren't there (which
> > effectively is what happens if the bridge is in D3hot) are at least
> > expected to return all ones.
> 
> Yes.  AIUI, the PCIe spec doesn't actually *require* all ones in this
> case, but the sec 2.3.2 implementation note says hardware intended for
> use with software (like Linux) that assumes the all ones behavior
> should synthesize all ones.  So I'm speculating that this Cavium
> device is supposed to synthesize all ones, but does not.
> 
> But from the discussion below, it sounds like this may have helped
> uncover a more serious Linux bug, i.e., we don't resume a device
> before trying to use it.
> 
> > > The PCIe r4.0, sec 5.3.1, implementation note seems relevant:
> > > 
> > >   When a Type 1 Function associated with a Switch/Root Port (a
> > >   "virtual bridge") is in a non-D0 power state, it will emulate the
> > >   behavior of a conventional PCI bridge in its handling of Memory,
> > >   I/O, and Configuration Requests and Completions. ...  All Type 1
> > >   Configuration Requests are terminated as Unsupported Requests, ...
> > > 
> > > > >This erratum isn't published anywhere where we could include a URL,
> > > > >is it?
> > > > This erratum is available at support.cavium.com, You might need to
> > > > register to the website to get hold of a copy.
> > > 
> > > That appears to require an NDA, so that doesn't work for me.  Can you
> > > just include the erratum text in the changelog?
> > > 
> > > > >>In our tests the above mentioned errata causes the following crash when
> > > > >>the downstream endpoint config space is accessed, while root port is in
> > > > >>D3 state.
> > > > >>
> > > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > > > >
> > > > >This looks like another example of not being able to deal with error
> > > > >responses to PCIe events, for example, the whole mess with drivers
> > > > >checking whether the link is up before issuing a config access.
> > > > >
> > > > >In that sense, this looks like a band-aid that avoids the issue by
> > > > >preventing the use of D3, but doesn't fix the underlying problem.  If
> > > > >we *could* deal nicely with this error, maybe we could use D3 on these
> > > > >root ports.
> > > > >
> > > > >So I'm not convinced yet that this is actually a PCIe erratum.  Does
> > > > >the hardware actually violate the PCIe spec, or is this just a problem
> > > > >with the kernel not knowing how to deal nicely with a PCIe error?
> > > > >
> > > > This is not an issue with the way kernel handles the PCIe error.
> > > > 
> > > > >If you could include the erratum text and reference to the relevant
> > > > >section of the PCIe spec, that would be useful.
> > > > >
> > > > The relevant section of the PCIe Spec is the following Section 5.3
> > > > on wards (subsection  5.3.1.4.1)
> > > > 
> > > > Configuration and Message requests are the only TLPs accepted by a
> > > > Function in the D3hot state. All other received Requests must be
> > > > handled as Unsupported Requests, and all received Completions may
> > > > optionally be handled as Unexpected Completions.
> > > 
> > > This isn't exactly relevant because it says requests *other than*
> > > config and message requests must be handled as Unsupported Requests,
> > > and we're talking about a config request.  I think sec 5.3.1 is more
> > > to the point: the Root Port sees a Type 1 Configuration Request that
> > > would be forwarded to its secondary interface if the port were in D0,
> > > and that request should be terminated as an Unsupported Request.
> > > 
> > > I think the question is how the Root Complex turns this Unsupported
> > > Request into some signal to the CPU.  The implementation note in 2.3.2
> > > might be relevant:
> > > 
> > >   Some system configuration software depends on reading a data value
> > >   of all 1’s when a Configuration Read Request is terminated as an
> > >   Unsupported Request, particularly when probing to determine the
> > >   existence of a device in the system. A Root Complex intended for use
> > >   with software that depends on a read-data value of all 1’s must
> > >   synthesize this value when UR Completion Status is returned for a
> > >   Configuration Read Request.
> > > 
> > > So maybe the erratum is that the RC was intended to synthesize all 1's
> > > but it doesn't?
> > > 
> > > There are other cases that can result in Unsupported Request
> > > completions, so my fear is that this change will take care of one such
> > > case but leave others that will cause similar unhandled external
> > > aborts.
> > > 
> > > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > > 
> > > > >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> > > > >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> > > > >> ...
> > > > >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> > > > >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> > > > >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> > > > >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> > > > >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> > > > >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> > > > >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> > > > >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> > > > >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> > > > >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> > > > >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> > > 
> > > I should have asked this before: why are we even trying to do this
> > > config read to a device that's in D3?  I assume this root port started
> > > out in D0 because we apparently enumerated it successfully, but it
> > > must have been put into D3 later?
> > > 
> > > The pci_probe_reset_function() function comment says "the PCI device
> > > must be responsive to PCI config space in order to use this function." 
> > > So apparently the caller should make sure the device is in at least
> > > D3hot and any bridges leading to it are in D0.
> > > 
> > > If we're missing whatever it is that makes sure the target device is
> > > reachable and responsive to config space, we likely have issues on
> > > other systems as well.  On Cavium this causes the external abort, but
> > > on other systems, we'd probably just get all 1's data back from the
> > > config read and silently do the wrong thing in
> > > pci_probe_reset_function().
> > > 
> > > I don't know how this runtime PM works, but maybe Rafael can help us
> > > out.
> > 
> > I'm not sure what the question is to be honest.
> 
> My questions are basically "What does the PCI core need to do to make
> sure a device is in D0 before it operates on it?  And where do we need
> to do that?"
> 
> > Unused ports are autosuspended after 100ms as per pcie_portdrv_probe().
> 
> So I guess this is only done for PCIe bridges, not for conventional
> PCI bridges.  Is this because autosuspend depends on a PCIe-only
> feature?

Yes, that's PCIe-only.

> Here's the path we're in now:
> 
>   pcie_portdrv_init                        # device_initcall (level 6)
>     pci_register_driver(&pcie_portdriver)
>       ...
>         pcie_portdrv_probe                 # pcie_portdriver.probe
>           pm_runtime_set_autosuspend_delay(&dev->dev, 100)
> 
>   pci_sysfs_init                           # late_initcall (level 7)
>     for_each_pci_dev(dev)
>       pci_create_sysfs_dev_files(dev)
>         pci_create_capabilities_sysfs(dev)
>           pci_probe_reset_function(dev)
>             ...
>               pcie_capability_read_dword   <-- config read causing abort
> 
> I guess it's conceivable that more than 100ms elapsed between
> pcie_portdrv_probe() and pci_probe_reset_function(), and obviously we
> can't depend on any particular timing.

Right.

> > Now, it looks like they should be resumed in the code path in question,
> > so this case seems to have been overlooked.
> 
> How do we identify which code paths need to resume the device, and
> what do we need to do to resume it?

The "suspended" metastate is generally defined as "the device may not be
accessible now" and for the majority of bus types it is unsafe to attempt
to access suspended devices.

That's not strictly the case for PCI, but IMO PCI devices should be resumed
before accessing them too because of the below. :-)

pm_runtime_get_sync() resumes the device and prevents it from suspending
again until pm_runtime_put() (or equivalent) is called on it.

And, of course, resuming the target device itself will cause all of the
bridges above it to resume (at least that should happen by design).

> I'm kind of scared about this because most hardware just returns all
> ones in this case and we might never notice.

That's a valid concern and not just related to bridges.  Devices in D3cold
look like they were not present.
Bjorn Helgaas Feb. 16, 2018, 8:34 p.m. | #8
On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> > On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > > On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> > > > > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> > > > > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:

<snip>
> > > > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > > > 
> > > > > >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> > > > > >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> > > > > >> ...
> > > > > >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> > > > > >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> > > > > >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> > > > > >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> > > > > >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> > > > > >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> > > > > >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> > > > > >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> > > > > >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> > > > > >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> > > > > >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> > > > 
> > > > I should have asked this before: why are we even trying to do this
> > > > config read to a device that's in D3?  I assume this root port started
> > > > out in D0 because we apparently enumerated it successfully, but it
> > > > must have been put into D3 later?
> > > > 
> > > > The pci_probe_reset_function() function comment says "the PCI device
> > > > must be responsive to PCI config space in order to use this function." 
> > > > So apparently the caller should make sure the device is in at least
> > > > D3hot and any bridges leading to it are in D0.
> > > > 
> > > > If we're missing whatever it is that makes sure the target device is
> > > > reachable and responsive to config space, we likely have issues on
> > > > other systems as well.  On Cavium this causes the external abort, but
> > > > on other systems, we'd probably just get all 1's data back from the
> > > > config read and silently do the wrong thing in
> > > > pci_probe_reset_function().
> > > > 
> > > > I don't know how this runtime PM works, but maybe Rafael can help us
> > > > out.
> > > 
> > > I'm not sure what the question is to be honest.
> > 
> > My questions are basically "What does the PCI core need to do to make
> > sure a device is in D0 before it operates on it?  And where do we need
> > to do that?"
> > 
> > > Unused ports are autosuspended after 100ms as per pcie_portdrv_probe().
> > 
> > So I guess this is only done for PCIe bridges, not for conventional
> > PCI bridges.  Is this because autosuspend depends on a PCIe-only
> > feature?
> 
> Yes, that's PCIe-only.

What PCIe feature does this depend on?

I see the PCIe test in pci_bridge_d3_possible(), added by 9d26d3a8f1b0
("PCI: Put PCIe ports into D3 during suspend"), but unfortunately that
commit doesn't say *why* we only do this for PCIe, or only for BIOS
dates of 2015 or newer.

> > Here's the path we're in now:
> > 
> >   pcie_portdrv_init                        # device_initcall (level 6)
> >     pci_register_driver(&pcie_portdriver)
> >       ...
> >         pcie_portdrv_probe                 # pcie_portdriver.probe
> >           pm_runtime_set_autosuspend_delay(&dev->dev, 100)
> > 
> >   pci_sysfs_init                           # late_initcall (level 7)
> >     for_each_pci_dev(dev)
> >       pci_create_sysfs_dev_files(dev)
> >         pci_create_capabilities_sysfs(dev)
> >           pci_probe_reset_function(dev)
> >             ...
> >               pcie_capability_read_dword   <-- config read causing abort
> > 
> > I guess it's conceivable that more than 100ms elapsed between
> > pcie_portdrv_probe() and pci_probe_reset_function(), and obviously we
> > can't depend on any particular timing.
> 
> Right.

George, can you find out which specific devices are involved here?

My working assumption is that as for_each_pci_dev() iterates through
all the devices, we call pci_probe_reset_function() for a root port
that happens to be in D3hot (this should work fine because the root
port's config space should be accessible), then we call
pci_probe_reset_function() for a device *below* the root port.  That
doesn't work because the root port's link is down and and devices
below it are not accessible.  The root port should treat this as an
Unsupported Request.

> > > Now, it looks like they should be resumed in the code path in question,
> > > so this case seems to have been overlooked.
> > 
> > How do we identify which code paths need to resume the device, and
> > what do we need to do to resume it?
> 
> The "suspended" metastate is generally defined as "the device may not be
> accessible now" and for the majority of bus types it is unsafe to attempt
> to access suspended devices.
> 
> That's not strictly the case for PCI, but IMO PCI devices should be resumed
> before accessing them too because of the below. :-)
> 
> pm_runtime_get_sync() resumes the device and prevents it from suspending
> again until pm_runtime_put() (or equivalent) is called on it.

How do we identify the places that need to call pm_runtime_get_sync()?

PCI devices will respond to config accesses while in D3hot, but not to
memory or I/O accesses.  I think that distinction is probably too
detailed and the general rule should be "the device must be in D0
before you access it."

I guess generally the driver is in control, and if it suspends the
device, it has to be smart enough to resume it before touching it.

This particular pci_probe_reset_function() case is a problem because
it's the PCI *core* doing the access while the device may already be
claimed by a driver.  I think it's generally taboo for the core to do
this, and I have a patch to move this access so it happens before a
driver can claim the device.

We also have the case where userspace accesses a device via sysfs.  I
guess all those cases probably need to wake up the device.  It looks
like pci_config_pm_runtime_get() already does that for config
accesses.

But I think other entry points, e.g., the following, should have the
same problem and I don't see anything there:

  pci_read_legacy_io()
  pci_write_legacy_io()
  pci_mmap_legacy_mem()
  pci_mmap_legacy_io()
  pci_mmap_resource()
  pci_resource_io()
  pci_read_rom()
  reset_store()

Plus probably the ASPM control files, VPD, maybe others?

Bjorn
Rafael J. Wysocki Feb. 19, 2018, 11:21 a.m. | #9
On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
> On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> > On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> > > On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > > > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > > > On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
> > > > > > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
> > > > > > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
> 
> <snip>
> > > > > > >>[   12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000
> > > > > 
> > > > > > >>[   12.813518] PC is at pci_generic_config_read+0x5c/0xf0
> > > > > > >>[   12.818643] LR is at pci_generic_config_read+0x48/0xf0
> > > > > > >> ...
> > > > > > >>[   13.269819] [<ffff000008506f34>] pci_generic_config_read+0x5c/0xf0
> > > > > > >>[   13.275987] [<ffff000008506e2c>] pci_bus_read_config_dword+0xb4/0xd8
> > > > > > >>[   13.282328] [<ffff0000085089f4>] pcie_capability_read_dword+0x64/0xb8
> > > > > > >>[   13.288757] [<ffff000008513d28>] __pci_dev_reset+0x90/0x328
> > > > > > >>[   13.294317] [<ffff0000085142d4>] pci_probe_reset_function+0x24/0x30
> > > > > > >>[   13.300571] [<ffff000008518754>] pci_create_sysfs_dev_files+0x18c/0x2a0
> > > > > > >>[   13.307173] [<ffff000008d9a974>] pci_sysfs_init+0x38/0x60
> > > > > > >>[   13.312560] [<ffff000008083b4c>] do_one_initcall+0x5c/0x170
> > > > > > >>[   13.318122] [<ffff000008d60dfc>] kernel_init_freeable+0x1c0/0x27c
> > > > > > >>[   13.324205] [<ffff000008980d90>] kernel_init+0x18/0x110
> > > > > > >>[   13.329416] [<ffff000008083690>] ret_from_fork+0x10/0x40
> > > > > 
> > > > > I should have asked this before: why are we even trying to do this
> > > > > config read to a device that's in D3?  I assume this root port started
> > > > > out in D0 because we apparently enumerated it successfully, but it
> > > > > must have been put into D3 later?
> > > > > 
> > > > > The pci_probe_reset_function() function comment says "the PCI device
> > > > > must be responsive to PCI config space in order to use this function." 
> > > > > So apparently the caller should make sure the device is in at least
> > > > > D3hot and any bridges leading to it are in D0.
> > > > > 
> > > > > If we're missing whatever it is that makes sure the target device is
> > > > > reachable and responsive to config space, we likely have issues on
> > > > > other systems as well.  On Cavium this causes the external abort, but
> > > > > on other systems, we'd probably just get all 1's data back from the
> > > > > config read and silently do the wrong thing in
> > > > > pci_probe_reset_function().
> > > > > 
> > > > > I don't know how this runtime PM works, but maybe Rafael can help us
> > > > > out.
> > > > 
> > > > I'm not sure what the question is to be honest.
> > > 
> > > My questions are basically "What does the PCI core need to do to make
> > > sure a device is in D0 before it operates on it?  And where do we need
> > > to do that?"
> > > 
> > > > Unused ports are autosuspended after 100ms as per pcie_portdrv_probe().
> > > 
> > > So I guess this is only done for PCIe bridges, not for conventional
> > > PCI bridges.  Is this because autosuspend depends on a PCIe-only
> > > feature?
> > 
> > Yes, that's PCIe-only.
> 
> What PCIe feature does this depend on?

It doesn't depend on any PCIe feature in particular, but we don't need to do it
for PCI-to-PCI bridges in general.

> I see the PCIe test in pci_bridge_d3_possible(), added by 9d26d3a8f1b0
> ("PCI: Put PCIe ports into D3 during suspend"), but unfortunately that
> commit doesn't say *why* we only do this for PCIe, or only for BIOS
> dates of 2015 or newer.

The reason why we need this is related to deep low-power states on new
SoCs that only can be entered if PCIe root ports are in D3 (that is, if the
root ports are not in D3, the whole SoC cannot enter some of its deep
low-power states and that may exted to package C-states for processors).

That also is the reason why we do it for PCIe only.

The reason for the cut-off date is because we know it for a fact that PCIe
port PM doesn't work on some older platforms due to hardware issues and,
again, it only really needs to be done on recent SoCs.

I believe that all of the above was mentioned during discussions on the patches
that ended up as commit 9d26d3a8f1b0 and so on.

> > > Here's the path we're in now:
> > > 
> > >   pcie_portdrv_init                        # device_initcall (level 6)
> > >     pci_register_driver(&pcie_portdriver)
> > >       ...
> > >         pcie_portdrv_probe                 # pcie_portdriver.probe
> > >           pm_runtime_set_autosuspend_delay(&dev->dev, 100)
> > > 
> > >   pci_sysfs_init                           # late_initcall (level 7)
> > >     for_each_pci_dev(dev)
> > >       pci_create_sysfs_dev_files(dev)
> > >         pci_create_capabilities_sysfs(dev)
> > >           pci_probe_reset_function(dev)
> > >             ...
> > >               pcie_capability_read_dword   <-- config read causing abort
> > > 
> > > I guess it's conceivable that more than 100ms elapsed between
> > > pcie_portdrv_probe() and pci_probe_reset_function(), and obviously we
> > > can't depend on any particular timing.
> > 
> > Right.
> 
> George, can you find out which specific devices are involved here?
> 
> My working assumption is that as for_each_pci_dev() iterates through
> all the devices, we call pci_probe_reset_function() for a root port
> that happens to be in D3hot (this should work fine because the root
> port's config space should be accessible), then we call
> pci_probe_reset_function() for a device *below* the root port.  That
> doesn't work because the root port's link is down and and devices
> below it are not accessible.  The root port should treat this as an
> Unsupported Request.
> 
> > > > Now, it looks like they should be resumed in the code path in question,
> > > > so this case seems to have been overlooked.
> > > 
> > > How do we identify which code paths need to resume the device, and
> > > what do we need to do to resume it?
> > 
> > The "suspended" metastate is generally defined as "the device may not be
> > accessible now" and for the majority of bus types it is unsafe to attempt
> > to access suspended devices.
> > 
> > That's not strictly the case for PCI, but IMO PCI devices should be resumed
> > before accessing them too because of the below. :-)
> > 
> > pm_runtime_get_sync() resumes the device and prevents it from suspending
> > again until pm_runtime_put() (or equivalent) is called on it.
> 
> How do we identify the places that need to call pm_runtime_get_sync()?
> 
> PCI devices will respond to config accesses while in D3hot, but not to
> memory or I/O accesses.  I think that distinction is probably too
> detailed and the general rule should be "the device must be in D0
> before you access it."

Right, but in some cases we access config spaces for things like PME
status changes and similar (and AML may be accessing them too for similar
reasons) and if the request is dropped on the floor silently then, well,
so be it.

> I guess generally the driver is in control, and if it suspends the
> device, it has to be smart enough to resume it before touching it.

Not necessarily.

Say, user space may want to access it too via the bus type's sysfs.

That's why we have pci_config_pm_runtime_get(), for instance.

> This particular pci_probe_reset_function() case is a problem because
> it's the PCI *core* doing the access while the device may already be
> claimed by a driver.  I think it's generally taboo for the core to do
> this, and I have a patch to move this access so it happens before a
> driver can claim the device.

The core may do accesses like that IMO (especially reads, not writes), but
indeed this particular case looks a bit over the top.

> We also have the case where userspace accesses a device via sysfs.  I
> guess all those cases probably need to wake up the device.  It looks
> like pci_config_pm_runtime_get() already does that for config
> accesses.

Right.

> But I think other entry points, e.g., the following, should have the
> same problem and I don't see anything there:
> 
>   pci_read_legacy_io()
>   pci_write_legacy_io()
>   pci_mmap_legacy_mem()
>   pci_mmap_legacy_io()
>   pci_mmap_resource()
>   pci_resource_io()
>   pci_read_rom()
>   reset_store()
> 
> Plus probably the ASPM control files, VPD, maybe others?

Generally speaking, every code path going directly to the config space
(below the driver) needs to take the possibility that the device may be in
D3cold into account.

Whether or not this means resuming the device upfront depends on the purpose
of the access I think.
Jon Masters Feb. 19, 2018, 7:08 p.m. | #10
Hi Bjorn, Rafael, others,

On 02/15/2018 06:39 PM, Bjorn Helgaas wrote:
> On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
>> On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
>>> On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote:
>>>> On 02/13/2018 08:39 PM, Bjorn Helgaas wrote:
>>>>> On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote:
>>>>>> The PCIe Controller on Cavium ThunderX2 processors does not
>>>>>> respond to downstream CFG/ECFG cycles when root port is
>>>>>> in power management D3-hot state.
>>>>>
>>>>> I think you're talking about the CPU initiating a config cycle to
>>>>> a device below the root port, right?
>>>> Yes
>>>
>>> If a bridge, e.g., a Root Port in your case, is in D3hot, we should be
>>> able to access config space of the bridge itself, but the secondary
>>> bus will be in B2 or B3 and we won't be able to access config space
>>> for any devices below the bridge.  This is true for *all* bridges, not
>>> just this Cavium Root Port.
>>
>> Right.
>>
>> But AFAICS config space reads from devices that aren't there (which
>> effectively is what happens if the bridge is in D3hot) are at least
>> expected to return all ones.
> 
> Yes.  AIUI, the PCIe spec doesn't actually *require* all ones

Indeed. This was my reading of the spec last year when I originally
discovered this bug (and suggested the temporary bandaid of the runtime
kernel parameter to disable pm for the port). I've seen this on certain
Cavium ThunderX2 systems in specific configurations, but in my debug
sessions it seemed that the problem was we're expecting all 1s and we
don't get those, so we then ultimately get an SError and go to lunch.

<snip>

> But from the discussion below, it sounds like this may have helped
> uncover a more serious Linux bug, i.e., we don't resume a device
> before trying to use it.

I suspected this too, but didn't get chance to followup. I had expected
the above would have been posted many months ago.

Jon.
Bjorn Helgaas Feb. 19, 2018, 8:36 p.m. | #11
On Mon, Feb 19, 2018 at 12:21:56PM +0100, Rafael J. Wysocki wrote:
> On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
> > On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> > > On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
<snip>

> > > > My questions are basically "What does the PCI core need to do
> > > > to make sure a device is in D0 before it operates on it?  And
> > > > where do we need to do that?"
> > > > 
> > > > > Unused ports are autosuspended after 100ms as per
> > > > > pcie_portdrv_probe().
> > > > 
> > > > So I guess this is only done for PCIe bridges, not for
> > > > conventional PCI bridges.  Is this because autosuspend depends
> > > > on a PCIe-only feature?
> > > 
> > > Yes, that's PCIe-only.
> > 
> > What PCIe feature does this depend on?
> 
> It doesn't depend on any PCIe feature in particular, but we don't
> need to do it for PCI-to-PCI bridges in general.
> 
> > I see the PCIe test in pci_bridge_d3_possible(), added by
> > 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend"), but
> > unfortunately that commit doesn't say *why* we only do this for
> > PCIe, or only for BIOS dates of 2015 or newer.
> 
> The reason why we need this is related to deep low-power states on
> new SoCs that only can be entered if PCIe root ports are in D3 (that
> is, if the root ports are not in D3, the whole SoC cannot enter some
> of its deep low-power states and that may exted to package C-states
> for processors).
> 
> That also is the reason why we do it for PCIe only.
> 
> The reason for the cut-off date is because we know it for a fact
> that PCIe port PM doesn't work on some older platforms due to
> hardware issues and, again, it only really needs to be done on
> recent SoCs.
> 
> I believe that all of the above was mentioned during discussions on
> the patches that ended up as commit 9d26d3a8f1b0 and so on.

OK.  I should have made sure more of this was captured in the code
comments.  The code doesn't match the spec and unless you happen to
remember all that history, you'd have no idea why, which makes
maintenance a bit of a hassle.  I'll propose a patch to add some
comments.

Bjorn
Lukas Wunner Feb. 20, 2018, 1:54 a.m. | #12
On Mon, Feb 19, 2018 at 12:21:56PM +0100, Rafael J. Wysocki wrote:
> On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
> > On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> > > On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> > > > On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > > > > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > > > > I don't know how this runtime PM works, but maybe Rafael can help
> > > > > > us out.

This has nothing to do with runtime PM AFAICS.

The device seems to be in D3hot on boot, is that correct?
The PCI core assumes that unbound devices remain in D0
(see comments in pci_pm_runtime_resume() / pci_pm_runtime_suspend()).


> > > > > I'm not sure what the question is to be honest.
> > > > 
> > > > My questions are basically "What does the PCI core need to do to make
> > > > sure a device is in D0 before it operates on it?  And where do we need
> > > > to do that?"

When scanning the bus and discovering the device is not in D0,
call pci_power_up().  This could probably go into pci_init_pm().
Once a driver binds to it, it may choose to runtime suspend it to
D3hot again.

Just an idea anyway.

Thanks,

Lukas
Rafael J. Wysocki Feb. 20, 2018, 10:58 a.m. | #13
On Tuesday, February 20, 2018 2:54:33 AM CET Lukas Wunner wrote:
> On Mon, Feb 19, 2018 at 12:21:56PM +0100, Rafael J. Wysocki wrote:
> > On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
> > > On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> > > > On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> > > > > On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > > > > > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > > > > > I don't know how this runtime PM works, but maybe Rafael can help
> > > > > > > us out.
> 
> This has nothing to do with runtime PM AFAICS.
> 
> The device seems to be in D3hot on boot, is that correct?
> The PCI core assumes that unbound devices remain in D0
> (see comments in pci_pm_runtime_resume() / pci_pm_runtime_suspend()).

Which may not be the case when user space fiddles with the runtime PM
sysfs for them IIRC.
Bjorn Helgaas Feb. 20, 2018, 7 p.m. | #14
[+cc Huang]

On Tue, Feb 20, 2018 at 02:54:33AM +0100, Lukas Wunner wrote:
> On Mon, Feb 19, 2018 at 12:21:56PM +0100, Rafael J. Wysocki wrote:
> > On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
> > > On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
> > > > On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
> > > > > On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
> > > > > > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
> > > > > > > I don't know how this runtime PM works, but maybe Rafael can help
> > > > > > > us out.
> 
> This has nothing to do with runtime PM AFAICS.
> 
> The device seems to be in D3hot on boot, is that correct?

I actually don't know for sure what device caused the issue or what
state it was in.  George, if you could clarify that, it would be
helpful.

My assumption (possibly incorrect) is that the issue happens when we
read config space of an endpoint behind a PCIe bridge that is in D3.
The PCIe portdrv binds to the bridge before the pci_sysfs_init()
late_initcall happens, and it enables runtime PM on the bridge, so
it's possible the bridge is in D3 before pci_sysfs_init() happens.

> The PCI core assumes that unbound devices remain in D0
> (see comments in pci_pm_runtime_resume() / pci_pm_runtime_suspend()).

That comment was added by 967577b06241 ("PCI/PM: Keep runtime PM
enabled for unbound PCI devices"), apparently because "some devices do
not work again after being put into D3".  That commit mentions
https://bugzilla.kernel.org/show_bug.cgi?id=48201 but I don't think
it's completely convincing about the requirement for keeping things in
D0.

Even if it's true that some devices don't work after being in D3, that
doesn't seem like sufficient reason to keep *all* devices on all
systems in D0.  We would normally use quirks to address device issues
like this.

> > > > > > I'm not sure what the question is to be honest.
> > > > > 
> > > > > My questions are basically "What does the PCI core need to do to make
> > > > > sure a device is in D0 before it operates on it?  And where do we need
> > > > > to do that?"
> 
> When scanning the bus and discovering the device is not in D0,
> call pci_power_up().  This could probably go into pci_init_pm().
> Once a driver binds to it, it may choose to runtime suspend it to
> D3hot again.

s/pci_init_pm/pci_pm_init/

That's an interesting idea.  I hadn't thought of the case where a
device is not in D0 when we enumerate it.  We *do* make sure bridges
are in D0 before scanning behind them (pci_scan_bridge_extend() calls
pm_runtime_get_sync()).

Since enumeration should only perform config accesses, and devices
must respond to config accesses even when in D3hot, I guess I'm not
convinced that we need to power up everything (other than bridges)
during enumeration.

Bjorn
George Cherian Feb. 21, 2018, 9:28 a.m. | #15
Hi Bjorn,


On 02/21/2018 12:30 AM, Bjorn Helgaas wrote:
> [+cc Huang]
> 
> On Tue, Feb 20, 2018 at 02:54:33AM +0100, Lukas Wunner wrote:
>> On Mon, Feb 19, 2018 at 12:21:56PM +0100, Rafael J. Wysocki wrote:
>>> On Friday, February 16, 2018 9:34:34 PM CET Bjorn Helgaas wrote:
>>>> On Fri, Feb 16, 2018 at 01:40:37PM +0100, Rafael J. Wysocki wrote:
>>>>> On Friday, February 16, 2018 12:39:00 AM CET Bjorn Helgaas wrote:
>>>>>> On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote:
>>>>>>> On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote:
>>>>>>>> I don't know how this runtime PM works, but maybe Rafael can help
>>>>>>>> us out.
>>
>> This has nothing to do with runtime PM AFAICS.
>>
>> The device seems to be in D3hot on boot, is that correct?
> 
> I actually don't know for sure what device caused the issue or what
> state it was in.  George, if you could clarify that, it would be
> helpful.
> 
I will explain the setup used
To the Cavium ThunderX RC the following PLX device is connected.
PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 
GT/s) Switch
There is no device connected downstream to the PLX switch.

AFAIU the pcie_port driver probes PLX and enters autosuspend after 100ms 
since pci_bridge_d3_possible() returns true.

And later pci_sysfs_init() ends up doing a config access of PLX which 
fails with a "synchronous external abort"

> My assumption (possibly incorrect) is that the issue happens when we
> read config space of an endpoint behind a PCIe bridge that is in D3.
> The PCIe portdrv binds to the bridge before the pci_sysfs_init()
> late_initcall happens, and it enables runtime PM on the bridge, so
> it's possible the bridge is in D3 before pci_sysfs_init() happens.
> 
>> The PCI core assumes that unbound devices remain in D0
>> (see comments in pci_pm_runtime_resume() / pci_pm_runtime_suspend()).
> 
> That comment was added by 967577b06241 ("PCI/PM: Keep runtime PM
> enabled for unbound PCI devices"), apparently because "some devices do
> not work again after being put into D3".  That commit mentions
> https://bugzilla.kernel.org/show_bug.cgi?id=48201 but I don't think
> it's completely convincing about the requirement for keeping things in
> D0.
> 
> Even if it's true that some devices don't work after being in D3, that
> doesn't seem like sufficient reason to keep *all* devices on all
> systems in D0.  We would normally use quirks to address device issues
> like this.
> 
>>>>>>> I'm not sure what the question is to be honest.
>>>>>>
>>>>>> My questions are basically "What does the PCI core need to do to make
>>>>>> sure a device is in D0 before it operates on it?  And where do we need
>>>>>> to do that?"
>>
>> When scanning the bus and discovering the device is not in D0,
>> call pci_power_up().  This could probably go into pci_init_pm().
>> Once a driver binds to it, it may choose to runtime suspend it to
>> D3hot again.
> 
> s/pci_init_pm/pci_pm_init/
> 
> That's an interesting idea.  I hadn't thought of the case where a
> device is not in D0 when we enumerate it.  We *do* make sure bridges
> are in D0 before scanning behind them (pci_scan_bridge_extend() calls
> pm_runtime_get_sync()).
> 
> Since enumeration should only perform config accesses, and devices
> must respond to config accesses even when in D3hot, I guess I'm not
> convinced that we need to power up everything (other than bridges)
> during enumeration.
> 
> Bjorn
>
Lukas Wunner Feb. 21, 2018, 9:54 a.m. | #16
On Wed, Feb 21, 2018 at 02:58:13PM +0530, George Cherian wrote:
> I will explain the setup used
> To the Cavium ThunderX RC the following PLX device is connected.
> PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)
> Switch
> There is no device connected downstream to the PLX switch.
> 
> AFAIU the pcie_port driver probes PLX and enters autosuspend after 100ms
> since pci_bridge_d3_possible() returns true.
> 
> And later pci_sysfs_init() ends up doing a config access of PLX which fails
> with a "synchronous external abort"

Then you're missing a pci_config_pm_runtime_get() in pci_sysfs_init() or
further down in the call stack, rather than a quirk which just papers
over the issue.

Thanks,

Lukas
George Cherian Feb. 21, 2018, 10:55 a.m. | #17
Hi Lukas,

On 02/21/2018 03:24 PM, Lukas Wunner wrote:
> On Wed, Feb 21, 2018 at 02:58:13PM +0530, George Cherian wrote:
>> I will explain the setup used
>> To the Cavium ThunderX RC the following PLX device is connected.
>> PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)
>> Switch
>> There is no device connected downstream to the PLX switch.
>>
>> AFAIU the pcie_port driver probes PLX and enters autosuspend after 100ms
>> since pci_bridge_d3_possible() returns true.
>>
>> And later pci_sysfs_init() ends up doing a config access of PLX which fails
>> with a "synchronous external abort"
> 
> Then you're missing a pci_config_pm_runtime_get() in pci_sysfs_init() or
> further down in the call stack, rather than a quirk which just papers
> over the issue.

I have found another configuration where this fails.
Following is the configuration
1) Connected a PCIe Intel i40 card under the root port.
2) unbind the i40 driver and bind with vfio-pci driver.
3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"

I get the same synchronous external abort.
In this case the vfio-pci driver probe it moves the device (i40) to
D3hot provided disable_idle_d3 is not set. lspci tries to do
the config_access which fails with synchronous external abort when
the root port transitions to D3hot.
> 
> Thanks,
> 
> Lukas
> 
Regards,
-George
Bjorn Helgaas Feb. 21, 2018, 11:20 p.m. | #18
On Wed, Feb 21, 2018 at 04:25:08PM +0530, George Cherian wrote:
> On 02/21/2018 03:24 PM, Lukas Wunner wrote:
> > On Wed, Feb 21, 2018 at 02:58:13PM +0530, George Cherian wrote:
> > > I will explain the setup used
> > > To the Cavium ThunderX RC the following PLX device is connected.
> > > PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)
> > > Switch
> > > There is no device connected downstream to the PLX switch.
> > > 
> > > AFAIU the pcie_port driver probes PLX and enters autosuspend after 100ms
> > > since pci_bridge_d3_possible() returns true.
> > > 
> > > And later pci_sysfs_init() ends up doing a config access of PLX which fails
> > > with a "synchronous external abort"

Thanks for the details!

This one *should* be fixed by this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=bf6c089ee2ac67eb22c0ff0ac9cc7f9ccd619d90

Any chance you could try that out?

> > Then you're missing a pci_config_pm_runtime_get() in pci_sysfs_init() or
> > further down in the call stack, rather than a quirk which just papers
> > over the issue.
> 
> I have found another configuration where this fails.
> Following is the configuration
> 1) Connected a PCIe Intel i40 card under the root port.
> 2) unbind the i40 driver and bind with vfio-pci driver.
> 3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"
> 
> I get the same synchronous external abort.
> In this case the vfio-pci driver probe it moves the device (i40) to
> D3hot provided disable_idle_d3 is not set. lspci tries to do
> the config_access which fails with synchronous external abort when
> the root port transitions to D3hot.

This one sounds like we're missing something in this path:

  pci_read_config
    pci_config_pm_runtime_get
      if (parent)
        pm_runtime_get_sync
          __pm_runtime_resume(dev, RPM_GET_PUT)
            rpm_resume

It *looks* like rpm_resume() should resume parent devices, i.e., the
root port, but I don't know that code at all.  Maybe Rafael or Lukas
could confirm that?

pci_config_pm_runtime_get() knows that config space is always
accessible unless the device is in D3cold, so if the target device is
in D3hot, it will leave it there.  I assume that if/when rpm_resume()
resumes the parent bridges, it will resume them all the way to D0.

I'm *really* glad you're finding these issues, because on most
platforms we would just silently read invalid data (all ones) and the
caller would have no idea what's going wrong.

Bjorn
Lukas Wunner Feb. 22, 2018, 8:35 a.m. | #19
On Wed, Feb 21, 2018 at 05:20:40PM -0600, Bjorn Helgaas wrote:
> On Wed, Feb 21, 2018 at 04:25:08PM +0530, George Cherian wrote:
> > I have found another configuration where this fails.
> > Following is the configuration
> > 1) Connected a PCIe Intel i40 card under the root port.
> > 2) unbind the i40 driver and bind with vfio-pci driver.
> > 3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"
> > 
> > I get the same synchronous external abort.
> > In this case the vfio-pci driver probe it moves the device (i40) to
> > D3hot provided disable_idle_d3 is not set. lspci tries to do
> > the config_access which fails with synchronous external abort when
> > the root port transitions to D3hot.
> 
> It *looks* like rpm_resume() should resume parent devices, i.e., the
> root port, but I don't know that code at all.

It does that unless the ignore_children flag is set, which only few
drivers do.

Sounds to me like another missing pci_config_pm_runtime_get() somewhere
but it's hard to tell without a stacktrace.

Thanks,

Lukas
Rafael J. Wysocki Feb. 22, 2018, 10:19 a.m. | #20
On Thursday, February 22, 2018 9:35:43 AM CET Lukas Wunner wrote:
> On Wed, Feb 21, 2018 at 05:20:40PM -0600, Bjorn Helgaas wrote:
> > On Wed, Feb 21, 2018 at 04:25:08PM +0530, George Cherian wrote:
> > > I have found another configuration where this fails.
> > > Following is the configuration
> > > 1) Connected a PCIe Intel i40 card under the root port.
> > > 2) unbind the i40 driver and bind with vfio-pci driver.
> > > 3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"
> > > 
> > > I get the same synchronous external abort.
> > > In this case the vfio-pci driver probe it moves the device (i40) to
> > > D3hot provided disable_idle_d3 is not set. lspci tries to do
> > > the config_access which fails with synchronous external abort when
> > > the root port transitions to D3hot.
> > 
> > It *looks* like rpm_resume() should resume parent devices, i.e., the
> > root port, but I don't know that code at all.
> 
> It does that unless the ignore_children flag is set, which only few
> drivers do.

Right.

And the PCIe port driver doesn't do that in particular, so PCIe ports
should be resumed by the resume of devices below them.

Thanks,
Rafael
George Cherian Feb. 22, 2018, 1:13 p.m. | #21
Hi Bjorn,

On 02/22/2018 04:50 AM, Bjorn Helgaas wrote:
> On Wed, Feb 21, 2018 at 04:25:08PM +0530, George Cherian wrote:
>> On 02/21/2018 03:24 PM, Lukas Wunner wrote:
>>> On Wed, Feb 21, 2018 at 02:58:13PM +0530, George Cherian wrote:
>>>> I will explain the setup used
>>>> To the Cavium ThunderX RC the following PLX device is connected.
>>>> PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)
>>>> Switch
>>>> There is no device connected downstream to the PLX switch.
>>>>
>>>> AFAIU the pcie_port driver probes PLX and enters autosuspend after 100ms
>>>> since pci_bridge_d3_possible() returns true.
>>>>
>>>> And later pci_sysfs_init() ends up doing a config access of PLX which fails
>>>> with a "synchronous external abort"
> 
> Thanks for the details!
> 
> This one *should* be fixed by this patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=bf6c089ee2ac67eb22c0ff0ac9cc7f9ccd619d90
> 
> Any chance you could try that out?

I did try your patch and it works fine on the above failing setup.
> 
>>> Then you're missing a pci_config_pm_runtime_get() in pci_sysfs_init() or
>>> further down in the call stack, rather than a quirk which just papers
>>> over the issue.
>>
>> I have found another configuration where this fails.
>> Following is the configuration
>> 1) Connected a PCIe Intel i40 card under the root port.
>> 2) unbind the i40 driver and bind with vfio-pci driver.
>> 3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"
>>
>> I get the same synchronous external abort.
>> In this case the vfio-pci driver probe it moves the device (i40) to
>> D3hot provided disable_idle_d3 is not set. lspci tries to do
>> the config_access which fails with synchronous external abort when
>> the root port transitions to D3hot.
> 
> This one sounds like we're missing something in this path:
> 
>    pci_read_config
>      pci_config_pm_runtime_get
>        if (parent)
>          pm_runtime_get_sync
>            __pm_runtime_resume(dev, RPM_GET_PUT)
>              rpm_resume
> 
> It *looks* like rpm_resume() should resume parent devices, i.e., the
> root port, but I don't know that code at all.  Maybe Rafael or Lukas
> could confirm that?
> 
> pci_config_pm_runtime_get() knows that config space is always
> accessible unless the device is in D3cold, so if the target device is
> in D3hot, it will leave it there.  I assume that if/when rpm_resume()
> resumes the parent bridges, it will resume them all the way to D0.

the stack trace for this issue looks like this
[<ffff00000851bbfc>] pci_generic_config_read+0x5c/0xf0
[<ffff00000851c6e4>] pci_user_read_config_dword+0x84/0x110
[<ffff00000851cda8>] pci_vpd_read+0x100/0x208
[<ffff00000851bee8>] pci_read_vpd+0x50/0x68
[<ffff00000852d6c0>] read_vpd_attr+0x60/0x80
[<ffff00000833b224>] sysfs_kf_bin_read+0x6c/0xa8
[<ffff00000833a674>] kernfs_fop_read+0xa4/0x1c8
[<ffff0000082a6238>] __vfs_read+0x60/0x170
[<ffff0000082a63d4>] vfs_read+0x8c/0x148
[<ffff0000082a6c64>] SyS_pread64+0xbc/0xd8

I have tried adding pci_config_pm_runtime_get/put pair inside 
pci_vpd_read(), which I guess might be needed, in case the device goes
to D3cold. But having said that it didnt fix the problem in our platform.

> 
> I'm *really* glad you're finding these issues, because on most
> platforms we would just silently read invalid data (all ones) and the
> caller would have no idea what's going wrong.
> 
> Bjorn
> 

Regards,
-George
Bjorn Helgaas Feb. 22, 2018, 3:09 p.m. | #22
On Thu, Feb 22, 2018 at 06:43:34PM +0530, George Cherian wrote:
> On 02/22/2018 04:50 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 21, 2018 at 04:25:08PM +0530, George Cherian wrote:
> > > On 02/21/2018 03:24 PM, Lukas Wunner wrote:
> > > > On Wed, Feb 21, 2018 at 02:58:13PM +0530, George Cherian wrote:
> > > > > I will explain the setup used
> > > > > To the Cavium ThunderX RC the following PLX device is connected.
> > > > > PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express
> > > > > Gen 3 (8.0 GT/s) Switch
> > > > > There is no device connected downstream to the PLX switch.
> > > > > 
> > > > > AFAIU the pcie_port driver probes PLX and enters autosuspend
> > > > > after 100ms since pci_bridge_d3_possible() returns true.
> > > > > 
> > > > > And later pci_sysfs_init() ends up doing a config access of
> > > > > PLX which fails with a "synchronous external abort"

> > 
> > Thanks for the details!
> > 
> > This one *should* be fixed by this patch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=bf6c089ee2ac67eb22c0ff0ac9cc7f9ccd619d90
> > 
> > Any chance you could try that out?
> 
> I did try your patch and it works fine on the above failing setup.

Thanks for testing it!

> > > I have found another configuration where this fails.
> > > Following is the configuration
> > > 1) Connected a PCIe Intel i40 card under the root port.
> > > 2) unbind the i40 driver and bind with vfio-pci driver.
> > > 3) Run lspci in a loop. "lspci -s xx:xx.xx -vvv"
> > > 
> > > I get the same synchronous external abort.
> > > In this case the vfio-pci driver probe it moves the device (i40) to
> > > D3hot provided disable_idle_d3 is not set. lspci tries to do
> > > the config_access which fails with synchronous external abort when
> > > the root port transitions to D3hot.

<snip>
> the stack trace for this issue looks like this
> [<ffff00000851bbfc>] pci_generic_config_read+0x5c/0xf0
> [<ffff00000851c6e4>] pci_user_read_config_dword+0x84/0x110
> [<ffff00000851cda8>] pci_vpd_read+0x100/0x208
> [<ffff00000851bee8>] pci_read_vpd+0x50/0x68
> [<ffff00000852d6c0>] read_vpd_attr+0x60/0x80
> [<ffff00000833b224>] sysfs_kf_bin_read+0x6c/0xa8
> [<ffff00000833a674>] kernfs_fop_read+0xa4/0x1c8
> [<ffff0000082a6238>] __vfs_read+0x60/0x170
> [<ffff0000082a63d4>] vfs_read+0x8c/0x148
> [<ffff0000082a6c64>] SyS_pread64+0xbc/0xd8
> 
> I have tried adding pci_config_pm_runtime_get/put pair inside
> pci_vpd_read(), which I guess might be needed, in case the device goes
> to D3cold. But having said that it didnt fix the problem in our platform.

Your original patch avoids this problem by setting PCI_DEV_FLAGS_NO_D3
on the root port, so it seems like this must be somehow related to the
root port's state.

I assume this VPD read is on the i40 device, right?  Since you're
still seeing the problem even after calling
pci_config_pm_runtime_get(), I assume the root port is still not in
D0.  Can you add a little more instrumentation to read PCI_PM_CTRL and
PCI_PM_PPB_EXTENSIONS for the root port and PCI_PM_CTRL for the i40
device right after you call pci_config_pm_runtime_get()?

I don't see anything obviously different between the pci_read_config()
path and the pci_vpd_read() path except for the
pci_config_pm_runtime_get() call that you've already added.  I guess
you could try using setpci instead of lspci to see if the failure only
happens in the pci_vpd_read() path.  I assume that will be the case
because lspci probably does config reads before it does the VPD read,
and those initial config reads seemed to work OK.

The VPD path does do config writes in addition to config reads.  Maybe
there's something special about writes, although I don't know what
that would be.  You can tell I'm running out of ideas here :)

Bjorn

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 10684b1..2eb08a8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -1154,6 +1154,18 @@  static void quirk_ide_samemode(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
 
 /*
+ * Cavium's Thunder-X2 Processors root port doesnot handle cfg/ecfg access to
+ * downstream properly if root port is put into D3
+ */
+
+static void quirk_no_rootport_d3(struct pci_dev *pdev)
+{
+	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_BROADCOM, 0x9084, quirk_no_rootport_d3);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CAVIUM, 0xaf84, quirk_no_rootport_d3);
+
+/*
  * Some ATA devices break if put into D3
  */