mbox series

[focal:linux-azure,bionic:linux-azure-4.15,0/5] Fix kdump Over Network

Message ID 20201007211640.60573-1-kelsey.skunberg@canonical.com
Headers show
Series Fix kdump Over Network | expand

Message

Kelsey Skunberg Oct. 7, 2020, 9:16 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1883261

[Impact]

Microsoft would like to request two kdump related fixes in all releases
supported on Azure. The two commits are:

c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device
state")
83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path
to release resource properly")

These are in the virtual PCI driver for Hyper-V. The customer visible
symptom is that the network is not functional in the kdump kernel, so
the dump file must be stored on the local disk and cannot be written
over the network.

The problem only occurs when Accelerated Networking is enabled. It’s a
relatively obscure scenario, which is why the problem has not surfaced
before now. But we have an important customer who wants the
“dump-file-over-the-network” functionality to work.

For bionic/linux-azure-4.15, the following additional patch needs to be
backported first to allow the requested patches to apply cleanly:

a8e37506e79a ("PCI: hv: Reorganize the code in preparation of
hibernation")

[Test Case]

- Apply requested patches and boot into updated kernel
- Verify Accelerated Networking is enabled
- Set up kdump
- configure kdump to use SSH
- Test the crash dump mechanism and verify the kernel crash dump appears
  on the selected remote server

Further details for setting up kdump through testing can be found here:
https://ubuntu.com/server/docs/kernel-crash-dump

[Regression Potential]

Patches are only targeted to azure kernels.

Patches are desgiend to release allocated resources remaining after
error cases in hv_pci_probe() or PCI devices not being shut down
properly. if those resources are still not correctly released, then
entering D0 state in kdump kernel could continue to fail.

Potential for finding regression with freeing resources or still failing to
enter D0 state in the kdump kernel even after all resources have been
released.  

Build & boot tested. Verified kdump works as intended over SSH after
patches are applied.

Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels
signed off on and verified to resolve problem.


Changes for Bionic/linux-azure-4.15:


Dexuan Cui (1):
  PCI: hv: Reorganize the code in preparation of hibernation

Wei Hu (2):
  PCI: hv: Fix the PCI HyperV probe failure path to release resource
    properly
  PCI: hv: Retry PCI bus D0 entry on invalid device state

 drivers/pci/host/pci-hyperv.c | 101 +++++++++++++++++++++++++++-------
 1 file changed, 81 insertions(+), 20 deletions(-)


Changes for Focal/linux-azure:

Wei Hu (2):
  PCI: hv: Fix the PCI HyperV probe failure path to release resource
    properly
  PCI: hv: Retry PCI bus D0 entry on invalid device state

 drivers/pci/controller/pci-hyperv.c | 60 ++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 6 deletions(-)

--
2.25.1

Comments

Stefan Bader Oct. 8, 2020, 9:33 a.m. UTC | #1
On 07.10.20 23:16, Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1883261
> 
> [Impact]
> 
> Microsoft would like to request two kdump related fixes in all releases
> supported on Azure. The two commits are:
> 
> c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device
> state")
> 83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path
> to release resource properly")
> 
> These are in the virtual PCI driver for Hyper-V. The customer visible
> symptom is that the network is not functional in the kdump kernel, so
> the dump file must be stored on the local disk and cannot be written
> over the network.
> 
> The problem only occurs when Accelerated Networking is enabled. It’s a
> relatively obscure scenario, which is why the problem has not surfaced
> before now. But we have an important customer who wants the
> “dump-file-over-the-network” functionality to work.
> 
> For bionic/linux-azure-4.15, the following additional patch needs to be
> backported first to allow the requested patches to apply cleanly:
> 
> a8e37506e79a ("PCI: hv: Reorganize the code in preparation of
> hibernation")
> 
> [Test Case]
> 
> - Apply requested patches and boot into updated kernel
> - Verify Accelerated Networking is enabled
> - Set up kdump
> - configure kdump to use SSH
> - Test the crash dump mechanism and verify the kernel crash dump appears
>   on the selected remote server
> 
> Further details for setting up kdump through testing can be found here:
> https://ubuntu.com/server/docs/kernel-crash-dump
> 
> [Regression Potential]
> 
> Patches are only targeted to azure kernels.
> 
> Patches are desgiend to release allocated resources remaining after
> error cases in hv_pci_probe() or PCI devices not being shut down
> properly. if those resources are still not correctly released, then
> entering D0 state in kdump kernel could continue to fail.
> 
> Potential for finding regression with freeing resources or still failing to
> enter D0 state in the kdump kernel even after all resources have been
> released.  
> 
> Build & boot tested. Verified kdump works as intended over SSH after
> patches are applied.
> 
> Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels
> signed off on and verified to resolve problem.
> 
> 
> Changes for Bionic/linux-azure-4.15:
> 
> 
> Dexuan Cui (1):
>   PCI: hv: Reorganize the code in preparation of hibernation
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/host/pci-hyperv.c | 101 +++++++++++++++++++++++++++-------
>  1 file changed, 81 insertions(+), 20 deletions(-)
> 
> 
> Changes for Focal/linux-azure:
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/controller/pci-hyperv.c | 60 ++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 6 deletions(-)
> 
> --
> 2.25.1
> 
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Colin Ian King Oct. 8, 2020, 11:28 a.m. UTC | #2
On 07/10/2020 22:16, Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1883261
> 
> [Impact]
> 
> Microsoft would like to request two kdump related fixes in all releases
> supported on Azure. The two commits are:
> 
> c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device
> state")
> 83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path
> to release resource properly")
> 
> These are in the virtual PCI driver for Hyper-V. The customer visible
> symptom is that the network is not functional in the kdump kernel, so
> the dump file must be stored on the local disk and cannot be written
> over the network.
> 
> The problem only occurs when Accelerated Networking is enabled. It’s a
> relatively obscure scenario, which is why the problem has not surfaced
> before now. But we have an important customer who wants the
> “dump-file-over-the-network” functionality to work.
> 
> For bionic/linux-azure-4.15, the following additional patch needs to be
> backported first to allow the requested patches to apply cleanly:
> 
> a8e37506e79a ("PCI: hv: Reorganize the code in preparation of
> hibernation")
> 
> [Test Case]
> 
> - Apply requested patches and boot into updated kernel
> - Verify Accelerated Networking is enabled
> - Set up kdump
> - configure kdump to use SSH
> - Test the crash dump mechanism and verify the kernel crash dump appears
>   on the selected remote server
> 
> Further details for setting up kdump through testing can be found here:
> https://ubuntu.com/server/docs/kernel-crash-dump
> 
> [Regression Potential]
> 
> Patches are only targeted to azure kernels.
> 
> Patches are desgiend to release allocated resources remaining after
> error cases in hv_pci_probe() or PCI devices not being shut down
> properly. if those resources are still not correctly released, then
> entering D0 state in kdump kernel could continue to fail.
> 
> Potential for finding regression with freeing resources or still failing to
> enter D0 state in the kdump kernel even after all resources have been
> released.  
> 
> Build & boot tested. Verified kdump works as intended over SSH after
> patches are applied.
> 
> Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels
> signed off on and verified to resolve problem.
> 
> 
> Changes for Bionic/linux-azure-4.15:
> 
> 
> Dexuan Cui (1):
>   PCI: hv: Reorganize the code in preparation of hibernation
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/host/pci-hyperv.c | 101 +++++++++++++++++++++++++++-------
>  1 file changed, 81 insertions(+), 20 deletions(-)
> 
> 
> Changes for Focal/linux-azure:
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/controller/pci-hyperv.c | 60 ++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 6 deletions(-)
> 
> --
> 2.25.1
> 

Thanks Kelsey; backports look good to me, good test case and results, I
think the regression potential vs benefit looks sane, so..

Acked-by: Colin Ian King <colin.king@canonical.com>
Ian May Oct. 9, 2020, 8:42 p.m. UTC | #3
Applied to Bionic/azure-4.15-next

Thanks!
Ian

On 2020-10-07 15:16:35 , Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1883261
> 
> [Impact]
> 
> Microsoft would like to request two kdump related fixes in all releases
> supported on Azure. The two commits are:
> 
> c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device
> state")
> 83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path
> to release resource properly")
> 
> These are in the virtual PCI driver for Hyper-V. The customer visible
> symptom is that the network is not functional in the kdump kernel, so
> the dump file must be stored on the local disk and cannot be written
> over the network.
> 
> The problem only occurs when Accelerated Networking is enabled. It’s a
> relatively obscure scenario, which is why the problem has not surfaced
> before now. But we have an important customer who wants the
> “dump-file-over-the-network” functionality to work.
> 
> For bionic/linux-azure-4.15, the following additional patch needs to be
> backported first to allow the requested patches to apply cleanly:
> 
> a8e37506e79a ("PCI: hv: Reorganize the code in preparation of
> hibernation")
> 
> [Test Case]
> 
> - Apply requested patches and boot into updated kernel
> - Verify Accelerated Networking is enabled
> - Set up kdump
> - configure kdump to use SSH
> - Test the crash dump mechanism and verify the kernel crash dump appears
>   on the selected remote server
> 
> Further details for setting up kdump through testing can be found here:
> https://ubuntu.com/server/docs/kernel-crash-dump
> 
> [Regression Potential]
> 
> Patches are only targeted to azure kernels.
> 
> Patches are desgiend to release allocated resources remaining after
> error cases in hv_pci_probe() or PCI devices not being shut down
> properly. if those resources are still not correctly released, then
> entering D0 state in kdump kernel could continue to fail.
> 
> Potential for finding regression with freeing resources or still failing to
> enter D0 state in the kdump kernel even after all resources have been
> released.  
> 
> Build & boot tested. Verified kdump works as intended over SSH after
> patches are applied.
> 
> Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels
> signed off on and verified to resolve problem.
> 
> 
> Changes for Bionic/linux-azure-4.15:
> 
> 
> Dexuan Cui (1):
>   PCI: hv: Reorganize the code in preparation of hibernation
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/host/pci-hyperv.c | 101 +++++++++++++++++++++++++++-------
>  1 file changed, 81 insertions(+), 20 deletions(-)
> 
> 
> Changes for Focal/linux-azure:
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/controller/pci-hyperv.c | 60 ++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 6 deletions(-)
> 
> --
> 2.25.1
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Ian May Oct. 9, 2020, 9:15 p.m. UTC | #4
Applied to Focal/azure

Thanks,
Ian

On 2020-10-07 15:16:35 , Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1883261
> 
> [Impact]
> 
> Microsoft would like to request two kdump related fixes in all releases
> supported on Azure. The two commits are:
> 
> c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device
> state")
> 83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path
> to release resource properly")
> 
> These are in the virtual PCI driver for Hyper-V. The customer visible
> symptom is that the network is not functional in the kdump kernel, so
> the dump file must be stored on the local disk and cannot be written
> over the network.
> 
> The problem only occurs when Accelerated Networking is enabled. It’s a
> relatively obscure scenario, which is why the problem has not surfaced
> before now. But we have an important customer who wants the
> “dump-file-over-the-network” functionality to work.
> 
> For bionic/linux-azure-4.15, the following additional patch needs to be
> backported first to allow the requested patches to apply cleanly:
> 
> a8e37506e79a ("PCI: hv: Reorganize the code in preparation of
> hibernation")
> 
> [Test Case]
> 
> - Apply requested patches and boot into updated kernel
> - Verify Accelerated Networking is enabled
> - Set up kdump
> - configure kdump to use SSH
> - Test the crash dump mechanism and verify the kernel crash dump appears
>   on the selected remote server
> 
> Further details for setting up kdump through testing can be found here:
> https://ubuntu.com/server/docs/kernel-crash-dump
> 
> [Regression Potential]
> 
> Patches are only targeted to azure kernels.
> 
> Patches are desgiend to release allocated resources remaining after
> error cases in hv_pci_probe() or PCI devices not being shut down
> properly. if those resources are still not correctly released, then
> entering D0 state in kdump kernel could continue to fail.
> 
> Potential for finding regression with freeing resources or still failing to
> enter D0 state in the kdump kernel even after all resources have been
> released.  
> 
> Build & boot tested. Verified kdump works as intended over SSH after
> patches are applied.
> 
> Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels
> signed off on and verified to resolve problem.
> 
> 
> Changes for Bionic/linux-azure-4.15:
> 
> 
> Dexuan Cui (1):
>   PCI: hv: Reorganize the code in preparation of hibernation
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/host/pci-hyperv.c | 101 +++++++++++++++++++++++++++-------
>  1 file changed, 81 insertions(+), 20 deletions(-)
> 
> 
> Changes for Focal/linux-azure:
> 
> Wei Hu (2):
>   PCI: hv: Fix the PCI HyperV probe failure path to release resource
>     properly
>   PCI: hv: Retry PCI bus D0 entry on invalid device state
> 
>  drivers/pci/controller/pci-hyperv.c | 60 ++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 6 deletions(-)
> 
> --
> 2.25.1
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team