mbox series

[SRU,j,l/linux-azure,0/1] Fix kernel panic when removing GPU

Message ID 20231206173636.163055-1-ioanna-maria.alifieraki@canonical.com
Headers show
Series [SRU,lunar/linux-azure,1/1] Revert "PCI: hv: Use async probing to reduce boot time" | expand

Message

Ioanna Alifieraki Dec. 6, 2023, 5:36 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2042568

SRU Justification

[Description]

On a VM on Azure with a Tesla gpu it was noticed that when removing
the gpu from the pci the vm would crash. In case the nvidia drivers
are loaded, the machine won't crash. Instead the removing process 
will hang and the machine will crash on reboot.

This is related to bug [1].
The bug reported in [1] regards another driver but the root cause is
the same. It is still investigated whether this is a bug in pci, or
it is a bug of various drivers on how they use pci.

For this case we have identified that removing commit [2] prevents 
the kernel crashes.

Azure has requested to revert this commit, at least for the time 
being. This commit is not in upstream, so it just need to be 
reverted from Ubuntu kernels.

[Test Case]

On an Azure vm with a gpu :

# echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove

where '0001:00:00.0' the pci address of the gpu.
The vm will crash.

[Where things could go wrong]

The commit to be reverted was included in a patchset to address bugs
https://bugs.launchpad.net/bugs/2023071 and 
https://bugs.launchpad.net/bugs/2023594

However this commit just reduces boot time and removing shall not 
introduce any regressions. Side effects will be increase in the boot
time.

[Other]

Only Ubuntu azure kernels are affected :

- Jammy 5.15
- Lunar 6.2

Focal is also affected since it's using 5.15 kernel.
This commit does not appear in Mantic 6.5 kernel.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215515
[2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370

Comments

Manuel Diewald Dec. 6, 2023, 6:12 p.m. UTC | #1
On Wed, Dec 06, 2023 at 07:36:34PM +0200, Ioanna Alifieraki wrote:
> BugLink: https://bugs.launchpad.net/bugs/2042568
> 
> SRU Justification
> 
> [Description]
> 
> On a VM on Azure with a Tesla gpu it was noticed that when removing
> the gpu from the pci the vm would crash. In case the nvidia drivers
> are loaded, the machine won't crash. Instead the removing process 
> will hang and the machine will crash on reboot.
> 
> This is related to bug [1].
> The bug reported in [1] regards another driver but the root cause is
> the same. It is still investigated whether this is a bug in pci, or
> it is a bug of various drivers on how they use pci.
> 
> For this case we have identified that removing commit [2] prevents 
> the kernel crashes.
> 
> Azure has requested to revert this commit, at least for the time 
> being. This commit is not in upstream, so it just need to be 
> reverted from Ubuntu kernels.
> 
> [Test Case]
> 
> On an Azure vm with a gpu :
> 
> # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove
> 
> where '0001:00:00.0' the pci address of the gpu.
> The vm will crash.
> 
> [Where things could go wrong]
> 
> The commit to be reverted was included in a patchset to address bugs
> https://bugs.launchpad.net/bugs/2023071 and 
> https://bugs.launchpad.net/bugs/2023594
> 
> However this commit just reduces boot time and removing shall not 
> introduce any regressions. Side effects will be increase in the boot
> time.
> 
> [Other]
> 
> Only Ubuntu azure kernels are affected :
> 
> - Jammy 5.15
> - Lunar 6.2
> 
> Focal is also affected since it's using 5.15 kernel.
> This commit does not appear in Mantic 6.5 kernel.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515
> [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370
> 
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

I think it is usually a good idea to include at least a one-liner
describing why the commit is reverted in the commit message.

Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Tim Gardner Dec. 7, 2023, 7:23 p.m. UTC | #2
On 12/6/23 10:36, Ioanna Alifieraki wrote:
> BugLink: https://bugs.launchpad.net/bugs/2042568
> 
> SRU Justification
> 
> [Description]
> 
> On a VM on Azure with a Tesla gpu it was noticed that when removing
> the gpu from the pci the vm would crash. In case the nvidia drivers
> are loaded, the machine won't crash. Instead the removing process
> will hang and the machine will crash on reboot.
> 
> This is related to bug [1].
> The bug reported in [1] regards another driver but the root cause is
> the same. It is still investigated whether this is a bug in pci, or
> it is a bug of various drivers on how they use pci.
> 
> For this case we have identified that removing commit [2] prevents
> the kernel crashes.
> 
> Azure has requested to revert this commit, at least for the time
> being. This commit is not in upstream, so it just need to be
> reverted from Ubuntu kernels.
> 
> [Test Case]
> 
> On an Azure vm with a gpu :
> 
> # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove
> 
> where '0001:00:00.0' the pci address of the gpu.
> The vm will crash.
> 
> [Where things could go wrong]
> 
> The commit to be reverted was included in a patchset to address bugs
> https://bugs.launchpad.net/bugs/2023071 and
> https://bugs.launchpad.net/bugs/2023594
> 
> However this commit just reduces boot time and removing shall not
> introduce any regressions. Side effects will be increase in the boot
> time.
> 
> [Other]
> 
> Only Ubuntu azure kernels are affected :
> 
> - Jammy 5.15
> - Lunar 6.2
> 
> Focal is also affected since it's using 5.15 kernel.
> This commit does not appear in Mantic 6.5 kernel.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515
> [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370
> 
> 
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Tim Gardner Dec. 7, 2023, 7:35 p.m. UTC | #3
On 12/6/23 10:36 AM, Ioanna Alifieraki wrote:
> BugLink: https://bugs.launchpad.net/bugs/2042568
> 
> SRU Justification
> 
> [Description]
> 
> On a VM on Azure with a Tesla gpu it was noticed that when removing
> the gpu from the pci the vm would crash. In case the nvidia drivers
> are loaded, the machine won't crash. Instead the removing process
> will hang and the machine will crash on reboot.
> 
> This is related to bug [1].
> The bug reported in [1] regards another driver but the root cause is
> the same. It is still investigated whether this is a bug in pci, or
> it is a bug of various drivers on how they use pci.
> 
> For this case we have identified that removing commit [2] prevents
> the kernel crashes.
> 
> Azure has requested to revert this commit, at least for the time
> being. This commit is not in upstream, so it just need to be
> reverted from Ubuntu kernels.
> 
> [Test Case]
> 
> On an Azure vm with a gpu :
> 
> # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove
> 
> where '0001:00:00.0' the pci address of the gpu.
> The vm will crash.
> 
> [Where things could go wrong]
> 
> The commit to be reverted was included in a patchset to address bugs
> https://bugs.launchpad.net/bugs/2023071 and
> https://bugs.launchpad.net/bugs/2023594
> 
> However this commit just reduces boot time and removing shall not
> introduce any regressions. Side effects will be increase in the boot
> time.
> 
> [Other]
> 
> Only Ubuntu azure kernels are affected :
> 
> - Jammy 5.15
> - Lunar 6.2
> 
> Focal is also affected since it's using 5.15 kernel.
> This commit does not appear in Mantic 6.5 kernel.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515
> [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370
> 
> 
> 
Applied to j/l linux-azure:master-next. Thanks.

-rtg