Message ID | 20231206173636.163055-1-ioanna-maria.alifieraki@canonical.com |
---|---|
Headers | show |
Series | [SRU,lunar/linux-azure,1/1] Revert "PCI: hv: Use async probing to reduce boot time" | expand |
On Wed, Dec 06, 2023 at 07:36:34PM +0200, Ioanna Alifieraki wrote: > BugLink: https://bugs.launchpad.net/bugs/2042568 > > SRU Justification > > [Description] > > On a VM on Azure with a Tesla gpu it was noticed that when removing > the gpu from the pci the vm would crash. In case the nvidia drivers > are loaded, the machine won't crash. Instead the removing process > will hang and the machine will crash on reboot. > > This is related to bug [1]. > The bug reported in [1] regards another driver but the root cause is > the same. It is still investigated whether this is a bug in pci, or > it is a bug of various drivers on how they use pci. > > For this case we have identified that removing commit [2] prevents > the kernel crashes. > > Azure has requested to revert this commit, at least for the time > being. This commit is not in upstream, so it just need to be > reverted from Ubuntu kernels. > > [Test Case] > > On an Azure vm with a gpu : > > # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove > > where '0001:00:00.0' the pci address of the gpu. > The vm will crash. > > [Where things could go wrong] > > The commit to be reverted was included in a patchset to address bugs > https://bugs.launchpad.net/bugs/2023071 and > https://bugs.launchpad.net/bugs/2023594 > > However this commit just reduces boot time and removing shall not > introduce any regressions. Side effects will be increase in the boot > time. > > [Other] > > Only Ubuntu azure kernels are affected : > > - Jammy 5.15 > - Lunar 6.2 > > Focal is also affected since it's using 5.15 kernel. > This commit does not appear in Mantic 6.5 kernel. > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515 > [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370 > > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team I think it is usually a good idea to include at least a one-liner describing why the commit is reverted in the commit message. Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
On 12/6/23 10:36, Ioanna Alifieraki wrote: > BugLink: https://bugs.launchpad.net/bugs/2042568 > > SRU Justification > > [Description] > > On a VM on Azure with a Tesla gpu it was noticed that when removing > the gpu from the pci the vm would crash. In case the nvidia drivers > are loaded, the machine won't crash. Instead the removing process > will hang and the machine will crash on reboot. > > This is related to bug [1]. > The bug reported in [1] regards another driver but the root cause is > the same. It is still investigated whether this is a bug in pci, or > it is a bug of various drivers on how they use pci. > > For this case we have identified that removing commit [2] prevents > the kernel crashes. > > Azure has requested to revert this commit, at least for the time > being. This commit is not in upstream, so it just need to be > reverted from Ubuntu kernels. > > [Test Case] > > On an Azure vm with a gpu : > > # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove > > where '0001:00:00.0' the pci address of the gpu. > The vm will crash. > > [Where things could go wrong] > > The commit to be reverted was included in a patchset to address bugs > https://bugs.launchpad.net/bugs/2023071 and > https://bugs.launchpad.net/bugs/2023594 > > However this commit just reduces boot time and removing shall not > introduce any regressions. Side effects will be increase in the boot > time. > > [Other] > > Only Ubuntu azure kernels are affected : > > - Jammy 5.15 > - Lunar 6.2 > > Focal is also affected since it's using 5.15 kernel. > This commit does not appear in Mantic 6.5 kernel. > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515 > [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370 > > > Acked-by: Tim Gardner <tim.gardner@canonical.com>
On 12/6/23 10:36 AM, Ioanna Alifieraki wrote: > BugLink: https://bugs.launchpad.net/bugs/2042568 > > SRU Justification > > [Description] > > On a VM on Azure with a Tesla gpu it was noticed that when removing > the gpu from the pci the vm would crash. In case the nvidia drivers > are loaded, the machine won't crash. Instead the removing process > will hang and the machine will crash on reboot. > > This is related to bug [1]. > The bug reported in [1] regards another driver but the root cause is > the same. It is still investigated whether this is a bug in pci, or > it is a bug of various drivers on how they use pci. > > For this case we have identified that removing commit [2] prevents > the kernel crashes. > > Azure has requested to revert this commit, at least for the time > being. This commit is not in upstream, so it just need to be > reverted from Ubuntu kernels. > > [Test Case] > > On an Azure vm with a gpu : > > # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove > > where '0001:00:00.0' the pci address of the gpu. > The vm will crash. > > [Where things could go wrong] > > The commit to be reverted was included in a patchset to address bugs > https://bugs.launchpad.net/bugs/2023071 and > https://bugs.launchpad.net/bugs/2023594 > > However this commit just reduces boot time and removing shall not > introduce any regressions. Side effects will be increase in the boot > time. > > [Other] > > Only Ubuntu azure kernels are affected : > > - Jammy 5.15 > - Lunar 6.2 > > Focal is also affected since it's using 5.15 kernel. > This commit does not appear in Mantic 6.5 kernel. > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515 > [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?id=75af0c10b370 > > > Applied to j/l linux-azure:master-next. Thanks. -rtg