Message ID | 20200722134702.899157-1-andrea.righi@canonical.com |
---|---|
Headers | show |
Series | [E] UBUNTU: SAUCE: xen-netfront: fix potential deadlock in xennet_remove() | expand |
Forgot to mention the bug link. BugLink: https://bugs.launchpad.net/bugs/1888510 Sorry about that. -Andrea
On 22/07/2020 14:47, Andrea Righi wrote: > [Impact] > > During our AWS testing we were experiencing deadlocks on hibernate > across all Xen instance types. The trace was showing that the system was > stuck in xennet_remove(): > > [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0): > [ 358.115102] modprobe D 0 4892 4833 0x00004004 > [ 358.115104] Call Trace: > [ 358.115112] __schedule+0x2a8/0x670 > [ 358.115115] schedule+0x33/0xa0 > [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront] > [ 358.115121] ? wait_woken+0x80/0x80 > [ 358.115124] xenbus_dev_remove+0x51/0xa0 > [ 358.115126] device_release_driver_internal+0xe0/0x1b0 > [ 358.115127] driver_detach+0x49/0x90 > [ 358.115129] bus_remove_driver+0x59/0xd0 > [ 358.115131] driver_unregister+0x2c/0x40 > [ 358.115132] xenbus_unregister_driver+0x12/0x20 > [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront] > [ 358.115137] __x64_sys_delete_module+0x146/0x290 > [ 358.115140] do_syscall_64+0x5a/0x130 > [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > This prevented hibernation to complete. > > The reason of this problem is a race condition in xennet_remove(): the > system is reading the current state of the bus, it's requesting to > change the state to "Closing", and it's waiting for the state to be > changed to "Closing". However, if the state becomes "Closed" between > reading the state and requesting the state change, we are stuck forever, > because the state will never change from "Closed" back to "Closing". > > [Test case] > > Create any Xen-based instance in AWS, hibernate/resume multiple times. > Some times the system gets stuck (hung task timeout) and hibernation > fails. > > [Fix] > > Prevent the deadlock by changing the wait condition to check also for > state == Closed. > > This is also an upstream bug, I posted a patch to the LKML and I'm > waiting for a review / feedbacks: > https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u > > This patch is not applied upstream, but both our tests and the tests > performed by Amazon show positive results after applying this fix (the > deadlock doesn't seem to happen anymore). > > [Regression potential] > > Minimal, this change affects only Xen and more exactly only the > xen-netfront driver. > > Adding the timeout is sensible. It has good test results. Add in the bug link and we're good to go. Acked-by: Colin Ian King <colin.king@canonical.com>
On 22.07.20 15:47, Andrea Righi wrote: > [Impact] > > During our AWS testing we were experiencing deadlocks on hibernate > across all Xen instance types. The trace was showing that the system was > stuck in xennet_remove(): > > [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0): > [ 358.115102] modprobe D 0 4892 4833 0x00004004 > [ 358.115104] Call Trace: > [ 358.115112] __schedule+0x2a8/0x670 > [ 358.115115] schedule+0x33/0xa0 > [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront] > [ 358.115121] ? wait_woken+0x80/0x80 > [ 358.115124] xenbus_dev_remove+0x51/0xa0 > [ 358.115126] device_release_driver_internal+0xe0/0x1b0 > [ 358.115127] driver_detach+0x49/0x90 > [ 358.115129] bus_remove_driver+0x59/0xd0 > [ 358.115131] driver_unregister+0x2c/0x40 > [ 358.115132] xenbus_unregister_driver+0x12/0x20 > [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront] > [ 358.115137] __x64_sys_delete_module+0x146/0x290 > [ 358.115140] do_syscall_64+0x5a/0x130 > [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > This prevented hibernation to complete. > > The reason of this problem is a race condition in xennet_remove(): the > system is reading the current state of the bus, it's requesting to > change the state to "Closing", and it's waiting for the state to be > changed to "Closing". However, if the state becomes "Closed" between > reading the state and requesting the state change, we are stuck forever, > because the state will never change from "Closed" back to "Closing". > > [Test case] > > Create any Xen-based instance in AWS, hibernate/resume multiple times. > Some times the system gets stuck (hung task timeout) and hibernation > fails. > > [Fix] > > Prevent the deadlock by changing the wait condition to check also for > state == Closed. > > This is also an upstream bug, I posted a patch to the LKML and I'm > waiting for a review / feedbacks: > https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u > > This patch is not applied upstream, but both our tests and the tests > performed by Amazon show positive results after applying this fix (the > deadlock doesn't seem to happen anymore). > > [Regression potential] > > Minimal, this change affects only Xen and more exactly only the > xen-netfront driver. > > Beside the BugLink which we can add later, there is a different target source in the bug report vs. the submission. The bug report is against linux-aws, the submission would be against the main kernel. Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be handled with extra care (this is probably less for Andrea than for whoever is going to apply). -Stefan
On Thu, Jul 23, 2020 at 09:15:17AM +0200, Stefan Bader wrote: > On 22.07.20 15:47, Andrea Righi wrote: > > [Impact] > > > > During our AWS testing we were experiencing deadlocks on hibernate > > across all Xen instance types. The trace was showing that the system was > > stuck in xennet_remove(): > > > > [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0): > > [ 358.115102] modprobe D 0 4892 4833 0x00004004 > > [ 358.115104] Call Trace: > > [ 358.115112] __schedule+0x2a8/0x670 > > [ 358.115115] schedule+0x33/0xa0 > > [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront] > > [ 358.115121] ? wait_woken+0x80/0x80 > > [ 358.115124] xenbus_dev_remove+0x51/0xa0 > > [ 358.115126] device_release_driver_internal+0xe0/0x1b0 > > [ 358.115127] driver_detach+0x49/0x90 > > [ 358.115129] bus_remove_driver+0x59/0xd0 > > [ 358.115131] driver_unregister+0x2c/0x40 > > [ 358.115132] xenbus_unregister_driver+0x12/0x20 > > [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront] > > [ 358.115137] __x64_sys_delete_module+0x146/0x290 > > [ 358.115140] do_syscall_64+0x5a/0x130 > > [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > This prevented hibernation to complete. > > > > The reason of this problem is a race condition in xennet_remove(): the > > system is reading the current state of the bus, it's requesting to > > change the state to "Closing", and it's waiting for the state to be > > changed to "Closing". However, if the state becomes "Closed" between > > reading the state and requesting the state change, we are stuck forever, > > because the state will never change from "Closed" back to "Closing". > > > > [Test case] > > > > Create any Xen-based instance in AWS, hibernate/resume multiple times. > > Some times the system gets stuck (hung task timeout) and hibernation > > fails. > > > > [Fix] > > > > Prevent the deadlock by changing the wait condition to check also for > > state == Closed. > > > > This is also an upstream bug, I posted a patch to the LKML and I'm > > waiting for a review / feedbacks: > > https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u > > > > This patch is not applied upstream, but both our tests and the tests > > performed by Amazon show positive results after applying this fix (the > > deadlock doesn't seem to happen anymore). > > > > [Regression potential] > > > > Minimal, this change affects only Xen and more exactly only the > > xen-netfront driver. > > > > > Beside the BugLink which we can add later, there is a different target source in > the bug report vs. the submission. The bug report is against linux-aws, the > submission would be against the main kernel. > > Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be > handled with extra care (this is probably less for Andrea than for whoever is > going to apply). > > -Stefan > NACK-ing this patch. I'll send a new version targeting the proper kernels and adding the BugLink. Thanks! -Andrea