mbox series

[SRU,E/F,0/1] xen-netfront: fix potential deadlock in xennet_remove()

Message ID 20200722134702.899157-1-andrea.righi@canonical.com
Headers show
Series [E] UBUNTU: SAUCE: xen-netfront: fix potential deadlock in xennet_remove() | expand

Message

Andrea Righi July 22, 2020, 1:47 p.m. UTC
[Impact]

During our AWS testing we were experiencing deadlocks on hibernate
across all Xen instance types. The trace was showing that the system was
stuck in xennet_remove():

[ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 358.115102] modprobe D 0 4892 4833 0x00004004
[ 358.115104] Call Trace:
[ 358.115112] __schedule+0x2a8/0x670
[ 358.115115] schedule+0x33/0xa0
[ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
[ 358.115121] ? wait_woken+0x80/0x80
[ 358.115124] xenbus_dev_remove+0x51/0xa0
[ 358.115126] device_release_driver_internal+0xe0/0x1b0
[ 358.115127] driver_detach+0x49/0x90
[ 358.115129] bus_remove_driver+0x59/0xd0
[ 358.115131] driver_unregister+0x2c/0x40
[ 358.115132] xenbus_unregister_driver+0x12/0x20
[ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
[ 358.115137] __x64_sys_delete_module+0x146/0x290
[ 358.115140] do_syscall_64+0x5a/0x130
[ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This prevented hibernation to complete.

The reason of this problem is a race condition in xennet_remove(): the
system is reading the current state of the bus, it's requesting to
change the state to "Closing", and it's waiting for the state to be
changed to "Closing". However, if the state becomes "Closed" between
reading the state and requesting the state change, we are stuck forever,
because the state will never change from "Closed" back to "Closing".

[Test case]

Create any Xen-based instance in AWS, hibernate/resume multiple times.
Some times the system gets stuck (hung task timeout) and hibernation
fails.

[Fix]

Prevent the deadlock by changing the wait condition to check also for
state == Closed.

This is also an upstream bug, I posted a patch to the LKML and I'm
waiting for a review / feedbacks:
https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u

This patch is not applied upstream, but both our tests and the tests
performed by Amazon show positive results after applying this fix (the
deadlock doesn't seem to happen anymore).

[Regression potential]

Minimal, this change affects only Xen and more exactly only the
xen-netfront driver.

Comments

Andrea Righi July 22, 2020, 1:51 p.m. UTC | #1
Forgot to mention the bug link.

BugLink: https://bugs.launchpad.net/bugs/1888510

Sorry about that.

-Andrea
Colin Ian King July 22, 2020, 1:55 p.m. UTC | #2
On 22/07/2020 14:47, Andrea Righi wrote:
> [Impact]
> 
> During our AWS testing we were experiencing deadlocks on hibernate
> across all Xen instance types. The trace was showing that the system was
> stuck in xennet_remove():
> 
> [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> [ 358.115102] modprobe D 0 4892 4833 0x00004004
> [ 358.115104] Call Trace:
> [ 358.115112] __schedule+0x2a8/0x670
> [ 358.115115] schedule+0x33/0xa0
> [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> [ 358.115121] ? wait_woken+0x80/0x80
> [ 358.115124] xenbus_dev_remove+0x51/0xa0
> [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> [ 358.115127] driver_detach+0x49/0x90
> [ 358.115129] bus_remove_driver+0x59/0xd0
> [ 358.115131] driver_unregister+0x2c/0x40
> [ 358.115132] xenbus_unregister_driver+0x12/0x20
> [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> [ 358.115137] __x64_sys_delete_module+0x146/0x290
> [ 358.115140] do_syscall_64+0x5a/0x130
> [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This prevented hibernation to complete.
> 
> The reason of this problem is a race condition in xennet_remove(): the
> system is reading the current state of the bus, it's requesting to
> change the state to "Closing", and it's waiting for the state to be
> changed to "Closing". However, if the state becomes "Closed" between
> reading the state and requesting the state change, we are stuck forever,
> because the state will never change from "Closed" back to "Closing".
> 
> [Test case]
> 
> Create any Xen-based instance in AWS, hibernate/resume multiple times.
> Some times the system gets stuck (hung task timeout) and hibernation
> fails.
> 
> [Fix]
> 
> Prevent the deadlock by changing the wait condition to check also for
> state == Closed.
> 
> This is also an upstream bug, I posted a patch to the LKML and I'm
> waiting for a review / feedbacks:
> https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> 
> This patch is not applied upstream, but both our tests and the tests
> performed by Amazon show positive results after applying this fix (the
> deadlock doesn't seem to happen anymore).
> 
> [Regression potential]
> 
> Minimal, this change affects only Xen and more exactly only the
> xen-netfront driver.
> 
> 
Adding the timeout is sensible. It has good test results. Add in the bug
link and we're good to go.

Acked-by: Colin Ian King <colin.king@canonical.com>
Stefan Bader July 23, 2020, 7:15 a.m. UTC | #3
On 22.07.20 15:47, Andrea Righi wrote:
> [Impact]
> 
> During our AWS testing we were experiencing deadlocks on hibernate
> across all Xen instance types. The trace was showing that the system was
> stuck in xennet_remove():
> 
> [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> [ 358.115102] modprobe D 0 4892 4833 0x00004004
> [ 358.115104] Call Trace:
> [ 358.115112] __schedule+0x2a8/0x670
> [ 358.115115] schedule+0x33/0xa0
> [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> [ 358.115121] ? wait_woken+0x80/0x80
> [ 358.115124] xenbus_dev_remove+0x51/0xa0
> [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> [ 358.115127] driver_detach+0x49/0x90
> [ 358.115129] bus_remove_driver+0x59/0xd0
> [ 358.115131] driver_unregister+0x2c/0x40
> [ 358.115132] xenbus_unregister_driver+0x12/0x20
> [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> [ 358.115137] __x64_sys_delete_module+0x146/0x290
> [ 358.115140] do_syscall_64+0x5a/0x130
> [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This prevented hibernation to complete.
> 
> The reason of this problem is a race condition in xennet_remove(): the
> system is reading the current state of the bus, it's requesting to
> change the state to "Closing", and it's waiting for the state to be
> changed to "Closing". However, if the state becomes "Closed" between
> reading the state and requesting the state change, we are stuck forever,
> because the state will never change from "Closed" back to "Closing".
> 
> [Test case]
> 
> Create any Xen-based instance in AWS, hibernate/resume multiple times.
> Some times the system gets stuck (hung task timeout) and hibernation
> fails.
> 
> [Fix]
> 
> Prevent the deadlock by changing the wait condition to check also for
> state == Closed.
> 
> This is also an upstream bug, I posted a patch to the LKML and I'm
> waiting for a review / feedbacks:
> https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> 
> This patch is not applied upstream, but both our tests and the tests
> performed by Amazon show positive results after applying this fix (the
> deadlock doesn't seem to happen anymore).
> 
> [Regression potential]
> 
> Minimal, this change affects only Xen and more exactly only the
> xen-netfront driver.
> 
> 
Beside the BugLink which we can add later, there is a different target source in
the bug report vs. the submission. The bug report is against linux-aws, the
submission would be against the main kernel.

Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be
handled with extra care (this is probably less for Andrea than for whoever is
going to apply).

-Stefan
Andrea Righi July 23, 2020, 1:06 p.m. UTC | #4
On Thu, Jul 23, 2020 at 09:15:17AM +0200, Stefan Bader wrote:
> On 22.07.20 15:47, Andrea Righi wrote:
> > [Impact]
> > 
> > During our AWS testing we were experiencing deadlocks on hibernate
> > across all Xen instance types. The trace was showing that the system was
> > stuck in xennet_remove():
> > 
> > [ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
> > [ 358.115102] modprobe D 0 4892 4833 0x00004004
> > [ 358.115104] Call Trace:
> > [ 358.115112] __schedule+0x2a8/0x670
> > [ 358.115115] schedule+0x33/0xa0
> > [ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
> > [ 358.115121] ? wait_woken+0x80/0x80
> > [ 358.115124] xenbus_dev_remove+0x51/0xa0
> > [ 358.115126] device_release_driver_internal+0xe0/0x1b0
> > [ 358.115127] driver_detach+0x49/0x90
> > [ 358.115129] bus_remove_driver+0x59/0xd0
> > [ 358.115131] driver_unregister+0x2c/0x40
> > [ 358.115132] xenbus_unregister_driver+0x12/0x20
> > [ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
> > [ 358.115137] __x64_sys_delete_module+0x146/0x290
> > [ 358.115140] do_syscall_64+0x5a/0x130
> > [ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > This prevented hibernation to complete.
> > 
> > The reason of this problem is a race condition in xennet_remove(): the
> > system is reading the current state of the bus, it's requesting to
> > change the state to "Closing", and it's waiting for the state to be
> > changed to "Closing". However, if the state becomes "Closed" between
> > reading the state and requesting the state change, we are stuck forever,
> > because the state will never change from "Closed" back to "Closing".
> > 
> > [Test case]
> > 
> > Create any Xen-based instance in AWS, hibernate/resume multiple times.
> > Some times the system gets stuck (hung task timeout) and hibernation
> > fails.
> > 
> > [Fix]
> > 
> > Prevent the deadlock by changing the wait condition to check also for
> > state == Closed.
> > 
> > This is also an upstream bug, I posted a patch to the LKML and I'm
> > waiting for a review / feedbacks:
> > https://lore.kernel.org/lkml/20200722065211.GA841369@xps-13/T/#u
> > 
> > This patch is not applied upstream, but both our tests and the tests
> > performed by Amazon show positive results after applying this fix (the
> > deadlock doesn't seem to happen anymore).
> > 
> > [Regression potential]
> > 
> > Minimal, this change affects only Xen and more exactly only the
> > xen-netfront driver.
> > 
> > 
> Beside the BugLink which we can add later, there is a different target source in
> the bug report vs. the submission. The bug report is against linux-aws, the
> submission would be against the main kernel.
> 
> Also note, that Eoan is EOL, so it anything submitted for 5.3 needs to be
> handled with extra care (this is probably less for Andrea than for whoever is
> going to apply).
> 
> -Stefan
> 

NACK-ing this patch. I'll send a new version targeting the proper
kernels and adding the BugLink.

Thanks!
-Andrea