Message ID | 20201023203943.2528465-1-marcelo.cerri@canonical.com |
---|---|
State | New |
Headers | show |
Series | [focal:linux-azure,1/1] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume() | expand |
On 23/10/2020 21:39, Marcelo Henrique Cerri wrote: > From: Dexuan Cui <decui@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1894895 > > After we Stop and later Start a VM that uses Accelerated Networking (NIC > SR-IOV), currently the VF vmbus device's Instance GUID can change, so after > vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find > the original vmbus channel of the VF, and hence we can't complete() > vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(), > and the VM hangs in vmbus_bus_resume() forever. > > Fix the issue by adding a timeout, so the resuming can still succeed, and > the saved state is not lost, and according to my test, the user can disable > Accelerated Networking and then will be able to SSH into the VM for > further recovery. Also prevent the VM in question from suspending again. > > The host will be fixed so in future the Instance GUID will stay the same > across hibernation. > > Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1) > Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> > --- > > That's a clean cherry-pick from upstream and I was able to reproduce > the issue and confirm it fixes the problem as described in the bug. > > 5.8 doesn't need the fix because it was already applied via upstream > stable updates. > > --- > drivers/hv/vmbus_drv.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 42fa92b005df..847c652ad00e 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev) > if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0) > wait_for_completion(&vmbus_connection.ready_for_suspend_event); > > - WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0); > + if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) { > + pr_err("Can not suspend due to a previous failed resuming\n"); I'd prefer "previous failed resume", but that a nit pick. > + return -EBUSY; > + } > > mutex_lock(&vmbus_connection.channel_mutex); > > @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev) > > vmbus_request_offers(); > > - wait_for_completion(&vmbus_connection.ready_for_resume_event); > + if (wait_for_completion_timeout( > + &vmbus_connection.ready_for_resume_event, 10 * HZ) == 0) > + pr_err("Some vmbus device is missing after suspending?\n"); > > /* Reset the event for the next suspend. */ > reinit_completion(&vmbus_connection.ready_for_suspend_event); > Clean cherry pick. Looks good to me. Thanks Marcelo Acked-by: Colin Ian King <colin.king@canonical.com>
On 23.10.20 22:39, Marcelo Henrique Cerri wrote: > From: Dexuan Cui <decui@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1894895 > > After we Stop and later Start a VM that uses Accelerated Networking (NIC > SR-IOV), currently the VF vmbus device's Instance GUID can change, so after > vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find > the original vmbus channel of the VF, and hence we can't complete() > vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(), > and the VM hangs in vmbus_bus_resume() forever. > > Fix the issue by adding a timeout, so the resuming can still succeed, and > the saved state is not lost, and according to my test, the user can disable > Accelerated Networking and then will be able to SSH into the VM for > further recovery. Also prevent the VM in question from suspending again. > > The host will be fixed so in future the Instance GUID will stay the same > across hibernation. > > Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1) > Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> > --- > > That's a clean cherry-pick from upstream and I was able to reproduce > the issue and confirm it fixes the problem as described in the bug. > > 5.8 doesn't need the fix because it was already applied via upstream > stable updates. > > --- > drivers/hv/vmbus_drv.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 42fa92b005df..847c652ad00e 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev) > if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0) > wait_for_completion(&vmbus_connection.ready_for_suspend_event); > > - WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0); > + if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) { > + pr_err("Can not suspend due to a previous failed resuming\n"); > + return -EBUSY; > + } > > mutex_lock(&vmbus_connection.channel_mutex); > > @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev) > > vmbus_request_offers(); > > - wait_for_completion(&vmbus_connection.ready_for_resume_event); > + if (wait_for_completion_timeout( > + &vmbus_connection.ready_for_resume_event, 10 * HZ) == 0) > + pr_err("Some vmbus device is missing after suspending?\n"); > > /* Reset the event for the next suspend. */ > reinit_completion(&vmbus_connection.ready_for_suspend_event); >
Applied to Azure Focal/master-next Thanks! Ian On 2020-10-23 17:39:43 , Marcelo Henrique Cerri wrote: > From: Dexuan Cui <decui@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1894895 > > After we Stop and later Start a VM that uses Accelerated Networking (NIC > SR-IOV), currently the VF vmbus device's Instance GUID can change, so after > vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find > the original vmbus channel of the VF, and hence we can't complete() > vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(), > and the VM hangs in vmbus_bus_resume() forever. > > Fix the issue by adding a timeout, so the resuming can still succeed, and > the saved state is not lost, and according to my test, the user can disable > Accelerated Networking and then will be able to SSH into the VM for > further recovery. Also prevent the VM in question from suspending again. > > The host will be fixed so in future the Instance GUID will stay the same > across hibernation. > > Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1) > Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> > --- > > That's a clean cherry-pick from upstream and I was able to reproduce > the issue and confirm it fixes the problem as described in the bug. > > 5.8 doesn't need the fix because it was already applied via upstream > stable updates. > > --- > drivers/hv/vmbus_drv.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 42fa92b005df..847c652ad00e 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev) > if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0) > wait_for_completion(&vmbus_connection.ready_for_suspend_event); > > - WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0); > + if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) { > + pr_err("Can not suspend due to a previous failed resuming\n"); > + return -EBUSY; > + } > > mutex_lock(&vmbus_connection.channel_mutex); > > @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev) > > vmbus_request_offers(); > > - wait_for_completion(&vmbus_connection.ready_for_resume_event); > + if (wait_for_completion_timeout( > + &vmbus_connection.ready_for_resume_event, 10 * HZ) == 0) > + pr_err("Some vmbus device is missing after suspending?\n"); > > /* Reset the event for the next suspend. */ > reinit_completion(&vmbus_connection.ready_for_suspend_event); > -- > 2.25.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 42fa92b005df..847c652ad00e 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev) if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0) wait_for_completion(&vmbus_connection.ready_for_suspend_event); - WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0); + if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) { + pr_err("Can not suspend due to a previous failed resuming\n"); + return -EBUSY; + } mutex_lock(&vmbus_connection.channel_mutex); @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev) vmbus_request_offers(); - wait_for_completion(&vmbus_connection.ready_for_resume_event); + if (wait_for_completion_timeout( + &vmbus_connection.ready_for_resume_event, 10 * HZ) == 0) + pr_err("Some vmbus device is missing after suspending?\n"); /* Reset the event for the next suspend. */ reinit_completion(&vmbus_connection.ready_for_suspend_event);