diff mbox series

[focal:linux-azure,1/1] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()

Message ID 20201023203943.2528465-1-marcelo.cerri@canonical.com
State New
Headers show
Series [focal:linux-azure,1/1] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume() | expand

Commit Message

Marcelo Henrique Cerri Oct. 23, 2020, 8:39 p.m. UTC
From: Dexuan Cui <decui@microsoft.com>

BugLink: https://bugs.launchpad.net/bugs/1894895

After we Stop and later Start a VM that uses Accelerated Networking (NIC
SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
the original vmbus channel of the VF, and hence we can't complete()
vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
and the VM hangs in vmbus_bus_resume() forever.

Fix the issue by adding a timeout, so the resuming can still succeed, and
the saved state is not lost, and according to my test, the user can disable
Accelerated Networking and then will be able to SSH into the VM for
further recovery. Also prevent the VM in question from suspending again.

The host will be fixed so in future the Instance GUID will stay the same
across hibernation.

Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
(cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1)
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
---

That's a clean cherry-pick from upstream and I was able to reproduce
the issue and confirm it fixes the problem as described in the bug.

5.8 doesn't need the fix because it was already applied via upstream
stable updates.

---
 drivers/hv/vmbus_drv.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Colin Ian King Oct. 23, 2020, 9:59 p.m. UTC | #1
On 23/10/2020 21:39, Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894895
> 
> After we Stop and later Start a VM that uses Accelerated Networking (NIC
> SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
> vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
> the original vmbus channel of the VF, and hence we can't complete()
> vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
> and the VM hangs in vmbus_bus_resume() forever.
> 
> Fix the issue by adding a timeout, so the resuming can still succeed, and
> the saved state is not lost, and according to my test, the user can disable
> Accelerated Networking and then will be able to SSH into the VM for
> further recovery. Also prevent the VM in question from suspending again.
> 
> The host will be fixed so in future the Instance GUID will stay the same
> across hibernation.
> 
> Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
> ---
> 
> That's a clean cherry-pick from upstream and I was able to reproduce
> the issue and confirm it fixes the problem as described in the bug.
> 
> 5.8 doesn't need the fix because it was already applied via upstream
> stable updates.
> 
> ---
>  drivers/hv/vmbus_drv.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 42fa92b005df..847c652ad00e 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev)
>  	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
>  		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
>  
> -	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
> +	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
> +		pr_err("Can not suspend due to a previous failed resuming\n");

I'd prefer "previous failed resume", but that a nit pick.

> +		return -EBUSY;
> +	}
>  
>  	mutex_lock(&vmbus_connection.channel_mutex);
>  
> @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev)
>  
>  	vmbus_request_offers();
>  
> -	wait_for_completion(&vmbus_connection.ready_for_resume_event);
> +	if (wait_for_completion_timeout(
> +		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
> +		pr_err("Some vmbus device is missing after suspending?\n");
>  
>  	/* Reset the event for the next suspend. */
>  	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> 

Clean cherry pick. Looks good to me. Thanks Marcelo

Acked-by: Colin Ian King <colin.king@canonical.com>
Stefan Bader Oct. 26, 2020, 8:18 a.m. UTC | #2
On 23.10.20 22:39, Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894895
> 
> After we Stop and later Start a VM that uses Accelerated Networking (NIC
> SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
> vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
> the original vmbus channel of the VF, and hence we can't complete()
> vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
> and the VM hangs in vmbus_bus_resume() forever.
> 
> Fix the issue by adding a timeout, so the resuming can still succeed, and
> the saved state is not lost, and according to my test, the user can disable
> Accelerated Networking and then will be able to SSH into the VM for
> further recovery. Also prevent the VM in question from suspending again.
> 
> The host will be fixed so in future the Instance GUID will stay the same
> across hibernation.
> 
> Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---
> 
> That's a clean cherry-pick from upstream and I was able to reproduce
> the issue and confirm it fixes the problem as described in the bug.
> 
> 5.8 doesn't need the fix because it was already applied via upstream
> stable updates.
> 
> ---
>  drivers/hv/vmbus_drv.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 42fa92b005df..847c652ad00e 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev)
>  	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
>  		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
>  
> -	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
> +	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
> +		pr_err("Can not suspend due to a previous failed resuming\n");
> +		return -EBUSY;
> +	}
>  
>  	mutex_lock(&vmbus_connection.channel_mutex);
>  
> @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev)
>  
>  	vmbus_request_offers();
>  
> -	wait_for_completion(&vmbus_connection.ready_for_resume_event);
> +	if (wait_for_completion_timeout(
> +		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
> +		pr_err("Some vmbus device is missing after suspending?\n");
>  
>  	/* Reset the event for the next suspend. */
>  	reinit_completion(&vmbus_connection.ready_for_suspend_event);
>
Ian May Oct. 26, 2020, 6:30 p.m. UTC | #3
Applied to Azure Focal/master-next

Thanks!
Ian

On 2020-10-23 17:39:43 , Marcelo Henrique Cerri wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1894895
> 
> After we Stop and later Start a VM that uses Accelerated Networking (NIC
> SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
> vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
> the original vmbus channel of the VF, and hence we can't complete()
> vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
> and the VM hangs in vmbus_bus_resume() forever.
> 
> Fix the issue by adding a timeout, so the resuming can still succeed, and
> the saved state is not lost, and according to my test, the user can disable
> Accelerated Networking and then will be able to SSH into the VM for
> further recovery. Also prevent the VM in question from suspending again.
> 
> The host will be fixed so in future the Instance GUID will stay the same
> across hibernation.
> 
> Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary channels")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Link: https://lore.kernel.org/r/20200905025555.45614-1-decui@microsoft.com
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> (cherry picked from commit 19873eec7e13fda140a0ebc75d6664e57c00bfb1)
> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
> ---
> 
> That's a clean cherry-pick from upstream and I was able to reproduce
> the issue and confirm it fixes the problem as described in the bug.
> 
> 5.8 doesn't need the fix because it was already applied via upstream
> stable updates.
> 
> ---
>  drivers/hv/vmbus_drv.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 42fa92b005df..847c652ad00e 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2230,7 +2230,10 @@ static int vmbus_bus_suspend(struct device *dev)
>  	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
>  		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
>  
> -	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
> +	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
> +		pr_err("Can not suspend due to a previous failed resuming\n");
> +		return -EBUSY;
> +	}
>  
>  	mutex_lock(&vmbus_connection.channel_mutex);
>  
> @@ -2304,7 +2307,9 @@ static int vmbus_bus_resume(struct device *dev)
>  
>  	vmbus_request_offers();
>  
> -	wait_for_completion(&vmbus_connection.ready_for_resume_event);
> +	if (wait_for_completion_timeout(
> +		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
> +		pr_err("Some vmbus device is missing after suspending?\n");
>  
>  	/* Reset the event for the next suspend. */
>  	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> -- 
> 2.25.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
diff mbox series

Patch

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 42fa92b005df..847c652ad00e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2230,7 +2230,10 @@  static int vmbus_bus_suspend(struct device *dev)
 	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
 		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
 
-	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0);
+	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
+		pr_err("Can not suspend due to a previous failed resuming\n");
+		return -EBUSY;
+	}
 
 	mutex_lock(&vmbus_connection.channel_mutex);
 
@@ -2304,7 +2307,9 @@  static int vmbus_bus_resume(struct device *dev)
 
 	vmbus_request_offers();
 
-	wait_for_completion(&vmbus_connection.ready_for_resume_event);
+	if (wait_for_completion_timeout(
+		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
+		pr_err("Some vmbus device is missing after suspending?\n");
 
 	/* Reset the event for the next suspend. */
 	reinit_completion(&vmbus_connection.ready_for_suspend_event);