mbox series

[SRU,F/aws,v3,0/6] aws: proper fix for c5.18xlarge hibernation issues

Message ID 20210519151513.309935-1-andrea.righi@canonical.com
Headers show
Series aws: proper fix for c5.18xlarge hibernation issues | expand

Message

Andrea Righi May 19, 2021, 3:15 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1920944

[Impact]

In LP: #1918694 we applied a fix and a workaround to solve the
hibernation issues on c5.18xlarge. The workaround was in the form of a
SAUCE patch:

  "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"

It looks like we can replace this workaround with a proper fix, by
applying this patch:

http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/

[Test plan]

Create a c5.18xlarge instance, run the memory stress test script (the
same test script that we are using to stress test hibernation), trigger
the hibernate event, trigger the resume event. Repeat a couple of times
and the problem is very likely to happen.

[Fix]

Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
with:

http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/

The fix has been tested extensively in the AWS infrastructure with
positive results.

[Where problems could occur]

This new code introduced by the fix can be executed also when a CPU is
put offline, so we may see potential regressions in the KVM CPU
hotplugging.

----------------------------------------------------------------
Changelog (v2 -> v3):
 - updated backported / signed-off lines with the right upstream info
   (thanks Guilherme!)

NOTE: backport activity was minimal, it only required some context
adjustments to properly apply the changes.

Andrea Righi (1):
      Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"

Vitaly Kuznetsov (5):
      x86/kvm: Fix pr_info() for async PF setup/teardown
      x86/kvm: Teardown PV features on boot CPU as well
      x86/kvm: Disable kvmclock on all CPUs on shutdown
      x86/kvm: Disable all PV features on crash
      x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()

 arch/x86/include/asm/kvm_para.h |   9 ++----
 arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
 arch/x86/kernel/kvmclock.c      |  28 ++---------------
 3 files changed, 79 insertions(+), 71 deletions(-)

Comments

Guilherme G. Piccoli May 19, 2021, 4:23 p.m. UTC | #1
On Wed, May 19, 2021 at 12:15 PM Andrea Righi
<andrea.righi@canonical.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/1920944
>
> [Impact]
>
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
>
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
>
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
>
> [Test plan]
>
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
>
> [Fix]
>
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
>
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
>
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
>
> [Where problems could occur]
>
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
>
> ----------------------------------------------------------------
> Changelog (v2 -> v3):
>  - updated backported / signed-off lines with the right upstream info
>    (thanks Guilherme!)
>
> NOTE: backport activity was minimal, it only required some context
> adjustments to properly apply the changes.
>
> Andrea Righi (1):
>       Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
>
> Vitaly Kuznetsov (5):
>       x86/kvm: Fix pr_info() for async PF setup/teardown
>       x86/kvm: Teardown PV features on boot CPU as well
>       x86/kvm: Disable kvmclock on all CPUs on shutdown
>       x86/kvm: Disable all PV features on crash
>       x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
>
>  arch/x86/include/asm/kvm_para.h |   9 ++----
>  arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
>  arch/x86/kernel/kvmclock.c      |  28 ++---------------
>  3 files changed, 79 insertions(+), 71 deletions(-)
>

Thanks a bunch Andrea, looks great to me:

Acked-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Tim Gardner May 19, 2021, 6:23 p.m. UTC | #2
Acked-by: Tim Gardner <tim.gardner@canonical.com>

pr_info() exists in focal/linux-aws. I'm curious why you didn't preserve 
it in patch 2/6 ?

On 5/19/21 9:15 AM, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1920944
> 
> [Impact]
> 
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
> 
>    "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
> 
> [Where problems could occur]
> 
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
> 
> ----------------------------------------------------------------
> Changelog (v2 -> v3):
>   - updated backported / signed-off lines with the right upstream info
>     (thanks Guilherme!)
> 
> NOTE: backport activity was minimal, it only required some context
> adjustments to properly apply the changes.
> 
> Andrea Righi (1):
>        Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> Vitaly Kuznetsov (5):
>        x86/kvm: Fix pr_info() for async PF setup/teardown
>        x86/kvm: Teardown PV features on boot CPU as well
>        x86/kvm: Disable kvmclock on all CPUs on shutdown
>        x86/kvm: Disable all PV features on crash
>        x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
> 
>   arch/x86/include/asm/kvm_para.h |   9 ++----
>   arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
>   arch/x86/kernel/kvmclock.c      |  28 ++---------------
>   3 files changed, 79 insertions(+), 71 deletions(-)
> 
>
Andrea Righi May 19, 2021, 7:33 p.m. UTC | #3
On Wed, May 19, 2021 at 12:23:22PM -0600, Tim Gardner wrote:
> Acked-by: Tim Gardner <tim.gardner@canonical.com>
> 
> pr_info() exists in focal/linux-aws. I'm curious why you didn't preserve it
> in patch 2/6 ?

Good point, I could have used pr_info(), but the original patch was
changing a pr_info() to another pr_info() and the original code has a
printk(), so I thought it was more consistent to keep the printk() and
change only the text like the original patch does...

-Andrea

> 
> On 5/19/21 9:15 AM, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1920944
> > 
> > [Impact]
> > 
> > In LP: #1918694 we applied a fix and a workaround to solve the
> > hibernation issues on c5.18xlarge. The workaround was in the form of a
> > SAUCE patch:
> > 
> >    "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > 
> > It looks like we can replace this workaround with a proper fix, by
> > applying this patch:
> > 
> > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> > 
> > [Test plan]
> > 
> > Create a c5.18xlarge instance, run the memory stress test script (the
> > same test script that we are using to stress test hibernation), trigger
> > the hibernate event, trigger the resume event. Repeat a couple of times
> > and the problem is very likely to happen.
> > 
> > [Fix]
> > 
> > Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > with:
> > 
> > http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> > 
> > The fix has been tested extensively in the AWS infrastructure with
> > positive results.
> > 
> > [Where problems could occur]
> > 
> > This new code introduced by the fix can be executed also when a CPU is
> > put offline, so we may see potential regressions in the KVM CPU
> > hotplugging.
> > 
> > ----------------------------------------------------------------
> > Changelog (v2 -> v3):
> >   - updated backported / signed-off lines with the right upstream info
> >     (thanks Guilherme!)
> > 
> > NOTE: backport activity was minimal, it only required some context
> > adjustments to properly apply the changes.
> > 
> > Andrea Righi (1):
> >        Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> > 
> > Vitaly Kuznetsov (5):
> >        x86/kvm: Fix pr_info() for async PF setup/teardown
> >        x86/kvm: Teardown PV features on boot CPU as well
> >        x86/kvm: Disable kvmclock on all CPUs on shutdown
> >        x86/kvm: Disable all PV features on crash
> >        x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
> > 
> >   arch/x86/include/asm/kvm_para.h |   9 ++----
> >   arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
> >   arch/x86/kernel/kvmclock.c      |  28 ++---------------
> >   3 files changed, 79 insertions(+), 71 deletions(-)
> > 
> > 
> 
> -- 
> -----------
> Tim Gardner
> Canonical, Inc
Kelsey Skunberg May 28, 2021, 11:48 p.m. UTC | #4
applied to F/aws. thank you! 

-Kelsey

On 2021-05-19 17:15:07 , Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1920944
> 
> [Impact]
> 
> In LP: #1918694 we applied a fix and a workaround to solve the
> hibernation issues on c5.18xlarge. The workaround was in the form of a
> SAUCE patch:
> 
>   "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> It looks like we can replace this workaround with a proper fix, by
> applying this patch:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> [Test plan]
> 
> Create a c5.18xlarge instance, run the memory stress test script (the
> same test script that we are using to stress test hibernation), trigger
> the hibernate event, trigger the resume event. Repeat a couple of times
> and the problem is very likely to happen.
> 
> [Fix]
> 
> Replace "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> with:
> 
> http://next.patchew.org/Linux/20210414123544.1060604-1-vkuznets@redhat.com/
> 
> The fix has been tested extensively in the AWS infrastructure with
> positive results.
> 
> [Where problems could occur]
> 
> This new code introduced by the fix can be executed also when a CPU is
> put offline, so we may see potential regressions in the KVM CPU
> hotplugging.
> 
> ----------------------------------------------------------------
> Changelog (v2 -> v3):
>  - updated backported / signed-off lines with the right upstream info
>    (thanks Guilherme!)
> 
> NOTE: backport activity was minimal, it only required some context
> adjustments to properly apply the changes.
> 
> Andrea Righi (1):
>       Revert "UBUNTU: SAUCE: aws: kvm: double the size of hv_clock_boot"
> 
> Vitaly Kuznetsov (5):
>       x86/kvm: Fix pr_info() for async PF setup/teardown
>       x86/kvm: Teardown PV features on boot CPU as well
>       x86/kvm: Disable kvmclock on all CPUs on shutdown
>       x86/kvm: Disable all PV features on crash
>       x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
> 
>  arch/x86/include/asm/kvm_para.h |   9 ++----
>  arch/x86/kernel/kvm.c           | 113 ++++++++++++++++++++++++++++++++++++++++++++----------------------
>  arch/x86/kernel/kvmclock.c      |  28 ++---------------
>  3 files changed, 79 insertions(+), 71 deletions(-)
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team