diff mbox series

[V3,07/22] cpr

Message ID 1620390320-301716-8-git-send-email-steven.sistare@oracle.com
State New
Headers show
Series Live Update | expand

Commit Message

Steven Sistare May 7, 2021, 12:25 p.m. UTC
Provide the cprsave and cprload functions for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
/usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
paused state and waits for the cprload command.

To use the restart mode, qemu must be started with the memfd-alloc machine
option.  The memfd's are saved to the environment and kept open across exec,
after which they are found from the environment and re-mmap'd.  Hence guest
ram is preserved in place, albeit with new virtual addresses in the qemu
process.  The caller resumes the guest by calling cprload, which loads
state from the file.  If the VM was running at cprsave time, then VM
execution resumes.  cprsave supports any type of guest image and block
device, but the caller must not modify guest block devices between cprsave
and cprload.

For the reboot mode, cprsave saves state and exits qemu, and the caller is
allowed to update the host kernel and system software and reboot.  The
caller resumes the guest by running qemu with the same arguments as the
original process and calling cprload.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0 or /dev/shm
PKRAM.

The reboot mode supports vfio devices if the caller suspends the guest
instead of stopping the VM, such as by issuing guest-suspend-ram to the
qemu guest agent.  The guest drivers' suspend methods flush outstanding
requests and re-initialize the devices, and thus there is no device state
to save and restore.

The restart mode supports vfio devices in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h   |  17 +++++
 include/sysemu/runstate.h |   1 +
 migration/cpr.c           | 188 ++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build     |   1 +
 migration/savevm.h        |   2 +
 softmmu/physmem.c         |   6 +-
 softmmu/runstate.c        |  21 +++++-
 softmmu/vl.c              |   6 ++
 8 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c

Comments

Stefan Hajnoczi May 12, 2021, 4:19 p.m. UTC | #1
On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> To use the restart mode, qemu must be started with the memfd-alloc machine
> option.  The memfd's are saved to the environment and kept open across exec,
> after which they are found from the environment and re-mmap'd.  Hence guest
> ram is preserved in place, albeit with new virtual addresses in the qemu
> process.  The caller resumes the guest by calling cprload, which loads
> state from the file.  If the VM was running at cprsave time, then VM
> execution resumes.  cprsave supports any type of guest image and block
> device, but the caller must not modify guest block devices between cprsave
> and cprload.

Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
achieve the same thing?
Steven Sistare May 13, 2021, 8:21 p.m. UTC | #2
On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
>> To use the restart mode, qemu must be started with the memfd-alloc machine
>> option.  The memfd's are saved to the environment and kept open across exec,
>> after which they are found from the environment and re-mmap'd.  Hence guest
>> ram is preserved in place, albeit with new virtual addresses in the qemu
>> process.  The caller resumes the guest by calling cprload, which loads
>> state from the file.  If the VM was running at cprsave time, then VM
>> execution resumes.  cprsave supports any type of guest image and block
>> device, but the caller must not modify guest block devices between cprsave
>> and cprload.
> 
> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> achieve the same thing?

Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
and must be preserved, such as these on x86_64.  
  vga.vram
  pc.ram
  pc.bios
  pc.rom
  vga.rom
  rom@etc/acpi/tables
  rom@etc/table-loader
  rom@etc/acpi/rsdp

Even the read-only areas must be preserved rather than recreated from files in the updated
qemu, as their contents may have changed.

- Steve
Stefan Hajnoczi May 14, 2021, 11:28 a.m. UTC | #3
On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> > On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> >> To use the restart mode, qemu must be started with the memfd-alloc machine
> >> option.  The memfd's are saved to the environment and kept open across exec,
> >> after which they are found from the environment and re-mmap'd.  Hence guest
> >> ram is preserved in place, albeit with new virtual addresses in the qemu
> >> process.  The caller resumes the guest by calling cprload, which loads
> >> state from the file.  If the VM was running at cprsave time, then VM
> >> execution resumes.  cprsave supports any type of guest image and block
> >> device, but the caller must not modify guest block devices between cprsave
> >> and cprload.
> > 
> > Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> > achieve the same thing?
> 
> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
> and must be preserved, such as these on x86_64.  
>   vga.vram
>   pc.ram
>   pc.bios
>   pc.rom
>   vga.rom
>   rom@etc/acpi/tables
>   rom@etc/table-loader
>   rom@etc/acpi/rsdp
> 
> Even the read-only areas must be preserved rather than recreated from files in the updated
> qemu, as their contents may have changed.

Migration knows how to save/load these RAM blocks. Only pc.ram is
significant in size so I'm not sure it's worth special-casing the
others?

Stefan
Steven Sistare May 14, 2021, 3:14 p.m. UTC | #4
On 5/14/2021 7:28 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
>>>> To use the restart mode, qemu must be started with the memfd-alloc machine
>>>> option.  The memfd's are saved to the environment and kept open across exec,
>>>> after which they are found from the environment and re-mmap'd.  Hence guest
>>>> ram is preserved in place, albeit with new virtual addresses in the qemu
>>>> process.  The caller resumes the guest by calling cprload, which loads
>>>> state from the file.  If the VM was running at cprsave time, then VM
>>>> execution resumes.  cprsave supports any type of guest image and block
>>>> device, but the caller must not modify guest block devices between cprsave
>>>> and cprload.
>>>
>>> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
>>> achieve the same thing?
>>
>> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
>> and must be preserved, such as these on x86_64.  
>>   vga.vram
>>   pc.ram
>>   pc.bios
>>   pc.rom
>>   vga.rom
>>   rom@etc/acpi/tables
>>   rom@etc/table-loader
>>   rom@etc/acpi/rsdp
>>
>> Even the read-only areas must be preserved rather than recreated from files in the updated
>> qemu, as their contents may have changed.
> 
> Migration knows how to save/load these RAM blocks. Only pc.ram is
> significant in size so I'm not sure it's worth special-casing the
> others?

Some of these are mapped for vfio dma as a consequence of the normal memory region callback to
consumers code.  We get conflict errors vs those existing vfio mappings if they are recreated 
and remapped in the new process.  The memfd option is a simple and robust solution to that issue.

- Steve
Stefan Hajnoczi May 18, 2021, 1:42 p.m. UTC | #5
On Fri, May 14, 2021 at 11:14:44AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:28 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> >>>> To use the restart mode, qemu must be started with the memfd-alloc machine
> >>>> option.  The memfd's are saved to the environment and kept open across exec,
> >>>> after which they are found from the environment and re-mmap'd.  Hence guest
> >>>> ram is preserved in place, albeit with new virtual addresses in the qemu
> >>>> process.  The caller resumes the guest by calling cprload, which loads
> >>>> state from the file.  If the VM was running at cprsave time, then VM
> >>>> execution resumes.  cprsave supports any type of guest image and block
> >>>> device, but the caller must not modify guest block devices between cprsave
> >>>> and cprload.
> >>>
> >>> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> >>> achieve the same thing?
> >>
> >> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
> >> and must be preserved, such as these on x86_64.  
> >>   vga.vram
> >>   pc.ram
> >>   pc.bios
> >>   pc.rom
> >>   vga.rom
> >>   rom@etc/acpi/tables
> >>   rom@etc/table-loader
> >>   rom@etc/acpi/rsdp
> >>
> >> Even the read-only areas must be preserved rather than recreated from files in the updated
> >> qemu, as their contents may have changed.
> > 
> > Migration knows how to save/load these RAM blocks. Only pc.ram is
> > significant in size so I'm not sure it's worth special-casing the
> > others?
> 
> Some of these are mapped for vfio dma as a consequence of the normal memory region callback to
> consumers code.  We get conflict errors vs those existing vfio mappings if they are recreated 
> and remapped in the new process.  The memfd option is a simple and robust solution to that issue.

Okay, if the VFIO device DMAs to them then they need to stay alive. Live
migration cannot copy their contents since they could be DMAed to at any
time and we'd copy stale data.

Stefan
diff mbox series

Patch

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..42dec4e
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,17 @@ 
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#include "qapi/qapi-types-cpr.h"
+
+bool cpr_active(void);
+void cprsave(const char *file, CprMode mode, Error **errp);
+void cprload(const char *file, Error **errp);
+
+#endif
diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 50c84af..d69dc2d 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -51,6 +51,7 @@  void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
 bool qemu_wakeup_suspend_enabled(void);
+void qemu_system_start_on_wake_request(void);
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
 void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..e0da1cf
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,188 @@ 
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "monitor/monitor.h"
+#include "migration.h"
+#include "migration/snapshot.h"
+#include "chardev/char.h"
+#include "migration/misc.h"
+#include "migration/cpr.h"
+#include "migration/global_state.h"
+#include "qemu-file-channel.h"
+#include "qemu-file.h"
+#include "savevm.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu/error-report.h"
+#include "io/channel-buffer.h"
+#include "io/channel-file.h"
+#include "sysemu/cpu-timers.h"
+#include "sysemu/runstate.h"
+#include "sysemu/runstate-action.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
+#include "sysemu/xen.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/virtio/vhost.h"
+#include "qemu/env.h"
+
+static int cpr_is_active;
+
+bool cpr_active(void)
+{
+    return cpr_is_active;
+}
+
+QEMUFile *qf_file_open(const char *path, int flags, int mode,
+                              const char *name, Error **errp)
+{
+    QIOChannelFile *fioc;
+    QIOChannel *ioc;
+    QEMUFile *f;
+
+    if (flags & O_RDWR) {
+        error_setg(errp, "qf_file_open %s: O_RDWR not supported", path);
+        return 0;
+    }
+
+    fioc = qio_channel_file_new_path(path, flags, mode, errp);
+    if (!fioc) {
+        return 0;
+    }
+
+    ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    object_unref(OBJECT(fioc));
+    return f;
+}
+
+static int preserve_fd(const char *name, const char *val, void *handle)
+{
+    qemu_clr_cloexec(atoi(val));
+    return 0;
+}
+
+void cprsave(const char *file, CprMode mode, Error **errp)
+{
+    int ret = 0;
+    QEMUFile *f;
+    int saved_vm_running = runstate_is_running();
+    bool restart = (mode == CPR_MODE_RESTART);
+    bool reboot = (mode == CPR_MODE_REBOOT);
+
+    if (reboot && qemu_ram_volatile(errp)) {
+        return;
+    }
+
+    if (restart && xen_enabled()) {
+        error_setg(errp, "xen does not support cprsave restart");
+        return;
+    }
+
+    if (migrate_colo_enabled()) {
+        error_setg(errp, "error: cprsave does not support x-colo");
+        return;
+    }
+
+    if (replay_mode != REPLAY_MODE_NONE) {
+        error_setg(errp, "error: cprsave does not support replay");
+        return;
+    }
+
+    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp);
+    if (!f) {
+        return;
+    }
+
+    ret = global_state_store();
+    if (ret) {
+        error_setg(errp, "Error saving global state");
+        qemu_fclose(f);
+        return;
+    }
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        /* Update timers_state before saving.  Suspend did not so do. */
+        cpu_disable_ticks();
+    }
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    cpr_is_active = true;
+    ret = qemu_save_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, QERR_IO_ERROR);
+        goto err;
+    }
+
+    if (ret < 0) {
+        if (!*errp) {
+            error_setg(errp, "Error %d while saving VM state", ret);
+        }
+        goto err;
+    }
+
+    if (reboot) {
+        shutdown_action = SHUTDOWN_ACTION_POWEROFF;
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    } else if (restart) {
+        walkenv(FD_PREFIX, preserve_fd, 0);
+        setenv("QEMU_START_FREEZE", "", 1);
+        qemu_system_exec_request();
+    }
+    goto done;
+
+err:
+    if (saved_vm_running) {
+        vm_start();
+    }
+done:
+    cpr_is_active = false;
+    return;
+}
+
+void cprload(const char *file, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cprload called for a running VM");
+        return;
+    }
+
+    f = qf_file_open(file, O_RDONLY, 0, "cprload", errp);
+    if (!f) {
+        return;
+    }
+
+    if (qemu_get_be32(f) != QEMU_VM_FILE_MAGIC ||
+        qemu_get_be32(f) != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "error: %s is not a vmstate file", file);
+        return;
+    }
+
+    ret = qemu_load_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        return;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            qemu_system_start_on_wake_request();
+        }
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index 3ecedce..c756374 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -15,6 +15,7 @@  softmmu_ss.add(files(
   'channel.c',
   'colo-failover.c',
   'colo.c',
+  'cpr.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342..ce5d710 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@  int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+QEMUFile *qf_file_open(const char *path, int flags, int mode,
+                       const char *name, Error **errp);
 
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 695aa10..b79f408 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -68,6 +68,7 @@ 
 #include "qemu/pmem.h"
 
 #include "qemu/memfd.h"
+#include "qemu/env.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1957,7 +1958,7 @@  static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
         } else {
             name = memory_region_name(new_block->mr);
             if (ms->memfd_alloc) {
-                int mfd = -1;          /* placeholder until next patch */
+                int mfd = getenv_fd(name);
                 mr->align = QEMU_VMALLOC_ALIGN;
                 if (mfd < 0) {
                     mfd = qemu_memfd_create(name, maxlen + mr->align,
@@ -1965,7 +1966,9 @@  static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
                     if (mfd < 0) {
                         return;
                     }
+                    setenv_fd(name, mfd);
                 }
+                qemu_clr_cloexec(mfd);
                 new_block->flags |= RAM_SHARED;
                 addr = file_ram_alloc(new_block, maxlen, mfd,
                                       false, false, 0, errp);
@@ -2214,6 +2217,7 @@  void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    unsetenv_fd(memory_region_name(block->mr));
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index bea7513..07952cc 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -115,6 +115,8 @@  static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -334,6 +336,7 @@  void vm_state_notify(bool running, RunState state)
     }
 }
 
+static bool start_on_wake_requested;
 static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
@@ -567,6 +570,11 @@  void qemu_register_suspend_notifier(Notifier *notifier)
     notifier_list_add(&suspend_notifiers, notifier);
 }
 
+void qemu_system_start_on_wake_request(void)
+{
+    start_on_wake_requested = true;
+}
+
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
 {
     trace_system_wakeup_request(reason);
@@ -579,7 +587,18 @@  void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    /*
+     * Must call vm_start if it has never been called, to invoke the state
+     * change callbacks for the first time.
+     */
+    if (start_on_wake_requested) {
+        start_on_wake_requested = false;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
+
     wakeup_reason = reason;
     qemu_notify_event();
 }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 04ab752..4654693 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -3510,6 +3510,12 @@  void qemu_init(int argc, char **argv, char **envp)
      */
     loc_set_none();
 
+    /* Equivalent to -S, but no need for parent to modify argv. */
+    if (getenv("QEMU_START_FREEZE")) {
+        unsetenv("QEMU_START_FREEZE");
+        autostart = 0;
+    }
+
     qemu_validate_options();
     qemu_process_sugar_options();