diff mbox

[15/46] Rework loadvm path for subloops

Message ID 1404495717-4239-16-git-send-email-dgilbert@redhat.com
State New
Headers show

Commit Message

Dr. David Alan Gilbert July 4, 2014, 5:41 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Postcopy needs to have two migration streams loading concurrently;
one from memory (with the device state) and the other from the fd
with the memory transactions.

Split the core of qemu_loadvm_state out so we can use it for both.

Allow the inner loadvm loop to quit and signal whether the parent
should.

loadvm_handlers is made static since it's lifetime is greater
than the outer qemu_loadvm_state.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 savevm.c | 136 +++++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 84 insertions(+), 52 deletions(-)

Comments

Paolo Bonzini July 5, 2014, 10:26 a.m. UTC | #1
Il 04/07/2014 19:41, Dr. David Alan Gilbert (git) ha scritto:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Postcopy needs to have two migration streams loading concurrently;
> one from memory (with the device state) and the other from the fd
> with the memory transactions.

Can you explain this?

I would have though the order is

     precopy RAM and everything
     prepare postcopy RAM ("sent && dirty" bitmap)
     finish precopy non-RAM
     finish devices
     postcopy RAM

Why do you need to have all the packaging stuff and a separate 
memory-based migration stream for devices?  I'm sure I'm missing 
something. :)

Paolo
Dr. David Alan Gilbert July 7, 2014, 2:35 p.m. UTC | #2
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 04/07/2014 19:41, Dr. David Alan Gilbert (git) ha scritto:
> >From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> >Postcopy needs to have two migration streams loading concurrently;
> >one from memory (with the device state) and the other from the fd
> >with the memory transactions.
> 
> Can you explain this?
> 
> I would have though the order is
> 
>     precopy RAM and everything
>     prepare postcopy RAM ("sent && dirty" bitmap)
>     finish precopy non-RAM
>     finish devices
>     postcopy RAM
> 
> Why do you need to have all the packaging stuff and a separate memory-based
> migration stream for devices?  I'm sure I'm missing something. :)

The thing you're missing is the details of 'finish devices'.
The device emulation may access guest memory as part of loading it's
state, so you can't successfully complete 'finish devices' without
having the 'postcopy RAM' available to provide pages.
Thus you need to be able to start up 'postcopy RAM' before 'finish devices'
has completed, and you can't do that if 'finish devices' is still stuffing
data down the fd.

Now, if hypothetically you had:
  1) A migration format that let you separate out device state so that you
could load all the state of the device off the fd without calling the device
IO code.
  2) All devices were good and didn't touch guest memory while loading their
state.

then you could avoid this complexity.  However, if you look at how Stefan's
BER code tried to do 1 (which I don't do in my way of doing it), it was by
using the same trick of stuffing the device data into a dummy memory file
to find out the size of the data.   And I'm not convinced (2) will happen
this century.

> Paolo

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini July 7, 2014, 2:53 p.m. UTC | #3
Il 07/07/2014 16:35, Dr. David Alan Gilbert ha scritto:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>> Il 04/07/2014 19:41, Dr. David Alan Gilbert (git) ha scritto:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> Postcopy needs to have two migration streams loading concurrently;
>>> one from memory (with the device state) and the other from the fd
>>> with the memory transactions.
>>
>> Can you explain this?
>>
>> I would have though the order is
>>
>>     precopy RAM and everything
>>     prepare postcopy RAM ("sent && dirty" bitmap)
>>     finish precopy non-RAM
>>     finish devices
>>     postcopy RAM
>>
>> Why do you need to have all the packaging stuff and a separate memory-based
>> migration stream for devices?  I'm sure I'm missing something. :)
>
> The thing you're missing is the details of 'finish devices'.
> The device emulation may access guest memory as part of loading it's
> state, so you can't successfully complete 'finish devices' without
> having the 'postcopy RAM' available to provide pages.

I see.  Can you document the flow (preferrably as a reply to this email 
_and_ in docs/ when you send v2 of the code :))?

 From my cursory read of the code it is something like this on the source:

     finish precopy non-RAM
     start RAM postcopy
     for each device
         pack up data
         send it to destination

and on the destination:

     while source sends packet
         pick up packet atomically
         pass the packet to device loader
             (while the loader works, userfaultfd does background magic)

But something is missing still, either some kind of ack is needed 
between device data sends or userfaultfd needs to be able to process 
device data packets.

Paolo

> Thus you need to be able to start up 'postcopy RAM' before 'finish devices'
> has completed, and you can't do that if 'finish devices' is still stuffing
> data down the fd.
>
> Now, if hypothetically you had:
>   1) A migration format that let you separate out device state so that you
> could load all the state of the device off the fd without calling the device
> IO code.
>   2) All devices were good and didn't touch guest memory while loading their
> state.
>
> then you could avoid this complexity.  However, if you look at how Stefan's
> BER code tried to do 1 (which I don't do in my way of doing it), it was by
> using the same trick of stuffing the device data into a dummy memory file
> to find out the size of the data.   And I'm not convinced (2) will happen
> this century.
>
>> Paolo
>
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Dr. David Alan Gilbert July 7, 2014, 3:04 p.m. UTC | #4
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 07/07/2014 16:35, Dr. David Alan Gilbert ha scritto:
> >* Paolo Bonzini (pbonzini@redhat.com) wrote:
> >>Il 04/07/2014 19:41, Dr. David Alan Gilbert (git) ha scritto:
> >>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>
> >>>Postcopy needs to have two migration streams loading concurrently;
> >>>one from memory (with the device state) and the other from the fd
> >>>with the memory transactions.
> >>
> >>Can you explain this?
> >>
> >>I would have though the order is
> >>
> >>    precopy RAM and everything
> >>    prepare postcopy RAM ("sent && dirty" bitmap)
> >>    finish precopy non-RAM
> >>    finish devices
> >>    postcopy RAM
> >>
> >>Why do you need to have all the packaging stuff and a separate memory-based
> >>migration stream for devices?  I'm sure I'm missing something. :)
> >
> >The thing you're missing is the details of 'finish devices'.
> >The device emulation may access guest memory as part of loading it's
> >state, so you can't successfully complete 'finish devices' without
> >having the 'postcopy RAM' available to provide pages.
> 
> I see.  Can you document the flow (preferrably as a reply to this email
> _and_ in docs/ when you send v2 of the code :))?

Yep, will do; I need to go through and check through it before I write the
detail reply.

Dave

> 
> From my cursory read of the code it is something like this on the source:
> 
>     finish precopy non-RAM
>     start RAM postcopy
>     for each device
>         pack up data
>         send it to destination
> 
> and on the destination:
> 
>     while source sends packet
>         pick up packet atomically
>         pass the packet to device loader
>             (while the loader works, userfaultfd does background magic)
> 
> But something is missing still, either some kind of ack is needed between
> device data sends or userfaultfd needs to be able to process device data
> packets.
> 
> Paolo
> 
> >Thus you need to be able to start up 'postcopy RAM' before 'finish devices'
> >has completed, and you can't do that if 'finish devices' is still stuffing
> >data down the fd.
> >
> >Now, if hypothetically you had:
> >  1) A migration format that let you separate out device state so that you
> >could load all the state of the device off the fd without calling the device
> >IO code.
> >  2) All devices were good and didn't touch guest memory while loading their
> >state.
> >
> >then you could avoid this complexity.  However, if you look at how Stefan's
> >BER code tried to do 1 (which I don't do in my way of doing it), it was by
> >using the same trick of stuffing the device data into a dummy memory file
> >to find out the size of the data.   And I'm not convinced (2) will happen
> >this century.
> >
> >>Paolo
> >
> >Dave
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert July 16, 2014, 9:25 a.m. UTC | #5
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 07/07/2014 16:35, Dr. David Alan Gilbert ha scritto:
> >* Paolo Bonzini (pbonzini@redhat.com) wrote:
> >>Il 04/07/2014 19:41, Dr. David Alan Gilbert (git) ha scritto:
> >>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>
> >>>Postcopy needs to have two migration streams loading concurrently;
> >>>one from memory (with the device state) and the other from the fd
> >>>with the memory transactions.
> >>
> >>Can you explain this?
> >>
> >>I would have though the order is
> >>
> >>    precopy RAM and everything
> >>    prepare postcopy RAM ("sent && dirty" bitmap)
> >>    finish precopy non-RAM
> >>    finish devices
> >>    postcopy RAM
> >>
> >>Why do you need to have all the packaging stuff and a separate memory-based
> >>migration stream for devices?  I'm sure I'm missing something. :)
> >
> >The thing you're missing is the details of 'finish devices'.
> >The device emulation may access guest memory as part of loading it's
> >state, so you can't successfully complete 'finish devices' without
> >having the 'postcopy RAM' available to provide pages.
> 
> I see.  Can you document the flow (preferrably as a reply to this email
> _and_ in docs/ when you send v2 of the code :))?

I thought I documented enough in the docs/migration.txt stuff in the last
patch (see the 'Postcopy states' section); however lets see if I the following
is better:

----
Postcopy stream

Loading of device data may cause the device emulation to access guest RAM
that may trigger faults that have to be resolved by the source, as such
the migration stream has to be able to respond with page data *during* the
device load, and hence the device data has to be read from the stream completely
before the device load begins to free the stream up.  This is acheived by
'packaging' the device data into a blob that's read in one go.

Source behaviour

Until postcopy is entered the migration stream is identical to normal postcopy,
except for the addition of a 'postcopy advise' command at the beginning to
let the destination know that postcopy might happen.  When postcopy starts
the source sends the page discard data and then forms the 'package' containing:

   Command: 'postcopy ram listen'
   The device state
      A series of sections, identical to the precopy streams device state stream
      containing everything except postcopiable devices (i.e. RAM)
   Command: 'postcopy ram run'

The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
contents are formatted in the same way as the main migration stream.

Destination behaviour

Initially the destination looks the same as precopy, with a single thread
reading the migration stream; the 'postcopy advise' and 'discard' commands
are processed to change the way RAM is managed, but don't affect the stream
processing.

------------------------------------------------------------------------------
                        1      2   3     4 5                      6   7
main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
thread                             |       |
                                   |     (page request)   
                                   |        \___
                                   v            \
listen thread:                     --- page -- page -- page -- page -- page --

                                   a   b        c
------------------------------------------------------------------------------

On receipt of CMD_PACKAGED (1)
   All the data associated with the package - the ( ... ) section in the
diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
recurses into qemu_loadvm_state_main to process the contents of the package (2)
which contains commands (3,6) and devices (4...)

On receipt of 'postcopy ram listen' - 3 -(i.e. the 1st command in the package)
a new thread (a) is started that takes over servicing the migration stream,
while the main thread carries on loading the package.   It loads normal
background page data (b) but if during a device load a fault happens (5) the
returned page (c) is loaded by the listen thread allowing the main threads
device load to carry on.

The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
CPUs start running.
At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
and is no longer used by migration, while the listen thread carries
on servicing page data until the end of migration.

----

Is that any better?

Dave
P.S. I know of at least one bug in this code at the moment, it happens
on a VM that doesn't have many dirty pages where all the pages are
transmitted, and hence the listen thread finishes, before the main thread
gets to 'run'.

> From my cursory read of the code it is something like this on the source:
> 
>     finish precopy non-RAM
>     start RAM postcopy
>     for each device
>         pack up data
>         send it to destination
> 
> and on the destination:
> 
>     while source sends packet
>         pick up packet atomically
>         pass the packet to device loader
>             (while the loader works, userfaultfd does background magic)
> 
> But something is missing still, either some kind of ack is needed between
> device data sends or userfaultfd needs to be able to process device data
> packets.
> 
> Paolo
> 
> >Thus you need to be able to start up 'postcopy RAM' before 'finish devices'
> >has completed, and you can't do that if 'finish devices' is still stuffing
> >data down the fd.
> >
> >Now, if hypothetically you had:
> >  1) A migration format that let you separate out device state so that you
> >could load all the state of the device off the fd without calling the device
> >IO code.
> >  2) All devices were good and didn't touch guest memory while loading their
> >state.
> >
> >then you could avoid this complexity.  However, if you look at how Stefan's
> >BER code tried to do 1 (which I don't do in my way of doing it), it was by
> >using the same trick of stuffing the device data into a dummy memory file
> >to find out the size of the data.   And I'm not convinced (2) will happen
> >this century.
> >
> >>Paolo
> >
> >Dave
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/savevm.c b/savevm.c
index 662a910..c277d77 100644
--- a/savevm.c
+++ b/savevm.c
@@ -915,6 +915,26 @@  static SaveStateEntry *find_se(const char *idstr, int instance_id)
     return NULL;
 }
 
+/* These are ORable flags */
+const int LOADVM_EXITCODE_QUITLOOP     =  1;
+const int LOADVM_EXITCODE_QUITPARENT   =  2;
+const int LOADVM_EXITCODE_KEEPHANDLERS =  4;
+
+typedef struct LoadStateEntry {
+    QLIST_ENTRY(LoadStateEntry) entry;
+    SaveStateEntry *se;
+    int section_id;
+    int version_id;
+} LoadStateEntry;
+
+typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
+
+static LoadStateEntry_Head loadvm_handlers =
+ QLIST_HEAD_INITIALIZER(loadvm_handlers);
+
+static int qemu_loadvm_state_main(QEMUFile *f,
+                                  LoadStateEntry_Head *loadvm_handlers);
+
 static int loadvm_process_command_simple_lencheck(const char *name,
                                                   unsigned int actual,
                                                   unsigned int expected)
@@ -930,8 +950,11 @@  static int loadvm_process_command_simple_lencheck(const char *name,
 
 /* Process an incoming 'QEMU_VM_COMMAND'
  * -ve return on error (will issue error message)
+ * 0   just a normal return
+ * 1   All good, but exit the loop
  */
-static int loadvm_process_command(QEMUFile *f)
+static int loadvm_process_command(QEMUFile *f,
+                                  LoadStateEntry_Head *loadvm_handlers)
 {
     MigrationIncomingState *mis = f->mis;
     uint16_t com;
@@ -981,39 +1004,13 @@  static int loadvm_process_command(QEMUFile *f)
     return 0;
 }
 
-typedef struct LoadStateEntry {
-    QLIST_ENTRY(LoadStateEntry) entry;
-    SaveStateEntry *se;
-    int section_id;
-    int version_id;
-} LoadStateEntry;
-
-int qemu_loadvm_state(QEMUFile *f)
+static int qemu_loadvm_state_main(QEMUFile *f,
+                                  LoadStateEntry_Head *loadvm_handlers)
 {
-    QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
-        QLIST_HEAD_INITIALIZER(loadvm_handlers);
-    LoadStateEntry *le, *new_le;
+    LoadStateEntry *le;
     uint8_t section_type;
-    unsigned int v;
     int ret;
-
-    if (qemu_savevm_state_blocked(NULL)) {
-        return -EINVAL;
-    }
-
-    v = qemu_get_be32(f);
-    if (v != QEMU_VM_FILE_MAGIC) {
-        return -EINVAL;
-    }
-
-    v = qemu_get_be32(f);
-    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-        error_report("SaveVM v2 format is obsolete and don't work anymore");
-        return -ENOTSUP;
-    }
-    if (v != QEMU_VM_FILE_VERSION) {
-        return -ENOTSUP;
-    }
+    int exitcode = 0;
 
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
@@ -1042,16 +1039,14 @@  int qemu_loadvm_state(QEMUFile *f)
             if (se == NULL) {
                 error_report("Unknown savevm section or instance '%s' %d",
                              idstr, instance_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             /* Validate version */
             if (version_id > se->version_id) {
                 error_report("savevm: unsupported version %d for '%s' v%d",
                         version_id, idstr, se->version_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             /* Add entry */
@@ -1060,14 +1055,14 @@  int qemu_loadvm_state(QEMUFile *f)
             le->se = se;
             le->section_id = section_id;
             le->version_id = version_id;
-            QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
+            QLIST_INSERT_HEAD(loadvm_handlers, le, entry);
 
             ret = vmstate_load(f, le->se, le->version_id);
             if (ret < 0) {
                 error_report("qemu: error while loading state for"
                              "instance 0x%x of device '%s'",
                              instance_id, idstr);
-                goto out;
+                return ret;
             }
             break;
         case QEMU_VM_SECTION_PART:
@@ -1075,47 +1070,84 @@  int qemu_loadvm_state(QEMUFile *f)
             section_id = qemu_get_be32(f);
 
             DPRINTF("QEMU_VM_SECTION_PART/END entry for id=%d", section_id);
-            QLIST_FOREACH(le, &loadvm_handlers, entry) {
+            QLIST_FOREACH(le, loadvm_handlers, entry) {
                 if (le->section_id == section_id) {
                     break;
                 }
             }
             if (le == NULL) {
                 error_report("Unknown savevm section %d", section_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             ret = vmstate_load(f, le->se, le->version_id);
             if (ret < 0) {
                 error_report("qemu: error while loading state section"
                              " id %d (%s)", section_id, le->se->idstr);
-                goto out;
+                return ret;
             }
             DPRINTF("QEMU_VM_SECTION_PART/END done for id=%d", section_id);
             break;
         case QEMU_VM_COMMAND:
-            ret = loadvm_process_command(f);
-            if (ret < 0) {
-                goto out;
+            ret = loadvm_process_command(f, loadvm_handlers);
+            DPRINTF("%s QEMU_VM_COMMAND ret: %d", __func__, ret);
+            if ((ret < 0) || (ret & LOADVM_EXITCODE_QUITLOOP)) {
+                return ret;
             }
+            exitcode |= ret; /* Lets us pass flags up to the parent */
             break;
         default:
             error_report("Unknown savevm section type %d", section_type);
-            ret = -EINVAL;
-            goto out;
+            return -EINVAL;
         }
     }
     DPRINTF("qemu_loadvm_state loop: exited loop");
 
-    cpu_synchronize_all_post_init();
+    if (exitcode & LOADVM_EXITCODE_QUITPARENT) {
+        DPRINTF("loadvm_handlers_state_main: End of loop with QUITPARENT");
+        exitcode &= ~LOADVM_EXITCODE_QUITPARENT;
+        exitcode &= LOADVM_EXITCODE_QUITLOOP;
+    }
+
+    return exitcode;
+}
+
+int qemu_loadvm_state(QEMUFile *f)
+{
+    LoadStateEntry *le, *new_le;
+    unsigned int v;
+    int ret;
+
+    if (qemu_savevm_state_blocked(NULL)) {
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_MAGIC) {
+        return -EINVAL;
+    }
 
-    ret = 0;
+    v = qemu_get_be32(f);
+    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+        error_report("SaveVM v2 format is obsolete and don't work anymore");
+        return -ENOTSUP;
+    }
+    if (v != QEMU_VM_FILE_VERSION) {
+        return -ENOTSUP;
+    }
+
+    QLIST_INIT(&loadvm_handlers);
+    ret = qemu_loadvm_state_main(f, &loadvm_handlers);
 
-out:
-    QLIST_FOREACH_SAFE(le, &loadvm_handlers, entry, new_le) {
-        QLIST_REMOVE(le, entry);
-        g_free(le);
+    if (ret == 0) {
+        cpu_synchronize_all_post_init();
+    }
+
+    if ((ret < 0) || !(ret & LOADVM_EXITCODE_KEEPHANDLERS)) {
+        QLIST_FOREACH_SAFE(le, &loadvm_handlers, entry, new_le) {
+            QLIST_REMOVE(le, entry);
+            g_free(le);
+        }
     }
 
     if (ret == 0) {