[v7,01/42] Start documenting how postcopy works.

Message ID	1434450415-11339-2-git-send-email-dgilbert@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> To: qemu-devel@nongnu.org Date: Tue, 16 Jun 2015 11:26:14 +0100 Message-Id: <1434450415-11339-2-git-send-email-dgilbert@redhat.com> In-Reply-To: <1434450415-11339-1-git-send-email-dgilbert@redhat.com> References: <1434450415-11339-1-git-send-email-dgilbert@redhat.com> Cc: aarcange@redhat.com, yamahata@private.email.ne.jp, quintela@redhat.com, liang.z.li@intel.com, luis@cs.umu.se, amit.shah@redhat.com, pbonzini@redhat.com, david@gibson.dropbear.id.au Subject: [Qemu-devel] [PATCH v7 01/42] Start documenting how postcopy works. Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Dr. David Alan Gilbert June 16, 2015, 10:26 a.m. UTC

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/migration.txt | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

Juan Quintela June 17, 2015, 11:42 a.m. UTC | #1

"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/migration.txt | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>
> diff --git a/docs/migration.txt b/docs/migration.txt
> index f6df4be..b4b93d1 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -291,3 +291,170 @@ save/send this state when we are in the middle
> of a pio operation
>  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
>  not enabled, the values on that fields are garbage and don't need to
>  be sent.
> +
> += Return path =
> +
> +In most migration scenarios there is only a single data path that runs
> +from the source VM to the destination, typically along a single fd (although
> +possibly with another fd or similar for some fast way of throwing pages across).
> +
> +However, some uses need two way communication; in particular the
> Postcopy destination

This line is a bit long O:-)

In general, we are too near to the 80 columns limit.

Dr. David Alan Gilbert June 17, 2015, 12:30 p.m. UTC | #2

* Juan Quintela (quintela@redhat.com) wrote:
> "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  docs/migration.txt | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 167 insertions(+)
> >
> > diff --git a/docs/migration.txt b/docs/migration.txt
> > index f6df4be..b4b93d1 100644
> > --- a/docs/migration.txt
> > +++ b/docs/migration.txt
> > @@ -291,3 +291,170 @@ save/send this state when we are in the middle
> > of a pio operation
> >  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
> >  not enabled, the values on that fields are garbage and don't need to
> >  be sent.
> > +
> > += Return path =
> > +
> > +In most migration scenarios there is only a single data path that runs
> > +from the source VM to the destination, typically along a single fd (although
> > +possibly with another fd or similar for some fast way of throwing pages across).
> > +
> > +However, some uses need two way communication; in particular the
> > Postcopy destination
> 
> This line is a bit long O:-)
> 
> In general, we are too near to the 80 columns limit.

Thanks, Fixed (interesting, the check-patch doesn't
seem to moan about text files).

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Li, Liang Z June 18, 2015, 7:50 a.m. UTC | #3

> diff --git a/docs/migration.txt b/docs/migration.txt index f6df4be..b4b93d1
> 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -291,3 +291,170 @@ save/send this state when we are in the middle of a
> pio operation  (that is what ide_drive_pio_state_needed() checks).  If
> DRQ_STAT is  not enabled, the values on that fields are garbage and don't
> need to  be sent.
> +
> += Return path =
> +
> +In most migration scenarios there is only a single data path that runs
> +from the source VM to the destination, typically along a single fd
> +(although possibly with another fd or similar for some fast way of throwing
> pages across).
> +
> +However, some uses need two way communication; in particular the
> +Postcopy destination needs to be able to request pages on demand from
> the source.
> +
> +For these scenarios there is a 'return path' from the destination to
> +the source;
> +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for
> +the return path.
> +
> +  Source side
> +     Forward path - written by migration thread
> +     Return path  - opened by main thread, read by return-path thread
> +
> +  Destination side
> +     Forward path - read by main thread
> +     Return path  - opened by main thread, written by main thread AND
> postcopy
> +                    thread (protected by rp_mutex)
> +
> += Postcopy =
> +'Postcopy' migration is a way to deal with migrations that refuse to
> +converge; its plus side is that there is an upper bound on the amount
> +of migration traffic and time it takes, the down side is that during
> +the postcopy phase, a failure of
> +*either* side or the network connection causes the guest to be lost.

Hi David,

Do you have any idea or plan to deal with the failure happened during the postcopy phase?

Lost the guest  is too frightening for a cloud provider, we have a discussion with 
Alibaba, they said that they can't use the postcopy feature unless there is a mechanism to
find the guest back.

Liang

Dr. David Alan Gilbert June 18, 2015, 8:10 a.m. UTC | #4

* Li, Liang Z (liang.z.li@intel.com) wrote:
> > diff --git a/docs/migration.txt b/docs/migration.txt index f6df4be..b4b93d1
> > 100644
> > --- a/docs/migration.txt
> > +++ b/docs/migration.txt
> > @@ -291,3 +291,170 @@ save/send this state when we are in the middle of a
> > pio operation  (that is what ide_drive_pio_state_needed() checks).  If
> > DRQ_STAT is  not enabled, the values on that fields are garbage and don't
> > need to  be sent.
> > +
> > += Return path =
> > +
> > +In most migration scenarios there is only a single data path that runs
> > +from the source VM to the destination, typically along a single fd
> > +(although possibly with another fd or similar for some fast way of throwing
> > pages across).
> > +
> > +However, some uses need two way communication; in particular the
> > +Postcopy destination needs to be able to request pages on demand from
> > the source.
> > +
> > +For these scenarios there is a 'return path' from the destination to
> > +the source;
> > +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for
> > +the return path.
> > +
> > +  Source side
> > +     Forward path - written by migration thread
> > +     Return path  - opened by main thread, read by return-path thread
> > +
> > +  Destination side
> > +     Forward path - read by main thread
> > +     Return path  - opened by main thread, written by main thread AND
> > postcopy
> > +                    thread (protected by rp_mutex)
> > +
> > += Postcopy =
> > +'Postcopy' migration is a way to deal with migrations that refuse to
> > +converge; its plus side is that there is an upper bound on the amount
> > +of migration traffic and time it takes, the down side is that during
> > +the postcopy phase, a failure of
> > +*either* side or the network connection causes the guest to be lost.
> 
> Hi David,
> 
> Do you have any idea or plan to deal with the failure happened during the postcopy phase?
> 
> Lost the guest  is too frightening for a cloud provider, we have a discussion with 
> Alibaba, they said that they can't use the postcopy feature unless there is a mechanism to
> find the guest back.

The VM memory image is still on the source VM, so you can restart
the source, however that's not safe, because once the destination has
started running it is sending out packets and also modifying the block storage.
If you restarted the source at that point what block and net state can
you accept being visible?

Dave

> 
> Liang
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Paolo Bonzini June 18, 2015, 8:28 a.m. UTC | #5

On 18/06/2015 09:50, Li, Liang Z wrote:
> Do you have any idea or plan to deal with the failure happened during
> the postcopy phase?
> 
> Lost the guest  is too frightening for a cloud provider, we have a
> discussion with Alibaba, they said that they can't use the postcopy
> feature unless there is a mechanism to find the guest back.

There's no solution to this problem, except for rollback to a previous
snapshot.

To give an idea, an example of an intended usecase for postcopy is
datacenter evacuation in 30 minutes after a tsunami alert.  That's not a
case where you care much about losing guests to network failures.

Why is there no solution?  Let's look at one of the best surveys on
migration,
http://courses.cs.vt.edu/~cs5204/fall05-kafura/Papers/Migration/ProcessMigration.pdf
(warning, 59 pages!):

  [3.2] If only part of the task state is transferred to another node,
  the task can start executing sooner, and the initial migration costs
  are lower.

  [3.4] Fault resilience can be improved in several ways. The impact of
  failures during migration can be reduced by maintaining process state
  on both the source and destination sites until the destination site
  instance is successfully promoted to a regular process and the source
  node is informed about this.

  [3.5] Migration algorithms should avoid linear dependencies on the
  amount of state to be transferred. For example, the eager data
  transfer strategy has costs proportional to the address space size

"Pre"copy means "start copying *before* promoting the destination to be
the primary host" and it has such a linear dependency on the amount of
state to be transferred. "Post"copy means "delay some copying to *after*
promoting the destination to be the primary host".

So we have:

                           Precopy            Postcopy
   3.2 Performance            - (1)             - (2)
   3.4 Fault resilience       +                 -
   3.5 Scalability            -                 +

      (1) smaller impact, longer freeze time
      (2) larger impact, extremely short freeze time

Postcopy can also limit the length of the non-resilient phase, by
starting with a precopy phase and only switching to postcopy after some
time.  Then you have:

                           Precopy        Hybrid      Postcopy
   3.2 Performance            - (1)          + (3)        - (2)
   3.4 Fault resilience       +              -            --
   3.5 Scalability            -              +            +

      (3) intermediate impact, extremely short freeze time

but there is still going to be a phase where migration is not resilient
to network faults.

Cloud operators can use a combination of precopy and postcopy.  For
example, I would not use postcopy for mass migration when doing
host updates, but it can be used as a last resort before a scheduled
downtime.

For example, say you're doing a rolling update and you want it complete
by next Sunday.  90% of the guests are shut down by the customers or can
be migrated successfully with precopy.  The others do not converge and
their SLA does not let you throttle them to complete precopy migration.

You then tell your customers that either they shutdown and restart their
instances before Saturday 8:00 PM, or they might be shut down forcibly.
 Then for customers who haven't rebooted you can do
postcopy---you have alerted them that something might go wrong.  So even
though postcopy would not be a first choice, it can still help cloud
operators.

Paolo

Dr. David Alan Gilbert June 19, 2015, 5:52 p.m. UTC | #6

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 18/06/2015 09:50, Li, Liang Z wrote:
> > Do you have any idea or plan to deal with the failure happened during
> > the postcopy phase?
> > 
> > Lost the guest  is too frightening for a cloud provider, we have a
> > discussion with Alibaba, they said that they can't use the postcopy
> > feature unless there is a mechanism to find the guest back.
> 
> There's no solution to this problem, except for rollback to a previous
> snapshot.

Yes, and you might be able to avoid some of the pain if you COWd the
disk data on the destination until the migration was finished; that would
allow you to restart the source VM in the state prior to postcopy starting;
although the network's view of it is going to be very messy.

> To give an idea, an example of an intended usecase for postcopy is
> datacenter evacuation in 30 minutes after a tsunami alert.  That's not a
> case where you care much about losing guests to network failures.

Well; you have to make a call as to what your best option is;  you could
always shut the VM down and boot it up fresh in your new safe data centre.
Your preference is determined by your confidence that your VM would boot
back up safely and how long it would take and the confidence in that network
during the migration period and the pain of knowing what will happen
if you explicitly shut the VM down.

> Cloud operators can use a combination of precopy and postcopy.  For
> example, I would not use postcopy for mass migration when doing
> host updates, but it can be used as a last resort before a scheduled
> downtime.
> 
> For example, say you're doing a rolling update and you want it complete
> by next Sunday.  90% of the guests are shut down by the customers or can
> be migrated successfully with precopy.  The others do not converge and
> their SLA does not let you throttle them to complete precopy migration.

Indeed the interface lets you do that pretty easily; since as long as you
have enabled postcopy, it starts in precopy mode and is fully recoverable
until you issue the 'migrate_start_postcopy' which might be when it's
tried 'n' times and you can see that the workload you have isn't going
to converge.

Dave

> You then tell your customers that either they shutdown and restart their
> instances before Saturday 8:00 PM, or they might be shut down forcibly.
>  Then for customers who haven't rebooted you can do
> postcopy---you have alerted them that something might go wrong.  So even
> though postcopy would not be a first choice, it can still help cloud
> operators.
> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Yang Hongyang June 26, 2015, 6:46 a.m. UTC | #7

Hi Dave,

On 06/16/2015 06:26 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
[...]
> += Postcopy =
> +'Postcopy' migration is a way to deal with migrations that refuse to converge;
> +its plus side is that there is an upper bound on the amount of migration traffic
> +and time it takes, the down side is that during the postcopy phase, a failure of
> +*either* side or the network connection causes the guest to be lost.
> +
> +In postcopy the destination CPUs are started before all the memory has been
> +transferred, and accesses to pages that are yet to be transferred cause
> +a fault that's translated by QEMU into a request to the source QEMU.

I have a immature idea,
Can we keep a source RAM cache on destination QEMU, instead of request to the
source QEMU, that is:
  - When start_postcopy issued, source will paused, and __open another socket
    (maybe another migration thread)__ to send the remaining dirty pages to
    destination, at the same time, destination will start, and cache the
    remaining pages.
  - When the page fault occured, first lookup the page in the CACHE, if it is not
    yet received, request to the source QEMU.
  - Once the remaining dirty pages are transfered, the source QEMU can go now.

The existing postcopy mechanism does not need to be changed, just add the
remaining page transfer mechanism, and the RAM cache.

I don't know if it is feasible and whether it will bring improvement to the
postcopy, what do you think?

> +
> +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
> +doesn't finish in a given time the switch is made to postcopy.
> +
> +=== Enabling postcopy ===
> +
> +To enable postcopy (prior to the start of migration):
> +
> +migrate_set_capability x-postcopy-ram on
> +
> +The migration will still start in precopy mode, however issuing:
> +
> +migrate_start_postcopy
> +
> +will now cause the transition from precopy to postcopy.
> +It can be issued immediately after migration is started or any
> +time later on.  Issuing it after the end of a migration is harmless.
> +
> +=== Postcopy device transfer ===
> +
> +Loading of device data may cause the device emulation to access guest RAM
> +that may trigger faults that have to be resolved by the source, as such
> +the migration stream has to be able to respond with page data *during* the
> +device load, and hence the device data has to be read from the stream completely
> +before the device load begins to free the stream up.  This is achieved by
> +'packaging' the device data into a blob that's read in one go.
> +
> +Source behaviour
> +
> +Until postcopy is entered the migration stream is identical to normal
> +precopy, except for the addition of a 'postcopy advise' command at
> +the beginning, to tell the destination that postcopy might happen.
> +When postcopy starts the source sends the page discard data and then
> +forms the 'package' containing:
> +
> +   Command: 'postcopy listen'
> +   The device state
> +      A series of sections, identical to the precopy streams device state stream
> +      containing everything except postcopiable devices (i.e. RAM)
> +   Command: 'postcopy run'
> +
> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> +contents are formatted in the same way as the main migration stream.
> +
> +Destination behaviour
> +
> +Initially the destination looks the same as precopy, with a single thread
> +reading the migration stream; the 'postcopy advise' and 'discard' commands
> +are processed to change the way RAM is managed, but don't affect the stream
> +processing.
> +
> +------------------------------------------------------------------------------
> +                        1      2   3     4 5                      6   7
> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> +thread                             |       |
> +                                   |     (page request)
> +                                   |        \___
> +                                   v            \
> +listen thread:                     --- page -- page -- page -- page -- page --
> +
> +                                   a   b        c
> +------------------------------------------------------------------------------
> +
> +On receipt of CMD_PACKAGED (1)
> +   All the data associated with the package - the ( ... ) section in the
> +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
> +which contains commands (3,6) and devices (4...)
> +
> +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
> +a new thread (a) is started that takes over servicing the migration stream,
> +while the main thread carries on loading the package.   It loads normal
> +background page data (b) but if during a device load a fault happens (5) the
> +returned page (c) is loaded by the listen thread allowing the main threads
> +device load to carry on.
> +
> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
> +CPUs start running.
> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
> +and is no longer used by migration, while the listen thread carries
> +on servicing page data until the end of migration.
> +
> +=== Postcopy states ===
> +
> +Postcopy moves through a series of states (see postcopy_state) from
> +ADVISE->LISTEN->RUNNING->END
> +
> +  Advise: Set at the start of migration if postcopy is enabled, even
> +          if it hasn't had the start command; here the destination
> +          checks that its OS has the support needed for postcopy, and performs
> +          setup to ensure the RAM mappings are suitable for later postcopy.
> +          (Triggered by reception of POSTCOPY_ADVISE command)
> +
> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> +          the destination state to Listen, and starts a new thread
> +          (the 'listen thread') which takes over the job of receiving
> +          pages off the migration stream, while the main thread carries
> +          on processing the blob.  With this thread able to process page
> +          reception, the destination now 'sensitises' the RAM to detect
> +          any access to missing pages (on Linux using the 'userfault'
> +          system).
> +
> +  Running: POSTCOPY_RUN causes the destination to synchronise all
> +          state and start the CPUs and IO devices running.  The main
> +          thread now finishes processing the migration package and
> +          now carries on as it would for normal precopy migration
> +          (although it can't do the cleanup it would do as it
> +          finishes a normal migration).
> +
> +  End: The listen thread can now quit, and perform the cleanup of migration
> +          state, the migration is now complete.
> +
> +=== Source side page maps ===
> +
> +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
> +and 'sent map'.  The 'migration bitmap' is basically the same as in
> +the precopy case, and holds a bit to indicate that page is 'dirty' -
> +i.e. needs sending.  During the precopy phase this is updated as the CPU
> +dirties pages, however during postcopy the CPUs are stopped and nothing
> +should dirty anything any more.
> +
> +The 'sent map' is used for the transition to postcopy. It is a bitmap that
> +has a bit set whenever a page is sent to the destination, however during
> +the transition to postcopy mode it is masked against the migration bitmap
> +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
> +have been previously been sent but are now dirty again.  This masked
> +sentmap is sent to the destination which discards those now dirty pages
> +before starting the CPUs.
> +
> +Note that the contents of the sentmap are sacrificed during the calculation
> +of the discard set and thus aren't valid once in postcopy.  The dirtymap
> +is still valid and is used to ensure that no page is sent more than once.  Any
> +request for a page that has already been sent is ignored.  Duplicate requests
> +such as this can happen as a page is sent at about the same time the
> +destination accesses it.
>

Zhanghailiang June 26, 2015, 7:53 a.m. UTC | #8

On 2015/6/26 14:46, Yang Hongyang wrote:
> Hi Dave,
>
> On 06/16/2015 06:26 PM, Dr. David Alan Gilbert (git) wrote:
>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>
> [...]
>> += Postcopy =
>> +'Postcopy' migration is a way to deal with migrations that refuse to converge;
>> +its plus side is that there is an upper bound on the amount of migration traffic
>> +and time it takes, the down side is that during the postcopy phase, a failure of
>> +*either* side or the network connection causes the guest to be lost.
>> +
>> +In postcopy the destination CPUs are started before all the memory has been
>> +transferred, and accesses to pages that are yet to be transferred cause
>> +a fault that's translated by QEMU into a request to the source QEMU.
>
> I have a immature idea,
> Can we keep a source RAM cache on destination QEMU, instead of request to the
> source QEMU, that is:
>   - When start_postcopy issued, source will paused, and __open another socket
>     (maybe another migration thread)__ to send the remaining dirty pages to
>     destination, at the same time, destination will start, and cache the
>     remaining pages.

Er, it seems that current implementation is just like what you described except the ram cache:
After switch to post-copy mode, the source side will send the remaining dirty pages as pre-copy.
Here it does not need any cache at all, it just places the dirty pages where it will be accessed.

>   - When the page fault occured, first lookup the page in the CACHE, if it is not
>     yet received, request to the source QEMU.
>   - Once the remaining dirty pages are transfered, the source QEMU can go now.
>
> The existing postcopy mechanism does not need to be changed, just add the
> remaining page transfer mechanism, and the RAM cache.
>
> I don't know if it is feasible and whether it will bring improvement to the
> postcopy, what do you think?
>
>> +
>> +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
>> +doesn't finish in a given time the switch is made to postcopy.
>> +
>> +=== Enabling postcopy ===
>> +
>> +To enable postcopy (prior to the start of migration):
>> +
>> +migrate_set_capability x-postcopy-ram on
>> +
>> +The migration will still start in precopy mode, however issuing:
>> +
>> +migrate_start_postcopy
>> +
>> +will now cause the transition from precopy to postcopy.
>> +It can be issued immediately after migration is started or any
>> +time later on.  Issuing it after the end of a migration is harmless.
>> +
>> +=== Postcopy device transfer ===
>> +
>> +Loading of device data may cause the device emulation to access guest RAM
>> +that may trigger faults that have to be resolved by the source, as such
>> +the migration stream has to be able to respond with page data *during* the
>> +device load, and hence the device data has to be read from the stream completely
>> +before the device load begins to free the stream up.  This is achieved by
>> +'packaging' the device data into a blob that's read in one go.
>> +
>> +Source behaviour
>> +
>> +Until postcopy is entered the migration stream is identical to normal
>> +precopy, except for the addition of a 'postcopy advise' command at
>> +the beginning, to tell the destination that postcopy might happen.
>> +When postcopy starts the source sends the page discard data and then
>> +forms the 'package' containing:
>> +
>> +   Command: 'postcopy listen'
>> +   The device state
>> +      A series of sections, identical to the precopy streams device state stream
>> +      containing everything except postcopiable devices (i.e. RAM)
>> +   Command: 'postcopy run'
>> +
>> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
>> +contents are formatted in the same way as the main migration stream.
>> +
>> +Destination behaviour
>> +
>> +Initially the destination looks the same as precopy, with a single thread
>> +reading the migration stream; the 'postcopy advise' and 'discard' commands
>> +are processed to change the way RAM is managed, but don't affect the stream
>> +processing.
>> +
>> +------------------------------------------------------------------------------
>> +                        1      2   3     4 5                      6   7
>> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
>> +thread                             |       |
>> +                                   |     (page request)
>> +                                   |        \___
>> +                                   v            \
>> +listen thread:                     --- page -- page -- page -- page -- page --
>> +
>> +                                   a   b        c
>> +------------------------------------------------------------------------------
>> +
>> +On receipt of CMD_PACKAGED (1)
>> +   All the data associated with the package - the ( ... ) section in the
>> +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
>> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
>> +which contains commands (3,6) and devices (4...)
>> +
>> +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
>> +a new thread (a) is started that takes over servicing the migration stream,
>> +while the main thread carries on loading the package.   It loads normal
>> +background page data (b) but if during a device load a fault happens (5) the
>> +returned page (c) is loaded by the listen thread allowing the main threads
>> +device load to carry on.
>> +
>> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
>> +CPUs start running.
>> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
>> +and is no longer used by migration, while the listen thread carries
>> +on servicing page data until the end of migration.
>> +
>> +=== Postcopy states ===
>> +
>> +Postcopy moves through a series of states (see postcopy_state) from
>> +ADVISE->LISTEN->RUNNING->END
>> +
>> +  Advise: Set at the start of migration if postcopy is enabled, even
>> +          if it hasn't had the start command; here the destination
>> +          checks that its OS has the support needed for postcopy, and performs
>> +          setup to ensure the RAM mappings are suitable for later postcopy.
>> +          (Triggered by reception of POSTCOPY_ADVISE command)
>> +
>> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
>> +          the destination state to Listen, and starts a new thread
>> +          (the 'listen thread') which takes over the job of receiving
>> +          pages off the migration stream, while the main thread carries
>> +          on processing the blob.  With this thread able to process page
>> +          reception, the destination now 'sensitises' the RAM to detect
>> +          any access to missing pages (on Linux using the 'userfault'
>> +          system).
>> +
>> +  Running: POSTCOPY_RUN causes the destination to synchronise all
>> +          state and start the CPUs and IO devices running.  The main
>> +          thread now finishes processing the migration package and
>> +          now carries on as it would for normal precopy migration
>> +          (although it can't do the cleanup it would do as it
>> +          finishes a normal migration).
>> +
>> +  End: The listen thread can now quit, and perform the cleanup of migration
>> +          state, the migration is now complete.
>> +
>> +=== Source side page maps ===
>> +
>> +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
>> +and 'sent map'.  The 'migration bitmap' is basically the same as in
>> +the precopy case, and holds a bit to indicate that page is 'dirty' -
>> +i.e. needs sending.  During the precopy phase this is updated as the CPU
>> +dirties pages, however during postcopy the CPUs are stopped and nothing
>> +should dirty anything any more.
>> +
>> +The 'sent map' is used for the transition to postcopy. It is a bitmap that
>> +has a bit set whenever a page is sent to the destination, however during
>> +the transition to postcopy mode it is masked against the migration bitmap
>> +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
>> +have been previously been sent but are now dirty again.  This masked
>> +sentmap is sent to the destination which discards those now dirty pages
>> +before starting the CPUs.
>> +
>> +Note that the contents of the sentmap are sacrificed during the calculation
>> +of the discard set and thus aren't valid once in postcopy.  The dirtymap
>> +is still valid and is used to ensure that no page is sent more than once.  Any
>> +request for a page that has already been sent is ignored.  Duplicate requests
>> +such as this can happen as a page is sent at about the same time the
>> +destination accesses it.
>>
>

Yang Hongyang June 26, 2015, 8 a.m. UTC | #9

On 06/26/2015 03:53 PM, zhanghailiang wrote:
> On 2015/6/26 14:46, Yang Hongyang wrote:
>> Hi Dave,
>>
>> On 06/16/2015 06:26 PM, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>> [...]
>>> += Postcopy =
>>> +'Postcopy' migration is a way to deal with migrations that refuse to converge;
>>> +its plus side is that there is an upper bound on the amount of migration
>>> traffic
>>> +and time it takes, the down side is that during the postcopy phase, a
>>> failure of
>>> +*either* side or the network connection causes the guest to be lost.
>>> +
>>> +In postcopy the destination CPUs are started before all the memory has been
>>> +transferred, and accesses to pages that are yet to be transferred cause
>>> +a fault that's translated by QEMU into a request to the source QEMU.
>>
>> I have a immature idea,
>> Can we keep a source RAM cache on destination QEMU, instead of request to the
>> source QEMU, that is:
>>   - When start_postcopy issued, source will paused, and __open another socket
>>     (maybe another migration thread)__ to send the remaining dirty pages to
>>     destination, at the same time, destination will start, and cache the
>>     remaining pages.
>
> Er, it seems that current implementation is just like what you described except
> the ram cache:
> After switch to post-copy mode, the source side will send the remaining dirty
> pages as pre-copy.
> Here it does not need any cache at all, it just places the dirty pages where it
> will be accessed.

I haven't look into the implementation in detail, but if it is, I think it
should be documented here...or in the below section [Source behaviour]

>
>>   - When the page fault occured, first lookup the page in the CACHE, if it is not
>>     yet received, request to the source QEMU.
>>   - Once the remaining dirty pages are transfered, the source QEMU can go now.
>>
>> The existing postcopy mechanism does not need to be changed, just add the
>> remaining page transfer mechanism, and the RAM cache.
>>
>> I don't know if it is feasible and whether it will bring improvement to the
>> postcopy, what do you think?
>>
>>> +
>>> +Postcopy can be combined with precopy (i.e. normal migration) so that if
>>> precopy
>>> +doesn't finish in a given time the switch is made to postcopy.
>>> +
>>> +=== Enabling postcopy ===
>>> +
>>> +To enable postcopy (prior to the start of migration):
>>> +
>>> +migrate_set_capability x-postcopy-ram on
>>> +
>>> +The migration will still start in precopy mode, however issuing:
>>> +
>>> +migrate_start_postcopy
>>> +
>>> +will now cause the transition from precopy to postcopy.
>>> +It can be issued immediately after migration is started or any
>>> +time later on.  Issuing it after the end of a migration is harmless.
>>> +
>>> +=== Postcopy device transfer ===
>>> +
>>> +Loading of device data may cause the device emulation to access guest RAM
>>> +that may trigger faults that have to be resolved by the source, as such
>>> +the migration stream has to be able to respond with page data *during* the
>>> +device load, and hence the device data has to be read from the stream
>>> completely
>>> +before the device load begins to free the stream up.  This is achieved by
>>> +'packaging' the device data into a blob that's read in one go.
>>> +
>>> +Source behaviour
>>> +
>>> +Until postcopy is entered the migration stream is identical to normal
>>> +precopy, except for the addition of a 'postcopy advise' command at
>>> +the beginning, to tell the destination that postcopy might happen.
>>> +When postcopy starts the source sends the page discard data and then
>>> +forms the 'package' containing:
>>> +
>>> +   Command: 'postcopy listen'
>>> +   The device state
>>> +      A series of sections, identical to the precopy streams device state
>>> stream
>>> +      containing everything except postcopiable devices (i.e. RAM)
>>> +   Command: 'postcopy run'
>>> +
>>> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
>>> +contents are formatted in the same way as the main migration stream.
>>> +
>>> +Destination behaviour
>>> +
>>> +Initially the destination looks the same as precopy, with a single thread
>>> +reading the migration stream; the 'postcopy advise' and 'discard' commands
>>> +are processed to change the way RAM is managed, but don't affect the stream
>>> +processing.
>>> +
>>> +------------------------------------------------------------------------------
>>> +                        1      2   3     4 5                      6   7
>>> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
>>> +thread                             |       |
>>> +                                   |     (page request)
>>> +                                   |        \___
>>> +                                   v            \
>>> +listen thread:                     --- page -- page -- page -- page -- page --
>>> +
>>> +                                   a   b        c
>>> +------------------------------------------------------------------------------
>>> +
>>> +On receipt of CMD_PACKAGED (1)
>>> +   All the data associated with the package - the ( ... ) section in the
>>> +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
>>> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
>>> +which contains commands (3,6) and devices (4...)
>>> +
>>> +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
>>> +a new thread (a) is started that takes over servicing the migration stream,
>>> +while the main thread carries on loading the package.   It loads normal
>>> +background page data (b) but if during a device load a fault happens (5) the
>>> +returned page (c) is loaded by the listen thread allowing the main threads
>>> +device load to carry on.
>>> +
>>> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the
>>> destination
>>> +CPUs start running.
>>> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running
>>> behaviour
>>> +and is no longer used by migration, while the listen thread carries
>>> +on servicing page data until the end of migration.
>>> +
>>> +=== Postcopy states ===
>>> +
>>> +Postcopy moves through a series of states (see postcopy_state) from
>>> +ADVISE->LISTEN->RUNNING->END
>>> +
>>> +  Advise: Set at the start of migration if postcopy is enabled, even
>>> +          if it hasn't had the start command; here the destination
>>> +          checks that its OS has the support needed for postcopy, and performs
>>> +          setup to ensure the RAM mappings are suitable for later postcopy.
>>> +          (Triggered by reception of POSTCOPY_ADVISE command)
>>> +
>>> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
>>> +          the destination state to Listen, and starts a new thread
>>> +          (the 'listen thread') which takes over the job of receiving
>>> +          pages off the migration stream, while the main thread carries
>>> +          on processing the blob.  With this thread able to process page
>>> +          reception, the destination now 'sensitises' the RAM to detect
>>> +          any access to missing pages (on Linux using the 'userfault'
>>> +          system).
>>> +
>>> +  Running: POSTCOPY_RUN causes the destination to synchronise all
>>> +          state and start the CPUs and IO devices running.  The main
>>> +          thread now finishes processing the migration package and
>>> +          now carries on as it would for normal precopy migration
>>> +          (although it can't do the cleanup it would do as it
>>> +          finishes a normal migration).
>>> +
>>> +  End: The listen thread can now quit, and perform the cleanup of migration
>>> +          state, the migration is now complete.
>>> +
>>> +=== Source side page maps ===
>>> +
>>> +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
>>> +and 'sent map'.  The 'migration bitmap' is basically the same as in
>>> +the precopy case, and holds a bit to indicate that page is 'dirty' -
>>> +i.e. needs sending.  During the precopy phase this is updated as the CPU
>>> +dirties pages, however during postcopy the CPUs are stopped and nothing
>>> +should dirty anything any more.
>>> +
>>> +The 'sent map' is used for the transition to postcopy. It is a bitmap that
>>> +has a bit set whenever a page is sent to the destination, however during
>>> +the transition to postcopy mode it is masked against the migration bitmap
>>> +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
>>> +have been previously been sent but are now dirty again.  This masked
>>> +sentmap is sent to the destination which discards those now dirty pages
>>> +before starting the CPUs.
>>> +
>>> +Note that the contents of the sentmap are sacrificed during the calculation
>>> +of the discard set and thus aren't valid once in postcopy.  The dirtymap
>>> +is still valid and is used to ensure that no page is sent more than once.  Any
>>> +request for a page that has already been sent is ignored.  Duplicate requests
>>> +such as this can happen as a page is sent at about the same time the
>>> +destination accesses it.
>>>
>>
>
>
> .
>

Dr. David Alan Gilbert June 26, 2015, 8:10 a.m. UTC | #10

* Yang Hongyang (yanghy@cn.fujitsu.com) wrote:
> 
> 
> On 06/26/2015 03:53 PM, zhanghailiang wrote:
> >On 2015/6/26 14:46, Yang Hongyang wrote:
> >>Hi Dave,
> >>
> >>On 06/16/2015 06:26 PM, Dr. David Alan Gilbert (git) wrote:
> >>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>
> >>[...]
> >>>+= Postcopy =
> >>>+'Postcopy' migration is a way to deal with migrations that refuse to converge;
> >>>+its plus side is that there is an upper bound on the amount of migration
> >>>traffic
> >>>+and time it takes, the down side is that during the postcopy phase, a
> >>>failure of
> >>>+*either* side or the network connection causes the guest to be lost.
> >>>+
> >>>+In postcopy the destination CPUs are started before all the memory has been
> >>>+transferred, and accesses to pages that are yet to be transferred cause
> >>>+a fault that's translated by QEMU into a request to the source QEMU.
> >>
> >>I have a immature idea,
> >>Can we keep a source RAM cache on destination QEMU, instead of request to the
> >>source QEMU, that is:
> >>  - When start_postcopy issued, source will paused, and __open another socket
> >>    (maybe another migration thread)__ to send the remaining dirty pages to
> >>    destination, at the same time, destination will start, and cache the
> >>    remaining pages.
> >
> >Er, it seems that current implementation is just like what you described except
> >the ram cache:
> >After switch to post-copy mode, the source side will send the remaining dirty
> >pages as pre-copy.
> >Here it does not need any cache at all, it just places the dirty pages where it
> >will be accessed.

Yes, zhanghailiang is correct; the source keeps sending other pages without being asked,
however when asked it sends requested pages immediately.  and the 'cache' is just
the main memory from which the destination is working.

However, the idea of using a separate socket is one that we have been thinking
about; one of the problems is that the urgent requested pages get delayed behind
the background page transfer and that increases the latency; a separate socket
should fix that.

> I haven't look into the implementation in detail, but if it is, I think it
> should be documented here...or in the below section [Source behaviour]

Yes, I can add to the documentation; I've added the following text:
  
  During postcopy the source scans the list of dirty pages and sends them
  to the destination without being requested (in much the same way as precopy),
  however when a page request is received from the destination the dirty page
  scanning restarts from the requested location.  This causes requested pages
  to be sent quickly, and also causes pages directly after the requested page
  to be sent quickly in the hope that those pages are likely to be requested
  by the destination soon.
  
Dave

> >
> >>  - When the page fault occured, first lookup the page in the CACHE, if it is not
> >>    yet received, request to the source QEMU.
> >>  - Once the remaining dirty pages are transfered, the source QEMU can go now.
> >>
> >>The existing postcopy mechanism does not need to be changed, just add the
> >>remaining page transfer mechanism, and the RAM cache.
> >>
> >>I don't know if it is feasible and whether it will bring improvement to the
> >>postcopy, what do you think?
> >>
> >>>+
> >>>+Postcopy can be combined with precopy (i.e. normal migration) so that if
> >>>precopy
> >>>+doesn't finish in a given time the switch is made to postcopy.
> >>>+
> >>>+=== Enabling postcopy ===
> >>>+
> >>>+To enable postcopy (prior to the start of migration):
> >>>+
> >>>+migrate_set_capability x-postcopy-ram on
> >>>+
> >>>+The migration will still start in precopy mode, however issuing:
> >>>+
> >>>+migrate_start_postcopy
> >>>+
> >>>+will now cause the transition from precopy to postcopy.
> >>>+It can be issued immediately after migration is started or any
> >>>+time later on.  Issuing it after the end of a migration is harmless.
> >>>+
> >>>+=== Postcopy device transfer ===
> >>>+
> >>>+Loading of device data may cause the device emulation to access guest RAM
> >>>+that may trigger faults that have to be resolved by the source, as such
> >>>+the migration stream has to be able to respond with page data *during* the
> >>>+device load, and hence the device data has to be read from the stream
> >>>completely
> >>>+before the device load begins to free the stream up.  This is achieved by
> >>>+'packaging' the device data into a blob that's read in one go.
> >>>+
> >>>+Source behaviour
> >>>+
> >>>+Until postcopy is entered the migration stream is identical to normal
> >>>+precopy, except for the addition of a 'postcopy advise' command at
> >>>+the beginning, to tell the destination that postcopy might happen.
> >>>+When postcopy starts the source sends the page discard data and then
> >>>+forms the 'package' containing:
> >>>+
> >>>+   Command: 'postcopy listen'
> >>>+   The device state
> >>>+      A series of sections, identical to the precopy streams device state
> >>>stream
> >>>+      containing everything except postcopiable devices (i.e. RAM)
> >>>+   Command: 'postcopy run'
> >>>+
> >>>+The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> >>>+contents are formatted in the same way as the main migration stream.
> >>>+
> >>>+Destination behaviour
> >>>+
> >>>+Initially the destination looks the same as precopy, with a single thread
> >>>+reading the migration stream; the 'postcopy advise' and 'discard' commands
> >>>+are processed to change the way RAM is managed, but don't affect the stream
> >>>+processing.
> >>>+
> >>>+------------------------------------------------------------------------------
> >>>+                        1      2   3     4 5                      6   7
> >>>+main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> >>>+thread                             |       |
> >>>+                                   |     (page request)
> >>>+                                   |        \___
> >>>+                                   v            \
> >>>+listen thread:                     --- page -- page -- page -- page -- page --
> >>>+
> >>>+                                   a   b        c
> >>>+------------------------------------------------------------------------------
> >>>+
> >>>+On receipt of CMD_PACKAGED (1)
> >>>+   All the data associated with the package - the ( ... ) section in the
> >>>+diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
> >>>+recurses into qemu_loadvm_state_main to process the contents of the package (2)
> >>>+which contains commands (3,6) and devices (4...)
> >>>+
> >>>+On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
> >>>+a new thread (a) is started that takes over servicing the migration stream,
> >>>+while the main thread carries on loading the package.   It loads normal
> >>>+background page data (b) but if during a device load a fault happens (5) the
> >>>+returned page (c) is loaded by the listen thread allowing the main threads
> >>>+device load to carry on.
> >>>+
> >>>+The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the
> >>>destination
> >>>+CPUs start running.
> >>>+At the end of the CMD_PACKAGED (7) the main thread returns to normal running
> >>>behaviour
> >>>+and is no longer used by migration, while the listen thread carries
> >>>+on servicing page data until the end of migration.
> >>>+
> >>>+=== Postcopy states ===
> >>>+
> >>>+Postcopy moves through a series of states (see postcopy_state) from
> >>>+ADVISE->LISTEN->RUNNING->END
> >>>+
> >>>+  Advise: Set at the start of migration if postcopy is enabled, even
> >>>+          if it hasn't had the start command; here the destination
> >>>+          checks that its OS has the support needed for postcopy, and performs
> >>>+          setup to ensure the RAM mappings are suitable for later postcopy.
> >>>+          (Triggered by reception of POSTCOPY_ADVISE command)
> >>>+
> >>>+  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> >>>+          the destination state to Listen, and starts a new thread
> >>>+          (the 'listen thread') which takes over the job of receiving
> >>>+          pages off the migration stream, while the main thread carries
> >>>+          on processing the blob.  With this thread able to process page
> >>>+          reception, the destination now 'sensitises' the RAM to detect
> >>>+          any access to missing pages (on Linux using the 'userfault'
> >>>+          system).
> >>>+
> >>>+  Running: POSTCOPY_RUN causes the destination to synchronise all
> >>>+          state and start the CPUs and IO devices running.  The main
> >>>+          thread now finishes processing the migration package and
> >>>+          now carries on as it would for normal precopy migration
> >>>+          (although it can't do the cleanup it would do as it
> >>>+          finishes a normal migration).
> >>>+
> >>>+  End: The listen thread can now quit, and perform the cleanup of migration
> >>>+          state, the migration is now complete.
> >>>+
> >>>+=== Source side page maps ===
> >>>+
> >>>+The source side keeps two bitmaps during postcopy; 'the migration bitmap'
> >>>+and 'sent map'.  The 'migration bitmap' is basically the same as in
> >>>+the precopy case, and holds a bit to indicate that page is 'dirty' -
> >>>+i.e. needs sending.  During the precopy phase this is updated as the CPU
> >>>+dirties pages, however during postcopy the CPUs are stopped and nothing
> >>>+should dirty anything any more.
> >>>+
> >>>+The 'sent map' is used for the transition to postcopy. It is a bitmap that
> >>>+has a bit set whenever a page is sent to the destination, however during
> >>>+the transition to postcopy mode it is masked against the migration bitmap
> >>>+(sentmap &= migrationbitmap) to generate a bitmap recording pages that
> >>>+have been previously been sent but are now dirty again.  This masked
> >>>+sentmap is sent to the destination which discards those now dirty pages
> >>>+before starting the CPUs.
> >>>+
> >>>+Note that the contents of the sentmap are sacrificed during the calculation
> >>>+of the discard set and thus aren't valid once in postcopy.  The dirtymap
> >>>+is still valid and is used to ensure that no page is sent more than once.  Any
> >>>+request for a page that has already been sent is ignored.  Duplicate requests
> >>>+such as this can happen as a page is sent at about the same time the
> >>>+destination accesses it.
> >>>
> >>
> >
> >
> >.
> >
> 
> -- 
> Thanks,
> Yang.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Yang Hongyang June 26, 2015, 8:19 a.m. UTC | #11

On 06/26/2015 04:10 PM, Dr. David Alan Gilbert wrote:
> * Yang Hongyang (yanghy@cn.fujitsu.com) wrote:
>>
>>
>> On 06/26/2015 03:53 PM, zhanghailiang wrote:
>>> On 2015/6/26 14:46, Yang Hongyang wrote:
>>>> Hi Dave,
>>>>
>>>> On 06/16/2015 06:26 PM, Dr. David Alan Gilbert (git) wrote:
>>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>>
>>>> [...]
>>>>> += Postcopy =
>>>>> +'Postcopy' migration is a way to deal with migrations that refuse to converge;
>>>>> +its plus side is that there is an upper bound on the amount of migration
>>>>> traffic
>>>>> +and time it takes, the down side is that during the postcopy phase, a
>>>>> failure of
>>>>> +*either* side or the network connection causes the guest to be lost.
>>>>> +
>>>>> +In postcopy the destination CPUs are started before all the memory has been
>>>>> +transferred, and accesses to pages that are yet to be transferred cause
>>>>> +a fault that's translated by QEMU into a request to the source QEMU.
>>>>
>>>> I have a immature idea,
>>>> Can we keep a source RAM cache on destination QEMU, instead of request to the
>>>> source QEMU, that is:
>>>>   - When start_postcopy issued, source will paused, and __open another socket
>>>>     (maybe another migration thread)__ to send the remaining dirty pages to
>>>>     destination, at the same time, destination will start, and cache the
>>>>     remaining pages.
>>>
>>> Er, it seems that current implementation is just like what you described except
>>> the ram cache:
>>> After switch to post-copy mode, the source side will send the remaining dirty
>>> pages as pre-copy.
>>> Here it does not need any cache at all, it just places the dirty pages where it
>>> will be accessed.
>
> Yes, zhanghailiang is correct; the source keeps sending other pages without being asked,
> however when asked it sends requested pages immediately.  and the 'cache' is just
> the main memory from which the destination is working.
>
> However, the idea of using a separate socket is one that we have been thinking
> about; one of the problems is that the urgent requested pages get delayed behind
> the background page transfer and that increases the latency; a separate socket
> should fix that.

That would be better.

>
>> I haven't look into the implementation in detail, but if it is, I think it
>> should be documented here...or in the below section [Source behaviour]
>
> Yes, I can add to the documentation; I've added the following text:
>
>    During postcopy the source scans the list of dirty pages and sends them
>    to the destination without being requested (in much the same way as precopy),
>    however when a page request is received from the destination the dirty page
>    scanning restarts from the requested location.  This causes requested pages
>    to be sent quickly, and also causes pages directly after the requested page
>    to be sent quickly in the hope that those pages are likely to be requested
>    by the destination soon.

Looks clearer for me now :)

>
> Dave
>
>>>
>>>>   - When the page fault occured, first lookup the page in the CACHE, if it is not
>>>>     yet received, request to the source QEMU.
>>>>   - Once the remaining dirty pages are transfered, the source QEMU can go now.
>>>>
>>>> The existing postcopy mechanism does not need to be changed, just add the
>>>> remaining page transfer mechanism, and the RAM cache.
>>>>
>>>> I don't know if it is feasible and whether it will bring improvement to the
>>>> postcopy, what do you think?
>>>>
>>>>> +
>>>>> +Postcopy can be combined with precopy (i.e. normal migration) so that if
>>>>> precopy
>>>>> +doesn't finish in a given time the switch is made to postcopy.
>>>>> +
>>>>> +=== Enabling postcopy ===
>>>>> +
>>>>> +To enable postcopy (prior to the start of migration):
>>>>> +
>>>>> +migrate_set_capability x-postcopy-ram on
>>>>> +
>>>>> +The migration will still start in precopy mode, however issuing:
>>>>> +
>>>>> +migrate_start_postcopy
>>>>> +
>>>>> +will now cause the transition from precopy to postcopy.
>>>>> +It can be issued immediately after migration is started or any
>>>>> +time later on.  Issuing it after the end of a migration is harmless.
>>>>> +
>>>>> +=== Postcopy device transfer ===
>>>>> +
>>>>> +Loading of device data may cause the device emulation to access guest RAM
>>>>> +that may trigger faults that have to be resolved by the source, as such
>>>>> +the migration stream has to be able to respond with page data *during* the
>>>>> +device load, and hence the device data has to be read from the stream
>>>>> completely
>>>>> +before the device load begins to free the stream up.  This is achieved by
>>>>> +'packaging' the device data into a blob that's read in one go.
>>>>> +
>>>>> +Source behaviour
>>>>> +
>>>>> +Until postcopy is entered the migration stream is identical to normal
>>>>> +precopy, except for the addition of a 'postcopy advise' command at
>>>>> +the beginning, to tell the destination that postcopy might happen.
>>>>> +When postcopy starts the source sends the page discard data and then
>>>>> +forms the 'package' containing:
>>>>> +
>>>>> +   Command: 'postcopy listen'
>>>>> +   The device state
>>>>> +      A series of sections, identical to the precopy streams device state
>>>>> stream
>>>>> +      containing everything except postcopiable devices (i.e. RAM)
>>>>> +   Command: 'postcopy run'
>>>>> +
>>>>> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
>>>>> +contents are formatted in the same way as the main migration stream.
>>>>> +
>>>>> +Destination behaviour
>>>>> +
>>>>> +Initially the destination looks the same as precopy, with a single thread
>>>>> +reading the migration stream; the 'postcopy advise' and 'discard' commands
>>>>> +are processed to change the way RAM is managed, but don't affect the stream
>>>>> +processing.
>>>>> +
>>>>> +------------------------------------------------------------------------------
>>>>> +                        1      2   3     4 5                      6   7
>>>>> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
>>>>> +thread                             |       |
>>>>> +                                   |     (page request)
>>>>> +                                   |        \___
>>>>> +                                   v            \
>>>>> +listen thread:                     --- page -- page -- page -- page -- page --
>>>>> +
>>>>> +                                   a   b        c
>>>>> +------------------------------------------------------------------------------
>>>>> +
>>>>> +On receipt of CMD_PACKAGED (1)
>>>>> +   All the data associated with the package - the ( ... ) section in the
>>>>> +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
>>>>> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
>>>>> +which contains commands (3,6) and devices (4...)
>>>>> +
>>>>> +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
>>>>> +a new thread (a) is started that takes over servicing the migration stream,
>>>>> +while the main thread carries on loading the package.   It loads normal
>>>>> +background page data (b) but if during a device load a fault happens (5) the
>>>>> +returned page (c) is loaded by the listen thread allowing the main threads
>>>>> +device load to carry on.
>>>>> +
>>>>> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the
>>>>> destination
>>>>> +CPUs start running.
>>>>> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running
>>>>> behaviour
>>>>> +and is no longer used by migration, while the listen thread carries
>>>>> +on servicing page data until the end of migration.
>>>>> +
>>>>> +=== Postcopy states ===
>>>>> +
>>>>> +Postcopy moves through a series of states (see postcopy_state) from
>>>>> +ADVISE->LISTEN->RUNNING->END
>>>>> +
>>>>> +  Advise: Set at the start of migration if postcopy is enabled, even
>>>>> +          if it hasn't had the start command; here the destination
>>>>> +          checks that its OS has the support needed for postcopy, and performs
>>>>> +          setup to ensure the RAM mappings are suitable for later postcopy.
>>>>> +          (Triggered by reception of POSTCOPY_ADVISE command)
>>>>> +
>>>>> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
>>>>> +          the destination state to Listen, and starts a new thread
>>>>> +          (the 'listen thread') which takes over the job of receiving
>>>>> +          pages off the migration stream, while the main thread carries
>>>>> +          on processing the blob.  With this thread able to process page
>>>>> +          reception, the destination now 'sensitises' the RAM to detect
>>>>> +          any access to missing pages (on Linux using the 'userfault'
>>>>> +          system).
>>>>> +
>>>>> +  Running: POSTCOPY_RUN causes the destination to synchronise all
>>>>> +          state and start the CPUs and IO devices running.  The main
>>>>> +          thread now finishes processing the migration package and
>>>>> +          now carries on as it would for normal precopy migration
>>>>> +          (although it can't do the cleanup it would do as it
>>>>> +          finishes a normal migration).
>>>>> +
>>>>> +  End: The listen thread can now quit, and perform the cleanup of migration
>>>>> +          state, the migration is now complete.
>>>>> +
>>>>> +=== Source side page maps ===
>>>>> +
>>>>> +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
>>>>> +and 'sent map'.  The 'migration bitmap' is basically the same as in
>>>>> +the precopy case, and holds a bit to indicate that page is 'dirty' -
>>>>> +i.e. needs sending.  During the precopy phase this is updated as the CPU
>>>>> +dirties pages, however during postcopy the CPUs are stopped and nothing
>>>>> +should dirty anything any more.
>>>>> +
>>>>> +The 'sent map' is used for the transition to postcopy. It is a bitmap that
>>>>> +has a bit set whenever a page is sent to the destination, however during
>>>>> +the transition to postcopy mode it is masked against the migration bitmap
>>>>> +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
>>>>> +have been previously been sent but are now dirty again.  This masked
>>>>> +sentmap is sent to the destination which discards those now dirty pages
>>>>> +before starting the CPUs.
>>>>> +
>>>>> +Note that the contents of the sentmap are sacrificed during the calculation
>>>>> +of the discard set and thus aren't valid once in postcopy.  The dirtymap
>>>>> +is still valid and is used to ensure that no page is sent more than once.  Any
>>>>> +request for a page that has already been sent is ignored.  Duplicate requests
>>>>> +such as this can happen as a page is sent at about the same time the
>>>>> +destination accesses it.
>>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>> --
>> Thanks,
>> Yang.
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>

Amit Shah Aug. 4, 2015, 5:20 a.m. UTC | #12

On (Tue) 16 Jun 2015 [11:26:14], Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Amit Shah <amit.shah@redhat.com>

A few minor comments:

> ---
>  docs/migration.txt | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
> 
> diff --git a/docs/migration.txt b/docs/migration.txt
> index f6df4be..b4b93d1 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -291,3 +291,170 @@ save/send this state when we are in the middle of a pio operation
>  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
>  not enabled, the values on that fields are garbage and don't need to
>  be sent.
> +
> += Return path =
> +
> +In most migration scenarios there is only a single data path that runs
> +from the source VM to the destination, typically along a single fd (although
> +possibly with another fd or similar for some fast way of throwing pages across).
> +
> +However, some uses need two way communication; in particular the Postcopy destination
> +needs to be able to request pages on demand from the source.
> +
> +For these scenarios there is a 'return path' from the destination to the source;
> +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
> +path.
> +
> +  Source side
> +     Forward path - written by migration thread
> +     Return path  - opened by main thread, read by return-path thread
> +
> +  Destination side
> +     Forward path - read by main thread
> +     Return path  - opened by main thread, written by main thread AND postcopy
> +                    thread (protected by rp_mutex)
> +
> += Postcopy =
> +'Postcopy' migration is a way to deal with migrations that refuse to converge;

(or take too long to converge)

> +its plus side is that there is an upper bound on the amount of migration traffic
> +and time it takes, the down side is that during the postcopy phase, a failure of
> +*either* side or the network connection causes the guest to be lost.
> +
> +In postcopy the destination CPUs are started before all the memory has been
> +transferred, and accesses to pages that are yet to be transferred cause
> +a fault that's translated by QEMU into a request to the source QEMU.
> +
> +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
> +doesn't finish in a given time the switch is made to postcopy.
> +
> +=== Enabling postcopy ===
> +
> +To enable postcopy (prior to the start of migration):

How about this instead:

"To enable postcopy, issue this command ont he monitor prior to the
start of migration:"

Otherwise, there's ambiguity that there is some way to enable this
after a precopy migration has started.

> +
> +migrate_set_capability x-postcopy-ram on
> +
> +The migration will still start in precopy mode, however issuing:

"A future migration will then start in precopy mode.  However,
issuing:"

?

> +
> +migrate_start_postcopy
> +
> +will now cause the transition from precopy to postcopy.
> +It can be issued immediately after migration is started or any
> +time later on.  Issuing it after the end of a migration is harmless.
> +
> +=== Postcopy device transfer ===
> +
> +Loading of device data may cause the device emulation to access guest RAM
> +that may trigger faults that have to be resolved by the source, as such
> +the migration stream has to be able to respond with page data *during* the
> +device load, and hence the device data has to be read from the stream completely
> +before the device load begins to free the stream up.  This is achieved by
> +'packaging' the device data into a blob that's read in one go.
> +
> +Source behaviour
> +
> +Until postcopy is entered the migration stream is identical to normal
> +precopy, except for the addition of a 'postcopy advise' command at
> +the beginning, to tell the destination that postcopy might happen.
> +When postcopy starts the source sends the page discard data and then
> +forms the 'package' containing:
> +
> +   Command: 'postcopy listen'
> +   The device state
> +      A series of sections, identical to the precopy streams device state stream
> +      containing everything except postcopiable devices (i.e. RAM)
> +   Command: 'postcopy run'
> +
> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> +contents are formatted in the same way as the main migration stream.
> +
> +Destination behaviour
> +
> +Initially the destination looks the same as precopy, with a single thread
> +reading the migration stream; the 'postcopy advise' and 'discard' commands
> +are processed to change the way RAM is managed, but don't affect the stream
> +processing.
> +
> +------------------------------------------------------------------------------
> +                        1      2   3     4 5                      6   7
> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> +thread                             |       |
> +                                   |     (page request)
> +                                   |        \___
> +                                   v            \
> +listen thread:                     --- page -- page -- page -- page -- page --
> +
> +                                   a   b        c
> +------------------------------------------------------------------------------
> +
> +On receipt of CMD_PACKAGED (1)
> +   All the data associated with the package - the ( ... ) section in the
> +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
> +which contains commands (3,6) and devices (4...)
> +
> +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
> +a new thread (a) is started that takes over servicing the migration stream,
> +while the main thread carries on loading the package.   It loads normal
> +background page data (b) but if during a device load a fault happens (5) the
> +returned page (c) is loaded by the listen thread allowing the main threads
> +device load to carry on.
> +
> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
> +CPUs start running.
> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
> +and is no longer used by migration, while the listen thread carries
> +on servicing page data until the end of migration.
> +
> +=== Postcopy states ===
> +
> +Postcopy moves through a series of states (see postcopy_state) from
> +ADVISE->LISTEN->RUNNING->END
> +
> +  Advise: Set at the start of migration if postcopy is enabled, even
> +          if it hasn't had the start command; here the destination
> +          checks that its OS has the support needed for postcopy, and performs
> +          setup to ensure the RAM mappings are suitable for later postcopy.
> +          (Triggered by reception of POSTCOPY_ADVISE command)

Adding:

"This gives the destination a chance to fail early if postcopy is not
possible."

?

> +
> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> +          the destination state to Listen, and starts a new thread
> +          (the 'listen thread') which takes over the job of receiving
> +          pages off the migration stream, while the main thread carries
> +          on processing the blob.  With this thread able to process page
> +          reception, the destination now 'sensitises' the RAM to detect
> +          any access to missing pages (on Linux using the 'userfault'
> +          system).
> +
> +  Running: POSTCOPY_RUN causes the destination to synchronise all
> +          state and start the CPUs and IO devices running.  The main
> +          thread now finishes processing the migration package and
> +          now carries on as it would for normal precopy migration
> +          (although it can't do the cleanup it would do as it
> +          finishes a normal migration).

indentation went off a bit



		Amit

Dr. David Alan Gilbert Aug. 5, 2015, 12:21 p.m. UTC | #13

* Amit Shah (amit.shah@redhat.com) wrote:
> On (Tue) 16 Jun 2015 [11:26:14], Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: Amit Shah <amit.shah@redhat.com>
> 
> A few minor comments:
> 
> > ---
> >  docs/migration.txt | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 167 insertions(+)
> > 
> > diff --git a/docs/migration.txt b/docs/migration.txt
> > index f6df4be..b4b93d1 100644
> > --- a/docs/migration.txt
> > +++ b/docs/migration.txt
> > @@ -291,3 +291,170 @@ save/send this state when we are in the middle of a pio operation
> >  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
> >  not enabled, the values on that fields are garbage and don't need to
> >  be sent.
> > +
> > += Return path =
> > +
> > +In most migration scenarios there is only a single data path that runs
> > +from the source VM to the destination, typically along a single fd (although
> > +possibly with another fd or similar for some fast way of throwing pages across).
> > +
> > +However, some uses need two way communication; in particular the Postcopy destination
> > +needs to be able to request pages on demand from the source.
> > +
> > +For these scenarios there is a 'return path' from the destination to the source;
> > +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
> > +path.
> > +
> > +  Source side
> > +     Forward path - written by migration thread
> > +     Return path  - opened by main thread, read by return-path thread
> > +
> > +  Destination side
> > +     Forward path - read by main thread
> > +     Return path  - opened by main thread, written by main thread AND postcopy
> > +                    thread (protected by rp_mutex)
> > +
> > += Postcopy =
> > +'Postcopy' migration is a way to deal with migrations that refuse to converge;
> 
> (or take too long to converge)

Added.

> 
> > +its plus side is that there is an upper bound on the amount of migration traffic
> > +and time it takes, the down side is that during the postcopy phase, a failure of
> > +*either* side or the network connection causes the guest to be lost.
> > +
> > +In postcopy the destination CPUs are started before all the memory has been
> > +transferred, and accesses to pages that are yet to be transferred cause
> > +a fault that's translated by QEMU into a request to the source QEMU.
> > +
> > +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
> > +doesn't finish in a given time the switch is made to postcopy.
> > +
> > +=== Enabling postcopy ===
> > +
> > +To enable postcopy (prior to the start of migration):
> 
> How about this instead:
> 
> "To enable postcopy, issue this command ont he monitor prior to the
> start of migration:"
> 
> Otherwise, there's ambiguity that there is some way to enable this
> after a precopy migration has started.

Done.

> > +
> > +migrate_set_capability x-postcopy-ram on
> > +
> > +The migration will still start in precopy mode, however issuing:
> 
> "A future migration will then start in precopy mode.  However,
> issuing:"
> 
> ?

Ah yes, I see it's ambiguous because it doesn't say you still need
to do the normal migration stuff to start migration;

I've changed it to:

The normal commands are then used to start a migration, which is still
started in precopy mode.  Issuing:

migrate_start_postcopy

will now cause the transition from precopy to postcopy.

> > +
> > +migrate_start_postcopy
> > +
> > +will now cause the transition from precopy to postcopy.
> > +It can be issued immediately after migration is started or any
> > +time later on.  Issuing it after the end of a migration is harmless.
> > +
> > +=== Postcopy device transfer ===
> > +
> > +Loading of device data may cause the device emulation to access guest RAM
> > +that may trigger faults that have to be resolved by the source, as such
> > +the migration stream has to be able to respond with page data *during* the
> > +device load, and hence the device data has to be read from the stream completely
> > +before the device load begins to free the stream up.  This is achieved by
> > +'packaging' the device data into a blob that's read in one go.
> > +
> > +Source behaviour
> > +
> > +Until postcopy is entered the migration stream is identical to normal
> > +precopy, except for the addition of a 'postcopy advise' command at
> > +the beginning, to tell the destination that postcopy might happen.
> > +When postcopy starts the source sends the page discard data and then
> > +forms the 'package' containing:
> > +
> > +   Command: 'postcopy listen'
> > +   The device state
> > +      A series of sections, identical to the precopy streams device state stream
> > +      containing everything except postcopiable devices (i.e. RAM)
> > +   Command: 'postcopy run'
> > +
> > +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> > +contents are formatted in the same way as the main migration stream.
> > +
> > +Destination behaviour
> > +
> > +Initially the destination looks the same as precopy, with a single thread
> > +reading the migration stream; the 'postcopy advise' and 'discard' commands
> > +are processed to change the way RAM is managed, but don't affect the stream
> > +processing.
> > +
> > +------------------------------------------------------------------------------
> > +                        1      2   3     4 5                      6   7
> > +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> > +thread                             |       |
> > +                                   |     (page request)
> > +                                   |        \___
> > +                                   v            \
> > +listen thread:                     --- page -- page -- page -- page -- page --
> > +
> > +                                   a   b        c
> > +------------------------------------------------------------------------------
> > +
> > +On receipt of CMD_PACKAGED (1)
> > +   All the data associated with the package - the ( ... ) section in the
> > +diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
> > +recurses into qemu_loadvm_state_main to process the contents of the package (2)
> > +which contains commands (3,6) and devices (4...)
> > +
> > +On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
> > +a new thread (a) is started that takes over servicing the migration stream,
> > +while the main thread carries on loading the package.   It loads normal
> > +background page data (b) but if during a device load a fault happens (5) the
> > +returned page (c) is loaded by the listen thread allowing the main threads
> > +device load to carry on.
> > +
> > +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
> > +CPUs start running.
> > +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
> > +and is no longer used by migration, while the listen thread carries
> > +on servicing page data until the end of migration.
> > +
> > +=== Postcopy states ===
> > +
> > +Postcopy moves through a series of states (see postcopy_state) from
> > +ADVISE->LISTEN->RUNNING->END
> > +
> > +  Advise: Set at the start of migration if postcopy is enabled, even
> > +          if it hasn't had the start command; here the destination
> > +          checks that its OS has the support needed for postcopy, and performs
> > +          setup to ensure the RAM mappings are suitable for later postcopy.
> > +          (Triggered by reception of POSTCOPY_ADVISE command)
> 
> Adding:
> 
> "This gives the destination a chance to fail early if postcopy is not
> possible."
> 
> ?

I added:
 "The destination will fail early in migration at this point if the
  required OS support is not present.  "


> > +
> > +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> > +          the destination state to Listen, and starts a new thread
> > +          (the 'listen thread') which takes over the job of receiving
> > +          pages off the migration stream, while the main thread carries
> > +          on processing the blob.  With this thread able to process page
> > +          reception, the destination now 'sensitises' the RAM to detect
> > +          any access to missing pages (on Linux using the 'userfault'
> > +          system).
> > +
> > +  Running: POSTCOPY_RUN causes the destination to synchronise all
> > +          state and start the CPUs and IO devices running.  The main
> > +          thread now finishes processing the migration package and
> > +          now carries on as it would for normal precopy migration
> > +          (although it can't do the cleanup it would do as it
> > +          finishes a normal migration).
> 
> indentation went off a bit

Fixed.

Thanks,

Dave

> 
> 
> 		Amit
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

[v7,01/42] Start documenting how postcopy works.

Commit Message

Comments

Patch