diff mbox series

[v4,10/19] migration: Postcopy preemption enablement

Message ID 20220331150857.74406-11-peterx@redhat.com
State New
Headers show
Series migration: Postcopy Preemption | expand

Commit Message

Peter Xu March 31, 2022, 3:08 p.m. UTC
This patch enables postcopy-preempt feature.

It contains two major changes to the migration logic:

(1) Postcopy requests are now sent via a different socket from precopy
    background migration stream, so as to be isolated from very high page
    request delays.

(2) For huge page enabled hosts: when there's postcopy requests, they can now
    intercept a partial sending of huge host pages on src QEMU.

After this patch, we'll live migrate a VM with two channels for postcopy: (1)
PRECOPY channel, which is the default channel that transfers background pages;
and (2) POSTCOPY channel, which only transfers requested pages.

There's no strict rule of which channel to use, e.g., if a requested page is
already being transferred on precopy channel, then we will keep using the same
precopy channel to transfer the page even if it's explicitly requested.  In 99%
of the cases we'll prioritize the channels so we send requested page via the
postcopy channel as long as possible.

On the source QEMU, when we found a postcopy request, we'll interrupt the
PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
After we serviced all the high priority postcopy pages, we'll switch back to
PRECOPY channel so that we'll continue to send the interrupted huge page again.
There's no new thread introduced on src QEMU.

On the destination QEMU, one new thread is introduced to receive page data from
the postcopy specific socket (done in the preparation patch).

This patch has a side effect: after sending postcopy pages, previously we'll
assume the guest will access follow up pages so we'll keep sending from there.
Now it's changed.  Instead of going on with a postcopy requested page, we'll go
back and continue sending the precopy huge page (which can be intercepted by a
postcopy request so the huge page can be sent partially before).

Whether that's a problem is debatable, because "assuming the guest will
continue to access the next page" may not really suite when huge pages are
used, especially if the huge page is large (e.g. 1GB pages).  So that locality
hint is much meaningless if huge pages are used.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c  |   2 +
 migration/migration.h  |   2 +-
 migration/ram.c        | 250 +++++++++++++++++++++++++++++++++++++++--
 migration/trace-events |   7 ++
 4 files changed, 252 insertions(+), 9 deletions(-)

Comments

Daniel P. Berrangé April 20, 2022, 11:05 a.m. UTC | #1
On Thu, Mar 31, 2022 at 11:08:48AM -0400, Peter Xu wrote:
> This patch enables postcopy-preempt feature.
> 
> It contains two major changes to the migration logic:
> 
> (1) Postcopy requests are now sent via a different socket from precopy
>     background migration stream, so as to be isolated from very high page
>     request delays.
> 
> (2) For huge page enabled hosts: when there's postcopy requests, they can now
>     intercept a partial sending of huge host pages on src QEMU.
> 
> After this patch, we'll live migrate a VM with two channels for postcopy: (1)
> PRECOPY channel, which is the default channel that transfers background pages;
> and (2) POSTCOPY channel, which only transfers requested pages.
> 
> There's no strict rule of which channel to use, e.g., if a requested page is
> already being transferred on precopy channel, then we will keep using the same
> precopy channel to transfer the page even if it's explicitly requested.  In 99%
> of the cases we'll prioritize the channels so we send requested page via the
> postcopy channel as long as possible.
> 
> On the source QEMU, when we found a postcopy request, we'll interrupt the
> PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> After we serviced all the high priority postcopy pages, we'll switch back to
> PRECOPY channel so that we'll continue to send the interrupted huge page again.
> There's no new thread introduced on src QEMU.

Implicit in this approach is that the delay in sending postcopy
OOB pages is from the pending socket buffers the kernel already
has, and not any delay caused by the QEMU sending thread being
busy doing other stuff.

Is there any scenario in which the QEMU sending thread is stalled
in sendmsg() with a 1GB huge page waiting for the kernel to
get space in the socket outgoing buffer ?

With regards,
Daniel
Peter Xu April 20, 2022, 7:39 p.m. UTC | #2
On Wed, Apr 20, 2022 at 12:05:24PM +0100, Daniel P. Berrangé wrote:
> On Thu, Mar 31, 2022 at 11:08:48AM -0400, Peter Xu wrote:
> > This patch enables postcopy-preempt feature.
> > 
> > It contains two major changes to the migration logic:
> > 
> > (1) Postcopy requests are now sent via a different socket from precopy
> >     background migration stream, so as to be isolated from very high page
> >     request delays.
> > 
> > (2) For huge page enabled hosts: when there's postcopy requests, they can now
> >     intercept a partial sending of huge host pages on src QEMU.
> > 
> > After this patch, we'll live migrate a VM with two channels for postcopy: (1)
> > PRECOPY channel, which is the default channel that transfers background pages;
> > and (2) POSTCOPY channel, which only transfers requested pages.
> > 
> > There's no strict rule of which channel to use, e.g., if a requested page is
> > already being transferred on precopy channel, then we will keep using the same
> > precopy channel to transfer the page even if it's explicitly requested.  In 99%
> > of the cases we'll prioritize the channels so we send requested page via the
> > postcopy channel as long as possible.
> > 
> > On the source QEMU, when we found a postcopy request, we'll interrupt the
> > PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> > After we serviced all the high priority postcopy pages, we'll switch back to
> > PRECOPY channel so that we'll continue to send the interrupted huge page again.
> > There's no new thread introduced on src QEMU.
> 
> Implicit in this approach is that the delay in sending postcopy
> OOB pages is from the pending socket buffers the kernel already
> has, and not any delay caused by the QEMU sending thread being
> busy doing other stuff.

Yes.

> 
> Is there any scenario in which the QEMU sending thread is stalled
> in sendmsg() with a 1GB huge page waiting for the kernel to
> get space in the socket outgoing buffer ?

Another yes..

It doesn't necessarily to be during sending a 1GB huge page, the guest can
be using small pages and IMHO we could get stuck at sendmsg() for a precopy
small page while there's actually postcopy requests in the queue.

We can't solve this as long as we keep using 1 single thread for sending
page.

This patchset doesn't solve this issue, yet.  And it's actually the chunk
discussed and mention in the cover letter too in the section "Avoid precopy
write() blocks postcopy" as an TODO item.

Logically in the future we could try to make two or more sender threads so
postcopy pages can use a separate sender thread.

Note that this change will _not_ require interface change either from qemu
cmdline or on migration protocol, because this patchset should have handled
all the migration protocol already even for that, but then if it'll work
well we could get pure speed up on further shrinked latency when preempt
mode enabled comparing to before.

The other thing is I never measured such an effect, so I can't tell how
would it perform at last.  We need more work on top if we'd like to persue
it, mostly on doing proper synchronizations on senders.

Thanks,
manish.mishra May 11, 2022, 3:54 p.m. UTC | #3
On 31/03/22 8:38 pm, Peter Xu wrote:

LGTM

> This patch enables postcopy-preempt feature.
>
> It contains two major changes to the migration logic:
>
> (1) Postcopy requests are now sent via a different socket from precopy
>      background migration stream, so as to be isolated from very high page
>      request delays.
>
> (2) For huge page enabled hosts: when there's postcopy requests, they can now
>      intercept a partial sending of huge host pages on src QEMU.
>
> After this patch, we'll live migrate a VM with two channels for postcopy: (1)
> PRECOPY channel, which is the default channel that transfers background pages;
> and (2) POSTCOPY channel, which only transfers requested pages.
>
> There's no strict rule of which channel to use, e.g., if a requested page is
> already being transferred on precopy channel, then we will keep using the same
> precopy channel to transfer the page even if it's explicitly requested.  In 99%
> of the cases we'll prioritize the channels so we send requested page via the
> postcopy channel as long as possible.
>
> On the source QEMU, when we found a postcopy request, we'll interrupt the
> PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> After we serviced all the high priority postcopy pages, we'll switch back to
> PRECOPY channel so that we'll continue to send the interrupted huge page again.
> There's no new thread introduced on src QEMU.
>
> On the destination QEMU, one new thread is introduced to receive page data from
> the postcopy specific socket (done in the preparation patch).
>
> This patch has a side effect: after sending postcopy pages, previously we'll
> assume the guest will access follow up pages so we'll keep sending from there.
> Now it's changed.  Instead of going on with a postcopy requested page, we'll go
> back and continue sending the precopy huge page (which can be intercepted by a
> postcopy request so the huge page can be sent partially before).
>
> Whether that's a problem is debatable, because "assuming the guest will
> continue to access the next page" may not really suite when huge pages are
> used, especially if the huge page is large (e.g. 1GB pages).  So that locality
> hint is much meaningless if huge pages are used.
>
> Reviewed-by: Dr. David Alan Gilbert<dgilbert@redhat.com>
> Signed-off-by: Peter Xu<peterx@redhat.com>
> ---
>   migration/migration.c  |   2 +
>   migration/migration.h  |   2 +-
>   migration/ram.c        | 250 +++++++++++++++++++++++++++++++++++++++--
>   migration/trace-events |   7 ++
>   4 files changed, 252 insertions(+), 9 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 01b882494d..56d54c186b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3158,6 +3158,8 @@ static int postcopy_start(MigrationState *ms)
>                                 MIGRATION_STATUS_FAILED);
>       }
>   
> +    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
> +
>       return ret;
>   
>   fail_closefb:
> diff --git a/migration/migration.h b/migration/migration.h
> index caa910d956..b8aacfe3af 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -68,7 +68,7 @@ typedef struct {
>   struct MigrationIncomingState {
>       QEMUFile *from_src_file;
>       /* Previously received RAM's RAMBlock pointer */
> -    RAMBlock *last_recv_block;
> +    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
>       /* A hook to allow cleanup at the end of incoming migration */
>       void *transport_data;
>       void (*transport_cleanup)(void *data);
> diff --git a/migration/ram.c b/migration/ram.c
> index c7ea1d9215..518d511874 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -295,6 +295,20 @@ struct RAMSrcPageRequest {
>       QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
>   };
>   
> +typedef struct {
> +    /*
> +     * Cached ramblock/offset values if preempted.  They're only meaningful if
> +     * preempted==true below.
> +     */
> +    RAMBlock *ram_block;
> +    unsigned long ram_page;
> +    /*
> +     * Whether a postcopy preemption just happened.  Will be reset after
> +     * precopy recovered to background migration.
> +     */
> +    bool preempted;
> +} PostcopyPreemptState;
> +
>   /* State of RAM for migration */
>   struct RAMState {
>       /* QEMUFile used for this migration */
> @@ -349,6 +363,14 @@ struct RAMState {
>       /* Queue of outstanding page requests from the destination */
>       QemuMutex src_page_req_mutex;
>       QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
> +
> +    /* Postcopy preemption informations */
> +    PostcopyPreemptState postcopy_preempt_state;
> +    /*
> +     * Current channel we're using on src VM.  Only valid if postcopy-preempt
> +     * is enabled.
> +     */
> +    unsigned int postcopy_channel;
>   };
>   typedef struct RAMState RAMState;
>   
> @@ -356,6 +378,11 @@ static RAMState *ram_state;
>   
>   static NotifierWithReturnList precopy_notifier_list;
>   
> +static void postcopy_preempt_reset(RAMState *rs)
> +{
> +    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
> +}
> +
>   /* Whether postcopy has queued requests? */
>   static bool postcopy_has_request(RAMState *rs)
>   {
> @@ -1947,6 +1974,55 @@ void ram_write_tracking_stop(void)
>   }
>   #endif /* defined(__linux__) */
>   
> +/*
> + * Check whether two addr/offset of the ramblock falls onto the same host huge
> + * page.  Returns true if so, false otherwise.
> + */
> +static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
> +                                     uint64_t addr2)
> +{
> +    size_t page_size = qemu_ram_pagesize(rb);
> +
> +    addr1 = ROUND_DOWN(addr1, page_size);
> +    addr2 = ROUND_DOWN(addr2, page_size);
> +
> +    return addr1 == addr2;
> +}
> +
> +/*
> + * Whether a previous preempted precopy huge page contains current requested
> + * page?  Returns true if so, false otherwise.
> + *
> + * This should really happen very rarely, because it means when we were sending
> + * during background migration for postcopy we're sending exactly the page that
> + * some vcpu got faulted on on dest node.  When it happens, we probably don't
> + * need to do much but drop the request, because we know right after we restore
> + * the precopy stream it'll be serviced.  It'll slightly affect the order of
> + * postcopy requests to be serviced (e.g. it'll be the same as we move current
> + * request to the end of the queue) but it shouldn't be a big deal.  The most
> + * imporant thing is we can _never_ try to send a partial-sent huge page on the
> + * POSTCOPY channel again, otherwise that huge page will got "split brain" on
> + * two channels (PRECOPY, POSTCOPY).
> + */
> +static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
> +                                        ram_addr_t offset)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    /* No preemption at all? */
> +    if (!state->preempted) {
> +        return false;
> +    }
> +
> +    /* Not even the same ramblock? */
> +    if (state->ram_block != block) {
> +        return false;
> +    }
> +
> +    return offset_on_same_huge_page(block, offset,
> +                                    state->ram_page << TARGET_PAGE_BITS);
> +}
> +
>   /**
>    * get_queued_page: unqueue a page from the postcopy requests
>    *
> @@ -1962,9 +2038,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>       RAMBlock  *block;
>       ram_addr_t offset;
>   
> +again:
>       block = unqueue_page(rs, &offset);
>   
> -    if (!block) {
> +    if (block) {
> +        /* See comment above postcopy_preempted_contains() */
> +        if (postcopy_preempted_contains(rs, block, offset)) {
> +            trace_postcopy_preempt_hit(block->idstr, offset);
> +            /* This request is dropped */
> +            goto again;
> +        }
If we continuosly keep on getting new post-copy request. Is it possible 
this case can starve post-copy request which is in precopy preemtion?
> +    } else {
>           /*
>            * Poll write faults too if background snapshot is enabled; that's
>            * when we have vcpus got blocked by the write protected pages.
> @@ -2180,6 +2264,117 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
>       return ram_save_page(rs, pss);
>   }
>   
> +static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    /* Not enabled eager preempt?  Then never do that. */
> +    if (!migrate_postcopy_preempt()) {
> +        return false;
> +    }
> +
> +    /* If the ramblock we're sending is a small page?  Never bother. */
> +    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
> +        return false;
> +    }
> +
> +    /* Not in postcopy at all? */
> +    if (!migration_in_postcopy()) {
> +        return false;
> +    }
> +
> +    /*
> +     * If we're already handling a postcopy request, don't preempt as this page
> +     * has got the same high priority.
> +     */
> +    if (pss->postcopy_requested) {
> +        return false;
> +    }
> +
> +    /* If there's postcopy requests, then check it up! */
> +    return postcopy_has_request(rs);
> +}
> +
> +/* Returns true if we preempted precopy, false otherwise */
> +static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
> +
> +    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
> +
> +    /*
> +     * Time to preempt precopy. Cache current PSS into preempt state, so that
> +     * after handling the postcopy pages we can recover to it.  We need to do
> +     * so because the dest VM will have partial of the precopy huge page kept
> +     * over in its tmp huge page caches; better move on with it when we can.
> +     */
> +    p_state->ram_block = pss->block;
> +    p_state->ram_page = pss->page;
> +    p_state->preempted = true;
> +}
> +
> +/* Whether we're preempted by a postcopy request during sending a huge page */
> +static bool postcopy_preempt_triggered(RAMState *rs)
> +{
> +    return rs->postcopy_preempt_state.preempted;
> +}
> +
> +static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    assert(state->preempted);
> +
> +    pss->block = state->ram_block;
> +    pss->page = state->ram_page;
> +    /* This is not a postcopy request but restoring previous precopy */
> +    pss->postcopy_requested = false;
> +
> +    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
> +
> +    /* Reset preempt state, most importantly, set preempted==false */
> +    postcopy_preempt_reset(rs);
> +}
> +
> +static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus *pss)
> +{
> +    MigrationState *s = migrate_get_current();
> +    unsigned int channel;
> +    QEMUFile *next;
> +
> +    channel = pss->postcopy_requested ?
> +        RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
> +
> +    if (channel != rs->postcopy_channel) {
> +        if (channel == RAM_CHANNEL_PRECOPY) {
> +            next = s->to_dst_file;
> +        } else {
> +            next = s->postcopy_qemufile_src;
> +        }
> +        /* Update and cache the current channel */
> +        rs->f = next;
> +        rs->postcopy_channel = channel;
> +
> +        /*
> +         * If channel switched, reset last_sent_block since the old sent block
> +         * may not be on the same channel.
> +         */
> +        rs->last_sent_block = NULL;
> +
> +        trace_postcopy_preempt_switch_channel(channel);
> +    }
> +
> +    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
> +}
> +
> +/* We need to make sure rs->f always points to the default channel elsewhere */
> +static void postcopy_preempt_reset_channel(RAMState *rs)
> +{
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> +        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
> +        rs->f = migrate_get_current()->to_dst_file;
> +        trace_postcopy_preempt_reset_channel();
> +    }
> +}
> +
>   /**
>    * ram_save_host_page: save a whole host page
>    *
> @@ -2211,7 +2406,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
>           return 0;
>       }
>   
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {

I see why there is only one extra channel, multiFD is not supported for 
postcopy. Peter, Any particular reason for that.

As it must be very bad without multiFD, we have seen we can not utilise 
NIC higher than 10 Gbps without multiFD. If it

is something in TODO can we help with that?

> +        postcopy_preempt_choose_channel(rs, pss);
> +    }
> +
>       do {
> +        if (postcopy_needs_preempt(rs, pss)) {
> +            postcopy_do_preempt(rs, pss);
> +            break;
> +        }
> +
>           /* Check the pages is dirty and if it is send it */
>           if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
>               tmppages = ram_save_target_page(rs, pss);
> @@ -2235,6 +2439,19 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
>       /* The offset we leave with is the min boundary of host page and block */
>       pss->page = MIN(pss->page, hostpage_boundary);
>   
> +    /*
> +     * When with postcopy preempt mode, flush the data as soon as possible for
> +     * postcopy requests, because we've already sent a whole huge page, so the
> +     * dst node should already have enough resource to atomically filling in
> +     * the current missing page.
> +     *
> +     * More importantly, when using separate postcopy channel, we must do
> +     * explicit flush or it won't flush until the buffer is full.
> +     */
> +    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
> +        qemu_fflush(rs->f);
> +    }
> +
>       res = ram_save_release_protection(rs, pss, start_page);
>       return (res < 0 ? res : pages);
>   }
> @@ -2276,8 +2493,17 @@ static int ram_find_and_save_block(RAMState *rs)
>           found = get_queued_page(rs, &pss);
>   
>           if (!found) {
> -            /* priority queue empty, so just search for something dirty */
> -            found = find_dirty_block(rs, &pss, &again);
> +            /*
> +             * Recover previous precopy ramblock/offset if postcopy has
> +             * preempted precopy.  Otherwise find the next dirty bit.
> +             */
> +            if (postcopy_preempt_triggered(rs)) {
> +                postcopy_preempt_restore(rs, &pss);
> +                found = true;
> +            } else {
> +                /* priority queue empty, so just search for something dirty */
> +                found = find_dirty_block(rs, &pss, &again);
> +            }
>           }
>   
>           if (found) {
> @@ -2405,6 +2631,8 @@ static void ram_state_reset(RAMState *rs)
>       rs->last_page = 0;
>       rs->last_version = ram_list.version;
>       rs->xbzrle_enabled = false;
> +    postcopy_preempt_reset(rs);
> +    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
>   }
>   
>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -3043,6 +3271,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>       }
>       qemu_mutex_unlock(&rs->bitmap_mutex);
>   
> +    postcopy_preempt_reset_channel(rs);
> +
>       /*
>        * Must occur before EOS (or any QEMUFile operation)
>        * because of RDMA protocol.
> @@ -3112,6 +3342,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>           ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>       }
>   
> +    postcopy_preempt_reset_channel(rs);
> +
>       if (ret >= 0) {
>           multifd_send_sync_main(rs->f);
>           qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> @@ -3194,11 +3426,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
>    * @mis: the migration incoming state pointer
>    * @f: QEMUFile where to read the data from
>    * @flags: Page flags (mostly to see if it's a continuation of previous block)
> + * @channel: the channel we're using
>    */
>   static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
> -                                              QEMUFile *f, int flags)
> +                                              QEMUFile *f, int flags,
> +                                              int channel)
>   {
> -    RAMBlock *block = mis->last_recv_block;
> +    RAMBlock *block = mis->last_recv_block[channel];
>       char id[256];
>       uint8_t len;
>   
> @@ -3225,7 +3459,7 @@ static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
>           return NULL;
>       }
>   
> -    mis->last_recv_block = block;
> +    mis->last_recv_block[channel] = block;
>   
>       return block;
>   }
> @@ -3679,7 +3913,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
>           trace_ram_load_postcopy_loop(channel, (uint64_t)addr, flags);
>           if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                        RAM_SAVE_FLAG_COMPRESS_PAGE)) {
> -            block = ram_block_from_stream(mis, f, flags);
> +            block = ram_block_from_stream(mis, f, flags, channel);
>               if (!block) {
>                   ret = -EINVAL;
>                   break;
> @@ -3930,7 +4164,7 @@ static int ram_load_precopy(QEMUFile *f)
>   
>           if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                        RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(mis, f, flags);
> +            RAMBlock *block = ram_block_from_stream(mis, f, flags, RAM_CHANNEL_PRECOPY);
>   
>               host = host_from_ram_block_offset(block, addr);
>               /*
> diff --git a/migration/trace-events b/migration/trace-events
> index 1f932782d9..f92793b5f5 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -111,6 +111,12 @@ ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRI
>   ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>   ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>   unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 0x%"PRIx64" dirty %d"
> +postcopy_preempt_triggered(char *str, unsigned long page) "during sending ramblock %s offset 0x%lx"
> +postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 0x%lx"
> +postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
> +postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
> +postcopy_preempt_switch_channel(int channel) "%d"
> +postcopy_preempt_reset_channel(void) ""
>   
>   # multifd.c
>   multifd_new_send_channel_async(uint8_t id) "channel %u"
> @@ -176,6 +182,7 @@ migration_thread_low_pending(uint64_t pending) "%" PRIu64
>   migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>   process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>   process_incoming_migration_co_postcopy_end_main(void) ""
> +postcopy_preempt_enabled(bool value) "%d"
>   
>   # channel.c
>   migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
Peter Xu May 12, 2022, 4:22 p.m. UTC | #4
Hi, Manish,

On Wed, May 11, 2022 at 09:24:28PM +0530, manish.mishra wrote:
> > @@ -1962,9 +2038,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
> >       RAMBlock  *block;
> >       ram_addr_t offset;
> > +again:
> >       block = unqueue_page(rs, &offset);
> > -    if (!block) {
> > +    if (block) {
> > +        /* See comment above postcopy_preempted_contains() */
> > +        if (postcopy_preempted_contains(rs, block, offset)) {
> > +            trace_postcopy_preempt_hit(block->idstr, offset);
> > +            /* This request is dropped */
> > +            goto again;
> > +        }
> If we continuosly keep on getting new post-copy request. Is it possible this
> case can starve post-copy request which is in precopy preemtion?

I didn't fully get your thoughts, could you elaborate?

Here we're checking against the case where the postcopy requested page is
exactly the one that we have preempted in previous precopy sessions.  If
true, we drop this postcopy page and continue with the rest.

When there'll be no postcopy requests pending then we'll continue with the
precopy page, which is exactly the request we've dropped.

Why we did this is actually in comment above postcopy_preempted_contains(),
and quotting from there:

/*
 * This should really happen very rarely, because it means when we were sending
 * during background migration for postcopy we're sending exactly the page that
 * some vcpu got faulted on on dest node.  When it happens, we probably don't
 * need to do much but drop the request, because we know right after we restore
 * the precopy stream it'll be serviced.  It'll slightly affect the order of
 * postcopy requests to be serviced (e.g. it'll be the same as we move current
 * request to the end of the queue) but it shouldn't be a big deal.  The most
 * imporant thing is we can _never_ try to send a partial-sent huge page on the
 * POSTCOPY channel again, otherwise that huge page will got "split brain" on
 * two channels (PRECOPY, POSTCOPY).
 */

[...]

> > @@ -2211,7 +2406,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
> >           return 0;
> >       }
> > +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> 
> I see why there is only one extra channel, multiFD is not supported for
> postcopy. Peter, Any particular reason for that.

We used one channel not because multifd is not enabled - if you read into
the series the channels are separately managed because they're servicing
different goals.  It's because I don't really know whether multiple
channels would be necessary, because postcopy requests should not be the
major channel that pages will be sent, kind of a fast-path.

One of the major goal of this series is to avoid interruptions made to
postcopy urgent pages due to sending of precopy pages.  One extra channel
already serviced it well, so I just stopped there as the initial version.
I actually raised that question myself too in the cover letter in the todo
section, I think we can always evaluate the possibility of that in the
future without major reworks (but we may need another parameter to specify
the num of threads just like multifd).

> 
> As it must be very bad without multiFD, we have seen we can not utilise NIC
> higher than 10 Gbps without multiFD. If it
> 
> is something in TODO can we help with that?

Yes, that should be on Juan's todo list (in the cc list as well), and
AFAICT he'll be happy if anyone would like to take items out of the list.
We can further discuss it somewhere.

One thing to mention is that I suspect the thread models will still need to
be separate even if multifd joins the equation.  I mean, IMHO multifd
threads take chunks of pages and send things in bulk, while if you read
into this series postcopy preempt threads send page one by one and asap.
The former cares on throughput and latter cares latency.  When we design
the mix of postcopy+multifd it'll be great we also keep this in mind so
hopefully it'll make postcopy+multifd+preempt easier at last.

Thanks,
manish.mishra May 13, 2022, 6:53 p.m. UTC | #5
On 12/05/22 9:52 pm, Peter Xu wrote:
> Hi, Manish,
>
> On Wed, May 11, 2022 at 09:24:28PM +0530, manish.mishra wrote:
>>> @@ -1962,9 +2038,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>>>        RAMBlock  *block;
>>>        ram_addr_t offset;
>>> +again:
>>>        block = unqueue_page(rs, &offset);
>>> -    if (!block) {
>>> +    if (block) {
>>> +        /* See comment above postcopy_preempted_contains() */
>>> +        if (postcopy_preempted_contains(rs, block, offset)) {
>>> +            trace_postcopy_preempt_hit(block->idstr, offset);
>>> +            /* This request is dropped */
>>> +            goto again;
>>> +        }
>> If we continuosly keep on getting new post-copy request. Is it possible this
>> case can starve post-copy request which is in precopy preemtion?
> I didn't fully get your thoughts, could you elaborate?
>
> Here we're checking against the case where the postcopy requested page is
> exactly the one that we have preempted in previous precopy sessions.  If
> true, we drop this postcopy page and continue with the rest.
>
> When there'll be no postcopy requests pending then we'll continue with the
> precopy page, which is exactly the request we've dropped.
>
> Why we did this is actually in comment above postcopy_preempted_contains(),
> and quotting from there:
>
> /*
>   * This should really happen very rarely, because it means when we were sending
>   * during background migration for postcopy we're sending exactly the page that
>   * some vcpu got faulted on on dest node.  When it happens, we probably don't
>   * need to do much but drop the request, because we know right after we restore
>   * the precopy stream it'll be serviced.  It'll slightly affect the order of
>   * postcopy requests to be serviced (e.g. it'll be the same as we move current
>   * request to the end of the queue) but it shouldn't be a big deal.  The most
>   * imporant thing is we can _never_ try to send a partial-sent huge page on the
>   * POSTCOPY channel again, otherwise that huge page will got "split brain" on
>   * two channels (PRECOPY, POSTCOPY).
>   */
>
> [...]

Hi Peter, what i meant here is that as we go to precopy sending only 
when there is

no post-copy request left so if there is some workload which is 
continuosly generating

new post-copy fault request, It may take very long before we resume on 
precopy channel.

So basically precopy channel may have a post-copy request pending for 
very long in

this case? Earlier as it was FCFS there was a guarantee a post-copy 
request will be

served after a fixed amount of time.

>>> @@ -2211,7 +2406,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
>>>            return 0;
>>>        }
>>> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
>> I see why there is only one extra channel, multiFD is not supported for
>> postcopy. Peter, Any particular reason for that.
> We used one channel not because multifd is not enabled - if you read into
> the series the channels are separately managed because they're servicing
> different goals.  It's because I don't really know whether multiple
> channels would be necessary, because postcopy requests should not be the
> major channel that pages will be sent, kind of a fast-path.
>
> One of the major goal of this series is to avoid interruptions made to
> postcopy urgent pages due to sending of precopy pages.  One extra channel
> already serviced it well, so I just stopped there as the initial version.
> I actually raised that question myself too in the cover letter in the todo
> section, I think we can always evaluate the possibility of that in the
> future without major reworks (but we may need another parameter to specify
> the num of threads just like multifd).

 >because postcopy requests should not be the major channel that pages 
will be sent, kind of a fast-path.

Yes, agree Peter, but in worst case scenario it is possible we may have 
to transfer full memory of VM

by post-copy requests? So in that case we may require higher number of 
threads. But agree there can not be

be binding with number of mutliFD channels as multiFD uses 256KB buffer 
size but here we may have to 4k

in small page case so there can be big diff in throughput limits. Also 
smaller the buffer size much higher will

be cpu usage so it needs to be decided carefully.

>> As it must be very bad without multiFD, we have seen we can not utilise NIC
>> higher than 10 Gbps without multiFD. If it
>>
>> is something in TODO can we help with that?
> Yes, that should be on Juan's todo list (in the cc list as well), and
> AFAICT he'll be happy if anyone would like to take items out of the list.
> We can further discuss it somewhere.
>
> One thing to mention is that I suspect the thread models will still need to
> be separate even if multifd joins the equation.  I mean, IMHO multifd
> threads take chunks of pages and send things in bulk, while if you read
> into this series postcopy preempt threads send page one by one and asap.
> The former cares on throughput and latter cares latency.  When we design
> the mix of postcopy+multifd it'll be great we also keep this in mind so
> hopefully it'll make postcopy+multifd+preempt easier at last.
yes, got it, thanks
>
> Thanks,
>
Peter Xu May 13, 2022, 7:31 p.m. UTC | #6
On Sat, May 14, 2022 at 12:23:44AM +0530, manish.mishra wrote:
> 
> On 12/05/22 9:52 pm, Peter Xu wrote:
> > Hi, Manish,
> > 
> > On Wed, May 11, 2022 at 09:24:28PM +0530, manish.mishra wrote:
> > > > @@ -1962,9 +2038,17 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
> > > >        RAMBlock  *block;
> > > >        ram_addr_t offset;
> > > > +again:
> > > >        block = unqueue_page(rs, &offset);
> > > > -    if (!block) {
> > > > +    if (block) {
> > > > +        /* See comment above postcopy_preempted_contains() */
> > > > +        if (postcopy_preempted_contains(rs, block, offset)) {
> > > > +            trace_postcopy_preempt_hit(block->idstr, offset);
> > > > +            /* This request is dropped */
> > > > +            goto again;
> > > > +        }
> > > If we continuosly keep on getting new post-copy request. Is it possible this
> > > case can starve post-copy request which is in precopy preemtion?
> > I didn't fully get your thoughts, could you elaborate?
> > 
> > Here we're checking against the case where the postcopy requested page is
> > exactly the one that we have preempted in previous precopy sessions.  If
> > true, we drop this postcopy page and continue with the rest.
> > 
> > When there'll be no postcopy requests pending then we'll continue with the
> > precopy page, which is exactly the request we've dropped.
> > 
> > Why we did this is actually in comment above postcopy_preempted_contains(),
> > and quotting from there:
> > 
> > /*
> >   * This should really happen very rarely, because it means when we were sending
> >   * during background migration for postcopy we're sending exactly the page that
> >   * some vcpu got faulted on on dest node.  When it happens, we probably don't
> >   * need to do much but drop the request, because we know right after we restore
> >   * the precopy stream it'll be serviced.  It'll slightly affect the order of
> >   * postcopy requests to be serviced (e.g. it'll be the same as we move current
> >   * request to the end of the queue) but it shouldn't be a big deal.  The most
> >   * imporant thing is we can _never_ try to send a partial-sent huge page on the
> >   * POSTCOPY channel again, otherwise that huge page will got "split brain" on
> >   * two channels (PRECOPY, POSTCOPY).
> >   */
> > 
> > [...]
> 
> Hi Peter, what i meant here is that as we go to precopy sending only when
> there is
> 
> no post-copy request left so if there is some workload which is continuosly
> generating
> 
> new post-copy fault request, It may take very long before we resume on
> precopy channel.
> 
> So basically precopy channel may have a post-copy request pending for very
> long in
> 
> this case? Earlier as it was FCFS there was a guarantee a post-copy request
> will be
> 
> served after a fixed amount of time.

Ah that's a good point.

In that case maybe what we want to do is to restore this preemption
immediately using postcopy_preempt_restore(), however we may also want to
make sure this huge page won't get preempted by any other postcopy pages.

One thing in my mind to do this is to add one more field to the pss
structure, we could call it pss->urgent.

Previously we have only had pss->postcopy_requested showing that whether
one request comes from postcopy and whether we should send this page via
the postcopy preempt channel.  We also use that as a hint so we will never
preempt a huge page when postcopy_requested is set.  Now we probably want
to separate that out of pss->urgent, so postcopy_requested will be mostly
like before (along with the new pss->urgent set to 1 for all postcopy
pages), except that we may want to also set pss->urgent to 1 for this very
special case even for precopy pages, so that we won't preempt this page as
well.

I'm thinking maybe it's not wise to directly change the patch when I
repost.  My current plan is I'll add one more patch at the end, so I won't
easily give away Dave's R-b meanwhile hopefully that could make reviewers
easy.  We could consider squashing that patch in if we'll commit the whole
thing, or we can even keep them separate as a further optimization.

> 
> > > > @@ -2211,7 +2406,16 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
> > > >            return 0;
> > > >        }
> > > > +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> > > I see why there is only one extra channel, multiFD is not supported for
> > > postcopy. Peter, Any particular reason for that.
> > We used one channel not because multifd is not enabled - if you read into
> > the series the channels are separately managed because they're servicing
> > different goals.  It's because I don't really know whether multiple
> > channels would be necessary, because postcopy requests should not be the
> > major channel that pages will be sent, kind of a fast-path.
> > 
> > One of the major goal of this series is to avoid interruptions made to
> > postcopy urgent pages due to sending of precopy pages.  One extra channel
> > already serviced it well, so I just stopped there as the initial version.
> > I actually raised that question myself too in the cover letter in the todo
> > section, I think we can always evaluate the possibility of that in the
> > future without major reworks (but we may need another parameter to specify
> > the num of threads just like multifd).
> 
> >because postcopy requests should not be the major channel that pages will
> be sent, kind of a fast-path.
> 
> Yes, agree Peter, but in worst case scenario it is possible we may have to
> transfer full memory of VM
> 
> by post-copy requests? So in that case we may require higher number of
> threads. But agree there can not be
> 
> be binding with number of mutliFD channels as multiFD uses 256KB buffer size
> but here we may have to 4k
> 
> in small page case so there can be big diff in throughput limits. Also
> smaller the buffer size much higher will
> 
> be cpu usage so it needs to be decided carefully.

Right, and I see your point here.

It's just non-trivial to both gain performance and low latency imho.  But
maybe you have a good point in that it also means with preemption mode on
and with an extremely busy VM we could have put multifd into a vain even if
we'll support both multifd+preempt in the future.

But anyway - let's think more of it and let's solve problems one by one.

The worst case is we'll have low bw for such migration but it still keeps
relatively good responsiveness on dest page faults for now.

Thanks,
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 01b882494d..56d54c186b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3158,6 +3158,8 @@  static int postcopy_start(MigrationState *ms)
                               MIGRATION_STATUS_FAILED);
     }
 
+    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
+
     return ret;
 
 fail_closefb:
diff --git a/migration/migration.h b/migration/migration.h
index caa910d956..b8aacfe3af 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -68,7 +68,7 @@  typedef struct {
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
     /* Previously received RAM's RAMBlock pointer */
-    RAMBlock *last_recv_block;
+    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
     /* A hook to allow cleanup at the end of incoming migration */
     void *transport_data;
     void (*transport_cleanup)(void *data);
diff --git a/migration/ram.c b/migration/ram.c
index c7ea1d9215..518d511874 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -295,6 +295,20 @@  struct RAMSrcPageRequest {
     QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
+typedef struct {
+    /*
+     * Cached ramblock/offset values if preempted.  They're only meaningful if
+     * preempted==true below.
+     */
+    RAMBlock *ram_block;
+    unsigned long ram_page;
+    /*
+     * Whether a postcopy preemption just happened.  Will be reset after
+     * precopy recovered to background migration.
+     */
+    bool preempted;
+} PostcopyPreemptState;
+
 /* State of RAM for migration */
 struct RAMState {
     /* QEMUFile used for this migration */
@@ -349,6 +363,14 @@  struct RAMState {
     /* Queue of outstanding page requests from the destination */
     QemuMutex src_page_req_mutex;
     QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
+
+    /* Postcopy preemption informations */
+    PostcopyPreemptState postcopy_preempt_state;
+    /*
+     * Current channel we're using on src VM.  Only valid if postcopy-preempt
+     * is enabled.
+     */
+    unsigned int postcopy_channel;
 };
 typedef struct RAMState RAMState;
 
@@ -356,6 +378,11 @@  static RAMState *ram_state;
 
 static NotifierWithReturnList precopy_notifier_list;
 
+static void postcopy_preempt_reset(RAMState *rs)
+{
+    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
+}
+
 /* Whether postcopy has queued requests? */
 static bool postcopy_has_request(RAMState *rs)
 {
@@ -1947,6 +1974,55 @@  void ram_write_tracking_stop(void)
 }
 #endif /* defined(__linux__) */
 
+/*
+ * Check whether two addr/offset of the ramblock falls onto the same host huge
+ * page.  Returns true if so, false otherwise.
+ */
+static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
+                                     uint64_t addr2)
+{
+    size_t page_size = qemu_ram_pagesize(rb);
+
+    addr1 = ROUND_DOWN(addr1, page_size);
+    addr2 = ROUND_DOWN(addr2, page_size);
+
+    return addr1 == addr2;
+}
+
+/*
+ * Whether a previous preempted precopy huge page contains current requested
+ * page?  Returns true if so, false otherwise.
+ *
+ * This should really happen very rarely, because it means when we were sending
+ * during background migration for postcopy we're sending exactly the page that
+ * some vcpu got faulted on on dest node.  When it happens, we probably don't
+ * need to do much but drop the request, because we know right after we restore
+ * the precopy stream it'll be serviced.  It'll slightly affect the order of
+ * postcopy requests to be serviced (e.g. it'll be the same as we move current
+ * request to the end of the queue) but it shouldn't be a big deal.  The most
+ * imporant thing is we can _never_ try to send a partial-sent huge page on the
+ * POSTCOPY channel again, otherwise that huge page will got "split brain" on
+ * two channels (PRECOPY, POSTCOPY).
+ */
+static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
+                                        ram_addr_t offset)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    /* No preemption at all? */
+    if (!state->preempted) {
+        return false;
+    }
+
+    /* Not even the same ramblock? */
+    if (state->ram_block != block) {
+        return false;
+    }
+
+    return offset_on_same_huge_page(block, offset,
+                                    state->ram_page << TARGET_PAGE_BITS);
+}
+
 /**
  * get_queued_page: unqueue a page from the postcopy requests
  *
@@ -1962,9 +2038,17 @@  static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
     RAMBlock  *block;
     ram_addr_t offset;
 
+again:
     block = unqueue_page(rs, &offset);
 
-    if (!block) {
+    if (block) {
+        /* See comment above postcopy_preempted_contains() */
+        if (postcopy_preempted_contains(rs, block, offset)) {
+            trace_postcopy_preempt_hit(block->idstr, offset);
+            /* This request is dropped */
+            goto again;
+        }
+    } else {
         /*
          * Poll write faults too if background snapshot is enabled; that's
          * when we have vcpus got blocked by the write protected pages.
@@ -2180,6 +2264,117 @@  static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
     return ram_save_page(rs, pss);
 }
 
+static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    /* Not enabled eager preempt?  Then never do that. */
+    if (!migrate_postcopy_preempt()) {
+        return false;
+    }
+
+    /* If the ramblock we're sending is a small page?  Never bother. */
+    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
+        return false;
+    }
+
+    /* Not in postcopy at all? */
+    if (!migration_in_postcopy()) {
+        return false;
+    }
+
+    /*
+     * If we're already handling a postcopy request, don't preempt as this page
+     * has got the same high priority.
+     */
+    if (pss->postcopy_requested) {
+        return false;
+    }
+
+    /* If there's postcopy requests, then check it up! */
+    return postcopy_has_request(rs);
+}
+
+/* Returns true if we preempted precopy, false otherwise */
+static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
+
+    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
+
+    /*
+     * Time to preempt precopy. Cache current PSS into preempt state, so that
+     * after handling the postcopy pages we can recover to it.  We need to do
+     * so because the dest VM will have partial of the precopy huge page kept
+     * over in its tmp huge page caches; better move on with it when we can.
+     */
+    p_state->ram_block = pss->block;
+    p_state->ram_page = pss->page;
+    p_state->preempted = true;
+}
+
+/* Whether we're preempted by a postcopy request during sending a huge page */
+static bool postcopy_preempt_triggered(RAMState *rs)
+{
+    return rs->postcopy_preempt_state.preempted;
+}
+
+static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
+{
+    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
+
+    assert(state->preempted);
+
+    pss->block = state->ram_block;
+    pss->page = state->ram_page;
+    /* This is not a postcopy request but restoring previous precopy */
+    pss->postcopy_requested = false;
+
+    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
+
+    /* Reset preempt state, most importantly, set preempted==false */
+    postcopy_preempt_reset(rs);
+}
+
+static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus *pss)
+{
+    MigrationState *s = migrate_get_current();
+    unsigned int channel;
+    QEMUFile *next;
+
+    channel = pss->postcopy_requested ?
+        RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
+
+    if (channel != rs->postcopy_channel) {
+        if (channel == RAM_CHANNEL_PRECOPY) {
+            next = s->to_dst_file;
+        } else {
+            next = s->postcopy_qemufile_src;
+        }
+        /* Update and cache the current channel */
+        rs->f = next;
+        rs->postcopy_channel = channel;
+
+        /*
+         * If channel switched, reset last_sent_block since the old sent block
+         * may not be on the same channel.
+         */
+        rs->last_sent_block = NULL;
+
+        trace_postcopy_preempt_switch_channel(channel);
+    }
+
+    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
+}
+
+/* We need to make sure rs->f always points to the default channel elsewhere */
+static void postcopy_preempt_reset_channel(RAMState *rs)
+{
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
+        rs->f = migrate_get_current()->to_dst_file;
+        trace_postcopy_preempt_reset_channel();
+    }
+}
+
 /**
  * ram_save_host_page: save a whole host page
  *
@@ -2211,7 +2406,16 @@  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
         return 0;
     }
 
+    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+        postcopy_preempt_choose_channel(rs, pss);
+    }
+
     do {
+        if (postcopy_needs_preempt(rs, pss)) {
+            postcopy_do_preempt(rs, pss);
+            break;
+        }
+
         /* Check the pages is dirty and if it is send it */
         if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
             tmppages = ram_save_target_page(rs, pss);
@@ -2235,6 +2439,19 @@  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
     /* The offset we leave with is the min boundary of host page and block */
     pss->page = MIN(pss->page, hostpage_boundary);
 
+    /*
+     * When with postcopy preempt mode, flush the data as soon as possible for
+     * postcopy requests, because we've already sent a whole huge page, so the
+     * dst node should already have enough resource to atomically filling in
+     * the current missing page.
+     *
+     * More importantly, when using separate postcopy channel, we must do
+     * explicit flush or it won't flush until the buffer is full.
+     */
+    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
+        qemu_fflush(rs->f);
+    }
+
     res = ram_save_release_protection(rs, pss, start_page);
     return (res < 0 ? res : pages);
 }
@@ -2276,8 +2493,17 @@  static int ram_find_and_save_block(RAMState *rs)
         found = get_queued_page(rs, &pss);
 
         if (!found) {
-            /* priority queue empty, so just search for something dirty */
-            found = find_dirty_block(rs, &pss, &again);
+            /*
+             * Recover previous precopy ramblock/offset if postcopy has
+             * preempted precopy.  Otherwise find the next dirty bit.
+             */
+            if (postcopy_preempt_triggered(rs)) {
+                postcopy_preempt_restore(rs, &pss);
+                found = true;
+            } else {
+                /* priority queue empty, so just search for something dirty */
+                found = find_dirty_block(rs, &pss, &again);
+            }
         }
 
         if (found) {
@@ -2405,6 +2631,8 @@  static void ram_state_reset(RAMState *rs)
     rs->last_page = 0;
     rs->last_version = ram_list.version;
     rs->xbzrle_enabled = false;
+    postcopy_preempt_reset(rs);
+    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -3043,6 +3271,8 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
     }
     qemu_mutex_unlock(&rs->bitmap_mutex);
 
+    postcopy_preempt_reset_channel(rs);
+
     /*
      * Must occur before EOS (or any QEMUFile operation)
      * because of RDMA protocol.
@@ -3112,6 +3342,8 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
+    postcopy_preempt_reset_channel(rs);
+
     if (ret >= 0) {
         multifd_send_sync_main(rs->f);
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
@@ -3194,11 +3426,13 @@  static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
  * @mis: the migration incoming state pointer
  * @f: QEMUFile where to read the data from
  * @flags: Page flags (mostly to see if it's a continuation of previous block)
+ * @channel: the channel we're using
  */
 static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
-                                              QEMUFile *f, int flags)
+                                              QEMUFile *f, int flags,
+                                              int channel)
 {
-    RAMBlock *block = mis->last_recv_block;
+    RAMBlock *block = mis->last_recv_block[channel];
     char id[256];
     uint8_t len;
 
@@ -3225,7 +3459,7 @@  static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
         return NULL;
     }
 
-    mis->last_recv_block = block;
+    mis->last_recv_block[channel] = block;
 
     return block;
 }
@@ -3679,7 +3913,7 @@  int ram_load_postcopy(QEMUFile *f, int channel)
         trace_ram_load_postcopy_loop(channel, (uint64_t)addr, flags);
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE)) {
-            block = ram_block_from_stream(mis, f, flags);
+            block = ram_block_from_stream(mis, f, flags, channel);
             if (!block) {
                 ret = -EINVAL;
                 break;
@@ -3930,7 +4164,7 @@  static int ram_load_precopy(QEMUFile *f)
 
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
-            RAMBlock *block = ram_block_from_stream(mis, f, flags);
+            RAMBlock *block = ram_block_from_stream(mis, f, flags, RAM_CHANNEL_PRECOPY);
 
             host = host_from_ram_block_offset(block, addr);
             /*
diff --git a/migration/trace-events b/migration/trace-events
index 1f932782d9..f92793b5f5 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -111,6 +111,12 @@  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRI
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 0x%"PRIx64" dirty %d"
+postcopy_preempt_triggered(char *str, unsigned long page) "during sending ramblock %s offset 0x%lx"
+postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 0x%lx"
+postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s offset 0x%"PRIx64
+postcopy_preempt_switch_channel(int channel) "%d"
+postcopy_preempt_reset_channel(void) ""
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %u"
@@ -176,6 +182,7 @@  migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
 process_incoming_migration_co_postcopy_end_main(void) ""
+postcopy_preempt_enabled(bool value) "%d"
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"