diff mbox

[v8,32/54] Postcopy: Maintain sentmap and calculate discard

Message ID 1443515898-3594-33-git-send-email-dgilbert@redhat.com
State New
Headers show

Commit Message

Dr. David Alan Gilbert Sept. 29, 2015, 8:37 a.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Where postcopy is preceeded by a period of precopy, the destination will
have received pages that may have been dirtied on the source after the
page was sent.  The destination must throw these pages away before
starting it's CPUs.

Maintain a 'sentmap' of pages that have already been sent.
Calculate list of sent & dirty pages
Provide helpers on the destination side to discard these.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
---
 include/migration/migration.h    |  12 +++
 include/migration/postcopy-ram.h |  35 +++++++
 include/qemu/typedefs.h          |   1 +
 migration/migration.c            |   1 +
 migration/postcopy-ram.c         | 129 +++++++++++++++++++++++
 migration/ram.c                  | 218 ++++++++++++++++++++++++++++++++++++++-
 migration/savevm.c               |   2 -
 trace-events                     |   5 +
 8 files changed, 396 insertions(+), 7 deletions(-)

Comments

Juan Quintela Oct. 21, 2015, 11:17 a.m. UTC | #1
"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Where postcopy is preceeded by a period of precopy, the destination will
> have received pages that may have been dirtied on the source after the
> page was sent.  The destination must throw these pages away before
> starting it's CPUs.
>
> Maintain a 'sentmap' of pages that have already been sent.
> Calculate list of sent & dirty pages
> Provide helpers on the destination side to discard these.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Amit Shah <amit.shah@redhat.com>


Hi

>      /* Flag set once the migration has been asked to enter postcopy */
>      bool start_postcopy;


This is from a previous patch, but ....

Change the name of the variable or the comment?  From the comment it
sholud be "in_postcopy", no?


> +
> +    /* bitmap of pages that have been sent at least once
> +     * only maintained and used in postcopy at the moment
> +     * where it's used to send the dirtymap at the start
> +     * of the postcopy phase
> +     */
> +    unsigned long *sentmap;
> };

I *think* that patch would be easier if you put this one inside
migration_bitmap_rcu.  If you put it there, yoqu could do the 

> +                if (ms->sentmap) {
> +                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> +                }

And you wouldn't have to change all callers to have an ram_addr_abs
address parameter, right?


> +struct PostcopyDiscardState {
> +    const char *name;

Iht is not obvious to me what name means here.  I assume ram block name,
change it to block_name, ramblock?


> + * returns: 0 on success.
> + */
> +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> +                               size_t length)
> +{
> +    trace_postcopy_ram_discard_range(start, length);
> +    if (madvise(start, length, MADV_DONTNEED)) {
> +        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  #else
>  /* No target OS support, stubs just fail */
>  bool postcopy_ram_supported_by_host(void)
> @@ -153,5 +192,95 @@ bool postcopy_ram_supported_by_host(void)
>      return false;
>  }
>  
> +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> +                               size_t length)
> +{
> +    assert(0);

I will assume that just returning -1 would work here.

But yes, I understand that this code shouldn't be reach ...

> +}
>  #endif
>  
> +/* ------------------------------------------------------------------------- */
> +
> +/**
> + * postcopy_discard_send_init: Called at the start of each RAMBlock before
> + *   asking to discard individual ranges.
> + *
> + * @ms: The current migration state.
> + * @offset: the bitmap offset of the named RAMBlock in the migration
> + *   bitmap.
> + * @name: RAMBlock that discards will operate on.
> + *
> + * returns: a new PDS.
> + */
> +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> +                                                 unsigned long offset,
> +                                                 const char *name)
> +{
> +    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));

Why are we using here g_try_malloc instead of g_malloc()?  Even
g_malloc0()?

Specially when we don't check if res is NULL on return.  Please change.


> +
> +    if (res) {
> +        res->name = name;
> +        res->cur_entry = 0;
> +        res->nsentwords = 0;
> +        res->nsentcmds = 0;

With the zero variant, this three can be removed.

> +        res->offset = offset;
> +    }
> +
> +    return res;
> +}

> -/* Called with rcu_read_lock() to protect migration_bitmap */
> +/* Called with rcu_read_lock() to protect migration_bitmap
> + * mr: The region to search for dirty pages in

Haha, you forgot to update the comment when you moved the function to
use ram blocks O:-)


> @@ -662,6 +672,24 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>  }
>  
>  /**
> + * ram_find_block_by_id: Find a ramblock by name.
> + *
> + * Returns: The RAMBlock with matching ID, or NULL.
> + */
> +static RAMBlock *ram_find_block_by_id(const char *id)
> +{
> +    RAMBlock *block;
> +
> +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +        if (!strcmp(id, block->idstr)) {
> +            return block;
> +        }
> +    }
> +
> +    return NULL;
> +}

We don't have this function already.....

Once here, could we split it in its own patch and use it in ram_load?


                QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
                    if (!strncmp(id, block->idstr, sizeof(id))) {
                        if (length != block->used_length) {
                            Error *local_err = NULL;

                            ret = qemu_ram_resize(block->offset, length, &local_err);
                            if (local_err) {
                                error_report_err(local_err);
                            }
                        }
                        ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG,
                                              block->idstr);
                        break;
                    }
                }

                if (!block) {
                    error_report("Unknown ramblock \"%s\", cannot "
                                 "accept migration", id);
                    ret = -EINVAL;
                }


We could also use it in:

host_from_stream_offset


> +/* **** functions for postcopy ***** */
> +
> +/*
> + * Callback from postcopy_each_ram_send_discard for each RAMBlock
> + * start,end: Indexes into the bitmap for the first and last bit
> + *            representing the named block
> + */
> +static int postcopy_send_discard_bm_ram(MigrationState *ms,
> +                                        PostcopyDiscardState *pds,
> +                                        unsigned long start, unsigned long end)
> +{
> +    unsigned long current;
> +
> +    for (current = start; current <= end; ) {
> +        unsigned long set = find_next_bit(ms->sentmap, end + 1, current);
> +
> +        if (set <= end) {
> +            unsigned long zero = find_next_zero_bit(ms->sentmap,
> +                                                    end + 1, set + 1);
> +
> +            if (zero > end) {
> +                zero = end + 1;
> +            }
> +            postcopy_discard_send_range(ms, pds, set, zero - 1);
> +            current = zero + 1;
> +        } else {
> +            current = set;
> +        }
> +    }

I think I undrestood the logic  here at the end....

But if we change the meaning of postcopy_discard_send_range() from
(begin, end), to (begin, length), I think everything goes clearer, no?

        if (set <= end) {
            unsigned long zero = find_next_zero_bit(ms->sentmap,
                                                    end + 1, set + 1);
            unsigned long length;

            if (zero > end) {
                length = end - set;
            } else {
                lenght = zero - 1 - set;
                current = zero + 1;
            }
            postcopy_discard_send_range(ms, pds, set, len);
        } else {
            current = set;
        }
    }

Y would clame that if we call one zero, the other would be called one.
Or change to set/unset, but that is just me.  Yes, I haven't tested, and
it is possible that there is a off-by-one somewhere...


Looking at postocpy_eand_ram_send_discard, I even think that it would be
a good idea to pass length to this function.

> +/*
> + * Transmit the set of pages to be discarded after precopy to the target
> + * these are pages that:
> + *     a) Have been previously transmitted but are now dirty again
> + *     b) Pages that have never been transmitted, this ensures that
> + *        any pages on the destination that have been mapped by background
> + *        tasks get discarded (transparent huge pages is the specific concern)
> + * Hopefully this is pretty sparse
> + */
> +int ram_postcopy_send_discard_bitmap(MigrationState *ms)
> +{
> +    int ret;
> +
> +    rcu_read_lock();
> +
> +    /* This should be our last sync, the src is now paused */
> +    migration_bitmap_sync();
> +
> +    /*
> +     * Update the sentmap to be sentmap = ~sentmap | dirty
> +     */
> +    bitmap_complement(ms->sentmap, ms->sentmap,
> +               last_ram_offset() >> TARGET_PAGE_BITS);
> +
> +    bitmap_or(ms->sentmap, ms->sentmap, migration_bitmap,
> +               last_ram_offset() >> TARGET_PAGE_BITS);

This bitmaps are really big, I don't know how long would take to do this
operations here, but I think that we can avoid (at least) the
bitmap_complement.  We can change the bitmap name to notsentbitmap, init
it to one and clear it each time that we sent a page, no?

We can also do the bitmap_or() at migration_sync_bitmap() time, at that
point, we shouldn't be on the critical path?

Later, Juan.
Dr. David Alan Gilbert Oct. 30, 2015, 6:43 p.m. UTC | #2
* Juan Quintela (quintela@redhat.com) wrote:
> "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > Where postcopy is preceeded by a period of precopy, the destination will
> > have received pages that may have been dirtied on the source after the
> > page was sent.  The destination must throw these pages away before
> > starting it's CPUs.
> >
> > Maintain a 'sentmap' of pages that have already been sent.
> > Calculate list of sent & dirty pages
> > Provide helpers on the destination side to discard these.
> >
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Amit Shah <amit.shah@redhat.com>
> 
> 
> Hi

(I'm going to reply to this mail in a few separate mails as I get
to them)

> >      /* Flag set once the migration has been asked to enter postcopy */
> >      bool start_postcopy;
> 
> 
> This is from a previous patch, but ....
> 
> Change the name of the variable or the comment?  From the comment it
> sholud be "in_postcopy", no?

We have to be careful to differentiate between two separate things:
  1) The user has issued 'migrate_start_postcopy'
     - that sets this 'start_postcopy' flag

  2) The non-postcopiable data has dropped below the limit and we've
     now been able to take notice of 'start_postcopy' and actually
     enter postcopy.

  I think 'in_postcopy' would imply (2); while 'start_postcopy'
  matches the command that's been issued.

> > +struct PostcopyDiscardState {
> > +    const char *name;
> 
> Iht is not obvious to me what name means here.  I assume ram block name,
> change it to block_name, ramblock?

Now ramblock_name.

> 
> > + * returns: 0 on success.
> > + */
> > +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > +                               size_t length)
> > +{
> > +    trace_postcopy_ram_discard_range(start, length);
> > +    if (madvise(start, length, MADV_DONTNEED)) {
> > +        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  #else
> >  /* No target OS support, stubs just fail */
> >  bool postcopy_ram_supported_by_host(void)
> > @@ -153,5 +192,95 @@ bool postcopy_ram_supported_by_host(void)
> >      return false;
> >  }
> >  
> > +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > +                               size_t length)
> > +{
> > +    assert(0);
> 
> I will assume that just returning -1 would work here.
> 
> But yes, I understand that this code shouldn't be reach ...

Yes, it really shouldn't happen if the previous code that says
postcopy isn't supported has been obeyed; I'm happy to change
it if you want.

> > +}
> >  #endif
> >  
> > +/* ------------------------------------------------------------------------- */
> > +
> > +/**
> > + * postcopy_discard_send_init: Called at the start of each RAMBlock before
> > + *   asking to discard individual ranges.
> > + *
> > + * @ms: The current migration state.
> > + * @offset: the bitmap offset of the named RAMBlock in the migration
> > + *   bitmap.
> > + * @name: RAMBlock that discards will operate on.
> > + *
> > + * returns: a new PDS.
> > + */
> > +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> > +                                                 unsigned long offset,
> > +                                                 const char *name)
> > +{
> > +    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));
> 
> Why are we using here g_try_malloc instead of g_malloc()?  Even
> g_malloc0()?
>
> Specially when we don't check if res is NULL on return.  Please change.

Eek yes; I've gone with malloc0.

> 
> 
> > +
> > +    if (res) {
> > +        res->name = name;
> > +        res->cur_entry = 0;
> > +        res->nsentwords = 0;
> > +        res->nsentcmds = 0;
> 
> With the zero variant, this three can be removed.

Done.

> 
> > +        res->offset = offset;
> > +    }
> > +
> > +    return res;
> > +}
> 
> > -/* Called with rcu_read_lock() to protect migration_bitmap */
> > +/* Called with rcu_read_lock() to protect migration_bitmap
> > + * mr: The region to search for dirty pages in
> 
> Haha, you forgot to update the comment when you moved the function to
> use ram blocks O:-)

Oops, fixed :-)

(Rest of the patch another time)

Dave

> > @@ -662,6 +672,24 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> >  }
> >  
> >  /**
> > + * ram_find_block_by_id: Find a ramblock by name.
> > + *
> > + * Returns: The RAMBlock with matching ID, or NULL.
> > + */
> > +static RAMBlock *ram_find_block_by_id(const char *id)
> > +{
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> > +        if (!strcmp(id, block->idstr)) {
> > +            return block;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> 
> We don't have this function already.....
> 
> Once here, could we split it in its own patch and use it in ram_load?
> 
> 
>                 QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>                     if (!strncmp(id, block->idstr, sizeof(id))) {
>                         if (length != block->used_length) {
>                             Error *local_err = NULL;
> 
>                             ret = qemu_ram_resize(block->offset, length, &local_err);
>                             if (local_err) {
>                                 error_report_err(local_err);
>                             }
>                         }
>                         ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG,
>                                               block->idstr);
>                         break;
>                     }
>                 }
> 
>                 if (!block) {
>                     error_report("Unknown ramblock \"%s\", cannot "
>                                  "accept migration", id);
>                     ret = -EINVAL;
>                 }
> 
> 
> We could also use it in:
> 
> host_from_stream_offset
> 
> 
> > +/* **** functions for postcopy ***** */
> > +
> > +/*
> > + * Callback from postcopy_each_ram_send_discard for each RAMBlock
> > + * start,end: Indexes into the bitmap for the first and last bit
> > + *            representing the named block
> > + */
> > +static int postcopy_send_discard_bm_ram(MigrationState *ms,
> > +                                        PostcopyDiscardState *pds,
> > +                                        unsigned long start, unsigned long end)
> > +{
> > +    unsigned long current;
> > +
> > +    for (current = start; current <= end; ) {
> > +        unsigned long set = find_next_bit(ms->sentmap, end + 1, current);
> > +
> > +        if (set <= end) {
> > +            unsigned long zero = find_next_zero_bit(ms->sentmap,
> > +                                                    end + 1, set + 1);
> > +
> > +            if (zero > end) {
> > +                zero = end + 1;
> > +            }
> > +            postcopy_discard_send_range(ms, pds, set, zero - 1);
> > +            current = zero + 1;
> > +        } else {
> > +            current = set;
> > +        }
> > +    }
> 
> I think I undrestood the logic  here at the end....
> 
> But if we change the meaning of postcopy_discard_send_range() from
> (begin, end), to (begin, length), I think everything goes clearer, no?
> 
>         if (set <= end) {
>             unsigned long zero = find_next_zero_bit(ms->sentmap,
>                                                     end + 1, set + 1);
>             unsigned long length;
> 
>             if (zero > end) {
>                 length = end - set;
>             } else {
>                 lenght = zero - 1 - set;
>                 current = zero + 1;
>             }
>             postcopy_discard_send_range(ms, pds, set, len);
>         } else {
>             current = set;
>         }
>     }
> 
> Y would clame that if we call one zero, the other would be called one.
> Or change to set/unset, but that is just me.  Yes, I haven't tested, and
> it is possible that there is a off-by-one somewhere...
> 
> 
> Looking at postocpy_eand_ram_send_discard, I even think that it would be
> a good idea to pass length to this function.
> 
> > +/*
> > + * Transmit the set of pages to be discarded after precopy to the target
> > + * these are pages that:
> > + *     a) Have been previously transmitted but are now dirty again
> > + *     b) Pages that have never been transmitted, this ensures that
> > + *        any pages on the destination that have been mapped by background
> > + *        tasks get discarded (transparent huge pages is the specific concern)
> > + * Hopefully this is pretty sparse
> > + */
> > +int ram_postcopy_send_discard_bitmap(MigrationState *ms)
> > +{
> > +    int ret;
> > +
> > +    rcu_read_lock();
> > +
> > +    /* This should be our last sync, the src is now paused */
> > +    migration_bitmap_sync();
> > +
> > +    /*
> > +     * Update the sentmap to be sentmap = ~sentmap | dirty
> > +     */
> > +    bitmap_complement(ms->sentmap, ms->sentmap,
> > +               last_ram_offset() >> TARGET_PAGE_BITS);
> > +
> > +    bitmap_or(ms->sentmap, ms->sentmap, migration_bitmap,
> > +               last_ram_offset() >> TARGET_PAGE_BITS);
> 
> This bitmaps are really big, I don't know how long would take to do this
> operations here, but I think that we can avoid (at least) the
> bitmap_complement.  We can change the bitmap name to notsentbitmap, init
> it to one and clear it each time that we sent a page, no?
> 
> We can also do the bitmap_or() at migration_sync_bitmap() time, at that
> point, we shouldn't be on the critical path?
> 
> Later, Juan.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 2, 2015, 5:31 p.m. UTC | #3
* Juan Quintela (quintela@redhat.com) wrote:

> > +/*
> > + * Transmit the set of pages to be discarded after precopy to the target
> > + * these are pages that:
> > + *     a) Have been previously transmitted but are now dirty again
> > + *     b) Pages that have never been transmitted, this ensures that
> > + *        any pages on the destination that have been mapped by background
> > + *        tasks get discarded (transparent huge pages is the specific concern)
> > + * Hopefully this is pretty sparse
> > + */
> > +int ram_postcopy_send_discard_bitmap(MigrationState *ms)
> > +{
> > +    int ret;
> > +
> > +    rcu_read_lock();
> > +
> > +    /* This should be our last sync, the src is now paused */
> > +    migration_bitmap_sync();
> > +
> > +    /*
> > +     * Update the sentmap to be sentmap = ~sentmap | dirty
> > +     */
> > +    bitmap_complement(ms->sentmap, ms->sentmap,
> > +               last_ram_offset() >> TARGET_PAGE_BITS);
> > +
> > +    bitmap_or(ms->sentmap, ms->sentmap, migration_bitmap,
> > +               last_ram_offset() >> TARGET_PAGE_BITS);
> 
> This bitmaps are really big, I don't know how long would take to do this
> operations here, but I think that we can avoid (at least) the
> bitmap_complement.  We can change the bitmap name to notsentbitmap, init
> it to one and clear it each time that we sent a page, no?

Done, it's now 'unsentmap' - although I suspect the complement step is
probably one of the simpler steps in the process, I'm not sure it's a vast
saving.

> We can also do the bitmap_or() at migration_sync_bitmap() time, at that
> point, we shouldn't be on the critical path?

Given that we're doing the bitmap_sync immediately before the OR, I don't
understand that line; are you talking about a modified migration_bitmap_sync?

Dave

> 
> Later, Juan.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 2, 2015, 6:19 p.m. UTC | #4
* Juan Quintela (quintela@redhat.com) wrote:

> > @@ -662,6 +672,24 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> >  }
> >  
> >  /**
> > + * ram_find_block_by_id: Find a ramblock by name.
> > + *
> > + * Returns: The RAMBlock with matching ID, or NULL.
> > + */
> > +static RAMBlock *ram_find_block_by_id(const char *id)
> > +{
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> > +        if (!strcmp(id, block->idstr)) {
> > +            return block;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> 
> We don't have this function already.....
> 
> Once here, could we split it in its own patch and use it in ram_load?
> 
> 
>                 QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>                     if (!strncmp(id, block->idstr, sizeof(id))) {
>                         if (length != block->used_length) {
>                             Error *local_err = NULL;
> 
>                             ret = qemu_ram_resize(block->offset, length, &local_err);
>                             if (local_err) {
>                                 error_report_err(local_err);
>                             }
>                         }
>                         ram_control_load_hook(f, RAM_CONTROL_BLOCK_REG,
>                                               block->idstr);
>                         break;
>                     }
>                 }
> 
>                 if (!block) {
>                     error_report("Unknown ramblock \"%s\", cannot "
>                                  "accept migration", id);
>                     ret = -EINVAL;
>                 }
> 
> 
> We could also use it in:
> 
> host_from_stream_offset

Done; replaced both uses and it's now called 'qemu_ram_block_by_name'

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 2, 2015, 8:14 p.m. UTC | #5
* Juan Quintela (quintela@redhat.com) wrote:
> > +/* **** functions for postcopy ***** */
> > +
> > +/*
> > + * Callback from postcopy_each_ram_send_discard for each RAMBlock
> > + * start,end: Indexes into the bitmap for the first and last bit
> > + *            representing the named block
> > + */
> > +static int postcopy_send_discard_bm_ram(MigrationState *ms,
> > +                                        PostcopyDiscardState *pds,
> > +                                        unsigned long start, unsigned long end)
> > +{
> > +    unsigned long current;
> > +
> > +    for (current = start; current <= end; ) {
> > +        unsigned long set = find_next_bit(ms->sentmap, end + 1, current);
> > +
> > +        if (set <= end) {
> > +            unsigned long zero = find_next_zero_bit(ms->sentmap,
> > +                                                    end + 1, set + 1);
> > +
> > +            if (zero > end) {
> > +                zero = end + 1;
> > +            }
> > +            postcopy_discard_send_range(ms, pds, set, zero - 1);
> > +            current = zero + 1;
> > +        } else {
> > +            current = set;
> > +        }
> > +    }
> 
> I think I undrestood the logic  here at the end....
> 
> But if we change the meaning of postcopy_discard_send_range() from
> (begin, end), to (begin, length), I think everything goes clearer, no?
> 
>         if (set <= end) {
>             unsigned long zero = find_next_zero_bit(ms->sentmap,
>                                                     end + 1, set + 1);
>             unsigned long length;
> 
>             if (zero > end) {
>                 length = end - set;
>             } else {
>                 lenght = zero - 1 - set;
>                 current = zero + 1;
>             }
>             postcopy_discard_send_range(ms, pds, set, len);
>         } else {
>             current = set;
>         }
>     }
> 
> Y would clame that if we call one zero, the other would be called one.
> Or change to set/unset, but that is just me.  Yes, I haven't tested, and
> it is possible that there is a off-by-one somewhere...
> 
> Looking at postocpy_eand_ram_send_discard, I even think that it would be
> a good idea to pass length to this function.

OK, so I've ended up with (build tested only so far):
/*
 * Callback from postcopy_each_ram_send_discard for each RAMBlock
 * Note: At this point the 'unsentmap' is the processed bitmap combined
 *       with the dirtymap; so a '1' means it's either dirty or unsent.
 * start,length: Indexes into the bitmap for the first bit
 *            representing the named block and length in target-pages
 */
static int postcopy_send_discard_bm_ram(MigrationState *ms,
                                        PostcopyDiscardState *pds,
                                        unsigned long start,
                                        unsigned long length)
{
    unsigned long end = start + length; /* one after the end */
    unsigned long current;

    for (current = start; current < end; ) {
        unsigned long one = find_next_bit(ms->unsentmap, end, current);

        if (one <= end) {
            unsigned long zero = find_next_zero_bit(ms->unsentmap,
                                                    end, one + 1);
            unsigned long discard_length;

            if (zero >= end) {
                discard_length = end - one;
            } else {
                discard_length = zero - one;
            }
            postcopy_discard_send_range(ms, pds, one, discard_length);
            current = one + discard_length;
        } else {
            current = one;
        }
    }

    return 0;
}

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 219032d..4904d00 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -130,6 +130,13 @@  struct MigrationState
 
     /* Flag set once the migration has been asked to enter postcopy */
     bool start_postcopy;
+
+    /* bitmap of pages that have been sent at least once
+     * only maintained and used in postcopy at the moment
+     * where it's used to send the dirtymap at the start
+     * of the postcopy phase
+     */
+    unsigned long *sentmap;
 };
 
 void process_incoming_migration(QEMUFile *f);
@@ -199,6 +206,11 @@  double xbzrle_mig_cache_miss_rate(void);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
+/* For outgoing discard bitmap */
+int ram_postcopy_send_discard_bitmap(MigrationState *ms);
+/* For incoming postcopy discard */
+int ram_discard_range(MigrationIncomingState *mis, const char *block_name,
+                      uint64_t start, size_t length);
 
 /**
  * @migrate_add_blocker - prevent migration from proceeding
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index d81934f..80ed2d9 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -16,4 +16,39 @@ 
 /* Return true if the host supports everything we need to do postcopy-ram */
 bool postcopy_ram_supported_by_host(void);
 
+/*
+ * Discard the contents of 'length' bytes from 'start'
+ * We can assume that if we've been called postcopy_ram_hosttest returned true
+ */
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               size_t length);
+
+
+/*
+ * Called at the start of each RAMBlock by the bitmap code.
+ * 'offset' is the bitmap offset of the named RAMBlock in the migration
+ * bitmap.
+ * Returns a new PDS
+ */
+PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
+                                                 unsigned long offset,
+                                                 const char *name);
+
+/*
+ * Called by the bitmap code for each chunk to discard.
+ * May send a discard message, may just leave it queued to
+ * be sent later.
+ * 'start' and 'end' describe an inclusive range of pages in the
+ * migration bitmap in the RAM block passed to postcopy_discard_send_init.
+ */
+void postcopy_discard_send_range(MigrationState *ms, PostcopyDiscardState *pds,
+                                 unsigned long start, unsigned long end);
+
+/*
+ * Called at the end of each RAMBlock by the bitmap code.
+ * Sends any outstanding discard messages, frees the PDS.
+ */
+void postcopy_discard_send_finish(MigrationState *ms,
+                                  PostcopyDiscardState *pds);
+
 #endif
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 0bf7967..32332b8 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -63,6 +63,7 @@  typedef struct PCMachineState PCMachineState;
 typedef struct PCMachineClass PCMachineClass;
 typedef struct PCMCIACardState PCMCIACardState;
 typedef struct PixelFormat PixelFormat;
+typedef struct PostcopyDiscardState PostcopyDiscardState;
 typedef struct PropertyInfo PropertyInfo;
 typedef struct Property Property;
 typedef struct QEMUBH QEMUBH;
diff --git a/migration/migration.c b/migration/migration.c
index 2ae5909..b57a0e6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -24,6 +24,7 @@ 
 #include "qemu/sockets.h"
 #include "qemu/rcu.h"
 #include "migration/block.h"
+#include "migration/postcopy-ram.h"
 #include "qemu/thread.h"
 #include "qmp-commands.h"
 #include "trace.h"
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index cdd0168..10c9cab 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -27,6 +27,24 @@ 
 #include "qemu/error-report.h"
 #include "trace.h"
 
+/* Arbitrary limit on size of each discard command,
+ * keeps them around ~200 bytes
+ */
+#define MAX_DISCARDS_PER_COMMAND 12
+
+struct PostcopyDiscardState {
+    const char *name;
+    uint64_t offset; /* Bitmap entry for the 1st bit of this RAMBlock */
+    uint16_t cur_entry;
+    /*
+     * Start and length of a discard range (bytes)
+     */
+    uint64_t start_list[MAX_DISCARDS_PER_COMMAND];
+    uint64_t length_list[MAX_DISCARDS_PER_COMMAND];
+    unsigned int nsentwords;
+    unsigned int nsentcmds;
+};
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -145,6 +163,27 @@  out:
     return ret;
 }
 
+/**
+ * postcopy_ram_discard_range: Discard a range of memory.
+ * We can assume that if we've been called postcopy_ram_hosttest returned true.
+ *
+ * @mis: Current incoming migration state.
+ * @start, @length: range of memory to discard.
+ *
+ * returns: 0 on success.
+ */
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               size_t length)
+{
+    trace_postcopy_ram_discard_range(start, length);
+    if (madvise(start, length, MADV_DONTNEED)) {
+        error_report("%s MADV_DONTNEED: %s", __func__, strerror(errno));
+        return -1;
+    }
+
+    return 0;
+}
+
 #else
 /* No target OS support, stubs just fail */
 bool postcopy_ram_supported_by_host(void)
@@ -153,5 +192,95 @@  bool postcopy_ram_supported_by_host(void)
     return false;
 }
 
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               size_t length)
+{
+    assert(0);
+}
 #endif
 
+/* ------------------------------------------------------------------------- */
+
+/**
+ * postcopy_discard_send_init: Called at the start of each RAMBlock before
+ *   asking to discard individual ranges.
+ *
+ * @ms: The current migration state.
+ * @offset: the bitmap offset of the named RAMBlock in the migration
+ *   bitmap.
+ * @name: RAMBlock that discards will operate on.
+ *
+ * returns: a new PDS.
+ */
+PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
+                                                 unsigned long offset,
+                                                 const char *name)
+{
+    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));
+
+    if (res) {
+        res->name = name;
+        res->cur_entry = 0;
+        res->nsentwords = 0;
+        res->nsentcmds = 0;
+        res->offset = offset;
+    }
+
+    return res;
+}
+
+/**
+ * postcopy_discard_send_range: Called by the bitmap code for each chunk to
+ *   discard. May send a discard message, may just leave it queued to
+ *   be sent later.
+ *
+ * @ms: Current migration state.
+ * @pds: Structure initialised by postcopy_discard_send_init().
+ * @start,@end: an inclusive range of pages in the migration bitmap in the
+ *   RAM block passed to postcopy_discard_send_init().
+ */
+void postcopy_discard_send_range(MigrationState *ms, PostcopyDiscardState *pds,
+                                unsigned long start, unsigned long end)
+{
+    size_t tp_bits = qemu_target_page_bits();
+    /* Convert to byte offsets within the RAM block */
+    pds->start_list[pds->cur_entry] = (start - pds->offset) << tp_bits;
+    pds->length_list[pds->cur_entry] = ((1 + end - pds->offset) << tp_bits) -
+                                       pds->start_list[pds->cur_entry];
+    pds->cur_entry++;
+    pds->nsentwords++;
+
+    if (pds->cur_entry == MAX_DISCARDS_PER_COMMAND) {
+        /* Full set, ship it! */
+        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
+                                              pds->cur_entry,
+                                              pds->start_list,
+                                              pds->length_list);
+        pds->nsentcmds++;
+        pds->cur_entry = 0;
+    }
+}
+
+/**
+ * postcopy_discard_send_finish: Called at the end of each RAMBlock by the
+ * bitmap code. Sends any outstanding discard messages, frees the PDS
+ *
+ * @ms: Current migration state.
+ * @pds: Structure initialised by postcopy_discard_send_init().
+ */
+void postcopy_discard_send_finish(MigrationState *ms, PostcopyDiscardState *pds)
+{
+    /* Anything unsent? */
+    if (pds->cur_entry) {
+        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
+                                              pds->cur_entry,
+                                              pds->start_list,
+                                              pds->length_list);
+        pds->nsentcmds++;
+    }
+
+    trace_postcopy_discard_send_finish(pds->name, pds->nsentwords,
+                                       pds->nsentcmds);
+
+    g_free(pds);
+}
diff --git a/migration/ram.c b/migration/ram.c
index 8644675..e1c6c4a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -32,6 +32,7 @@ 
 #include "qemu/timer.h"
 #include "qemu/main-loop.h"
 #include "migration/migration.h"
+#include "migration/postcopy-ram.h"
 #include "exec/address-spaces.h"
 #include "migration/page_cache.h"
 #include "qemu/error-report.h"
@@ -506,10 +507,18 @@  static int save_xbzrle_page(QEMUFile *f, uint8_t **current_data,
     return 1;
 }
 
-/* Called with rcu_read_lock() to protect migration_bitmap */
+/* Called with rcu_read_lock() to protect migration_bitmap
+ * mr: The region to search for dirty pages in
+ * start: Start address (typically so we can continue from previous page)
+ * ram_addr_abs: Pointer into which to store the address of the dirty page
+ *               within the global ram_addr space
+ *
+ * Returns: byte offset within memory region of the start of a dirty page
+ */
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock *rb,
-                                                 ram_addr_t start)
+                                                 ram_addr_t start,
+                                                 ram_addr_t *ram_addr_abs)
 {
     unsigned long base = rb->offset >> TARGET_PAGE_BITS;
     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
@@ -530,6 +539,7 @@  ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock *rb,
         clear_bit(next, bitmap);
         migration_dirty_pages--;
     }
+    *ram_addr_abs = next << TARGET_PAGE_BITS;
     return (next - base) << TARGET_PAGE_BITS;
 }
 
@@ -662,6 +672,24 @@  static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
 }
 
 /**
+ * ram_find_block_by_id: Find a ramblock by name.
+ *
+ * Returns: The RAMBlock with matching ID, or NULL.
+ */
+static RAMBlock *ram_find_block_by_id(const char *id)
+{
+    RAMBlock *block;
+
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        if (!strcmp(id, block->idstr)) {
+            return block;
+        }
+    }
+
+    return NULL;
+}
+
+/**
  * ram_save_page: Send the given page to the stream
  *
  * Returns: Number of pages written.
@@ -926,12 +954,15 @@  static int ram_save_compressed_page(QEMUFile *f, RAMBlock *block,
  * @f: Current migration stream.
  * @pss: Data about the state of the current dirty page scan.
  * @*again: Set to false if the search has scanned the whole of RAM
+ * *ram_addr_abs: Pointer into which to store the address of the dirty page
+ *               within the global ram_addr space
  */
 static bool find_dirty_block(QEMUFile *f, PageSearchStatus *pss,
-                             bool *again)
+                             bool *again, ram_addr_t *ram_addr_abs)
 {
     pss->offset = migration_bitmap_find_and_reset_dirty(pss->block,
-                                                       pss->offset);
+                                                       pss->offset,
+                                                       ram_addr_abs);
     if (pss->complete_round && pss->block == last_seen_block &&
         pss->offset >= last_offset) {
         /*
@@ -989,6 +1020,8 @@  static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
     PageSearchStatus pss;
     int pages = 0;
     bool again, found;
+    ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
+                                 ram_addr_t space */
 
     pss.block = last_seen_block;
     pss.offset = last_offset;
@@ -999,7 +1032,7 @@  static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
     }
 
     do {
-        found = find_dirty_block(f, &pss, &again);
+        found = find_dirty_block(f, &pss, &again, &dirty_ram_abs);
 
         if (found) {
             if (compression_switch && migrate_use_compression()) {
@@ -1013,7 +1046,11 @@  static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
 
             /* if page is unmodified, continue to the next */
             if (pages > 0) {
+                MigrationState *ms = migrate_get_current();
                 last_sent_block = pss.block;
+                if (ms->sentmap) {
+                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
+                }
             }
         }
     } while (!pages && again);
@@ -1071,6 +1108,8 @@  void free_xbzrle_decoded_buf(void)
 
 static void migration_end(void)
 {
+    MigrationState *s = migrate_get_current();
+
     /* caller have hold iothread lock or is in a bh, so there is
      * no writing race against this migration_bitmap
      */
@@ -1082,6 +1121,9 @@  static void migration_end(void)
         g_free(bitmap);
     }
 
+    g_free(s->sentmap);
+    s->sentmap = NULL;
+
     XBZRLE_cache_lock();
     if (XBZRLE.cache) {
         cache_fini(XBZRLE.cache);
@@ -1174,6 +1216,166 @@  void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
     }
 }
 
+/* **** functions for postcopy ***** */
+
+/*
+ * Callback from postcopy_each_ram_send_discard for each RAMBlock
+ * start,end: Indexes into the bitmap for the first and last bit
+ *            representing the named block
+ */
+static int postcopy_send_discard_bm_ram(MigrationState *ms,
+                                        PostcopyDiscardState *pds,
+                                        unsigned long start, unsigned long end)
+{
+    unsigned long current;
+
+    for (current = start; current <= end; ) {
+        unsigned long set = find_next_bit(ms->sentmap, end + 1, current);
+
+        if (set <= end) {
+            unsigned long zero = find_next_zero_bit(ms->sentmap,
+                                                    end + 1, set + 1);
+
+            if (zero > end) {
+                zero = end + 1;
+            }
+            postcopy_discard_send_range(ms, pds, set, zero - 1);
+            current = zero + 1;
+        } else {
+            current = set;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Utility for the outgoing postcopy code.
+ *   Calls postcopy_send_discard_bm_ram for each RAMBlock
+ *   passing it bitmap indexes and name.
+ * Returns: 0 on success
+ * (qemu_ram_foreach_block ends up passing unscaled lengths
+ *  which would mean postcopy code would have to deal with target page)
+ */
+static int postcopy_each_ram_send_discard(MigrationState *ms)
+{
+    struct RAMBlock *block;
+    int ret;
+
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        unsigned long first = block->offset >> TARGET_PAGE_BITS;
+        unsigned long last = (block->offset + (block->used_length - 1))
+                                >> TARGET_PAGE_BITS;
+        PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
+                                                               first,
+                                                               block->idstr);
+
+        /*
+         * Postcopy sends chunks of bitmap over the wire, but it
+         * just needs indexes at this point, avoids it having
+         * target page specific code.
+         */
+        ret = postcopy_send_discard_bm_ram(ms, pds, first, last);
+        postcopy_discard_send_finish(ms, pds);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Transmit the set of pages to be discarded after precopy to the target
+ * these are pages that:
+ *     a) Have been previously transmitted but are now dirty again
+ *     b) Pages that have never been transmitted, this ensures that
+ *        any pages on the destination that have been mapped by background
+ *        tasks get discarded (transparent huge pages is the specific concern)
+ * Hopefully this is pretty sparse
+ */
+int ram_postcopy_send_discard_bitmap(MigrationState *ms)
+{
+    int ret;
+
+    rcu_read_lock();
+
+    /* This should be our last sync, the src is now paused */
+    migration_bitmap_sync();
+
+    /*
+     * Update the sentmap to be sentmap = ~sentmap | dirty
+     */
+    bitmap_complement(ms->sentmap, ms->sentmap,
+               last_ram_offset() >> TARGET_PAGE_BITS);
+
+    bitmap_or(ms->sentmap, ms->sentmap, migration_bitmap,
+               last_ram_offset() >> TARGET_PAGE_BITS);
+
+
+    trace_ram_postcopy_send_discard_bitmap();
+#ifdef DEBUG_POSTCOPY
+    ram_debug_dump_bitmap(ms->sentmap, false);
+#endif
+
+    ret = postcopy_each_ram_send_discard(ms);
+    rcu_read_unlock();
+
+    return ret;
+}
+
+/*
+ * At the start of the postcopy phase of migration, any now-dirty
+ * precopied pages are discarded.
+ *
+ * start, length describe a byte address range within the RAMBlock
+ *
+ * Returns 0 on success.
+ */
+int ram_discard_range(MigrationIncomingState *mis,
+                      const char *block_name,
+                      uint64_t start, size_t length)
+{
+    int ret = -1;
+
+    rcu_read_lock();
+    RAMBlock *rb = ram_find_block_by_id(block_name);
+
+    if (!rb) {
+        error_report("ram_discard_range: Failed to find block '%s'",
+                     block_name);
+        goto err;
+    }
+
+    uint8_t *host_startaddr = rb->host + start;
+
+    if ((uintptr_t)host_startaddr & (qemu_host_page_size - 1)) {
+        error_report("ram_discard_range: Unaligned start address: %p",
+                     host_startaddr);
+        goto err;
+    }
+
+    if ((start + length) <= rb->used_length) {
+        uint8_t *host_endaddr = host_startaddr + length;
+        if ((uintptr_t)host_endaddr & (qemu_host_page_size - 1)) {
+            error_report("ram_discard_range: Unaligned end address: %p",
+                         host_endaddr);
+            goto err;
+        }
+        ret = postcopy_ram_discard_range(mis, host_startaddr, length);
+    } else {
+        error_report("ram_discard_range: Overrun block '%s' (%" PRIu64
+                     "/%zu/%zu)",
+                     block_name, start, length, rb->used_length);
+    }
+
+err:
+    rcu_read_unlock();
+
+    return ret;
+}
+
+
 /* Each of ram_save_setup, ram_save_iterate and ram_save_complete has
  * long-running RCU critical section.  When rcu-reclaims in the code
  * start to become numerous it will be necessary to reduce the
@@ -1232,6 +1434,12 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
     migration_bitmap = bitmap_new(ram_bitmap_pages);
     bitmap_set(migration_bitmap, 0, ram_bitmap_pages);
 
+    if (migrate_postcopy_ram()) {
+        MigrationState *s = migrate_get_current();
+        s->sentmap = bitmap_new(ram_bitmap_pages);
+        bitmap_clear(s->sentmap, 0, ram_bitmap_pages);
+    }
+
     /*
      * Count the total number of pages used by ram blocks not including any
      * gaps due to alignment or unplugs.
diff --git a/migration/savevm.c b/migration/savevm.c
index 52fca3c..85462b1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1347,7 +1347,6 @@  static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
     }
     trace_loadvm_postcopy_ram_handle_discard_header(ramid, len);
     while (len) {
-        /* TODO - ram_discard_range gets added in a later patch
         uint64_t start_addr, block_length;
         start_addr = qemu_get_be64(mis->from_src_file);
         block_length = qemu_get_be64(mis->from_src_file);
@@ -1358,7 +1357,6 @@  static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
         if (ret) {
             return ret;
         }
-        */
     }
     trace_loadvm_postcopy_ram_handle_discard_end();
 
diff --git a/trace-events b/trace-events
index e68e69d..aa2d2e7 100644
--- a/trace-events
+++ b/trace-events
@@ -1247,6 +1247,7 @@  qemu_file_fclose(void) ""
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
 migration_throttle(void) ""
+ram_postcopy_send_discard_bitmap(void) ""
 
 # hw/display/qxl.c
 disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
@@ -1518,6 +1519,10 @@  rdma_start_incoming_migration_after_rdma_listen(void) ""
 rdma_start_outgoing_migration_after_rdma_connect(void) ""
 rdma_start_outgoing_migration_after_rdma_source_init(void) ""
 
+# migration/postcopy-ram.c
+postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
+postcopy_ram_discard_range(void *start, size_t length) "%p,+%zx"
+
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
 kvm_vm_ioctl(int type, void *arg) "type 0x%x, arg %p"