mbox

[RFC,00/14] Migration thread

Message ID 1348236500-2565-1-git-send-email-quintela@redhat.com
State New
Headers show

Pull-request

http://repo.or.cz/r/qemu/quintela.git migration-thread-v3

Message

Juan Quintela Sept. 21, 2012, 2:08 p.m. UTC
Hi

This is work in progress on top of the previous migration series just sent.

- Introduces a thread for migration instead of using a timer and callback
- remove the writting to the fd from the iothread lock
- make the writes synchronous
- Introduce a new pending method that returns how many bytes are pending for
  one save live section
- last patch just shows printfs to see where the time is being spent
  on the migration complete phase.
  (yes it pollutes all uses of stop on the monitor)

So far I have found that we spent a lot of time on bdrv_flush_all() It
can take from 1ms to 600ms (yes, it is not a typo).  That dwarfs the
migration default downtime time (30ms).

Stop all vcpus:

- it works now (after the changes on qemu_cpu_is_vcpu on the previous
  series) caveat is that the time that brdv_flush_all() takes is
  "unpredictable".  Any silver bullets?

  Paolo suggested to call for migration completion phase:

  bdrv_aio_flush_all();
  Sent the dirty pages;
  bdrv_drain_all()
  brdv_flush_all()
  another round through the bitmap in case that completions have
  changed some page

  Paolo, did I get it right?
  Any other suggestion?

- migrate_cancel() is not properly implemented (as in the film that we
  take no locks, ...)

- expected_downtime is not calculated.

  I am about to merge migrate_fd_put_ready & buffered_thread() and
  that would make trivial to calculate.

It outputs something like:

wakeup_request 0
time cpu_disable_ticks 0
time pause_all_vcpus 1
time runstate_set 1
time vmstate_notify 2
time bdrv_drain_all 2
time flush device /dev/disk/by-path/ip-192.168.10.200:3260-iscsi-iqn.2010-12.org.trasno:iscsi.lvm-lun-1: 3
time flush device : 3
time flush device : 3
time flush device : 3
time bdrv_flush_all 5
time monitor_protocol_event 5
vm_stop 2 5
synchronize_all_states 1
migrate RAM 37
migrate rest devices 1
complete without error 3a 44
completed 45
end completed stage 45

As you can see, we estimate that we can sent all pending data in 30ms,
it took 37ms to send the RAM (that is what we calculate).  So
estimation is quite good.

What it gives me lots of variation is on the line with device name of "time flush device".
That is what varies between 1ms to 600ms

This is in a completely idle guest.  I am running:

	while (1) {
		uint64_t delay;

		if (gettimeofday(&t0, NULL) != 0)
			perror("gettimeofday 1");
		if (usleep(ms2us(10)) != 0)
			perror("usleep");
		if (gettimeofday(&t1, NULL) != 0)
			perror("gettimeofday 2");

		t1.tv_usec -= t0.tv_usec;
		if (t1.tv_usec < 0) {
			t1.tv_usec += 1000000;
			t1.tv_sec--;
		}
		t1.tv_sec -= t0.tv_sec;

		delay = t1.tv_sec * 1000 + t1.tv_usec/1000;

		if (delay > 100)
			printf("delay of %ld ms\n", delay);
       }

To see the latency inside the guest (i.e. ask for a 10ms sleep, and see how long it takes).


[root@d1 ~]# ./timer 
delay of 161 ms
delay of 135 ms
delay of 143 ms
delay of 132 ms
delay of 131 ms
delay of 141 ms
delay of 113 ms
delay of 119 ms
delay of 114 ms


But that values are independent of migration.  Without even starting
the migration, idle guest doing nothing, we get it sometimes.

Comments?



Thanks, Juan.


The following changes since commit 4bce0b88b10ed790ad3669ce4ff61c945cd655eb:

  cpus: create qemu_cpu_is_vcpu() (2012-09-21 10:43:10 +0200)

are available in the git repository at:

  http://repo.or.cz/r/qemu/quintela.git migration-thread-v3

for you to fetch changes up to 0e0f8dfd9fc308b790e55ceca5c2c193e1802417:

  migration: print times for end phase (2012-09-21 11:52:20 +0200)

Juan Quintela (11):
  buffered_file: Move from using a timer to use a thread
  migration: make qemu_fopen_ops_buffered() return void
  migration: stop all cpus correctly
  migration: make writes blocking
  migration: remove unfreeze logic
  migration: take finer locking
  buffered_file: Unfold the trick to restart generating migration data
  buffered_file: don't flush on put buffer
  buffered_file: unfold buffered_append in buffered_put_buffer
  savevm: New save live migration method: pending
  migration: print times for end phase

Paolo Bonzini (1):
  split MRU ram list

Umesh Deshpande (2):
  add a version number to ram_list
  protect the ramlist with a separate mutex

 arch_init.c       |  62 +++++++++++++-------------
 block-migration.c |  49 +++++---------------
 block.c           |   6 +++
 buffered_file.c   | 130 +++++++++++++++++++++++++-----------------------------
 buffered_file.h   |   2 +-
 cpu-all.h         |  13 +++++-
 cpus.c            |  17 +++++++
 exec.c            |  43 +++++++++++++++---
 migration-exec.c  |   2 -
 migration-fd.c    |   6 ---
 migration-tcp.c   |   2 +-
 migration-unix.c  |   2 -
 migration.c       | 108 +++++++++++++++++++--------------------------
 migration.h       |   4 +-
 qemu-file.h       |   5 ---
 savevm.c          |  37 +++++++++++++---
 sysemu.h          |   1 +
 vmstate.h         |   1 +
 18 files changed, 255 insertions(+), 235 deletions(-)

Comments

Paolo Bonzini Sept. 21, 2012, 2:42 p.m. UTC | #1
Il 21/09/2012 16:08, Juan Quintela ha scritto:
> From: Umesh Deshpande <udeshpan@redhat.com>
> 
> This will be used to detect if last_block might have become invalid
> across different calls to ram_save_live.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Umesh Deshpande <udeshpan@redhat.com>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  arch_init.c | 7 ++++++-
>  cpu-all.h   | 1 +
>  exec.c      | 4 ++++
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index d96e888..eb33fdd 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -336,6 +336,7 @@ static RAMBlock *last_block;
>  static ram_addr_t last_offset;
>  static unsigned long *migration_bitmap;
>  static uint64_t migration_dirty_pages;
> +static uint32_t last_version;
> 
>  static inline bool migration_bitmap_test_and_reset_dirty(MemoryRegion *mr,
>                                                           ram_addr_t offset)
> @@ -406,7 +407,6 @@ static void migration_bitmap_sync(void)
>      }
>  }
> 
> -
>  /*
>   * ram_save_block: Writes a page of memory to the stream f
>   *
> @@ -558,6 +558,7 @@ static void reset_ram_globals(void)
>  {
>      last_block = NULL;
>      last_offset = 0;
> +    last_version = ram_list.version;
>      sort_ram_list();
>  }
> 
> @@ -613,6 +614,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>      uint64_t expected_downtime;
>      MigrationState *s = migrate_get_current();
> 
> +    if (ram_list.version != last_version) {
> +        reset_ram_globals();
> +    }
> +
>      bytes_transferred_last = bytes_transferred;
>      bwidth = qemu_get_clock_ns(rt_clock);
> 

Assuming ram_save_iterate() will take the RAM lock throughout, that's ok.

> diff --git a/cpu-all.h b/cpu-all.h
> index 3e67019..6576229 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -497,6 +497,7 @@ typedef struct RAMBlock {
> 
>  typedef struct RAMList {
>      uint8_t *phys_dirty;
> +    uint32_t version;
>      QLIST_HEAD(, RAMBlock) blocks_mru;
>      QLIST_HEAD(, RAMBlock) blocks;
>  } RAMList;
> diff --git a/exec.c b/exec.c
> index 0db2beb..e9d1509 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2583,6 +2583,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>      QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next);
>      QLIST_INSERT_HEAD(&ram_list.blocks_mru, new_block, next_mru);
> 
> +    ram_list.version++;
> +
>      ram_list.phys_dirty = g_realloc(ram_list.phys_dirty,
>                                         last_ram_offset() >> TARGET_PAGE_BITS);
>      memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
> @@ -2610,6 +2612,7 @@ void qemu_ram_free_from_ptr(ram_addr_t addr)
>          if (addr == block->offset) {
>              QLIST_REMOVE(block, next);
>              QLIST_REMOVE(block, next_mru);
> +            ram_list.version++;
>              g_free(block);
>              return;
>          }
> @@ -2624,6 +2627,7 @@ void qemu_ram_free(ram_addr_t addr)
>          if (addr == block->offset) {
>              QLIST_REMOVE(block, next);
>              QLIST_REMOVE(block, next_mru);
> +            ram_list.version++;
>              if (block->flags & RAM_PREALLOC_MASK) {
>                  ;
>              } else if (mem_path) {
> 

Reviewed-by: Paolo Bonzini <pbonzini@gnu.org>
Paolo Bonzini Sept. 21, 2012, 2:58 p.m. UTC | #2
Il 21/09/2012 16:08, Juan Quintela ha scritto:
> Now that we have a thread, and blocking writes, we don't need it.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  buffered_file.c | 24 +-----------------------
>  migration.c     | 23 -----------------------
>  migration.h     |  1 -
>  3 files changed, 1 insertion(+), 47 deletions(-)
> 
> diff --git a/buffered_file.c b/buffered_file.c
> index 6c3d057..2c9859b 100644
> --- a/buffered_file.c
> +++ b/buffered_file.c
> @@ -26,7 +26,6 @@ typedef struct QEMUFileBuffered
>  {
>      MigrationState *migration_state;
>      QEMUFile *file;
> -    int freeze_output;
>      size_t bytes_xfer;
>      size_t xfer_limit;
>      uint8_t *buffer;
> @@ -70,13 +69,6 @@ static int buffered_flush(QEMUFileBuffered *s)
> 
>          ret = migrate_fd_put_buffer(s->migration_state, s->buffer + offset,
>                                      s->buffer_size - offset);
> -        if (ret == -EAGAIN) {
> -            DPRINTF("backend not ready, freezing\n");
> -            ret = 0;
> -            s->freeze_output = 1;
> -            break;
> -        }
> -
>          if (ret <= 0) {
>              DPRINTF("error flushing data, %zd\n", ret);
>              break;
> @@ -110,9 +102,6 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in
>          return error;
>      }
> 
> -    DPRINTF("unfreezing output\n");
> -    s->freeze_output = 0;
> -
>      if (size > 0) {
>          DPRINTF("buffering %d bytes\n", size - offset);
>          buffered_append(s, buf, size);
> @@ -126,7 +115,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in
> 
>      if (pos == 0 && size == 0) {
>          DPRINTF("file is ready\n");
> -        if (!s->freeze_output && s->bytes_xfer < s->xfer_limit) {
> +        if (s->bytes_xfer < s->xfer_limit) {
>              DPRINTF("notifying client\n");
>              migrate_fd_put_ready(s->migration_state);
>          }
> @@ -148,12 +137,6 @@ static int buffered_close(void *opaque)
>          if (ret < 0) {
>              break;
>          }
> -        if (s->freeze_output) {
> -            ret = migrate_fd_wait_for_unfreeze(s->migration_state);
> -            if (ret < 0) {
> -                break;
> -            }
> -        }
>      }
> 
>      ret2 = migrate_fd_close(s->migration_state);
> @@ -180,8 +163,6 @@ static int buffered_rate_limit(void *opaque)
>      if (ret) {
>          return ret;
>      }
> -    if (s->freeze_output)
> -        return 1;
> 
>      if (s->bytes_xfer > s->xfer_limit)
>          return 1;
> @@ -226,9 +207,6 @@ static void *buffered_file_thread(void *opaque)
>          if (s->migration_state->complete) {
>              break;
>          }
> -        if (s->freeze_output) {
> -            continue;
> -        }
>          if (current_time >= expire_time) {
>              s->bytes_xfer = 0;
>              expire_time = current_time + BUFFER_DELAY;
> diff --git a/migration.c b/migration.c
> index a8b2f4a..29ee710 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -367,29 +367,6 @@ static void migrate_fd_cancel(MigrationState *s)
>      migrate_fd_cleanup(s);
>  }
> 
> -int migrate_fd_wait_for_unfreeze(MigrationState *s)
> -{
> -    int ret;
> -
> -    DPRINTF("wait for unfreeze\n");
> -    if (s->state != MIG_STATE_ACTIVE)
> -        return -EINVAL;
> -
> -    do {
> -        fd_set wfds;
> -
> -        FD_ZERO(&wfds);
> -        FD_SET(s->fd, &wfds);
> -
> -        ret = select(s->fd + 1, NULL, &wfds, NULL, NULL);
> -    } while (ret == -1 && (s->get_error(s)) == EINTR);
> -
> -    if (ret == -1) {
> -        return -s->get_error(s);
> -    }
> -    return 0;
> -}
> -
>  int migrate_fd_close(MigrationState *s)
>  {
>      return s->close(s);
> diff --git a/migration.h b/migration.h
> index a63c5d5..505f792 100644
> --- a/migration.h
> +++ b/migration.h
> @@ -82,7 +82,6 @@ void migrate_fd_connect(MigrationState *s);
>  ssize_t migrate_fd_put_buffer(MigrationState *s, const void *data,
>                                size_t size);
>  void migrate_fd_put_ready(MigrationState *s);
> -int migrate_fd_wait_for_unfreeze(MigrationState *s);
>  int migrate_fd_close(MigrationState *s);
> 
>  void add_migration_state_change_notifier(Notifier *notify);
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini Sept. 21, 2012, 3:29 p.m. UTC | #3
Il 21/09/2012 16:08, Juan Quintela ha scritto:
> We still protect everything except the wait with the iothread lock.
> But we moved from a timer to a thread.  Steps one by one.
> 
> We also need to detect when we have finished with a variable "complete".
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  buffered_file.c | 58 +++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 36 insertions(+), 22 deletions(-)
> 
> diff --git a/buffered_file.c b/buffered_file.c
> index 318d0f0..61985a9 100644
> --- a/buffered_file.c
> +++ b/buffered_file.c
> @@ -18,6 +18,7 @@
>  #include "qemu-timer.h"
>  #include "qemu-char.h"
>  #include "buffered_file.h"
> +#include "qemu-thread.h"
> 
>  //#define DEBUG_BUFFERED_FILE
> 
> @@ -31,7 +32,8 @@ typedef struct QEMUFileBuffered
>      uint8_t *buffer;
>      size_t buffer_size;
>      size_t buffer_capacity;
> -    QEMUTimer *timer;
> +    QemuThread thread;
> +    bool complete;
>  } QEMUFileBuffered;
> 
>  #ifdef DEBUG_BUFFERED_FILE
> @@ -159,11 +161,8 @@ static int buffered_close(void *opaque)
>      if (ret >= 0) {
>          ret = ret2;
>      }
> -    qemu_del_timer(s->timer);
> -    qemu_free_timer(s->timer);
> -    g_free(s->buffer);
> -    g_free(s);
> -
> +    ret = migrate_fd_close(s->migration_state);
> +    s->complete = true;
>      return ret;
>  }
> 
> @@ -214,23 +213,38 @@ static int64_t buffered_get_rate_limit(void *opaque)
>      return s->xfer_limit;
>  }
> 
> -static void buffered_rate_tick(void *opaque)
> +/* 10ms  xfer_limit is the limit that we should write each 10ms */
> +#define BUFFER_DELAY 100
> +
> +static void *buffered_file_thread(void *opaque)
>  {
>      QEMUFileBuffered *s = opaque;
> +    int64_t expire_time = qemu_get_clock_ms(rt_clock) + BUFFER_DELAY;
> 
> -    if (qemu_file_get_error(s->file)) {
> -        buffered_close(s);
> -        return;
> -    }
> -
> -    qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
> -
> -    if (s->freeze_output)
> -        return;
> -
> -    s->bytes_xfer = 0;
> +    while (true) {
> +        int64_t current_time = qemu_get_clock_ms(rt_clock);
> 
> -    buffered_put_buffer(s, NULL, 0, 0);
> +        if (s->complete) {
> +            break;
> +        }
> +        if (s->freeze_output) {
> +            continue;
> +        }
> +        if (current_time >= expire_time) {
> +            s->bytes_xfer = 0;
> +            expire_time = current_time + BUFFER_DELAY;
> +        }
> +        if (s->bytes_xfer >= s->xfer_limit) {
> +            /* usleep expects microseconds */
> +            usleep((expire_time - current_time)*1000);

usleep is broken under IIRC Solaris (it uses signals and conflicts with
qemu-timer.c).  Please use g_usleep instead.

Bonus points for converting other users (hw/xenfb.c, qemu-ga.c,
tests/test-iov.c).

> +        }
> +        qemu_mutex_lock_iothread();
> +        buffered_put_buffer(s, NULL, 0, 0);
> +        qemu_mutex_unlock_iothread();
> +    }
> +    g_free(s->buffer);
> +    g_free(s);

I think freeing "s" here is incorrect.  Instead, you should "join" the
thread after setting s->complete = true, and move the g_free(s) to
buffered_close.

> +    return NULL;
>  }
> 
>  QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state)
> @@ -241,15 +255,15 @@ QEMUFile *qemu_fopen_ops_buffered(MigrationState *migration_state)
> 
>      s->migration_state = migration_state;
>      s->xfer_limit = migration_state->bandwidth_limit / 10;
> +    s->complete = false;
> 
>      s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
>                               buffered_close, buffered_rate_limit,
>                               buffered_set_rate_limit,
>  			     buffered_get_rate_limit);
> 
> -    s->timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s);
> -
> -    qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
> +    qemu_thread_create(&s->thread, buffered_file_thread, s,
> +                       QEMU_THREAD_DETACHED);

So this needs to become QEMU_THREAD_JOINABLE.

> 
>      return s->file;
>  }
> 

Paolo