Message ID | AB5A8C7661872E428D6B8E1C2DFA35085D84AF0480@DEWDFECCR02.wdf.sap.corp |
---|---|
State | New |
Headers | show |
On 08/08/2011 03:42 AM, Shribman, Aidan wrote: > Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps > From: Aidan Shribman<aidan.shribman@sap.com> > > By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime > and total live-migration time of VMs running memory write intensive workloads > typical of large enterprise applications such as SAP ERP Systems, and generally > speaking for any application with a sparse memory update pattern. > > On the sender side XBZRLE is used as a compact delta encoding of page updates, > retrieving the old page content from an LRU cache (default size of 64 MB). The > receiving side uses the existing page content and XBZRLE to decode the new page > content. > > Work was originally based on research results published VEE 2011: Evaluation of > Delta Compression Techniques for Efficient Live Migration of Large Virtual > Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder > XBRLE was improved further using XBZRLE instead. > > XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it > ideal for in-line, real-time encoding such as is needed for live-migration. > > A typical usage scenario: > {qemu} migrate_set_cachesize 256m > {qemu} migrate -x -d tcp:destination.host:4444 > {qemu} info migrate > ... > transferred ram-duplicate: A kbytes > transferred ram-duplicate: B pages > transferred ram-normal: C kbytes > transferred ram-normal: D pages > transferred ram-xbrle: E kbytes > transferred ram-xbrle: F pages > overflow ram-xbrle: G pages > cache-hit ram-xbrle: H pages > cache-lookup ram-xbrle: J pages > > Testing: live migration with XBZRLE completed in 110 seconds, without live > migration was not able to complete. > > A simple synthetic memory r/w load generator: > .. include<stdlib.h> > .. include<stdio.h> > .. int main() > .. { > .. char *buf = (char *) calloc(4096, 4096); > .. while (1) { > .. int i; > .. for (i = 0; i< 4096 * 4; i++) { > .. buf[i * 4096 / 4]++; > .. } > .. printf("."); > .. } > .. } > > Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com> > Signed-off-by: Petter Svard<petters@cs.umu.se> > Signed-off-by: Aidan Shribman<aidan.shribman@sap.com> One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really. I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms. It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt. Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this. Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering. I'm curious what people think about this type of approach. CC'ing libvirt to get their input. Regards, Anthony Liguori > > -- > > Makefile.target | 1 + > arch_init.c | 351 ++++++++++++++++++++++++++++++++++++++++++++++------ > block-migration.c | 3 +- > hash.h | 72 +++++++++++ > hmp-commands.hx | 36 ++++-- > hw/hw.h | 3 +- > lru.c | 142 +++++++++++++++++++++ > lru.h | 13 ++ > migration-exec.c | 6 +- > migration-fd.c | 6 +- > migration-tcp.c | 6 +- > migration-unix.c | 6 +- > migration.c | 119 +++++++++++++++++- > migration.h | 25 +++- > qmp-commands.hx | 43 ++++++- > savevm.c | 13 ++- > sysemu.h | 13 ++- > xbzrle.c | 126 +++++++++++++++++++ > xbzrle.h | 12 ++ > 19 files changed, 917 insertions(+), 79 deletions(-) > > diff --git a/Makefile.target b/Makefile.target > index 2800f47..b3215de 100644 > --- a/Makefile.target > +++ b/Makefile.target > @@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER > ifdef CONFIG_SOFTMMU > > obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o > +obj-y += lru.o xbzrle.o > # virtio has to be here due to weird dependency between PCI and virtio-net. > # need to fix this properly > obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o > diff --git a/arch_init.c b/arch_init.c > old mode 100644 > new mode 100755 > index 4486925..d67dc82 > --- a/arch_init.c > +++ b/arch_init.c > @@ -40,6 +40,17 @@ > #include "net.h" > #include "gdbstub.h" > #include "hw/smbios.h" > +#include "lru.h" > +#include "xbzrle.h" > + > +//#define DEBUG_ARCH_INIT > +#ifdef DEBUG_ARCH_INIT > +#define DPRINTF(fmt, ...) \ > + do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0) > +#else > +#define DPRINTF(fmt, ...) \ > + do { } while (0) > +#endif > > #ifdef TARGET_SPARC > int graphic_width = 1024; > @@ -88,6 +99,161 @@ const uint32_t arch_type = QEMU_ARCH; > #define RAM_SAVE_FLAG_PAGE 0x08 > #define RAM_SAVE_FLAG_EOS 0x10 > #define RAM_SAVE_FLAG_CONTINUE 0x20 > +#define RAM_SAVE_FLAG_XBZRLE 0x40 > + > +/***********************************************************/ > +/* RAM Migration State */ > +typedef struct ArchMigrationState { > + int use_xbrle; > + int64_t xbrle_cache_size; > +} ArchMigrationState; > + > +static ArchMigrationState arch_mig_state; > + > +void arch_set_params(int blk_enable, int shared_base, int use_xbrle, > + int64_t xbrle_cache_size, void *opaque) > +{ > + arch_mig_state.use_xbrle = use_xbrle; > + arch_mig_state.xbrle_cache_size = xbrle_cache_size; > +} > + > +#define BE16_MAGIC 0x0123 > + > +/***********************************************************/ > +/* XBZRLE (Xor Binary Zero Run-Length Encoding) */ > +typedef struct XBZRLEHeader { > + uint32_t xh_cksum; /* not used */ > + uint16_t xh_magic; > + uint16_t xh_len; > + uint8_t xh_flags; > +} XBZRLEHeader; > + > +static uint8_t dup_buf[TARGET_PAGE_SIZE]; > + > +/***********************************************************/ > +/* accounting */ > +typedef struct AccountingInfo{ > + uint64_t dup_pages; > + uint64_t norm_pages; > + uint64_t xbrle_bytes; > + uint64_t xbrle_pages; > + uint64_t xbrle_overflow; > + uint64_t xbrle_cache_lookup; > + uint64_t xbrle_cache_hit; > + uint64_t iterations; > +} AccountingInfo; > + > +static AccountingInfo acct_info; > + > +static void acct_clear(void) > +{ > + memset(&acct_info, 0, sizeof(acct_info)); > +} > + > +uint64_t dup_mig_bytes_transferred(void) > +{ > + return acct_info.dup_pages; > +} > + > +uint64_t dup_mig_pages_transferred(void) > +{ > + return acct_info.dup_pages; > +} > + > +uint64_t norm_mig_bytes_transferred(void) > +{ > + return acct_info.norm_pages * TARGET_PAGE_SIZE; > +} > + > +uint64_t norm_mig_pages_transferred(void) > +{ > + return acct_info.norm_pages; > +} > + > +uint64_t xbrle_mig_bytes_transferred(void) > +{ > + return acct_info.xbrle_bytes; > +} > + > +uint64_t xbrle_mig_pages_transferred(void) > +{ > + return acct_info.xbrle_pages; > +} > + > +uint64_t xbrle_mig_pages_overflow(void) > +{ > + return acct_info.xbrle_overflow; > +} > + > +uint64_t xbrle_mig_pages_cache_hit(void) > +{ > + return acct_info.xbrle_cache_hit; > +} > + > +uint64_t xbrle_mig_pages_cache_lookup(void) > +{ > + return acct_info.xbrle_cache_lookup; > +} > + > +static void save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset, > + int cont, int flag) > +{ > + qemu_put_be64(f, offset | cont | flag); > + if (!cont) { > + qemu_put_byte(f, strlen(block->idstr)); > + qemu_put_buffer(f, (uint8_t *)block->idstr, > + strlen(block->idstr)); > + } > +} > + > +#define ENCODING_FLAG_XBZRLE 0x1 > + > +static int save_xbrle_page(QEMUFile *f, uint8_t *current_page, > + ram_addr_t current_addr, RAMBlock *block, ram_addr_t offset, int cont) > +{ > + int encoded_len = 0, bytes_sent = 0; > + XBZRLEHeader hdr = {0, BE16_MAGIC}; > + uint8_t *encoded, *old_page; > + > + /* abort if page not cached */ > + acct_info.xbrle_cache_lookup++; > + old_page = lru_lookup(current_addr); > + if (!old_page) { > + goto done; > + } > + acct_info.xbrle_cache_hit++; > + > + /* XBZRLE (XOR+ZRLE) encoding */ > + encoded = (uint8_t *) qemu_malloc(TARGET_PAGE_SIZE); > + encoded_len = xbzrle_encode(encoded, old_page, current_page, > + TARGET_PAGE_SIZE); > + > + if (encoded_len< 0) { > + DPRINTF("XBZRLE encoding overflow - sending uncompressed\n"); > + acct_info.xbrle_overflow++; > + goto done; > + } > + > + hdr.xh_len = encoded_len; > + hdr.xh_flags |= ENCODING_FLAG_XBZRLE; > + > + /* Send XBZRLE compressed page */ > + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBZRLE); > + > + qemu_put_be32(f, hdr.xh_cksum); > + qemu_put_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic)); > + qemu_put_be16(f, hdr.xh_len); > + qemu_put_byte(f, hdr.xh_flags); > + > + qemu_put_buffer(f, encoded, encoded_len); > + acct_info.xbrle_pages++; > + bytes_sent = encoded_len + sizeof(hdr); > + acct_info.xbrle_bytes += bytes_sent; > + > +done: > + qemu_free(encoded); > + return bytes_sent; > +} > > static int is_dup_page(uint8_t *page, uint8_t ch) > { > @@ -107,7 +273,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) > static RAMBlock *last_block; > static ram_addr_t last_offset; > > -static int ram_save_block(QEMUFile *f) > +static int ram_save_block(QEMUFile *f, int stage) > { > RAMBlock *block = last_block; > ram_addr_t offset = last_offset; > @@ -120,6 +286,7 @@ static int ram_save_block(QEMUFile *f) > current_addr = block->offset + offset; > > do { > + lru_free_cb_t free_cb = qemu_free; > if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) { > uint8_t *p; > int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0; > @@ -128,28 +295,35 @@ static int ram_save_block(QEMUFile *f) > current_addr + TARGET_PAGE_SIZE, > MIGRATION_DIRTY_FLAG); > > - p = block->host + offset; > + if (arch_mig_state.use_xbrle) { > + p = qemu_malloc(TARGET_PAGE_SIZE); > + memcpy(p, block->host + offset, TARGET_PAGE_SIZE); > + } else { > + p = block->host + offset; > + } > > if (is_dup_page(p, *p)) { > - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS); > - if (!cont) { > - qemu_put_byte(f, strlen(block->idstr)); > - qemu_put_buffer(f, (uint8_t *)block->idstr, > - strlen(block->idstr)); > - } > + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_COMPRESS); > qemu_put_byte(f, *p); > bytes_sent = 1; > - } else { > - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE); > - if (!cont) { > - qemu_put_byte(f, strlen(block->idstr)); > - qemu_put_buffer(f, (uint8_t *)block->idstr, > - strlen(block->idstr)); > + acct_info.dup_pages++; > + if (arch_mig_state.use_xbrle&& !*p) { > + p = dup_buf; > + free_cb = NULL; > } > + } else if (stage == 2&& arch_mig_state.use_xbrle) { > + bytes_sent = save_xbrle_page(f, p, current_addr, block, > + offset, cont); > + } > + if (!bytes_sent) { > + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE); > qemu_put_buffer(f, p, TARGET_PAGE_SIZE); > bytes_sent = TARGET_PAGE_SIZE; > + acct_info.norm_pages++; > + } > + if (arch_mig_state.use_xbrle) { > + lru_insert(current_addr, p, free_cb); > } > - > break; > } > > @@ -221,6 +395,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) > > if (stage< 0) { > cpu_physical_memory_set_dirty_tracking(0); > + if (arch_mig_state.use_xbrle) { > + lru_fini(); > + } > return 0; > } > > @@ -235,6 +412,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) > last_block = NULL; > last_offset = 0; > > + if (arch_mig_state.use_xbrle) { > + lru_init(arch_mig_state.xbrle_cache_size/TARGET_PAGE_SIZE, 0); > + acct_clear(); > + } > + > /* Make sure all dirty bits are set */ > QLIST_FOREACH(block,&ram_list.blocks, next) { > for (addr = block->offset; addr< block->offset + block->length; > @@ -264,8 +446,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) > while (!qemu_file_rate_limit(f)) { > int bytes_sent; > > - bytes_sent = ram_save_block(f); > + bytes_sent = ram_save_block(f, stage); > bytes_transferred += bytes_sent; > + acct_info.iterations++; > if (bytes_sent == 0) { /* no more blocks */ > break; > } > @@ -285,19 +468,79 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) > int bytes_sent; > > /* flush all remaining blocks regardless of rate limiting */ > - while ((bytes_sent = ram_save_block(f)) != 0) { > + while ((bytes_sent = ram_save_block(f, stage))) { > bytes_transferred += bytes_sent; > } > cpu_physical_memory_set_dirty_tracking(0); > + if (arch_mig_state.use_xbrle) { > + lru_fini(); > + } > } > > qemu_put_be64(f, RAM_SAVE_FLAG_EOS); > > expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth; > > + DPRINTF("ram_save_live: expected(%ld)<= max(%ld)?\n", expected_time, > + migrate_max_downtime()); > + > return (stage == 2)&& (expected_time<= migrate_max_downtime()); > } > > +static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host) > +{ > + int len, rc = -1; > + uint8_t *encoded; > + XBZRLEHeader hdr = {0}; > + > + /* extract ZRLE header */ > + hdr.xh_cksum = qemu_get_be32(f); > + qemu_get_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic)); > + hdr.xh_len = qemu_get_be16(f); > + hdr.xh_flags = qemu_get_byte(f); > + > + if (!(hdr.xh_flags& ENCODING_FLAG_XBZRLE)) { > + fprintf(stderr, "Failed to load XZBRLE page - wrong compression!\n"); > + goto done; > + } > + > + if (hdr.xh_len> TARGET_PAGE_SIZE) { > + fprintf(stderr, "Failed to load XZBRLE page - len overflow!\n"); > + goto done; > + } > + > + /* load data and decode */ > + encoded = (uint8_t *) qemu_malloc(hdr.xh_len); > + qemu_get_buffer(f, encoded, hdr.xh_len); > + /* covert endianess if magic indicated destination differs from source */ > + if (hdr.xh_magic != BE16_MAGIC) { > + const uint64_t *end = (uint64_t *) encoded + > + hdr.xh_len / sizeof (uint64_t); > + uint64_t *p; > + for (p = (uint64_t *) encoded; p< end; p++) { > + bswap64s(p); > + } > + } > + > + /* decode ZRLE */ > + len = xbzrle_decode(host, host, encoded, hdr.xh_len); > + if (len == -1) { > + fprintf(stderr, "Failed to load XBZRLE page - decode error!\n"); > + goto done; > + } > + > + if (len != TARGET_PAGE_SIZE) { > + fprintf(stderr, "Failed to load XBZRLE page - size %d expected %d!\n", > + len, TARGET_PAGE_SIZE); > + goto done; > + } > + > + rc = 0; > +done: > + qemu_free(encoded); > + return rc; > +} > + > static inline void *host_from_stream_offset(QEMUFile *f, > ram_addr_t offset, > int flags) > @@ -328,16 +571,38 @@ static inline void *host_from_stream_offset(QEMUFile *f, > return NULL; > } > > +static inline void *host_from_stream_offset_versioned(int version_id, > + QEMUFile *f, ram_addr_t offset, int flags) > +{ > + void *host; > + if (version_id == 3) { > + host = qemu_get_ram_ptr(offset); > + } else { > + host = host_from_stream_offset(f, offset, flags); > + } > + if (!host) { > + fprintf(stderr, "Failed to convert RAM address to host" > + " for offset 0x%lX!\n", offset); > + abort(); > + } > + return host; > +} > + > int ram_load(QEMUFile *f, void *opaque, int version_id) > { > ram_addr_t addr; > - int flags; > + int flags, ret = 0; > + static uint64_t seq_iter; > + > + seq_iter++; > > if (version_id< 3 || version_id> 4) { > - return -EINVAL; > + ret = -EINVAL; > + goto done; > } > > do { > + void *host; > addr = qemu_get_be64(f); > > flags = addr& ~TARGET_PAGE_MASK; > @@ -346,7 +611,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) > if (flags& RAM_SAVE_FLAG_MEM_SIZE) { > if (version_id == 3) { > if (addr != ram_bytes_total()) { > - return -EINVAL; > + ret = -EINVAL; > + goto done; > } > } else { > /* Synchronize RAM block list */ > @@ -365,8 +631,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) > > QLIST_FOREACH(block,&ram_list.blocks, next) { > if (!strncmp(id, block->idstr, sizeof(id))) { > - if (block->length != length) > - return -EINVAL; > + if (block->length != length) { > + ret = -EINVAL; > + goto done; > + } > break; > } > } > @@ -374,7 +642,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) > if (!block) { > fprintf(stderr, "Unknown ramblock \"%s\", cannot " > "accept migration\n", id); > - return -EINVAL; > + ret = -EINVAL; > + goto done; > } > > total_ram_bytes -= length; > @@ -383,17 +652,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) > } > > if (flags& RAM_SAVE_FLAG_COMPRESS) { > - void *host; > uint8_t ch; > > - if (version_id == 3) > - host = qemu_get_ram_ptr(addr); > - else > - host = host_from_stream_offset(f, addr, flags); > - if (!host) { > - return -EINVAL; > - } > - > + host = host_from_stream_offset_versioned(version_id, > + f, addr, flags); > ch = qemu_get_byte(f); > memset(host, ch, TARGET_PAGE_SIZE); > #ifndef _WIN32 > @@ -403,21 +665,28 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) > } > #endif > } else if (flags& RAM_SAVE_FLAG_PAGE) { > - void *host; > - > - if (version_id == 3) > - host = qemu_get_ram_ptr(addr); > - else > - host = host_from_stream_offset(f, addr, flags); > - > + host = host_from_stream_offset_versioned(version_id, > + f, addr, flags); > qemu_get_buffer(f, host, TARGET_PAGE_SIZE); > + } else if (flags& RAM_SAVE_FLAG_XBZRLE) { > + host = host_from_stream_offset_versioned(version_id, > + f, addr, flags); > + if (load_xbrle(f, addr, host)< 0) { > + ret = -EINVAL; > + goto done; > + } > } > + > if (qemu_file_has_error(f)) { > - return -EIO; > + ret = -EIO; > + goto done; > } > } while (!(flags& RAM_SAVE_FLAG_EOS)); > > - return 0; > +done: > + DPRINTF("Completed load of VM with exit code %d seq iteration %ld\n", > + ret, seq_iter); > + return ret; > } > > void qemu_service_io(void) > diff --git a/block-migration.c b/block-migration.c > index 3e66f49..504df70 100644 > --- a/block-migration.c > +++ b/block-migration.c > @@ -689,7 +689,8 @@ static int block_load(QEMUFile *f, void *opaque, int version_id) > return 0; > } > > -static void block_set_params(int blk_enable, int shared_base, void *opaque) > +static void block_set_params(int blk_enable, int shared_base, > + int use_xbrle, int64_t xbrle_cache_size, void *opaque) > { > block_mig_state.blk_enable = blk_enable; > block_mig_state.shared_base = shared_base; > diff --git a/hash.h b/hash.h > new file mode 100644 > index 0000000..7109905 > --- /dev/null > +++ b/hash.h > @@ -0,0 +1,72 @@ > +#ifndef _LINUX_HASH_H > +#define _LINUX_HASH_H > +/* Fast hashing routine for ints, longs and pointers. > + (C) 2002 William Lee Irwin III, IBM */ > + > +/* > + * Knuth recommends primes in approximately golden ratio to the maximum > + * integer representable by a machine word for multiplicative hashing. > + * Chuck Lever verified the effectiveness of this technique: > + * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf > + * > + * These primes are chosen to be bit-sparse, that is operations on > + * them can use shifts and additions instead of multiplications for > + * machines where multiplications are slow. > + */ > + > +typedef uint64_t u64; > +typedef uint32_t u32; > +#define BITS_PER_LONG TARGET_LONG_BITS > + > +/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */ > +#define GOLDEN_RATIO_PRIME_32 0x9e370001UL > +/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */ > +#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL > + > +#if BITS_PER_LONG == 32 > +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32 > +#define hash_long(val, bits) hash_32(val, bits) > +#elif BITS_PER_LONG == 64 > +#define hash_long(val, bits) hash_64(val, bits) > +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64 > +#else > +#error Wordsize not 32 or 64 > +#endif > + > +static inline u64 hash_64(u64 val, unsigned int bits) > +{ > + u64 hash = val; > + > + /* Sigh, gcc can't optimise this alone like it does for 32 bits. */ > + u64 n = hash; > + n<<= 18; > + hash -= n; > + n<<= 33; > + hash -= n; > + n<<= 3; > + hash += n; > + n<<= 3; > + hash -= n; > + n<<= 4; > + hash += n; > + n<<= 2; > + hash += n; > + > + /* High bits are more random, so use them. */ > + return hash>> (64 - bits); > +} > + > +static inline u32 hash_32(u32 val, unsigned int bits) > +{ > + /* On some cpus multiply is faster, on others gcc will do shifts */ > + u32 hash = val * GOLDEN_RATIO_PRIME_32; > + > + /* High bits are more random, so use them. */ > + return hash>> (32 - bits); > +} > + > +static inline unsigned long hash_ptr(void *ptr, unsigned int bits) > +{ > + return hash_long((unsigned long)ptr, bits); > +} > +#endif /* _LINUX_HASH_H */ > diff --git a/hmp-commands.hx b/hmp-commands.hx > old mode 100644 > new mode 100755 > index e5585ba..e49d5be > --- a/hmp-commands.hx > +++ b/hmp-commands.hx > @@ -717,24 +717,27 @@ ETEXI > > { > .name = "migrate", > - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", > - .params = "[-d] [-b] [-i] uri", > - .help = "migrate to URI (using -d to not wait for completion)" > - "\n\t\t\t -b for migration without shared storage with" > - " full copy of disk\n\t\t\t -i for migration without " > - "shared storage with incremental copy of disk " > - "(base image shared between src and destination)", > + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", > + .params = "[-d] [-b] [-i] [-x] uri", > + .help = "migrate to URI" > + "\n\t -d to not wait for completion" > + "\n\t -b for migration without shared storage with" > + " full copy of disk" > + "\n\t -i for migration without" > + " shared storage with incremental copy of disk" > + " (base image shared between source and destination)" > + "\n\t -x to use XBRLE page delta compression", > .user_print = monitor_user_noop, > .mhandler.cmd_new = do_migrate, > }, > > - > STEXI > -@item migrate [-d] [-b] [-i] @var{uri} > +@item migrate [-d] [-b] [-i] [-x] @var{uri} > @findex migrate > Migrate to @var{uri} (using -d to not wait for completion). > -b for migration with full copy of disk > -i for migration with incremental copy of disk (base image is shared) > + -x to use XBRLE page delta compression > ETEXI > > { > @@ -753,10 +756,23 @@ Cancel the current VM migration. > ETEXI > > { > + .name = "migrate_set_cachesize", > + .args_type = "value:s", > + .params = "value", > + .help = "set cache size (in MB) for XBRLE migrations", > + .mhandler.cmd = do_migrate_set_cachesize, > + }, > + > +STEXI > +@item migrate_set_cachesize @var{value} > +Set cache size (in MB) for xbrle migrations. > +ETEXI > + > + { > .name = "migrate_set_speed", > .args_type = "value:o", > .params = "value", > - .help = "set maximum speed (in bytes) for migrations. " > + .help = "set maximum XBRLE cache size (in bytes) for migrations. " > "Defaults to MB if no size suffix is specified, ie. B/K/M/G/T", > .user_print = monitor_user_noop, > .mhandler.cmd_new = do_migrate_set_speed, > diff --git a/hw/hw.h b/hw/hw.h > index 9d2cfc2..aa336ec 100644 > --- a/hw/hw.h > +++ b/hw/hw.h > @@ -239,7 +239,8 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv) > int64_t qemu_ftell(QEMUFile *f); > int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence); > > -typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque); > +typedef void SaveSetParamsHandler(int blk_enable, int shared, > + int use_xbrle, int64_t xbrle_cache_size, void *opaque); > typedef void SaveStateHandler(QEMUFile *f, void *opaque); > typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage, > void *opaque); > diff --git a/lru.c b/lru.c > new file mode 100644 > index 0000000..e7230d0 > --- /dev/null > +++ b/lru.c > @@ -0,0 +1,142 @@ > +#include<assert.h> > +#include<math.h> > +#include "qemu-common.h" > +#include "qemu-queue.h" > +#include "host-utils.h" > +#include "lru.h" > +#include "hash.h" > + > +typedef struct CacheItem { > + ram_addr_t it_addr; > + uint8_t *it_data; > + lru_free_cb_t it_free; > + QCIRCLEQ_ENTRY(CacheItem) it_lru_next; > + QCIRCLEQ_ENTRY(CacheItem) it_bucket_next; > +} CacheItem; > + > +typedef QCIRCLEQ_HEAD(, CacheItem) CacheBucket; > +static CacheBucket *page_hash; > +static int64_t cache_table_size; > +static uint64_t cache_max_items; > +static int64_t cache_num_items; > +static uint8_t cache_hash_bits; > + > +static QCIRCLEQ_HEAD(page_lru, CacheItem) page_lru; > + > +static uint64_t next_pow_of_2(uint64_t v) > +{ > + v--; > + v |= v>> 1; > + v |= v>> 2; > + v |= v>> 4; > + v |= v>> 8; > + v |= v>> 16; > + v |= v>> 32; > + v++; > + return v; > +} > + > +void lru_init(int64_t max_items, void *param) > +{ > + int i; > + > + cache_num_items = 0; > + cache_max_items = max_items; > + /* add 20% to table size to reduce collisions */ > + cache_table_size = next_pow_of_2(1.2 * max_items); > + cache_hash_bits = ctz64(cache_table_size) - 1; > + > + QCIRCLEQ_INIT(&page_lru); > + > + page_hash = qemu_mallocz(sizeof(CacheBucket) * cache_table_size); > + assert(page_hash); > + for (i = 0; i< cache_table_size; i++) { > + QCIRCLEQ_INIT(&page_hash[i]); > + } > +} > + > +static CacheBucket *page_bucket_list(ram_addr_t addr) > +{ > + return&page_hash[hash_long(addr, cache_hash_bits)]; > +} > + > +static void do_lru_remove(CacheItem *it) > +{ > + assert(it); > + > + QCIRCLEQ_REMOVE(&page_lru, it, it_lru_next); > + QCIRCLEQ_REMOVE(page_bucket_list(it->it_addr), it, it_bucket_next); > + if (it->it_free) { > + (*it->it_free)(it->it_data); > + } > + qemu_free(it); > + cache_num_items--; > +} > + > +static int do_lru_remove_first(void) > +{ > + CacheItem *first; > + > + if (QCIRCLEQ_EMPTY(&page_lru)) { > + return -1; > + } > + first = QCIRCLEQ_FIRST(&page_lru); > + do_lru_remove(first); > + return 0; > +} > + > + > +void lru_fini(void) > +{ > + while (!do_lru_remove_first()) { > + } > + qemu_free(page_hash); > +} > + > +static CacheItem *do_lru_lookup(ram_addr_t addr) > +{ > + CacheBucket *head = page_bucket_list(addr); > + CacheItem *it; > + > + if (QCIRCLEQ_EMPTY(head)) { > + return NULL; > + } > + QCIRCLEQ_FOREACH(it, head, it_bucket_next) { > + if (addr == it->it_addr) { > + return it; > + } > + } > + return NULL; > +} > + > +uint8_t *lru_lookup(ram_addr_t addr) > +{ > + CacheItem *it = do_lru_lookup(addr); > + return it ? it->it_data : NULL; > +} > + > +void lru_insert(ram_addr_t addr, uint8_t *data, lru_free_cb_t free_cb) > +{ > + CacheItem *it; > + > + /* remove old if item exists */ > + it = do_lru_lookup(addr); > + if (it) { > + do_lru_remove(it); > + } > + > + /* evict LRU if require free space */ > + if (cache_num_items == cache_max_items) { > + do_lru_remove_first(); > + } > + > + /* add new entry */ > + it = qemu_mallocz(sizeof(*it)); > + it->it_addr = addr; > + it->it_data = data; > + it->it_free = free_cb; > + QCIRCLEQ_INSERT_HEAD(page_bucket_list(addr), it, it_bucket_next); > + QCIRCLEQ_INSERT_TAIL(&page_lru, it, it_lru_next); > + cache_num_items++; > +} > + > diff --git a/lru.h b/lru.h > new file mode 100644 > index 0000000..6c70095 > --- /dev/null > +++ b/lru.h > @@ -0,0 +1,13 @@ > +#ifndef _LRU_H_ > +#define _LRU_H_ > + > +#include<unistd.h> > +#include<stdint.h> > +#include "cpu-all.h" > +typedef void (*lru_free_cb_t)(void *); > +void lru_init(ssize_t num_items, void *param); > +void lru_fini(void); > +void lru_insert(ram_addr_t id, uint8_t *pdata, lru_free_cb_t free_cb); > +uint8_t *lru_lookup(ram_addr_t addr); > +#endif > + > diff --git a/migration-exec.c b/migration-exec.c > index 14718dd..fe8254a 100644 > --- a/migration-exec.c > +++ b/migration-exec.c > @@ -67,7 +67,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc) > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size) > { > FdMigrationState *s; > FILE *f; > @@ -99,6 +101,8 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, > > s->mig_state.blk = blk; > s->mig_state.shared = inc; > + s->mig_state.use_xbrle = use_xbrle; > + s->mig_state.xbrle_cache_size = xbrle_cache_size; > > s->state = MIG_STATE_ACTIVE; > s->mon = NULL; > diff --git a/migration-fd.c b/migration-fd.c > index 6d14505..4a1ddbd 100644 > --- a/migration-fd.c > +++ b/migration-fd.c > @@ -56,7 +56,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc) > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size) > { > FdMigrationState *s; > > @@ -82,6 +84,8 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, > > s->mig_state.blk = blk; > s->mig_state.shared = inc; > + s->mig_state.use_xbrle = use_xbrle; > + s->mig_state.xbrle_cache_size = xbrle_cache_size; > > s->state = MIG_STATE_ACTIVE; > s->mon = NULL; > diff --git a/migration-tcp.c b/migration-tcp.c > index b55f419..4ca5bf6 100644 > --- a/migration-tcp.c > +++ b/migration-tcp.c > @@ -81,7 +81,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc) > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size) > { > struct sockaddr_in addr; > FdMigrationState *s; > @@ -101,6 +103,8 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, > > s->mig_state.blk = blk; > s->mig_state.shared = inc; > + s->mig_state.use_xbrle = use_xbrle; > + s->mig_state.xbrle_cache_size = xbrle_cache_size; > > s->state = MIG_STATE_ACTIVE; > s->mon = NULL; > diff --git a/migration-unix.c b/migration-unix.c > index 57232c0..0813902 100644 > --- a/migration-unix.c > +++ b/migration-unix.c > @@ -80,7 +80,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc) > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size) > { > FdMigrationState *s; > struct sockaddr_un addr; > @@ -100,6 +102,8 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, > > s->mig_state.blk = blk; > s->mig_state.shared = inc; > + s->mig_state.use_xbrle = use_xbrle; > + s->mig_state.xbrle_cache_size = xbrle_cache_size; > > s->state = MIG_STATE_ACTIVE; > s->mon = NULL; > diff --git a/migration.c b/migration.c > old mode 100644 > new mode 100755 > index 9ee8b17..ccacf81 > --- a/migration.c > +++ b/migration.c > @@ -34,6 +34,11 @@ > /* Migration speed throttling */ > static uint32_t max_throttle = (32<< 20); > > +/* Migration XBRLE cache size */ > +#define DEFAULT_MIGRATE_CACHE_SIZE (64 * 1024 * 1024) > + > +static int64_t migrate_cache_size = DEFAULT_MIGRATE_CACHE_SIZE; > + > static MigrationState *current_migration; > > int qemu_start_incoming_migration(const char *uri) > @@ -80,6 +85,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) > int detach = qdict_get_try_bool(qdict, "detach", 0); > int blk = qdict_get_try_bool(qdict, "blk", 0); > int inc = qdict_get_try_bool(qdict, "inc", 0); > + int use_xbrle = qdict_get_try_bool(qdict, "xbrle", 0); > const char *uri = qdict_get_str(qdict, "uri"); > > if (current_migration&& > @@ -90,17 +96,21 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) > > if (strstart(uri, "tcp:",&p)) { > s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, > - blk, inc); > + blk, inc, use_xbrle, > + migrate_cache_size); > #if !defined(WIN32) > } else if (strstart(uri, "exec:",&p)) { > s = exec_start_outgoing_migration(mon, p, max_throttle, detach, > - blk, inc); > + blk, inc, use_xbrle, > + migrate_cache_size); > } else if (strstart(uri, "unix:",&p)) { > s = unix_start_outgoing_migration(mon, p, max_throttle, detach, > - blk, inc); > + blk, inc, use_xbrle, > + migrate_cache_size); > } else if (strstart(uri, "fd:",&p)) { > s = fd_start_outgoing_migration(mon, p, max_throttle, detach, > - blk, inc); > + blk, inc, use_xbrle, > + migrate_cache_size); > #endif > } else { > monitor_printf(mon, "unknown migration protocol: %s\n", uri); > @@ -185,6 +195,36 @@ static void migrate_print_status(Monitor *mon, const char *name, > qdict_get_int(qdict, "total")>> 10); > } > > +static void migrate_print_ram_status(Monitor *mon, const char *name, > + const QDict *status_dict) > +{ > + QDict *qdict; > + uint64_t overflow, cache_hit, cache_lookup; > + > + qdict = qobject_to_qdict(qdict_get(status_dict, name)); > + > + monitor_printf(mon, "transferred %s: %" PRIu64 " kbytes\n", name, > + qdict_get_int(qdict, "bytes")>> 10); > + monitor_printf(mon, "transferred %s: %" PRIu64 " pages\n", name, > + qdict_get_int(qdict, "pages")); > + overflow = qdict_get_int(qdict, "overflow"); > + if (overflow> 0) { > + monitor_printf(mon, "overflow %s: %" PRIu64 " pages\n", name, > + overflow); > + } > + cache_hit = qdict_get_int(qdict, "cache-hit"); > + if (cache_hit> 0) { > + monitor_printf(mon, "cache-hit %s: %" PRIu64 " pages\n", name, > + cache_hit); > + } > + cache_lookup = qdict_get_int(qdict, "cache-lookup"); > + if (cache_lookup> 0) { > + monitor_printf(mon, "cache-lookup %s: %" PRIu64 " pages\n", name, > + cache_lookup); > + } > + > +} > + > void do_info_migrate_print(Monitor *mon, const QObject *data) > { > QDict *qdict; > @@ -198,6 +238,18 @@ void do_info_migrate_print(Monitor *mon, const QObject *data) > migrate_print_status(mon, "ram", qdict); > } > > + if (qdict_haskey(qdict, "ram-duplicate")) { > + migrate_print_ram_status(mon, "ram-duplicate", qdict); > + } > + > + if (qdict_haskey(qdict, "ram-normal")) { > + migrate_print_ram_status(mon, "ram-normal", qdict); > + } > + > + if (qdict_haskey(qdict, "ram-xbrle")) { > + migrate_print_ram_status(mon, "ram-xbrle", qdict); > + } > + > if (qdict_haskey(qdict, "disk")) { > migrate_print_status(mon, "disk", qdict); > } > @@ -214,6 +266,23 @@ static void migrate_put_status(QDict *qdict, const char *name, > qdict_put_obj(qdict, name, obj); > } > > +static void migrate_put_ram_status(QDict *qdict, const char *name, > + uint64_t bytes, uint64_t pages, > + uint64_t overflow, uint64_t cache_hit, > + uint64_t cache_lookup) > +{ > + QObject *obj; > + > + obj = qobject_from_jsonf("{ 'bytes': %" PRId64 ", " > + "'pages': %" PRId64 ", " > + "'overflow': %" PRId64 ", " > + "'cache-hit': %" PRId64 ", " > + "'cache-lookup': %" PRId64 " }", > + bytes, pages, overflow, cache_hit, > + cache_lookup); > + qdict_put_obj(qdict, name, obj); > +} > + > void do_info_migrate(Monitor *mon, QObject **ret_data) > { > QDict *qdict; > @@ -228,6 +297,21 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) > migrate_put_status(qdict, "ram", ram_bytes_transferred(), > ram_bytes_remaining(), ram_bytes_total()); > > + if (s->use_xbrle) { > + migrate_put_ram_status(qdict, "ram-duplicate", > + dup_mig_bytes_transferred(), > + dup_mig_pages_transferred(), 0, 0, 0); > + migrate_put_ram_status(qdict, "ram-normal", > + norm_mig_bytes_transferred(), > + norm_mig_pages_transferred(), 0, 0, 0); > + migrate_put_ram_status(qdict, "ram-xbrle", > + xbrle_mig_bytes_transferred(), > + xbrle_mig_pages_transferred(), > + xbrle_mig_pages_overflow(), > + xbrle_mig_pages_cache_hit(), > + xbrle_mig_pages_cache_lookup()); > + } > + > if (blk_mig_active()) { > migrate_put_status(qdict, "disk", blk_mig_bytes_transferred(), > blk_mig_bytes_remaining(), > @@ -341,7 +425,8 @@ void migrate_fd_connect(FdMigrationState *s) > > DPRINTF("beginning savevm\n"); > ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk, > - s->mig_state.shared); > + s->mig_state.shared, s->mig_state.use_xbrle, > + s->mig_state.xbrle_cache_size); > if (ret< 0) { > DPRINTF("failed, %d\n", ret); > migrate_fd_error(s); > @@ -448,3 +533,27 @@ int migrate_fd_close(void *opaque) > qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); > return s->close(s); > } > + > +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict) > +{ > + ssize_t bytes; > + const char *value = qdict_get_str(qdict, "value"); > + > + bytes = strtosz(value, NULL); > + if (bytes< 0) { > + monitor_printf(mon, "invalid cache size: %s\n", value); > + return; > + } > + > + /* On 32-bit hosts, QEMU is limited by virtual address space */ > + if (bytes> (2047<< 20)&& HOST_LONG_BITS == 32) { > + monitor_printf(mon, "cache can't exceed 2047 MB RAM limit on host\n"); > + return; > + } > + if (bytes != (uint64_t) bytes) { > + monitor_printf(mon, "cache size too large\n"); > + return; > + } > + migrate_cache_size = bytes; > +} > + > diff --git a/migration.h b/migration.h > index d13ed4f..6dc0543 100644 > --- a/migration.h > +++ b/migration.h > @@ -32,6 +32,8 @@ struct MigrationState > void (*release)(MigrationState *s); > int blk; > int shared; > + int use_xbrle; > + int64_t xbrle_cache_size; > }; > > typedef struct FdMigrationState FdMigrationState; > @@ -76,7 +78,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc); > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size); > > int tcp_start_incoming_migration(const char *host_port); > > @@ -85,7 +89,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc); > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size); > > int unix_start_incoming_migration(const char *path); > > @@ -94,7 +100,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc); > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size); > > int fd_start_incoming_migration(const char *path); > > @@ -103,7 +111,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, > int64_t bandwidth_limit, > int detach, > int blk, > - int inc); > + int inc, > + int use_xbrle, > + int64_t xbrle_cache_size); > > void migrate_fd_monitor_suspend(FdMigrationState *s, Monitor *mon); > > @@ -134,4 +144,11 @@ static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state) > return container_of(mig_state, FdMigrationState, mig_state); > } > > +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict); > + > +void arch_set_params(int blk_enable, int shared_base, > + int use_xbrle, int64_t xbrle_cache_size, void *opaque); > + > +int xbrle_mig_active(void); > + > #endif > diff --git a/qmp-commands.hx b/qmp-commands.hx > index 793cf1c..8fbe64b 100644 > --- a/qmp-commands.hx > +++ b/qmp-commands.hx > @@ -431,13 +431,16 @@ EQMP > > { > .name = "migrate", > - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", > - .params = "[-d] [-b] [-i] uri", > - .help = "migrate to URI (using -d to not wait for completion)" > - "\n\t\t\t -b for migration without shared storage with" > - " full copy of disk\n\t\t\t -i for migration without " > - "shared storage with incremental copy of disk " > - "(base image shared between src and destination)", > + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", > + .params = "[-d] [-b] [-i] [-x] uri", > + .help = "migrate to URI" > + "\n\t -d to not wait for completion" > + "\n\t -b for migration without shared storage with" > + " full copy of disk" > + "\n\t -i for migration without" > + " shared storage with incremental copy of disk" > + " (base image shared between source and destination)" > + "\n\t -x to use XBRLE page delta compression", > .user_print = monitor_user_noop, > .mhandler.cmd_new = do_migrate, > }, > @@ -453,6 +456,7 @@ Arguments: > - "blk": block migration, full disk copy (json-bool, optional) > - "inc": incremental disk copy (json-bool, optional) > - "uri": Destination URI (json-string) > +- "xbrle": to use XBRLE page delta compression > > Example: > > @@ -494,6 +498,31 @@ Example: > EQMP > > { > + .name = "migrate_set_cachesize", > + .args_type = "value:s", > + .params = "value", > + .help = "set cache size (in MB) for xbrle migrations", > + .mhandler.cmd = do_migrate_set_cachesize, > + }, > + > +SQMP > +migrate_set_cachesize > +--------------------- > + > +Set cache size to be used by XBRLE migration > + > +Arguments: > + > +- "value": cache size in bytes (json-number) > + > +Example: > + > +-> { "execute": "migrate_set_cachesize", "arguments": { "value": 500M } } > +<- { "return": {} } > + > +EQMP > + > + { > .name = "migrate_set_speed", > .args_type = "value:f", > .params = "value", > diff --git a/savevm.c b/savevm.c > index 4e49765..93b512b 100644 > --- a/savevm.c > +++ b/savevm.c > @@ -1141,7 +1141,8 @@ int register_savevm(DeviceState *dev, > void *opaque) > { > return register_savevm_live(dev, idstr, instance_id, version_id, > - NULL, NULL, save_state, load_state, opaque); > + arch_set_params, NULL, save_state, > + load_state, opaque); > } > > void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque) > @@ -1428,15 +1429,17 @@ static int vmstate_save(QEMUFile *f, SaveStateEntry *se) > #define QEMU_VM_SUBSECTION 0x05 > > int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, > - int shared) > + int shared, int use_xbrle, > + int64_t xbrle_cache_size) > { > SaveStateEntry *se; > > QTAILQ_FOREACH(se,&savevm_handlers, entry) { > if(se->set_params == NULL) { > continue; > - } > - se->set_params(blk_enable, shared, se->opaque); > + } > + se->set_params(blk_enable, shared, use_xbrle, xbrle_cache_size, > + se->opaque); > } > > qemu_put_be32(f, QEMU_VM_FILE_MAGIC); > @@ -1577,7 +1580,7 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f) > > bdrv_flush_all(); > > - ret = qemu_savevm_state_begin(mon, f, 0, 0); > + ret = qemu_savevm_state_begin(mon, f, 0, 0, 0, 0); > if (ret< 0) > goto out; > > diff --git a/sysemu.h b/sysemu.h > index b81a70e..eb53bf7 100644 > --- a/sysemu.h > +++ b/sysemu.h > @@ -44,6 +44,16 @@ uint64_t ram_bytes_remaining(void); > uint64_t ram_bytes_transferred(void); > uint64_t ram_bytes_total(void); > > +uint64_t dup_mig_bytes_transferred(void); > +uint64_t dup_mig_pages_transferred(void); > +uint64_t norm_mig_bytes_transferred(void); > +uint64_t norm_mig_pages_transferred(void); > +uint64_t xbrle_mig_bytes_transferred(void); > +uint64_t xbrle_mig_pages_transferred(void); > +uint64_t xbrle_mig_pages_overflow(void); > +uint64_t xbrle_mig_pages_cache_lookup(void); > +uint64_t xbrle_mig_pages_cache_hit(void); > + > int64_t cpu_get_ticks(void); > void cpu_enable_ticks(void); > void cpu_disable_ticks(void); > @@ -74,7 +84,8 @@ void qemu_announce_self(void); > void main_loop_wait(int nonblocking); > > int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, > - int shared); > + int shared, int use_xbrle, > + int64_t xbrle_cache_size); > int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); > int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); > void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); > diff --git a/xbzrle.c b/xbzrle.c > new file mode 100644 > index 0000000..e9285e0 > --- /dev/null > +++ b/xbzrle.c > @@ -0,0 +1,126 @@ > +#include<stdint.h> > +#include<string.h> > +#include<assert.h> > +#include "cpu-all.h" > +#include "xbzrle.h" > + > +typedef struct { > + uint64_t c; > + uint64_t num; > +} zero_encoding_t; > + > +typedef struct { > + uint64_t c; > +} char_encoding_t; > + > +static int rle_encode(uint64_t *in, int slen, uint8_t *out, const int dlen) > +{ > + int dl = 0; > + uint64_t cp = 0, c, run_len = 0; > + > + if (slen<= 0) > + return -1; > + > + while (1) { > + if (!slen) > + break; > + c = *in++; > + slen--; > + if (!(cp || c)) { > + run_len++; > + } else if (!cp) { > + ((zero_encoding_t *)out)->c = cp; > + ((zero_encoding_t *)out)->num = run_len; > + dl += sizeof(zero_encoding_t); > + out += sizeof(zero_encoding_t); > + run_len = 1; > + } else { > + ((char_encoding_t *)out)->c = cp; > + dl += sizeof(char_encoding_t); > + out += sizeof(char_encoding_t); > + } > + cp = c; > + } > + > + if (!cp) { > + ((zero_encoding_t *)out)->c = cp; > + ((zero_encoding_t *)out)->num = run_len; > + dl += sizeof(zero_encoding_t); > + out += sizeof(zero_encoding_t); > + } else { > + ((char_encoding_t *)out)->c = cp; > + dl += sizeof(char_encoding_t); > + out += sizeof(char_encoding_t); > + } > + return dl; > +} > + > +static int rle_decode(const uint8_t *in, int slen, uint64_t *out, int dlen) > +{ > + int tb = 0; > + uint64_t run_len, c; > + > + while (slen> 0) { > + c = ((char_encoding_t *) in)->c; > + if (c) { > + slen -= sizeof(char_encoding_t); > + in += sizeof(char_encoding_t); > + *out++ = c; > + tb++; > + continue; > + } > + run_len = ((zero_encoding_t *) in)->num; > + slen -= sizeof(zero_encoding_t); > + in += sizeof(zero_encoding_t); > + while (run_len--> 0) { > + *out++ = c; > + tb++; > + } > + } > + return tb; > +} > + > +static void xor_encode_word(uint8_t *dst, const uint8_t *src1, > + const uint8_t *src2) > +{ > + int len = TARGET_PAGE_SIZE / sizeof (uint64_t); > + uint64_t *dstw = (uint64_t *) dst; > + const uint64_t *srcw1 = (const uint64_t *) src1; > + const uint64_t *srcw2 = (const uint64_t *) src2; > + > + while (len--) { > + *dstw++ = *srcw1++ ^ *srcw2++; > + } > +} > + > +int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old, const uint8_t *curr, > + const size_t max_compressed_len) > +{ > + int compressed_len; > + uint8_t xor_buf[TARGET_PAGE_SIZE]; > + uint8_t work_buf[TARGET_PAGE_SIZE * 2]; /* worst case xbzrle is 150% */ > + > + xor_encode_word(xor_buf, old, curr); > + compressed_len = rle_encode((uint64_t *)xor_buf, > + sizeof(xor_buf)/sizeof(uint64_t), work_buf, > + sizeof(work_buf)); > + if (compressed_len> max_compressed_len) { > + return -1; > + } > + memcpy(xbzrle, work_buf, compressed_len); > + return compressed_len; > +} > + > +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, > + const size_t compressed_len) > +{ > + uint8_t xor_buf[TARGET_PAGE_SIZE]; > + > + int len = rle_decode(xbrle, compressed_len, > + (uint64_t *)xor_buf, sizeof(xor_buf)/sizeof(uint64_t)); > + if (len< 0) { > + return len; > + } > + xor_encode_word(curr, old, xor_buf); > + return len * sizeof(uint64_t); > +} > diff --git a/xbzrle.h b/xbzrle.h > new file mode 100644 > index 0000000..5d625a0 > --- /dev/null > +++ b/xbzrle.h > @@ -0,0 +1,12 @@ > +#ifndef _XBZRLE_H_ > +#define _XBZRLE_H_ > + > +#include<stdio.h> > + > +int xbzrle_encode(uint8_t *xbrle, const uint8_t *old, const uint8_t *curr, > + const size_t len); > +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, > + const size_t len); > + > +#endif > + >
On 08.08.2011, at 15:29, Anthony Liguori wrote: > On 08/08/2011 03:42 AM, Shribman, Aidan wrote: >> Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps >> From: Aidan Shribman<aidan.shribman@sap.com> >> >> By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime >> and total live-migration time of VMs running memory write intensive workloads >> typical of large enterprise applications such as SAP ERP Systems, and generally >> speaking for any application with a sparse memory update pattern. >> >> On the sender side XBZRLE is used as a compact delta encoding of page updates, >> retrieving the old page content from an LRU cache (default size of 64 MB). The >> receiving side uses the existing page content and XBZRLE to decode the new page >> content. >> >> Work was originally based on research results published VEE 2011: Evaluation of >> Delta Compression Techniques for Efficient Live Migration of Large Virtual >> Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder >> XBRLE was improved further using XBZRLE instead. >> >> XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it >> ideal for in-line, real-time encoding such as is needed for live-migration. >> >> A typical usage scenario: >> {qemu} migrate_set_cachesize 256m >> {qemu} migrate -x -d tcp:destination.host:4444 >> {qemu} info migrate >> ... >> transferred ram-duplicate: A kbytes >> transferred ram-duplicate: B pages >> transferred ram-normal: C kbytes >> transferred ram-normal: D pages >> transferred ram-xbrle: E kbytes >> transferred ram-xbrle: F pages >> overflow ram-xbrle: G pages >> cache-hit ram-xbrle: H pages >> cache-lookup ram-xbrle: J pages >> >> Testing: live migration with XBZRLE completed in 110 seconds, without live >> migration was not able to complete. >> >> A simple synthetic memory r/w load generator: >> .. include<stdlib.h> >> .. include<stdio.h> >> .. int main() >> .. { >> .. char *buf = (char *) calloc(4096, 4096); >> .. while (1) { >> .. int i; >> .. for (i = 0; i< 4096 * 4; i++) { >> .. buf[i * 4096 / 4]++; >> .. } >> .. printf("."); >> .. } >> .. } >> >> Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com> >> Signed-off-by: Petter Svard<petters@cs.umu.se> >> Signed-off-by: Aidan Shribman<aidan.shribman@sap.com> > > One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really. > > I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms. > > It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt. > > Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this. > > Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering. > > I'm curious what people think about this type of approach. CC'ing libvirt to get their input. In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :). Alex
On 08/08/2011 08:41 AM, Alexander Graf wrote: > > On 08.08.2011, at 15:29, Anthony Liguori wrote: > >> One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really. >> >> I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms. >> >> It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt. >> >> Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this. >> >> Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering. >> >> I'm curious what people think about this type of approach. CC'ing libvirt to get their input. > > In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :). The way I originally envisioned things, you'd have: (qemu) migrate xbzrle://destination?opt1=value1&opt2=value2 Which would in turn be equivalent to: (qemu) migrate exec:///usr/libexec/qemu/migration-helper-xbzrle --opt1=value1 --opt2=value2 But even if we supported that, it wouldn't get exposed via libvirt unless the libvirt guys exposed QEMU URIs directly. So I think the open question is, how do we do transport plugins in a way that makes libvirt and QEMU both happy? Regards, Anthony Liguori > > Alex > >
On 08/08/2011 04:41 PM, Alexander Graf wrote:
> In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer.
Very much yes.
On 08/08/2011 04:29 PM, Anthony Liguori wrote: > > One thing that strikes me about this algorithm is that it's very good > for a particular type of workload--shockingly good really. Poking bytes at random places in memory is fairly generic. If you have a lot of small objects, and modify a subset of them, this is the pattern you get. > > I think workload aware migration compression is possible for a lot of > different types of workloads. That makes me a bit wary of QEMU > growing quite a lot of compression mechanisms. > > It makes me think that this logic may really belong at a higher level > where more information is known about the workload. For instance, I > can imagine XBZRLE living in something like libvirt. A better model would be plugin based.
On Mon, Aug 08, 2011 at 08:29:51AM -0500, Anthony Liguori wrote: > On 08/08/2011 03:42 AM, Shribman, Aidan wrote: > >Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps > >From: Aidan Shribman<aidan.shribman@sap.com> > > > >By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime > >and total live-migration time of VMs running memory write intensive workloads > >typical of large enterprise applications such as SAP ERP Systems, and generally > >speaking for any application with a sparse memory update pattern. [snip] > One thing that strikes me about this algorithm is that it's very > good for a particular type of workload--shockingly good really. > > I think workload aware migration compression is possible for a lot > of different types of workloads. That makes me a bit wary of QEMU > growing quite a lot of compression mechanisms. > > It makes me think that this logic may really belong at a higher > level where more information is known about the workload. For > instance, I can imagine XBZRLE living in something like libvirt. > > Today, parsing migration traffic is pretty horrible but I think > we're pretty strongly committed to fixing that in 1.0. That makes > me wonder if it would be nicer architecturally for a higher level > tool to own something like this. > > Originally, when I added migration, I had the view that we would > have transport plugins based on the exec: protocol. That hasn't > really happened since libvirt really owns migration but I think > having XBZRLE as a transport plugin for libvirt is something worth > considering. NB I've not been much of a fan of the exec: migration code, since it has proved rather buggy in practice when we used it for 'save/restore to/from file' support. It has been hard to diagnose when things go wrong, and difficult for QEMU to report any useful error messages. Even with the tcp: protocol, QEMU is seemingly unable to provide any useful error reporting even of things as simple as "unable to connect to remote host". So with one exception, current libvirt now uses the 'fd:' protocol for everything, and the last exception will be removed soon too. > I'm curious what people think about this type of approach. CC'ing > libvirt to get their input. In "normal" migration though, even when using fd:, we don't make any attempt to touch the data stream. We just pass a pre-connected TCP socket into QEMU and let it write directly to it. This avoids extra data copying via libvirt. In our alternative "tunnelled" migration mode, libvirt does touch the data stream, passing a pipe FD into QEMU, and copying the data from the pipe into packets to be sent over libvirtd's existing secure RPC stream, and then copying it back to QEMU on the destination. The downside here is that we've added several extra data copies. In our "save/restore to file" code, we use 'fd:' and always have to send the data via a filter program. For example, we have the ability to compress/decompress data via gzip, bzip, xz, and lzop, for which instead pass QEMU as pipe FD to the external compression helper program. We also have another new option where we send data via another I/O helper program that uses O_DIRECT, so save/restore does not pollute the page cache. With this kind of existing precedent, I won't strongly argue against libvirt adding a filter to support this XBZRLE encoding scheme for migration, or indeed save/restore too, if it proves better than lzop which is our current optimal speed/compression winner. My main concern with all these scenarios where libvirt touches the actual data stream though is that we're introducing extra data copies into the migration path which potentially waste CPU cycles. If QEMU can directly XBZRLE encode data into the FD passed via 'fd:' then we minimize data copies. Whether this is a big enough benefit to offset the burden of having to maintain various compression code options in QEMU I can't answer. Regards, Daniel
On 08/08/2011 08:51 AM, Avi Kivity wrote: > On 08/08/2011 04:29 PM, Anthony Liguori wrote: >> >> One thing that strikes me about this algorithm is that it's very good >> for a particular type of workload--shockingly good really. > > Poking bytes at random places in memory is fairly generic. If you have a > lot of small objects, and modify a subset of them, this is the pattern > you get. > >> >> I think workload aware migration compression is possible for a lot of >> different types of workloads. That makes me a bit wary of QEMU growing >> quite a lot of compression mechanisms. >> >> It makes me think that this logic may really belong at a higher level >> where more information is known about the workload. For instance, I >> can imagine XBZRLE living in something like libvirt. > > A better model would be plugin based. exec helpers are plugins. They just live in a different address space and a channel to exchange data (pipe). If we did .so plugins, which I'm really not opposed to, I'd want the interface to be something like: typedef struct MigrationTransportClass { ssize_t (*writev)(MigrationTransport *obj, struct iovec *iov, int iovcnt); } MigrationTransportClass; I think it's useful to use an interface like this because it makes it easy to put the transport in a dedicated thread that didn't hold qemu_mutex (which is sort of equivalent to using a fork'd helper but is zero-copy at the expense of less isolation). Regards, Anthony Liguori >
On 08/08/2011 05:15 PM, Anthony Liguori wrote: >> >>> >>> I think workload aware migration compression is possible for a lot of >>> different types of workloads. That makes me a bit wary of QEMU growing >>> quite a lot of compression mechanisms. >>> >>> It makes me think that this logic may really belong at a higher level >>> where more information is known about the workload. For instance, I >>> can imagine XBZRLE living in something like libvirt. >> >> A better model would be plugin based. > > > exec helpers are plugins. They just live in a different address space > and a channel to exchange data (pipe). libvirt isn't an exec helper. > > If we did .so plugins, which I'm really not opposed to, I'd want the > interface to be something like: > > typedef struct MigrationTransportClass > { > ssize_t (*writev)(MigrationTransport *obj, > struct iovec *iov, > int iovcnt); > } MigrationTransportClass; > > I think it's useful to use an interface like this because it makes it > easy to put the transport in a dedicated thread that didn't hold > qemu_mutex (which is sort of equivalent to using a fork'd helper but > is zero-copy at the expense of less isolation). If we have a shared object helper, the thread should be maintained by qemu proper, not the plugin. I wouldn't call it "migration transport", but instead a compression/decompression plugin. I don't think it merits a plugin at all though. There's limited scope for compression and it best sits in qemu proper. If anything, it needs to be more integrated (for example turning itself off if it doesn't match enough).
On 08/08/2011 09:23 AM, Avi Kivity wrote: > On 08/08/2011 05:15 PM, Anthony Liguori wrote: >> >> If we did .so plugins, which I'm really not opposed to, I'd want the >> interface to be something like: >> >> typedef struct MigrationTransportClass >> { >> ssize_t (*writev)(MigrationTransport *obj, >> struct iovec *iov, >> int iovcnt); >> } MigrationTransportClass; >> >> I think it's useful to use an interface like this because it makes it >> easy to put the transport in a dedicated thread that didn't hold >> qemu_mutex (which is sort of equivalent to using a fork'd helper but >> is zero-copy at the expense of less isolation). > > If we have a shared object helper, the thread should be maintained by > qemu proper, not the plugin. > > I wouldn't call it "migration transport", but instead a > compression/decompression plugin. > > I don't think it merits a plugin at all though. There's limited scope > for compression and it best sits in qemu proper. If anything, it needs > to be more integrated (for example turning itself off if it doesn't > match enough). That adds a tremendous amount of complexity to QEMU. If we're going to change our compression algorithm, we would need to use a single algorithm that worked well for a wide variety of workloads. We struggle enough with migration as it is, it only would get worse if we have 10 different algorithms that we were dynamically enabling/disabling. The other option is to allow 1-off compression algorithms in the form of plugins. I think in this case, plugins are a pretty good compromise in terms of isolating complexity while allowing something that at least works very well for one particular type of workload. Regards, Anthony Liguori
On 08/08/2011 05:33 PM, Anthony Liguori wrote: >> If we have a shared object helper, the thread should be maintained by >> qemu proper, not the plugin. >> >> I wouldn't call it "migration transport", but instead a >> compression/decompression plugin. >> >> I don't think it merits a plugin at all though. There's limited scope >> for compression and it best sits in qemu proper. If anything, it needs >> to be more integrated (for example turning itself off if it doesn't >> match enough). > > > That adds a tremendous amount of complexity to QEMU. Tremendous? You exaggerate. It's a lot simpler than the block or char layers, for example. > If we're going to change our compression algorithm, we would need to > use a single algorithm that worked well for a wide variety of workloads. That algorithm will have to include XBZRLE as a subset, since it matches what workloads actually do (touch memory sparsely). > > We struggle enough with migration as it is, it only would get worse if > we have 10 different algorithms that we were dynamically > enabling/disabling. > > The other option is to allow 1-off compression algorithms in the form > of plugins. I think in this case, plugins are a pretty good > compromise in terms of isolating complexity while allowing something > that at least works very well for one particular type of workload. I think you underestimate the generality of XBZRLE (or maybe I'm overestimating it?). It's not reasonable to ask users to match a compression algorithm to their workload; most times they won't be interacting with the host at all. We need compression to be enabled at all time, turning itself off if it finds it isn't effective so it can consume less cpu.
On 08/08/2011 05:04 PM, Daniel P. Berrange wrote: > My main concern with all these scenarios where libvirt touches the > actual data stream though is that we're introducing extra data copies > into the migration path which potentially waste CPU cycles. > If QEMU can directly XBZRLE encode data into the FD passed via 'fd:' > then we minimize data copies. Whether this is a big enough benefit > to offset the burden of having to maintain various compression code > options in QEMU I can't answer. > It's counterproductive to force an unneeded data copy in order to increase bandwidth.
On 08/08/2011 11:42 AM, Shribman, Aidan wrote: > Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps > From: Aidan Shribman<aidan.shribman@sap.com> > > By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime > and total live-migration time of VMs running memory write intensive workloads > typical of large enterprise applications such as SAP ERP Systems, and generally > speaking for any application with a sparse memory update pattern. > > On the sender side XBZRLE is used as a compact delta encoding of page updates, > retrieving the old page content from an LRU cache (default size of 64 MB). The > receiving side uses the existing page content and XBZRLE to decode the new page > content. > > Work was originally based on research results published VEE 2011: Evaluation of > Delta Compression Techniques for Efficient Live Migration of Large Virtual > Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder > XBRLE was improved further using XBZRLE instead. > > XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it > ideal for in-line, real-time encoding such as is needed for live-migration. > > A typical usage scenario: > {qemu} migrate_set_cachesize 256m > {qemu} migrate -x -d tcp:destination.host:4444 > {qemu} info migrate > ... > transferred ram-duplicate: A kbytes > transferred ram-duplicate: B pages > transferred ram-normal: C kbytes > transferred ram-normal: D pages > transferred ram-xbrle: E kbytes > transferred ram-xbrle: F pages > overflow ram-xbrle: G pages > cache-hit ram-xbrle: H pages > cache-lookup ram-xbrle: J pages > > Testing: live migration with XBZRLE completed in 110 seconds, without live > migration was not able to complete. > > A simple synthetic memory r/w load generator: > .. include<stdlib.h> > .. include<stdio.h> > .. int main() > .. { > .. char *buf = (char *) calloc(4096, 4096); > .. while (1) { > .. int i; > .. for (i = 0; i< 4096 * 4; i++) { > .. buf[i * 4096 / 4]++; > .. } > .. printf("."); > .. } > .. } > > Please provide documentation in docs/ of the compression format. IMO it should be disabled by default (with an option to disable it, via, sat, migrate-set-options, so we can migrate to older hosts). The protocol should allow XBZRLE to turn itself off if it detects that it isn't effective.
On 08/08/2011 05:46 PM, Avi Kivity wrote: > > Please provide documentation in docs/ of the compression format. > > IMO it should be disabled by default (with an option to disable it, > via, sat, migrate-set-options, so we can migrate to older hosts). > > The protocol should allow XBZRLE to turn itself off if it detects that > it isn't effective. > IOW, this should be part of the standard migration protocol, not some side option that is enabled if the user remembers. It should not be mutually exclusive with future migration extensions, including compression.
On Mon, Aug 8, 2011 at 3:47 PM, Avi Kivity <avi@redhat.com> wrote: > On 08/08/2011 05:46 PM, Avi Kivity wrote: >> >> Please provide documentation in docs/ of the compression format. >> >> IMO it should be disabled by default (with an option to disable it, via, >> sat, migrate-set-options, so we can migrate to older hosts). >> >> The protocol should allow XBZRLE to turn itself off if it detects that it >> isn't effective. >> > > IOW, this should be part of the standard migration protocol, not some side > option that is enabled if the user remembers. Â It should not be mutually > exclusive with future migration extensions, including compression. This is an attractive option. With some polish maybe XBZRLE could be integrated as a default option that does not degrade performance. Adding features that require user configuration isn't worthwhile because they won't be used or they'll be misused - let's not make QEMU more complicated if it can be avoided. If there is no way to make XBZRLE automatic then I think it should live outside QEMU because it will be a niche feature that relatively few will use but adds complexity to migration. Stefan
On 08/08/2011 05:56 PM, Stefan Hajnoczi wrote: > > IOW, this should be part of the standard migration protocol, not some side > > option that is enabled if the user remembers. It should not be mutually > > exclusive with future migration extensions, including compression. > > This is an attractive option. With some polish maybe XBZRLE could be > integrated as a default option that does not degrade performance. > Adding features that require user configuration isn't worthwhile > because they won't be used or they'll be misused - let's not make QEMU > more complicated if it can be avoided. > > If there is no way to make XBZRLE automatic then I think it should > live outside QEMU because it will be a niche feature that relatively > few will use but adds complexity to migration. Agree. Aidan, can you provide impact numbers on non-XBZRLE favourable workloads (both throughput and cpu usage)? What about turning itself off automatically if the hit rate is too low?
On 08/08/2011 09:39 AM, Avi Kivity wrote: >> The other option is to allow 1-off compression algorithms in the form >> of plugins. I think in this case, plugins are a pretty good compromise >> in terms of isolating complexity while allowing something that at >> least works very well for one particular type of workload. > > I think you underestimate the generality of XBZRLE (or maybe I'm > overestimating it?). This is really my fundamental concern. When it comes to something that we have to support for a very long time, no one should be estimating anything. We should make these decisions based on an awful lot of analysis on a wide variety of workloads. It's hard to do this in QEMU today because we don't have a module mechanism to make it easy for users to try out new things without fully committing to including something in the tree. But I don't think that's the root of the problem I have. I really am just extremely reluctant to commit to something that we have to support forever. Thinking more about it though, I think there can be another solution--feature negotiation. I view adding feature negotiation as a pre-requisite to adding any type of transport compression such as XBZRLE. That will let us support migration to older QEMUs and also to eventually remove XBZRLE if we decide it doesn't make sense anymore. Regards, Anthony Liguori > It's not reasonable to ask users to match a > compression algorithm to their workload; most times they won't be > interacting with the host at all. We need compression to be enabled at > all time, turning itself off if it finds it isn't effective so it can > consume less cpu. >
On 08/08/2011 09:47 AM, Avi Kivity wrote: > On 08/08/2011 05:46 PM, Avi Kivity wrote: >> >> Please provide documentation in docs/ of the compression format. >> >> IMO it should be disabled by default (with an option to disable it, >> via, sat, migrate-set-options, so we can migrate to older hosts). >> >> The protocol should allow XBZRLE to turn itself off if it detects that >> it isn't effective. >> > > IOW, this should be part of the standard migration protocol, not some > side option that is enabled if the user remembers. It should not be > mutually exclusive with future migration extensions, including compression. Are you thinking of a static decision or a dynamic decision? I think feature negotiation would address static decision making. For dynamic decision making, you could look to something like the VNC protocol and how it encodes pixel data. The flow looks something like: 1) All clients/servers must support raw encoding 2) Client presents list of support encodings 3) Server takes intersection of client supported encodings and server supported encodings. 4) Server can choose to encode updates using any encoding supported by client and server. Regards, Anthony Liguori
On 08/08/2011 06:10 PM, Anthony Liguori wrote: > On 08/08/2011 09:47 AM, Avi Kivity wrote: >> On 08/08/2011 05:46 PM, Avi Kivity wrote: >>> >>> Please provide documentation in docs/ of the compression format. >>> >>> IMO it should be disabled by default (with an option to disable it, >>> via, sat, migrate-set-options, so we can migrate to older hosts). >>> >>> The protocol should allow XBZRLE to turn itself off if it detects that >>> it isn't effective. >>> >> >> IOW, this should be part of the standard migration protocol, not some >> side option that is enabled if the user remembers. It should not be >> mutually exclusive with future migration extensions, including >> compression. > > Are you thinking of a static decision or a dynamic decision? > Dynamic. If the cache hit rate is too low, disable XBZRLE and eliminate the overhead of copying pages to the history buffer. > I think feature negotiation would address static decision making. For > dynamic decision making, you could look to something like the VNC > protocol and how it encodes pixel data. The flow looks something like: > > 1) All clients/servers must support raw encoding > > 2) Client presents list of support encodings > > 3) Server takes intersection of client supported encodings and server > supported encodings. > > 4) Server can choose to encode updates using any encoding supported by > client and server. Feature negotiation in the migration protocol itself would break exec: migration (and any existing single duplex proxies). We can do a poor man's feature negotiation via capabilities, relying on management to disable features which don't exist on the other side. It isn't pretty, but it's the best we can do at this point. Real feature negotiation will likely have to wait until the next version of the migration protocol.
On 08/08/2011 10:15 AM, Avi Kivity wrote: > On 08/08/2011 06:10 PM, Anthony Liguori wrote: >> On 08/08/2011 09:47 AM, Avi Kivity wrote: >>> On 08/08/2011 05:46 PM, Avi Kivity wrote: >>>> >>>> Please provide documentation in docs/ of the compression format. >>>> >>>> IMO it should be disabled by default (with an option to disable it, >>>> via, sat, migrate-set-options, so we can migrate to older hosts). >>>> >>>> The protocol should allow XBZRLE to turn itself off if it detects that >>>> it isn't effective. >>>> >>> >>> IOW, this should be part of the standard migration protocol, not some >>> side option that is enabled if the user remembers. It should not be >>> mutually exclusive with future migration extensions, including >>> compression. >> >> Are you thinking of a static decision or a dynamic decision? >> > > Dynamic. If the cache hit rate is too low, disable XBZRLE and eliminate > the overhead of copying pages to the history buffer. > >> I think feature negotiation would address static decision making. For >> dynamic decision making, you could look to something like the VNC >> protocol and how it encodes pixel data. The flow looks something like: >> >> 1) All clients/servers must support raw encoding >> >> 2) Client presents list of support encodings >> >> 3) Server takes intersection of client supported encodings and server >> supported encodings. >> >> 4) Server can choose to encode updates using any encoding supported by >> client and server. > > Feature negotiation in the migration protocol itself would break exec: > migration (and any existing single duplex proxies). > > We can do a poor man's feature negotiation via capabilities, relying on > management to disable features which don't exist on the other side. It > isn't pretty, but it's the best we can do at this point. I think the above can be done via capabilities too fwiw. In this case, the source and destination advertise the compression formats they support, and the management tool takes the interaction and sets the capability mask on the source and destination appropriately. There's no need for a full duplex protocol because the source just sends compressed data in whatever format it thinks is appropriate at any given point in time. > Real feature negotiation will likely have to wait until the next version > of the migration protocol. Since we're talking about moving to ASN.1 for 1.0, I think we should think also include memory compression and wait until we rev the protocol before introducing any type of compression. Regards, Anthony Liguori
On 08/08/2011 07:19 PM, Anthony Liguori wrote: > >> Real feature negotiation will likely have to wait until the next version >> of the migration protocol. > > Since we're talking about moving to ASN.1 for 1.0, I think we should > think also include memory compression and wait until we rev the > protocol before introducing any type of compression. Is anyone actually working on this?
On 08/08/2011 11:53 AM, Avi Kivity wrote: > On 08/08/2011 07:19 PM, Anthony Liguori wrote: >> >>> Real feature negotiation will likely have to wait until the next version >>> of the migration protocol. >> >> Since we're talking about moving to ASN.1 for 1.0, I think we should >> think also include memory compression and wait until we rev the >> protocol before introducing any type of compression. > > Is anyone actually working on this? Yes. Regards, Anthony Liguori
-----Original Message----- From: Anthony Liguori [mailto:anthony@codemonkey.ws] Sent: Monday, August 08, 2011 7:56 PM To: Avi Kivity Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel Developers Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps On 08/08/2011 11:53 AM, Avi Kivity wrote: > On 08/08/2011 07:19 PM, Anthony Liguori wrote: >> >>> Real feature negotiation will likely have to wait until the next version >>> of the migration protocol. >> >> Since we're talking about moving to ASN.1 for 1.0, I think we should >> think also include memory compression and wait until we rev the >> protocol before introducing any type of compression. > > Is anyone actually working on this? XBZRLE will very rarely (if at all) degrade live-migration as it runs at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by using 128bit registers instead of the 64bit registers used currently. IMO XBZRLE could safely be used by default exposing capabilities by Qemu such that higher level management would handle static negotiation (as suggested). Given that XBZRLE will seldom fail due to inflated encoded output (an example for such a case -> dirty the new page every 2nd 64bit word: the word-wise Xor would give 0x0y0z... ZRLE would future encode as 01x01y01z... a +50% increase), I see little incentive in automatic XBZRLE disablement. As to implementing XBZRLE delta compression as a compression plug-in - this is not that straight forward as it has some interesting interplay with DUP packat's which are crucial for performance, specifically a page consisting of only zero's is LRU cached as reference without the standard qemu_malloc()/memcpy() done in other cases. This is especially important for eliminating slowdown during live-migration initiation. As to waiting for ASN.1 capability - I can see this will make parsing of live-migration messages much more reliable (ensuring that Qemu is able to detect an incorrect protocol version) but I can't say I am very happy waiting for 1.0 - are there any alternatives? Aidan
On 08/10/2011 06:07 PM, Shribman, Aidan wrote: > XBZRLE will very rarely (if at all) degrade live-migration as it runs at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by using 128bit registers instead of the 64bit registers used currently. IMO XBZRLE could safely be used by default exposing capabilities by Qemu such that higher level management would handle static negotiation (as suggested). > > Given that XBZRLE will seldom fail due to inflated encoded output (an example for such a case -> dirty the new page every 2nd 64bit word: the word-wise Xor would give 0x0y0z... ZRLE would future encode as 01x01y01z... a +50% increase), I see little incentive in automatic XBZRLE disablement. My concern is not reduced migration bandwidth or inflated image size, but increased cpu use for copying pages to the cache and xoring them. > As to implementing XBZRLE delta compression as a compression plug-in - this is not that straight forward as it has some interesting interplay with DUP packat's which are crucial for performance, specifically a page consisting of only zero's is LRU cached as reference without the standard qemu_malloc()/memcpy() done in other cases. This is especially important for eliminating slowdown during live-migration initiation. I agree, it should be on-by-default and in the main code base. Please provide numbers to justify this on non-artificial workloads, and on artificial worst-case workloads. > As to waiting for ASN.1 capability - I can see this will make parsing of live-migration messages much more reliable (ensuring that Qemu is able to detect an incorrect protocol version) but I can't say I am very happy waiting for 1.0 - are there any alternatives? > I don't think we should couple the two features together.
On 08/10/2011 10:12 AM, Avi Kivity wrote: > On 08/10/2011 06:07 PM, Shribman, Aidan wrote: >> XBZRLE will very rarely (if at all) degrade live-migration as it runs >> at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by >> using 128bit registers instead of the 64bit registers used currently. >> IMO XBZRLE could safely be used by default exposing capabilities by >> Qemu such that higher level management would handle static negotiation >> (as suggested). >> >> Given that XBZRLE will seldom fail due to inflated encoded output (an >> example for such a case -> dirty the new page every 2nd 64bit word: >> the word-wise Xor would give 0x0y0z... ZRLE would future encode as >> 01x01y01z... a +50% increase), I see little incentive in automatic >> XBZRLE disablement. > > My concern is not reduced migration bandwidth or inflated image size, > but increased cpu use for copying pages to the cache and xoring them. > >> As to implementing XBZRLE delta compression as a compression plug-in - >> this is not that straight forward as it has some interesting interplay >> with DUP packat's which are crucial for performance, specifically a >> page consisting of only zero's is LRU cached as reference without the >> standard qemu_malloc()/memcpy() done in other cases. This is >> especially important for eliminating slowdown during live-migration >> initiation. > > I agree, it should be on-by-default and in the main code base. Please > provide numbers to justify this on non-artificial workloads, and on > artificial worst-case workloads. > >> As to waiting for ASN.1 capability - I can see this will make parsing >> of live-migration messages much more reliable (ensuring that Qemu is >> able to detect an incorrect protocol version) but I can't say I am >> very happy waiting for 1.0 - are there any alternatives? >> > > I don't think we should couple the two features together. ASN.1 is orthogonal to capabilities. Capabilities are a hard requirement before merging any new type of compression algorithm IMO. Regards, Anthony Liguori >
On 08/10/2011 06:58 PM, Anthony Liguori wrote: >> I don't think we should couple the two features together. > > > ASN.1 is orthogonal to capabilities. > > Capabilities are a hard requirement before merging any new type of > compression algorithm IMO. Right now we have capabilties in the form of -help output. If -help says -no-xzbrle disable xzbrle support (or -migration-compression xzbrle=off, or something) that's sufficient for management tools. We shouldn't block this feature just because some monitor facility is not yet implemented.
On 08/10/2011 11:08 AM, Avi Kivity wrote: > On 08/10/2011 06:58 PM, Anthony Liguori wrote: >>> I don't think we should couple the two features together. >> >> >> ASN.1 is orthogonal to capabilities. >> >> Capabilities are a hard requirement before merging any new type of >> compression algorithm IMO. > > Right now we have capabilties in the form of -help output. > > If -help says > > -no-xzbrle disable xzbrle support > > (or -migration-compression xzbrle=off, or something) that's sufficient > for management tools. This is static, not dynamic. You may attempt to migrate to another host that supports it and then migrate to a second host that doesn't support it after the first migration fails. > > We shouldn't block this feature just because some monitor facility is > not yet implemented. We shouldn't make *any* changes to the migration protocol before we have a feature negotiation capability. I only want to do a hard break of the protocol once. Regards, Anthony Liguori
On 08/10/2011 07:23 PM, Anthony Liguori wrote: >> Right now we have capabilties in the form of -help output. >> >> If -help says >> >> -no-xzbrle disable xzbrle support >> >> (or -migration-compression xzbrle=off, or something) that's sufficient >> for management tools. > > > This is static, not dynamic. You may attempt to migrate to another > host that supports it and then migrate to a second host that doesn't > support it after the first migration fails. This may be acceptable, wait until the entire migration cluster is xzbrle capable before enabling it. If not, add a monitor command. > >> >> We shouldn't block this feature just because some monitor facility is >> not yet implemented. > > We shouldn't make *any* changes to the migration protocol before we > have a feature negotiation capability. I only want to do a hard break > of the protocol once. Didn't we agree that management tool mediated feature negotiation (that is, outside the migration protocol itself) is acceptable?
On 08/10/2011 11:40 AM, Avi Kivity wrote: > On 08/10/2011 07:23 PM, Anthony Liguori wrote: >>> Right now we have capabilties in the form of -help output. >>> >>> If -help says >>> >>> -no-xzbrle disable xzbrle support >>> >>> (or -migration-compression xzbrle=off, or something) that's sufficient >>> for management tools. >> >> >> This is static, not dynamic. You may attempt to migrate to another >> host that supports it and then migrate to a second host that doesn't >> support it after the first migration fails. > > This may be acceptable, wait until the entire migration cluster is > xzbrle capable before enabling it. If not, add a monitor command. 1) xzbrle needs to be disabled by default. That way management tools don't unknowingly enable it by not passing -no-xzbrle. 2) there needs to be a mechanism for the management tool to query whether qemu supports xzbrle. 3) a management tool should be able to query the source and destination, and then enable xzbrle if both sides support it. You can argue that (3) could be static. A command could be added to toggle it dynamically through the monitor. But no matter what, someone has to touch libvirt and any other tool that works with QEMU to make this thing work. But this is a general problem. Any optional change to the migration protocol has exactly the same characteristics whether it's XZBRLE, XZBRLE v2 (if there is a v2), ASN.1, or any other form of compression that rolls around. Instead of teaching management tools how to deal with all of these things, let's just fix this problem once. It just takes: a) A query-migration-caps command that returns a dict with two lists of strings. Something like: { 'execute': 'query-migration-caps' } { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } b) A set-migration-caps command that takes a list of strings. It simply takes the intersection of the capabilities set with the argument and sets the current set to the result. Something like: { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }} { 'return' : {} } c) An internal interface to register a capability and an internal interface to check if a capability is currently enabled. The xzbrle code just needs to disable itself if the capability isn't set. Then we teach libvirt (and other tools) to query the caps list on the source, set the destination, query the current set on the destination, and then set that set on the source. As we introduce new things, like the next great compression protocol, or ASN.1, we don't need to touch libvirt again. libvirt can still know about the caps and selectively override QEMU if it's so inclined but it prevents us from reinventing the same mechanisms over and over again. >>> We shouldn't block this feature just because some monitor facility is >>> not yet implemented. >> >> We shouldn't make *any* changes to the migration protocol before we >> have a feature negotiation capability. I only want to do a hard break >> of the protocol once. > > Didn't we agree that management tool mediated feature negotiation (that > is, outside the migration protocol itself) is acceptable? Yes. But that negotiation needs to become part of the "protocol" for migration. In the absence of that negotiation, we need to use the wire protocol we use today. We cannot have ad-hoc feature negotiation for every change we make to the wire protocol. Regards, Anthony Liguori >
> From: Anthony Liguori [mailto:anthony@codemonkey.ws] > Sent: Wednesday, August 10, 2011 10:28 PM > To: Avi Kivity > Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel > Developers; libvir-list@redhat.com > Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of > large memory apps > a) A query-migration-caps command that returns a dict with two lists of > strings. Something like: > > { 'execute': 'query-migration-caps' } > { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } > > b) A set-migration-caps command that takes a list of strings. It > simply > takes the intersection of the capabilities set with the argument and > sets the current set to the result. Something like: > > { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] > }} > { 'return' : {} } We may want to further sub-divide capabilities into categories: { 'execute': 'query-migration-caps' } { 'return' : { 'encoding' : { 'current', 'asn.1', 'proto2', 'thrift', etc. } }, { 'delta' : { 'xbzrle', "xdelta", ...} }, { 'compression' : { 'snappy', 'lzo' } } } This would help libvirt/management to select features automatically or manually (via UI) without having to 'understand' the any given capability meaning. > Yes. But that negotiation needs to become part of the "protocol" for > migration. In the absence of that negotiation, we need to use the wire > protocol we use today. We cannot have ad-hoc feature negotiation for > every change we make to the wire protocol. Agreed. Therefore caps plus xbzrle could be added before ASN.1/v1.0 without breaking anything as long as when 'set-migration-caps' is not issued Qemu uses the current protocol. Aidan
On 08/10/2011 10:27 PM, Anthony Liguori wrote: >> This may be acceptable, wait until the entire migration cluster is >> xzbrle capable before enabling it. If not, add a monitor command. > > > 1) xzbrle needs to be disabled by default. That way management tools > don't unknowingly enable it by not passing -no-xzbrle. We could hook it to -M, though it's a bit gross. Otherwise we need to document this clearly in the management tool author's guide. > > 3) a management tool should be able to query the source and > destination, and then enable xzbrle if both sides support it. > > You can argue that (3) could be static. A command could be added to > toggle it dynamically through the monitor. > > But no matter what, someone has to touch libvirt and any other tool > that works with QEMU to make this thing work. But this is a general > problem. Any optional change to the migration protocol has exactly > the same characteristics whether it's XZBRLE, XZBRLE v2 (if there is a > v2), ASN.1, or any other form of compression that rolls around. If we have two-way communication we can do this transparently in the protocol itself. > > Instead of teaching management tools how to deal with all of these > things, let's just fix this problem once. It just takes: > > a) A query-migration-caps command that returns a dict with two lists > of strings. Something like: > > { 'execute': 'query-migration-caps' } > { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } > > b) A set-migration-caps command that takes a list of strings. It > simply takes the intersection of the capabilities set with the > argument and sets the current set to the result. Something like: > > { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }} > { 'return' : {} } > > c) An internal interface to register a capability and an internal > interface to check if a capability is currently enabled. The xzbrle > code just needs to disable itself if the capability isn't set. > > Then we teach libvirt (and other tools) to query the caps list on the > source, set the destination, query the current set on the destination, > and then set that set on the source. This is only if the capability has no side effect. > > As we introduce new things, like the next great compression protocol, > or ASN.1, we don't need to touch libvirt again. libvirt can still > know about the caps and selectively override QEMU if it's so inclined > but it prevents us from reinventing the same mechanisms over and over > again. Right. > > Yes. But that negotiation needs to become part of the "protocol" for > migration. In the absence of that negotiation, we need to use the > wire protocol we use today. We cannot have ad-hoc feature negotiation > for every change we make to the wire protocol. Okay, as long as we have someone willing to implement it.
On Thu, Aug 11, 2011 at 11:17:09AM +0300, Avi Kivity wrote: > On 08/10/2011 10:27 PM, Anthony Liguori wrote: > >>This may be acceptable, wait until the entire migration cluster is > >>xzbrle capable before enabling it. If not, add a monitor command. > > > > > >1) xzbrle needs to be disabled by default. That way management > >tools don't unknowingly enable it by not passing -no-xzbrle. > > We could hook it to -M, though it's a bit gross. That would needlessly prevent its use for any existing installed guests with a older machine type, which are running in a new QEMU Some kind of monitor capabilities seems good to me. Daniel
On 08/11/2011 12:16 PM, Daniel P. Berrange wrote: > On Thu, Aug 11, 2011 at 11:17:09AM +0300, Avi Kivity wrote: > > On 08/10/2011 10:27 PM, Anthony Liguori wrote: > > >>This may be acceptable, wait until the entire migration cluster is > > >>xzbrle capable before enabling it. If not, add a monitor command. > > > > > > > > >1) xzbrle needs to be disabled by default. That way management > > >tools don't unknowingly enable it by not passing -no-xzbrle. > > > > We could hook it to -M, though it's a bit gross. > > That would needlessly prevent its use for any existing installed > guests with a older machine type, which are running in a new QEMU You could still enable it explicitly; I'm just trying to get it to be enabled by default. > Some kind of monitor capabilities seems good to me. > Live migration is probably mostly done in managed environments, so I think you're right.
On Wed, Aug 10, 2011 at 02:27:41PM -0500, Anthony Liguori wrote: > On 08/10/2011 11:40 AM, Avi Kivity wrote: > Instead of teaching management tools how to deal with all of these > things, let's just fix this problem once. It just takes: > > a) A query-migration-caps command that returns a dict with two lists > of strings. Something like: > > { 'execute': 'query-migration-caps' } > { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } > > b) A set-migration-caps command that takes a list of strings. It > simply takes the intersection of the capabilities set with the > argument and sets the current set to the result. Something like: > > { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }} > { 'return' : {} } We have a number of commands to set migration parameters (bandwidth, max downtime, etc). One thing that has always troubled me with this, is that they are a global setting for QEMU, not simply the next migration command. This is fine if you expect that only a single 'migrate' command will ever be invoked during the lifetime of QEMU, but this doesn't hold true when we use 'migrate' as a means to implement saving of snapshots/save-to-file, or "core" dumps. For example, when libvirt does 'save to file', we want to set the max bandwidth to unlimited for that, but we don't want that 'unlimited' setting to apply to a future cross-host migration attempt. This means we have to send three commands migrate_set_speed 9223372036854775808 migrate file:/some/file migrate_set_speed 33554432 it doubly sucks because there is no way to reset the migration speed to the QEMU default, so we have to hardcoded (32 << 20) which is what QEMU currently uses. It would be more desirable if we could simply pass in the desired speed, compression algorithm, max downtime, etc, as parameters to the 'migrate' command. And then have 'migrate_set_speed' only affect the current active migration, not any future ones. So a 'query-migration-caps' command is nice, but I think having a set-migration-caps command is wrong. There should just be a 'caps' parameter for 'migrate'. Regards, Daniel
On 08/11/2011 03:03 AM, Shribman, Aidan wrote: >> From: Anthony Liguori [mailto:anthony@codemonkey.ws] >> Sent: Wednesday, August 10, 2011 10:28 PM >> To: Avi Kivity >> Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel >> Developers; libvir-list@redhat.com >> Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of >> large memory apps > >> a) A query-migration-caps command that returns a dict with two lists of >> strings. Something like: >> >> { 'execute': 'query-migration-caps' } >> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } >> >> b) A set-migration-caps command that takes a list of strings. It >> simply >> takes the intersection of the capabilities set with the argument and >> sets the current set to the result. Something like: >> >> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] >> }} >> { 'return' : {} } > > We may want to further sub-divide capabilities into categories: > { 'execute': 'query-migration-caps' } > { 'return' : > { 'encoding' : { 'current', 'asn.1', 'proto2', 'thrift', etc. } }, > { 'delta' : { 'xbzrle', "xdelta", ...} }, > { 'compression' : { 'snappy', 'lzo' } } } > This would help libvirt/management to select features automatically or manually (via UI) without having to 'understand' the any given capability meaning. I would prefer caps to be mostly transparent to libvirt. In fact, I'd like to see exactly three caps: xbzrle, asn1, and autonegotiate. I'd like to move the caps negotation into the protocol itself. >> Yes. But that negotiation needs to become part of the "protocol" for >> migration. In the absence of that negotiation, we need to use the wire >> protocol we use today. We cannot have ad-hoc feature negotiation for >> every change we make to the wire protocol. > > Agreed. Therefore caps plus xbzrle could be added before ASN.1/v1.0 without breaking anything as long as when 'set-migration-caps' is not issued Qemu uses the current protocol. Exactly. Regards, Anthony Liguori > Aidan >
On 08/11/2011 03:17 AM, Avi Kivity wrote: >> 3) a management tool should be able to query the source and >> destination, and then enable xzbrle if both sides support it. >> >> You can argue that (3) could be static. A command could be added to >> toggle it dynamically through the monitor. >> >> But no matter what, someone has to touch libvirt and any other tool >> that works with QEMU to make this thing work. But this is a general >> problem. Any optional change to the migration protocol has exactly the >> same characteristics whether it's XZBRLE, XZBRLE v2 (if there is a >> v2), ASN.1, or any other form of compression that rolls around. > > If we have two-way communication we can do this transparently in the > protocol itself. Yes. This should be one of the initial caps to introduce. >> Instead of teaching management tools how to deal with all of these >> things, let's just fix this problem once. It just takes: >> >> a) A query-migration-caps command that returns a dict with two lists >> of strings. Something like: >> >> { 'execute': 'query-migration-caps' } >> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } } >> >> b) A set-migration-caps command that takes a list of strings. It >> simply takes the intersection of the capabilities set with the >> argument and sets the current set to the result. Something like: >> >> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }} >> { 'return' : {} } >> >> c) An internal interface to register a capability and an internal >> interface to check if a capability is currently enabled. The xzbrle >> code just needs to disable itself if the capability isn't set. >> >> Then we teach libvirt (and other tools) to query the caps list on the >> source, set the destination, query the current set on the destination, >> and then set that set on the source. > > This is only if the capability has no side effect. Right, it can't change the output of any monitor commands or anything like that. It's strictly about the encoding of the wire protocol which ought to be transparent to libvirt. >> As we introduce new things, like the next great compression protocol, >> or ASN.1, we don't need to touch libvirt again. libvirt can still know >> about the caps and selectively override QEMU if it's so inclined but >> it prevents us from reinventing the same mechanisms over and over again. > > Right. > >> >> Yes. But that negotiation needs to become part of the "protocol" for >> migration. In the absence of that negotiation, we need to use the wire >> protocol we use today. We cannot have ad-hoc feature negotiation for >> every change we make to the wire protocol. > > Okay, as long as we have someone willing to implement it. Sounds like a good hackathon project :-) Regards, Anthony Liguori
diff --git a/Makefile.target b/Makefile.target index 2800f47..b3215de 100644 --- a/Makefile.target +++ b/Makefile.target @@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER ifdef CONFIG_SOFTMMU obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o +obj-y += lru.o xbzrle.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o diff --git a/arch_init.c b/arch_init.c old mode 100644 new mode 100755 index 4486925..d67dc82 --- a/arch_init.c +++ b/arch_init.c @@ -40,6 +40,17 @@ #include "net.h" #include "gdbstub.h" #include "hw/smbios.h" +#include "lru.h" +#include "xbzrle.h" + +//#define DEBUG_ARCH_INIT +#ifdef DEBUG_ARCH_INIT +#define DPRINTF(fmt, ...) \ + do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) \ + do { } while (0) +#endif #ifdef TARGET_SPARC int graphic_width = 1024; @@ -88,6 +99,161 @@ const uint32_t arch_type = QEMU_ARCH; #define RAM_SAVE_FLAG_PAGE 0x08 #define RAM_SAVE_FLAG_EOS 0x10 #define RAM_SAVE_FLAG_CONTINUE 0x20 +#define RAM_SAVE_FLAG_XBZRLE 0x40 + +/***********************************************************/ +/* RAM Migration State */ +typedef struct ArchMigrationState { + int use_xbrle; + int64_t xbrle_cache_size; +} ArchMigrationState; + +static ArchMigrationState arch_mig_state; + +void arch_set_params(int blk_enable, int shared_base, int use_xbrle, + int64_t xbrle_cache_size, void *opaque) +{ + arch_mig_state.use_xbrle = use_xbrle; + arch_mig_state.xbrle_cache_size = xbrle_cache_size; +} + +#define BE16_MAGIC 0x0123 + +/***********************************************************/ +/* XBZRLE (Xor Binary Zero Run-Length Encoding) */ +typedef struct XBZRLEHeader { + uint32_t xh_cksum; /* not used */ + uint16_t xh_magic; + uint16_t xh_len; + uint8_t xh_flags; +} XBZRLEHeader; + +static uint8_t dup_buf[TARGET_PAGE_SIZE]; + +/***********************************************************/ +/* accounting */ +typedef struct AccountingInfo{ + uint64_t dup_pages; + uint64_t norm_pages; + uint64_t xbrle_bytes; + uint64_t xbrle_pages; + uint64_t xbrle_overflow; + uint64_t xbrle_cache_lookup; + uint64_t xbrle_cache_hit; + uint64_t iterations; +} AccountingInfo; + +static AccountingInfo acct_info; + +static void acct_clear(void) +{ + memset(&acct_info, 0, sizeof(acct_info)); +} + +uint64_t dup_mig_bytes_transferred(void) +{ + return acct_info.dup_pages; +} + +uint64_t dup_mig_pages_transferred(void) +{ + return acct_info.dup_pages; +} + +uint64_t norm_mig_bytes_transferred(void) +{ + return acct_info.norm_pages * TARGET_PAGE_SIZE; +} + +uint64_t norm_mig_pages_transferred(void) +{ + return acct_info.norm_pages; +} + +uint64_t xbrle_mig_bytes_transferred(void) +{ + return acct_info.xbrle_bytes; +} + +uint64_t xbrle_mig_pages_transferred(void) +{ + return acct_info.xbrle_pages; +} + +uint64_t xbrle_mig_pages_overflow(void) +{ + return acct_info.xbrle_overflow; +} + +uint64_t xbrle_mig_pages_cache_hit(void) +{ + return acct_info.xbrle_cache_hit; +} + +uint64_t xbrle_mig_pages_cache_lookup(void) +{ + return acct_info.xbrle_cache_lookup; +} + +static void save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset, + int cont, int flag) +{ + qemu_put_be64(f, offset | cont | flag); + if (!cont) { + qemu_put_byte(f, strlen(block->idstr)); + qemu_put_buffer(f, (uint8_t *)block->idstr, + strlen(block->idstr)); + } +} + +#define ENCODING_FLAG_XBZRLE 0x1 + +static int save_xbrle_page(QEMUFile *f, uint8_t *current_page, + ram_addr_t current_addr, RAMBlock *block, ram_addr_t offset, int cont) +{ + int encoded_len = 0, bytes_sent = 0; + XBZRLEHeader hdr = {0, BE16_MAGIC}; + uint8_t *encoded, *old_page; + + /* abort if page not cached */ + acct_info.xbrle_cache_lookup++; + old_page = lru_lookup(current_addr); + if (!old_page) { + goto done; + } + acct_info.xbrle_cache_hit++; + + /* XBZRLE (XOR+ZRLE) encoding */ + encoded = (uint8_t *) qemu_malloc(TARGET_PAGE_SIZE); + encoded_len = xbzrle_encode(encoded, old_page, current_page, + TARGET_PAGE_SIZE); + + if (encoded_len < 0) { + DPRINTF("XBZRLE encoding overflow - sending uncompressed\n"); + acct_info.xbrle_overflow++; + goto done; + } + + hdr.xh_len = encoded_len; + hdr.xh_flags |= ENCODING_FLAG_XBZRLE; + + /* Send XBZRLE compressed page */ + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBZRLE); + + qemu_put_be32(f, hdr.xh_cksum); + qemu_put_buffer(f, (uint8_t *) &hdr.xh_magic, sizeof (hdr.xh_magic)); + qemu_put_be16(f, hdr.xh_len); + qemu_put_byte(f, hdr.xh_flags); + + qemu_put_buffer(f, encoded, encoded_len); + acct_info.xbrle_pages++; + bytes_sent = encoded_len + sizeof(hdr); + acct_info.xbrle_bytes += bytes_sent; + +done: + qemu_free(encoded); + return bytes_sent; +} static int is_dup_page(uint8_t *page, uint8_t ch) { @@ -107,7 +273,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset; -static int ram_save_block(QEMUFile *f) +static int ram_save_block(QEMUFile *f, int stage) { RAMBlock *block = last_block; ram_addr_t offset = last_offset; @@ -120,6 +286,7 @@ static int ram_save_block(QEMUFile *f) current_addr = block->offset + offset; do { + lru_free_cb_t free_cb = qemu_free; if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) { uint8_t *p; int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0; @@ -128,28 +295,35 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG); - p = block->host + offset; + if (arch_mig_state.use_xbrle) { + p = qemu_malloc(TARGET_PAGE_SIZE); + memcpy(p, block->host + offset, TARGET_PAGE_SIZE); + } else { + p = block->host + offset; + } if (is_dup_page(p, *p)) { - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS); - if (!cont) { - qemu_put_byte(f, strlen(block->idstr)); - qemu_put_buffer(f, (uint8_t *)block->idstr, - strlen(block->idstr)); - } + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_COMPRESS); qemu_put_byte(f, *p); bytes_sent = 1; - } else { - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE); - if (!cont) { - qemu_put_byte(f, strlen(block->idstr)); - qemu_put_buffer(f, (uint8_t *)block->idstr, - strlen(block->idstr)); + acct_info.dup_pages++; + if (arch_mig_state.use_xbrle && !*p) { + p = dup_buf; + free_cb = NULL; } + } else if (stage == 2 && arch_mig_state.use_xbrle) { + bytes_sent = save_xbrle_page(f, p, current_addr, block, + offset, cont); + } + if (!bytes_sent) { + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE); qemu_put_buffer(f, p, TARGET_PAGE_SIZE); bytes_sent = TARGET_PAGE_SIZE; + acct_info.norm_pages++; + } + if (arch_mig_state.use_xbrle) { + lru_insert(current_addr, p, free_cb); } - break; } @@ -221,6 +395,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) if (stage < 0) { cpu_physical_memory_set_dirty_tracking(0); + if (arch_mig_state.use_xbrle) { + lru_fini(); + } return 0; } @@ -235,6 +412,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) last_block = NULL; last_offset = 0; + if (arch_mig_state.use_xbrle) { + lru_init(arch_mig_state.xbrle_cache_size/TARGET_PAGE_SIZE, 0); + acct_clear(); + } + /* Make sure all dirty bits are set */ QLIST_FOREACH(block, &ram_list.blocks, next) { for (addr = block->offset; addr < block->offset + block->length; @@ -264,8 +446,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) while (!qemu_file_rate_limit(f)) { int bytes_sent; - bytes_sent = ram_save_block(f); + bytes_sent = ram_save_block(f, stage); bytes_transferred += bytes_sent; + acct_info.iterations++; if (bytes_sent == 0) { /* no more blocks */ break; } @@ -285,19 +468,79 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) int bytes_sent; /* flush all remaining blocks regardless of rate limiting */ - while ((bytes_sent = ram_save_block(f)) != 0) { + while ((bytes_sent = ram_save_block(f, stage))) { bytes_transferred += bytes_sent; } cpu_physical_memory_set_dirty_tracking(0); + if (arch_mig_state.use_xbrle) { + lru_fini(); + } } qemu_put_be64(f, RAM_SAVE_FLAG_EOS); expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth; + DPRINTF("ram_save_live: expected(%ld) <= max(%ld)?\n", expected_time, + migrate_max_downtime()); + return (stage == 2) && (expected_time <= migrate_max_downtime()); } +static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host) +{ + int len, rc = -1; + uint8_t *encoded; + XBZRLEHeader hdr = {0}; + + /* extract ZRLE header */ + hdr.xh_cksum = qemu_get_be32(f); + qemu_get_buffer(f, (uint8_t *) &hdr.xh_magic, sizeof (hdr.xh_magic)); + hdr.xh_len = qemu_get_be16(f); + hdr.xh_flags = qemu_get_byte(f); + + if (!(hdr.xh_flags & ENCODING_FLAG_XBZRLE)) { + fprintf(stderr, "Failed to load XZBRLE page - wrong compression!\n"); + goto done; + } + + if (hdr.xh_len > TARGET_PAGE_SIZE) { + fprintf(stderr, "Failed to load XZBRLE page - len overflow!\n"); + goto done; + } + + /* load data and decode */ + encoded = (uint8_t *) qemu_malloc(hdr.xh_len); + qemu_get_buffer(f, encoded, hdr.xh_len); + /* covert endianess if magic indicated destination differs from source */ + if (hdr.xh_magic != BE16_MAGIC) { + const uint64_t *end = (uint64_t *) encoded + + hdr.xh_len / sizeof (uint64_t); + uint64_t *p; + for (p = (uint64_t *) encoded; p < end; p++) { + bswap64s(p); + } + } + + /* decode ZRLE */ + len = xbzrle_decode(host, host, encoded, hdr.xh_len); + if (len == -1) { + fprintf(stderr, "Failed to load XBZRLE page - decode error!\n"); + goto done; + } + + if (len != TARGET_PAGE_SIZE) { + fprintf(stderr, "Failed to load XBZRLE page - size %d expected %d!\n", + len, TARGET_PAGE_SIZE); + goto done; + } + + rc = 0; +done: + qemu_free(encoded); + return rc; +} + static inline void *host_from_stream_offset(QEMUFile *f, ram_addr_t offset, int flags) @@ -328,16 +571,38 @@ static inline void *host_from_stream_offset(QEMUFile *f, return NULL; } +static inline void *host_from_stream_offset_versioned(int version_id, + QEMUFile *f, ram_addr_t offset, int flags) +{ + void *host; + if (version_id == 3) { + host = qemu_get_ram_ptr(offset); + } else { + host = host_from_stream_offset(f, offset, flags); + } + if (!host) { + fprintf(stderr, "Failed to convert RAM address to host" + " for offset 0x%lX!\n", offset); + abort(); + } + return host; +} + int ram_load(QEMUFile *f, void *opaque, int version_id) { ram_addr_t addr; - int flags; + int flags, ret = 0; + static uint64_t seq_iter; + + seq_iter++; if (version_id < 3 || version_id > 4) { - return -EINVAL; + ret = -EINVAL; + goto done; } do { + void *host; addr = qemu_get_be64(f); flags = addr & ~TARGET_PAGE_MASK; @@ -346,7 +611,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) if (flags & RAM_SAVE_FLAG_MEM_SIZE) { if (version_id == 3) { if (addr != ram_bytes_total()) { - return -EINVAL; + ret = -EINVAL; + goto done; } } else { /* Synchronize RAM block list */ @@ -365,8 +631,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) QLIST_FOREACH(block, &ram_list.blocks, next) { if (!strncmp(id, block->idstr, sizeof(id))) { - if (block->length != length) - return -EINVAL; + if (block->length != length) { + ret = -EINVAL; + goto done; + } break; } } @@ -374,7 +642,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) if (!block) { fprintf(stderr, "Unknown ramblock \"%s\", cannot " "accept migration\n", id); - return -EINVAL; + ret = -EINVAL; + goto done; } total_ram_bytes -= length; @@ -383,17 +652,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) } if (flags & RAM_SAVE_FLAG_COMPRESS) { - void *host; uint8_t ch; - if (version_id == 3) - host = qemu_get_ram_ptr(addr); - else - host = host_from_stream_offset(f, addr, flags); - if (!host) { - return -EINVAL; - } - + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); ch = qemu_get_byte(f); memset(host, ch, TARGET_PAGE_SIZE); #ifndef _WIN32 @@ -403,21 +665,28 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) } #endif } else if (flags & RAM_SAVE_FLAG_PAGE) { - void *host; - - if (version_id == 3) - host = qemu_get_ram_ptr(addr); - else - host = host_from_stream_offset(f, addr, flags); - + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); qemu_get_buffer(f, host, TARGET_PAGE_SIZE); + } else if (flags & RAM_SAVE_FLAG_XBZRLE) { + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); + if (load_xbrle(f, addr, host) < 0) { + ret = -EINVAL; + goto done; + } } + if (qemu_file_has_error(f)) { - return -EIO; + ret = -EIO; + goto done; } } while (!(flags & RAM_SAVE_FLAG_EOS)); - return 0; +done: + DPRINTF("Completed load of VM with exit code %d seq iteration %ld\n", + ret, seq_iter); + return ret; } void qemu_service_io(void) diff --git a/block-migration.c b/block-migration.c index 3e66f49..504df70 100644 --- a/block-migration.c +++ b/block-migration.c @@ -689,7 +689,8 @@ static int block_load(QEMUFile *f, void *opaque, int version_id) return 0; } -static void block_set_params(int blk_enable, int shared_base, void *opaque) +static void block_set_params(int blk_enable, int shared_base, + int use_xbrle, int64_t xbrle_cache_size, void *opaque) { block_mig_state.blk_enable = blk_enable; block_mig_state.shared_base = shared_base; diff --git a/hash.h b/hash.h new file mode 100644 index 0000000..7109905 --- /dev/null +++ b/hash.h @@ -0,0 +1,72 @@ +#ifndef _LINUX_HASH_H +#define _LINUX_HASH_H +/* Fast hashing routine for ints, longs and pointers. + (C) 2002 William Lee Irwin III, IBM */ + +/* + * Knuth recommends primes in approximately golden ratio to the maximum + * integer representable by a machine word for multiplicative hashing. + * Chuck Lever verified the effectiveness of this technique: + * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf + * + * These primes are chosen to be bit-sparse, that is operations on + * them can use shifts and additions instead of multiplications for + * machines where multiplications are slow. + */ + +typedef uint64_t u64; +typedef uint32_t u32; +#define BITS_PER_LONG TARGET_LONG_BITS + +/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */ +#define GOLDEN_RATIO_PRIME_32 0x9e370001UL +/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */ +#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL + +#if BITS_PER_LONG == 32 +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32 +#define hash_long(val, bits) hash_32(val, bits) +#elif BITS_PER_LONG == 64 +#define hash_long(val, bits) hash_64(val, bits) +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64 +#else +#error Wordsize not 32 or 64 +#endif + +static inline u64 hash_64(u64 val, unsigned int bits) +{ + u64 hash = val; + + /* Sigh, gcc can't optimise this alone like it does for 32 bits. */ + u64 n = hash; + n <<= 18; + hash -= n; + n <<= 33; + hash -= n; + n <<= 3; + hash += n; + n <<= 3; + hash -= n; + n <<= 4; + hash += n; + n <<= 2; + hash += n; + + /* High bits are more random, so use them. */ + return hash >> (64 - bits); +} + +static inline u32 hash_32(u32 val, unsigned int bits) +{ + /* On some cpus multiply is faster, on others gcc will do shifts */ + u32 hash = val * GOLDEN_RATIO_PRIME_32; + + /* High bits are more random, so use them. */ + return hash >> (32 - bits); +} + +static inline unsigned long hash_ptr(void *ptr, unsigned int bits) +{ + return hash_long((unsigned long)ptr, bits); +} +#endif /* _LINUX_HASH_H */ diff --git a/hmp-commands.hx b/hmp-commands.hx old mode 100644 new mode 100755 index e5585ba..e49d5be --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -717,24 +717,27 @@ ETEXI { .name = "migrate", - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", - .params = "[-d] [-b] [-i] uri", - .help = "migrate to URI (using -d to not wait for completion)" - "\n\t\t\t -b for migration without shared storage with" - " full copy of disk\n\t\t\t -i for migration without " - "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", + .params = "[-d] [-b] [-i] [-x] uri", + .help = "migrate to URI" + "\n\t -d to not wait for completion" + "\n\t -b for migration without shared storage with" + " full copy of disk" + "\n\t -i for migration without" + " shared storage with incremental copy of disk" + " (base image shared between source and destination)" + "\n\t -x to use XBRLE page delta compression", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, - STEXI -@item migrate [-d] [-b] [-i] @var{uri} +@item migrate [-d] [-b] [-i] [-x] @var{uri} @findex migrate Migrate to @var{uri} (using -d to not wait for completion). -b for migration with full copy of disk -i for migration with incremental copy of disk (base image is shared) + -x to use XBRLE page delta compression ETEXI { @@ -753,10 +756,23 @@ Cancel the current VM migration. ETEXI { + .name = "migrate_set_cachesize", + .args_type = "value:s", + .params = "value", + .help = "set cache size (in MB) for XBRLE migrations", + .mhandler.cmd = do_migrate_set_cachesize, + }, + +STEXI +@item migrate_set_cachesize @var{value} +Set cache size (in MB) for xbrle migrations. +ETEXI + + { .name = "migrate_set_speed", .args_type = "value:o", .params = "value", - .help = "set maximum speed (in bytes) for migrations. " + .help = "set maximum XBRLE cache size (in bytes) for migrations. " "Defaults to MB if no size suffix is specified, ie. B/K/M/G/T", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate_set_speed, diff --git a/hw/hw.h b/hw/hw.h index 9d2cfc2..aa336ec 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -239,7 +239,8 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv) int64_t qemu_ftell(QEMUFile *f); int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence); -typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque); +typedef void SaveSetParamsHandler(int blk_enable, int shared, + int use_xbrle, int64_t xbrle_cache_size, void *opaque); typedef void SaveStateHandler(QEMUFile *f, void *opaque); typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage, void *opaque); diff --git a/lru.c b/lru.c new file mode 100644 index 0000000..e7230d0 --- /dev/null +++ b/lru.c @@ -0,0 +1,142 @@ +#include <assert.h> +#include <math.h> +#include "qemu-common.h" +#include "qemu-queue.h" +#include "host-utils.h" +#include "lru.h" +#include "hash.h" + +typedef struct CacheItem { + ram_addr_t it_addr; + uint8_t *it_data; + lru_free_cb_t it_free; + QCIRCLEQ_ENTRY(CacheItem) it_lru_next; + QCIRCLEQ_ENTRY(CacheItem) it_bucket_next; +} CacheItem; + +typedef QCIRCLEQ_HEAD(, CacheItem) CacheBucket; +static CacheBucket *page_hash; +static int64_t cache_table_size; +static uint64_t cache_max_items; +static int64_t cache_num_items; +static uint8_t cache_hash_bits; + +static QCIRCLEQ_HEAD(page_lru, CacheItem) page_lru; + +static uint64_t next_pow_of_2(uint64_t v) +{ + v--; + v |= v >> 1; + v |= v >> 2; + v |= v >> 4; + v |= v >> 8; + v |= v >> 16; + v |= v >> 32; + v++; + return v; +} + +void lru_init(int64_t max_items, void *param) +{ + int i; + + cache_num_items = 0; + cache_max_items = max_items; + /* add 20% to table size to reduce collisions */ + cache_table_size = next_pow_of_2(1.2 * max_items); + cache_hash_bits = ctz64(cache_table_size) - 1; + + QCIRCLEQ_INIT(&page_lru); + + page_hash = qemu_mallocz(sizeof(CacheBucket) * cache_table_size); + assert(page_hash); + for (i = 0; i < cache_table_size; i++) { + QCIRCLEQ_INIT(&page_hash[i]); + } +} + +static CacheBucket *page_bucket_list(ram_addr_t addr) +{ + return &page_hash[hash_long(addr, cache_hash_bits)]; +} + +static void do_lru_remove(CacheItem *it) +{ + assert(it); + + QCIRCLEQ_REMOVE(&page_lru, it, it_lru_next); + QCIRCLEQ_REMOVE(page_bucket_list(it->it_addr), it, it_bucket_next); + if (it->it_free) { + (*it->it_free)(it->it_data); + } + qemu_free(it); + cache_num_items--; +} + +static int do_lru_remove_first(void) +{ + CacheItem *first; + + if (QCIRCLEQ_EMPTY(&page_lru)) { + return -1; + } + first = QCIRCLEQ_FIRST(&page_lru); + do_lru_remove(first); + return 0; +} + + +void lru_fini(void) +{ + while (!do_lru_remove_first()) { + } + qemu_free(page_hash); +} + +static CacheItem *do_lru_lookup(ram_addr_t addr) +{ + CacheBucket *head = page_bucket_list(addr); + CacheItem *it; + + if (QCIRCLEQ_EMPTY(head)) { + return NULL; + } + QCIRCLEQ_FOREACH(it, head, it_bucket_next) { + if (addr == it->it_addr) { + return it; + } + } + return NULL; +} + +uint8_t *lru_lookup(ram_addr_t addr) +{ + CacheItem *it = do_lru_lookup(addr); + return it ? it->it_data : NULL; +} + +void lru_insert(ram_addr_t addr, uint8_t *data, lru_free_cb_t free_cb) +{ + CacheItem *it; + + /* remove old if item exists */ + it = do_lru_lookup(addr); + if (it) { + do_lru_remove(it); + } + + /* evict LRU if require free space */ + if (cache_num_items == cache_max_items) { + do_lru_remove_first(); + } + + /* add new entry */ + it = qemu_mallocz(sizeof(*it)); + it->it_addr = addr; + it->it_data = data; + it->it_free = free_cb; + QCIRCLEQ_INSERT_HEAD(page_bucket_list(addr), it, it_bucket_next); + QCIRCLEQ_INSERT_TAIL(&page_lru, it, it_lru_next); + cache_num_items++; +} + diff --git a/lru.h b/lru.h new file mode 100644 index 0000000..6c70095 --- /dev/null +++ b/lru.h @@ -0,0 +1,13 @@ +#ifndef _LRU_H_ +#define _LRU_H_ + +#include <unistd.h> +#include <stdint.h> +#include "cpu-all.h" +typedef void (*lru_free_cb_t)(void *); +void lru_init(ssize_t num_items, void *param); +void lru_fini(void); +void lru_insert(ram_addr_t id, uint8_t *pdata, lru_free_cb_t free_cb); +uint8_t *lru_lookup(ram_addr_t addr); +#endif + diff --git a/migration-exec.c b/migration-exec.c index 14718dd..fe8254a 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -67,7 +67,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s; FILE *f; @@ -99,6 +101,8 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size; s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-fd.c b/migration-fd.c index 6d14505..4a1ddbd 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -56,7 +56,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s; @@ -82,6 +84,8 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size; s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..4ca5bf6 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -81,7 +81,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { struct sockaddr_in addr; FdMigrationState *s; @@ -101,6 +103,8 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size; s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-unix.c b/migration-unix.c index 57232c0..0813902 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -80,7 +80,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s; struct sockaddr_un addr; @@ -100,6 +102,8 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size; s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration.c b/migration.c old mode 100644 new mode 100755 index 9ee8b17..ccacf81 --- a/migration.c +++ b/migration.c @@ -34,6 +34,11 @@ /* Migration speed throttling */ static uint32_t max_throttle = (32 << 20); +/* Migration XBRLE cache size */ +#define DEFAULT_MIGRATE_CACHE_SIZE (64 * 1024 * 1024) + +static int64_t migrate_cache_size = DEFAULT_MIGRATE_CACHE_SIZE; + static MigrationState *current_migration; int qemu_start_incoming_migration(const char *uri) @@ -80,6 +85,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) int detach = qdict_get_try_bool(qdict, "detach", 0); int blk = qdict_get_try_bool(qdict, "blk", 0); int inc = qdict_get_try_bool(qdict, "inc", 0); + int use_xbrle = qdict_get_try_bool(qdict, "xbrle", 0); const char *uri = qdict_get_str(qdict, "uri"); if (current_migration && @@ -90,17 +96,21 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) if (strstart(uri, "tcp:", &p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); #if !defined(WIN32) } else if (strstart(uri, "exec:", &p)) { s = exec_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); } else if (strstart(uri, "unix:", &p)) { s = unix_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); } else if (strstart(uri, "fd:", &p)) { s = fd_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); #endif } else { monitor_printf(mon, "unknown migration protocol: %s\n", uri); @@ -185,6 +195,36 @@ static void migrate_print_status(Monitor *mon, const char *name, qdict_get_int(qdict, "total") >> 10); } +static void migrate_print_ram_status(Monitor *mon, const char *name, + const QDict *status_dict) +{ + QDict *qdict; + uint64_t overflow, cache_hit, cache_lookup; + + qdict = qobject_to_qdict(qdict_get(status_dict, name)); + + monitor_printf(mon, "transferred %s: %" PRIu64 " kbytes\n", name, + qdict_get_int(qdict, "bytes") >> 10); + monitor_printf(mon, "transferred %s: %" PRIu64 " pages\n", name, + qdict_get_int(qdict, "pages")); + overflow = qdict_get_int(qdict, "overflow"); + if (overflow > 0) { + monitor_printf(mon, "overflow %s: %" PRIu64 " pages\n", name, + overflow); + } + cache_hit = qdict_get_int(qdict, "cache-hit"); + if (cache_hit > 0) { + monitor_printf(mon, "cache-hit %s: %" PRIu64 " pages\n", name, + cache_hit); + } + cache_lookup = qdict_get_int(qdict, "cache-lookup"); + if (cache_lookup > 0) { + monitor_printf(mon, "cache-lookup %s: %" PRIu64 " pages\n", name, + cache_lookup); + } + +} + void do_info_migrate_print(Monitor *mon, const QObject *data) { QDict *qdict; @@ -198,6 +238,18 @@ void do_info_migrate_print(Monitor *mon, const QObject *data) migrate_print_status(mon, "ram", qdict); } + if (qdict_haskey(qdict, "ram-duplicate")) { + migrate_print_ram_status(mon, "ram-duplicate", qdict); + } + + if (qdict_haskey(qdict, "ram-normal")) { + migrate_print_ram_status(mon, "ram-normal", qdict); + } + + if (qdict_haskey(qdict, "ram-xbrle")) { + migrate_print_ram_status(mon, "ram-xbrle", qdict); + } + if (qdict_haskey(qdict, "disk")) { migrate_print_status(mon, "disk", qdict); } @@ -214,6 +266,23 @@ static void migrate_put_status(QDict *qdict, const char *name, qdict_put_obj(qdict, name, obj); } +static void migrate_put_ram_status(QDict *qdict, const char *name, + uint64_t bytes, uint64_t pages, + uint64_t overflow, uint64_t cache_hit, + uint64_t cache_lookup) +{ + QObject *obj; + + obj = qobject_from_jsonf("{ 'bytes': %" PRId64 ", " + "'pages': %" PRId64 ", " + "'overflow': %" PRId64 ", " + "'cache-hit': %" PRId64 ", " + "'cache-lookup': %" PRId64 " }", + bytes, pages, overflow, cache_hit, + cache_lookup); + qdict_put_obj(qdict, name, obj); +} + void do_info_migrate(Monitor *mon, QObject **ret_data) { QDict *qdict; @@ -228,6 +297,21 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) migrate_put_status(qdict, "ram", ram_bytes_transferred(), ram_bytes_remaining(), ram_bytes_total()); + if (s->use_xbrle) { + migrate_put_ram_status(qdict, "ram-duplicate", + dup_mig_bytes_transferred(), + dup_mig_pages_transferred(), 0, 0, 0); + migrate_put_ram_status(qdict, "ram-normal", + norm_mig_bytes_transferred(), + norm_mig_pages_transferred(), 0, 0, 0); + migrate_put_ram_status(qdict, "ram-xbrle", + xbrle_mig_bytes_transferred(), + xbrle_mig_pages_transferred(), + xbrle_mig_pages_overflow(), + xbrle_mig_pages_cache_hit(), + xbrle_mig_pages_cache_lookup()); + } + if (blk_mig_active()) { migrate_put_status(qdict, "disk", blk_mig_bytes_transferred(), blk_mig_bytes_remaining(), @@ -341,7 +425,8 @@ void migrate_fd_connect(FdMigrationState *s) DPRINTF("beginning savevm\n"); ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk, - s->mig_state.shared); + s->mig_state.shared, s->mig_state.use_xbrle, + s->mig_state.xbrle_cache_size); if (ret < 0) { DPRINTF("failed, %d\n", ret); migrate_fd_error(s); @@ -448,3 +533,27 @@ int migrate_fd_close(void *opaque) qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); return s->close(s); } + +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict) +{ + ssize_t bytes; + const char *value = qdict_get_str(qdict, "value"); + + bytes = strtosz(value, NULL); + if (bytes < 0) { + monitor_printf(mon, "invalid cache size: %s\n", value); + return; + } + + /* On 32-bit hosts, QEMU is limited by virtual address space */ + if (bytes > (2047 << 20) && HOST_LONG_BITS == 32) { + monitor_printf(mon, "cache can't exceed 2047 MB RAM limit on host\n"); + return; + } + if (bytes != (uint64_t) bytes) { + monitor_printf(mon, "cache size too large\n"); + return; + } + migrate_cache_size = bytes; +} + diff --git a/migration.h b/migration.h index d13ed4f..6dc0543 100644 --- a/migration.h +++ b/migration.h @@ -32,6 +32,8 @@ struct MigrationState void (*release)(MigrationState *s); int blk; int shared; + int use_xbrle; + int64_t xbrle_cache_size; }; typedef struct FdMigrationState FdMigrationState; @@ -76,7 +78,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size); int tcp_start_incoming_migration(const char *host_port); @@ -85,7 +89,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size); int unix_start_incoming_migration(const char *path); @@ -94,7 +100,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size); int fd_start_incoming_migration(const char *path); @@ -103,7 +111,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size); void migrate_fd_monitor_suspend(FdMigrationState *s, Monitor *mon); @@ -134,4 +144,11 @@ static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state) return container_of(mig_state, FdMigrationState, mig_state); } +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict); + +void arch_set_params(int blk_enable, int shared_base, + int use_xbrle, int64_t xbrle_cache_size, void *opaque); + +int xbrle_mig_active(void); + #endif diff --git a/qmp-commands.hx b/qmp-commands.hx index 793cf1c..8fbe64b 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -431,13 +431,16 @@ EQMP { .name = "migrate", - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", - .params = "[-d] [-b] [-i] uri", - .help = "migrate to URI (using -d to not wait for completion)" - "\n\t\t\t -b for migration without shared storage with" - " full copy of disk\n\t\t\t -i for migration without " - "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", + .params = "[-d] [-b] [-i] [-x] uri", + .help = "migrate to URI" + "\n\t -d to not wait for completion" + "\n\t -b for migration without shared storage with" + " full copy of disk" + "\n\t -i for migration without" + " shared storage with incremental copy of disk" + " (base image shared between source and destination)" + "\n\t -x to use XBRLE page delta compression", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, @@ -453,6 +456,7 @@ Arguments: - "blk": block migration, full disk copy (json-bool, optional) - "inc": incremental disk copy (json-bool, optional) - "uri": Destination URI (json-string) +- "xbrle": to use XBRLE page delta compression Example: @@ -494,6 +498,31 @@ Example: EQMP { + .name = "migrate_set_cachesize", + .args_type = "value:s", + .params = "value", + .help = "set cache size (in MB) for xbrle migrations", + .mhandler.cmd = do_migrate_set_cachesize, + }, + +SQMP +migrate_set_cachesize +--------------------- + +Set cache size to be used by XBRLE migration + +Arguments: + +- "value": cache size in bytes (json-number) + +Example: + +-> { "execute": "migrate_set_cachesize", "arguments": { "value": 500M } } +<- { "return": {} } + +EQMP + + { .name = "migrate_set_speed", .args_type = "value:f", .params = "value", diff --git a/savevm.c b/savevm.c index 4e49765..93b512b 100644 --- a/savevm.c +++ b/savevm.c @@ -1141,7 +1141,8 @@ int register_savevm(DeviceState *dev, void *opaque) { return register_savevm_live(dev, idstr, instance_id, version_id, - NULL, NULL, save_state, load_state, opaque); + arch_set_params, NULL, save_state, + load_state, opaque); } void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque) @@ -1428,15 +1429,17 @@ static int vmstate_save(QEMUFile *f, SaveStateEntry *se) #define QEMU_VM_SUBSECTION 0x05 int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, - int shared) + int shared, int use_xbrle, + int64_t xbrle_cache_size) { SaveStateEntry *se; QTAILQ_FOREACH(se, &savevm_handlers, entry) { if(se->set_params == NULL) { continue; - } - se->set_params(blk_enable, shared, se->opaque); + } + se->set_params(blk_enable, shared, use_xbrle, xbrle_cache_size, + se->opaque); } qemu_put_be32(f, QEMU_VM_FILE_MAGIC); @@ -1577,7 +1580,7 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f) bdrv_flush_all(); - ret = qemu_savevm_state_begin(mon, f, 0, 0); + ret = qemu_savevm_state_begin(mon, f, 0, 0, 0, 0); if (ret < 0) goto out; diff --git a/sysemu.h b/sysemu.h index b81a70e..eb53bf7 100644 --- a/sysemu.h +++ b/sysemu.h @@ -44,6 +44,16 @@ uint64_t ram_bytes_remaining(void); uint64_t ram_bytes_transferred(void); uint64_t ram_bytes_total(void); +uint64_t dup_mig_bytes_transferred(void); +uint64_t dup_mig_pages_transferred(void); +uint64_t norm_mig_bytes_transferred(void); +uint64_t norm_mig_pages_transferred(void); +uint64_t xbrle_mig_bytes_transferred(void); +uint64_t xbrle_mig_pages_transferred(void); +uint64_t xbrle_mig_pages_overflow(void); +uint64_t xbrle_mig_pages_cache_lookup(void); +uint64_t xbrle_mig_pages_cache_hit(void); + int64_t cpu_get_ticks(void); void cpu_enable_ticks(void); void cpu_disable_ticks(void); @@ -74,7 +84,8 @@ void qemu_announce_self(void); void main_loop_wait(int nonblocking); int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, - int shared); + int shared, int use_xbrle, + int64_t xbrle_cache_size); int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); diff --git a/xbzrle.c b/xbzrle.c new file mode 100644 index 0000000..e9285e0 --- /dev/null +++ b/xbzrle.c @@ -0,0 +1,126 @@ +#include <stdint.h> +#include <string.h> +#include <assert.h> +#include "cpu-all.h" +#include "xbzrle.h" + +typedef struct { + uint64_t c; + uint64_t num; +} zero_encoding_t; + +typedef struct { + uint64_t c; +} char_encoding_t; + +static int rle_encode(uint64_t *in, int slen, uint8_t *out, const int dlen) +{ + int dl = 0; + uint64_t cp = 0, c, run_len = 0; + + if (slen <= 0) + return -1; + + while (1) { + if (!slen) + break; + c = *in++; + slen--; + if (!(cp || c)) { + run_len++; + } else if (!cp) { + ((zero_encoding_t *)out)->c = cp; + ((zero_encoding_t *)out)->num = run_len; + dl += sizeof(zero_encoding_t); + out += sizeof(zero_encoding_t); + run_len = 1; + } else { + ((char_encoding_t *)out)->c = cp; + dl += sizeof(char_encoding_t); + out += sizeof(char_encoding_t); + } + cp = c; + } + + if (!cp) { + ((zero_encoding_t *)out)->c = cp; + ((zero_encoding_t *)out)->num = run_len; + dl += sizeof(zero_encoding_t); + out += sizeof(zero_encoding_t); + } else { + ((char_encoding_t *)out)->c = cp; + dl += sizeof(char_encoding_t); + out += sizeof(char_encoding_t); + } + return dl; +} + +static int rle_decode(const uint8_t *in, int slen, uint64_t *out, int dlen) +{ + int tb = 0; + uint64_t run_len, c; + + while (slen > 0) { + c = ((char_encoding_t *) in)->c; + if (c) { + slen -= sizeof(char_encoding_t); + in += sizeof(char_encoding_t); + *out++ = c; + tb++; + continue; + } + run_len = ((zero_encoding_t *) in)->num; + slen -= sizeof(zero_encoding_t); + in += sizeof(zero_encoding_t); + while (run_len-- > 0) { + *out++ = c; + tb++; + } + } + return tb; +} + +static void xor_encode_word(uint8_t *dst, const uint8_t *src1, + const uint8_t *src2) +{ + int len = TARGET_PAGE_SIZE / sizeof (uint64_t); + uint64_t *dstw = (uint64_t *) dst; + const uint64_t *srcw1 = (const uint64_t *) src1; + const uint64_t *srcw2 = (const uint64_t *) src2; + + while (len--) { + *dstw++ = *srcw1++ ^ *srcw2++; + } +} + +int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old, const uint8_t *curr, + const size_t max_compressed_len) +{ + int compressed_len; + uint8_t xor_buf[TARGET_PAGE_SIZE]; + uint8_t work_buf[TARGET_PAGE_SIZE * 2]; /* worst case xbzrle is 150% */ + + xor_encode_word(xor_buf, old, curr); + compressed_len = rle_encode((uint64_t *)xor_buf, + sizeof(xor_buf)/sizeof(uint64_t), work_buf, + sizeof(work_buf)); + if (compressed_len > max_compressed_len) { + return -1; + } + memcpy(xbzrle, work_buf, compressed_len); + return compressed_len; +} + +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, + const size_t compressed_len) +{ + uint8_t xor_buf[TARGET_PAGE_SIZE]; + + int len = rle_decode(xbrle, compressed_len, + (uint64_t *)xor_buf, sizeof(xor_buf)/sizeof(uint64_t)); + if (len < 0) { + return len; + } + xor_encode_word(curr, old, xor_buf); + return len * sizeof(uint64_t); +} diff --git a/xbzrle.h b/xbzrle.h new file mode 100644 index 0000000..5d625a0 --- /dev/null +++ b/xbzrle.h @@ -0,0 +1,12 @@ +#ifndef _XBZRLE_H_ +#define _XBZRLE_H_ + +#include <stdio.h> + +int xbzrle_encode(uint8_t *xbrle, const uint8_t *old, const uint8_t *curr, + const size_t len); +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, + const size_t len); + +#endif +