diff mbox

[v4] XBZRLE delta for live migration of large memory apps

Message ID AB5A8C7661872E428D6B8E1C2DFA35085D84AF0480@DEWDFECCR02.wdf.sap.corp
State New
Headers show

Commit Message

Shribman, Aidan Aug. 8, 2011, 8:42 a.m. UTC
Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
From: Aidan Shribman <aidan.shribman@sap.com>

By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
and total live-migration time of VMs running memory write intensive workloads
typical of large enterprise applications such as SAP ERP Systems, and generally
speaking for any application with a sparse memory update pattern.

On the sender side XBZRLE is used as a compact delta encoding of page updates,
retrieving the old page content from an LRU cache (default size of 64 MB). The
receiving side uses the existing page content and XBZRLE to decode the new page
content.

Work was originally based on research results published VEE 2011: Evaluation of
Delta Compression Techniques for Efficient Live Migration of Large Virtual
Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
XBRLE was improved further using XBZRLE instead.

XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
ideal for in-line, real-time encoding such as is needed for live-migration.

A typical usage scenario:
    {qemu} migrate_set_cachesize 256m
    {qemu} migrate -x -d tcp:destination.host:4444
    {qemu} info migrate
    ...
    transferred ram-duplicate: A kbytes
    transferred ram-duplicate: B pages
    transferred ram-normal: C kbytes
    transferred ram-normal: D pages
    transferred ram-xbrle: E kbytes
    transferred ram-xbrle: F pages
    overflow ram-xbrle: G pages
    cache-hit ram-xbrle: H pages
    cache-lookup ram-xbrle: J pages

Testing: live migration with XBZRLE completed in 110 seconds, without live
migration was not able to complete.

A simple synthetic memory r/w load generator:
..    include <stdlib.h>
..    include <stdio.h>
..    int main()
..    {
..        char *buf = (char *) calloc(4096, 4096);
..        while (1) {
..            int i;
..            for (i = 0; i < 4096 * 4; i++) {
..                buf[i * 4096 / 4]++;
..            }
..            printf(".");
..        }
..    }

Signed-off-by: Benoit Hudzia <benoit.hudzia@sap.com>
Signed-off-by: Petter Svard <petters@cs.umu.se>
Signed-off-by: Aidan Shribman <aidan.shribman@sap.com>

--

 Makefile.target   |    1 +
 arch_init.c       |  351 ++++++++++++++++++++++++++++++++++++++++++++++------
 block-migration.c |    3 +-
 hash.h            |   72 +++++++++++
 hmp-commands.hx   |   36 ++++--
 hw/hw.h           |    3 +-
 lru.c             |  142 +++++++++++++++++++++
 lru.h             |   13 ++
 migration-exec.c  |    6 +-
 migration-fd.c    |    6 +-
 migration-tcp.c   |    6 +-
 migration-unix.c  |    6 +-
 migration.c       |  119 +++++++++++++++++-
 migration.h       |   25 +++-
 qmp-commands.hx   |   43 ++++++-
 savevm.c          |   13 ++-
 sysemu.h          |   13 ++-
 xbzrle.c          |  126 +++++++++++++++++++
 xbzrle.h          |   12 ++
 19 files changed, 917 insertions(+), 79 deletions(-)

Comments

Anthony Liguori Aug. 8, 2011, 1:29 p.m. UTC | #1
On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
> Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
> From: Aidan Shribman<aidan.shribman@sap.com>
>
> By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
> and total live-migration time of VMs running memory write intensive workloads
> typical of large enterprise applications such as SAP ERP Systems, and generally
> speaking for any application with a sparse memory update pattern.
>
> On the sender side XBZRLE is used as a compact delta encoding of page updates,
> retrieving the old page content from an LRU cache (default size of 64 MB). The
> receiving side uses the existing page content and XBZRLE to decode the new page
> content.
>
> Work was originally based on research results published VEE 2011: Evaluation of
> Delta Compression Techniques for Efficient Live Migration of Large Virtual
> Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
> XBRLE was improved further using XBZRLE instead.
>
> XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
> ideal for in-line, real-time encoding such as is needed for live-migration.
>
> A typical usage scenario:
>      {qemu} migrate_set_cachesize 256m
>      {qemu} migrate -x -d tcp:destination.host:4444
>      {qemu} info migrate
>      ...
>      transferred ram-duplicate: A kbytes
>      transferred ram-duplicate: B pages
>      transferred ram-normal: C kbytes
>      transferred ram-normal: D pages
>      transferred ram-xbrle: E kbytes
>      transferred ram-xbrle: F pages
>      overflow ram-xbrle: G pages
>      cache-hit ram-xbrle: H pages
>      cache-lookup ram-xbrle: J pages
>
> Testing: live migration with XBZRLE completed in 110 seconds, without live
> migration was not able to complete.
>
> A simple synthetic memory r/w load generator:
> ..    include<stdlib.h>
> ..    include<stdio.h>
> ..    int main()
> ..    {
> ..        char *buf = (char *) calloc(4096, 4096);
> ..        while (1) {
> ..            int i;
> ..            for (i = 0; i<  4096 * 4; i++) {
> ..                buf[i * 4096 / 4]++;
> ..            }
> ..            printf(".");
> ..        }
> ..    }
>
> Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com>
> Signed-off-by: Petter Svard<petters@cs.umu.se>
> Signed-off-by: Aidan Shribman<aidan.shribman@sap.com>

One thing that strikes me about this algorithm is that it's very good 
for a particular type of workload--shockingly good really.

I think workload aware migration compression is possible for a lot of 
different types of workloads.  That makes me a bit wary of QEMU growing 
quite a lot of compression mechanisms.

It makes me think that this logic may really belong at a higher level 
where more information is known about the workload.  For instance, I can 
imagine XBZRLE living in something like libvirt.

Today, parsing migration traffic is pretty horrible but I think we're 
pretty strongly committed to fixing that in 1.0.  That makes me wonder 
if it would be nicer architecturally for a higher level tool to own 
something like this.

Originally, when I added migration, I had the view that we would have 
transport plugins based on the exec: protocol.  That hasn't really 
happened since libvirt really owns migration but I think having XBZRLE 
as a transport plugin for libvirt is something worth considering.

I'm curious what people think about this type of approach.  CC'ing 
libvirt to get their input.

Regards,

Anthony Liguori

>
> --
>
>   Makefile.target   |    1 +
>   arch_init.c       |  351 ++++++++++++++++++++++++++++++++++++++++++++++------
>   block-migration.c |    3 +-
>   hash.h            |   72 +++++++++++
>   hmp-commands.hx   |   36 ++++--
>   hw/hw.h           |    3 +-
>   lru.c             |  142 +++++++++++++++++++++
>   lru.h             |   13 ++
>   migration-exec.c  |    6 +-
>   migration-fd.c    |    6 +-
>   migration-tcp.c   |    6 +-
>   migration-unix.c  |    6 +-
>   migration.c       |  119 +++++++++++++++++-
>   migration.h       |   25 +++-
>   qmp-commands.hx   |   43 ++++++-
>   savevm.c          |   13 ++-
>   sysemu.h          |   13 ++-
>   xbzrle.c          |  126 +++++++++++++++++++
>   xbzrle.h          |   12 ++
>   19 files changed, 917 insertions(+), 79 deletions(-)
>
> diff --git a/Makefile.target b/Makefile.target
> index 2800f47..b3215de 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER
>   ifdef CONFIG_SOFTMMU
>
>   obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o
> +obj-y += lru.o xbzrle.o
>   # virtio has to be here due to weird dependency between PCI and virtio-net.
>   # need to fix this properly
>   obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o
> diff --git a/arch_init.c b/arch_init.c
> old mode 100644
> new mode 100755
> index 4486925..d67dc82
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -40,6 +40,17 @@
>   #include "net.h"
>   #include "gdbstub.h"
>   #include "hw/smbios.h"
> +#include "lru.h"
> +#include "xbzrle.h"
> +
> +//#define DEBUG_ARCH_INIT
> +#ifdef DEBUG_ARCH_INIT
> +#define DPRINTF(fmt, ...) \
> +    do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> +    do { } while (0)
> +#endif
>
>   #ifdef TARGET_SPARC
>   int graphic_width = 1024;
> @@ -88,6 +99,161 @@ const uint32_t arch_type = QEMU_ARCH;
>   #define RAM_SAVE_FLAG_PAGE     0x08
>   #define RAM_SAVE_FLAG_EOS      0x10
>   #define RAM_SAVE_FLAG_CONTINUE 0x20
> +#define RAM_SAVE_FLAG_XBZRLE    0x40
> +
> +/***********************************************************/
> +/* RAM Migration State */
> +typedef struct ArchMigrationState {
> +    int use_xbrle;
> +    int64_t xbrle_cache_size;
> +} ArchMigrationState;
> +
> +static ArchMigrationState arch_mig_state;
> +
> +void arch_set_params(int blk_enable, int shared_base, int use_xbrle,
> +        int64_t xbrle_cache_size, void *opaque)
> +{
> +    arch_mig_state.use_xbrle = use_xbrle;
> +    arch_mig_state.xbrle_cache_size = xbrle_cache_size;
> +}
> +
> +#define BE16_MAGIC 0x0123
> +
> +/***********************************************************/
> +/* XBZRLE (Xor Binary Zero Run-Length Encoding) */
> +typedef struct XBZRLEHeader {
> +    uint32_t xh_cksum; /* not used */
> +    uint16_t xh_magic;
> +    uint16_t xh_len;
> +    uint8_t xh_flags;
> +} XBZRLEHeader;
> +
> +static uint8_t dup_buf[TARGET_PAGE_SIZE];
> +
> +/***********************************************************/
> +/* accounting */
> +typedef struct AccountingInfo{
> +    uint64_t dup_pages;
> +    uint64_t norm_pages;
> +    uint64_t xbrle_bytes;
> +    uint64_t xbrle_pages;
> +    uint64_t xbrle_overflow;
> +    uint64_t xbrle_cache_lookup;
> +    uint64_t xbrle_cache_hit;
> +    uint64_t iterations;
> +} AccountingInfo;
> +
> +static AccountingInfo acct_info;
> +
> +static void acct_clear(void)
> +{
> +    memset(&acct_info, 0, sizeof(acct_info));
> +}
> +
> +uint64_t dup_mig_bytes_transferred(void)
> +{
> +    return acct_info.dup_pages;
> +}
> +
> +uint64_t dup_mig_pages_transferred(void)
> +{
> +    return acct_info.dup_pages;
> +}
> +
> +uint64_t norm_mig_bytes_transferred(void)
> +{
> +    return acct_info.norm_pages * TARGET_PAGE_SIZE;
> +}
> +
> +uint64_t norm_mig_pages_transferred(void)
> +{
> +    return acct_info.norm_pages;
> +}
> +
> +uint64_t xbrle_mig_bytes_transferred(void)
> +{
> +    return acct_info.xbrle_bytes;
> +}
> +
> +uint64_t xbrle_mig_pages_transferred(void)
> +{
> +    return acct_info.xbrle_pages;
> +}
> +
> +uint64_t xbrle_mig_pages_overflow(void)
> +{
> +    return acct_info.xbrle_overflow;
> +}
> +
> +uint64_t xbrle_mig_pages_cache_hit(void)
> +{
> +    return acct_info.xbrle_cache_hit;
> +}
> +
> +uint64_t xbrle_mig_pages_cache_lookup(void)
> +{
> +    return acct_info.xbrle_cache_lookup;
> +}
> +
> +static void save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> +        int cont, int flag)
> +{
> +        qemu_put_be64(f, offset | cont | flag);
> +        if (!cont) {
> +                qemu_put_byte(f, strlen(block->idstr));
> +                qemu_put_buffer(f, (uint8_t *)block->idstr,
> +                                strlen(block->idstr));
> +        }
> +}
> +
> +#define ENCODING_FLAG_XBZRLE 0x1
> +
> +static int save_xbrle_page(QEMUFile *f, uint8_t *current_page,
> +        ram_addr_t current_addr, RAMBlock *block, ram_addr_t offset, int cont)
> +{
> +    int encoded_len = 0, bytes_sent = 0;
> +    XBZRLEHeader hdr = {0, BE16_MAGIC};
> +    uint8_t *encoded, *old_page;
> +
> +    /* abort if page not cached */
> +    acct_info.xbrle_cache_lookup++;
> +    old_page = lru_lookup(current_addr);
> +    if (!old_page) {
> +        goto done;
> +    }
> +    acct_info.xbrle_cache_hit++;
> +
> +    /* XBZRLE (XOR+ZRLE) encoding */
> +    encoded = (uint8_t *) qemu_malloc(TARGET_PAGE_SIZE);
> +    encoded_len = xbzrle_encode(encoded, old_page, current_page,
> +            TARGET_PAGE_SIZE);
> +
> +    if (encoded_len<  0) {
> +        DPRINTF("XBZRLE encoding overflow - sending uncompressed\n");
> +        acct_info.xbrle_overflow++;
> +        goto done;
> +    }
> +
> +    hdr.xh_len = encoded_len;
> +    hdr.xh_flags |= ENCODING_FLAG_XBZRLE;
> +
> +    /* Send XBZRLE compressed page */
> +    save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBZRLE);
> +
> +    qemu_put_be32(f, hdr.xh_cksum);
> +    qemu_put_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic));
> +    qemu_put_be16(f, hdr.xh_len);
> +    qemu_put_byte(f, hdr.xh_flags);
> +
> +    qemu_put_buffer(f, encoded, encoded_len);
> +    acct_info.xbrle_pages++;
> +    bytes_sent = encoded_len + sizeof(hdr);
> +    acct_info.xbrle_bytes += bytes_sent;
> +
> +done:
> +    qemu_free(encoded);
> +    return bytes_sent;
> +}
>
>   static int is_dup_page(uint8_t *page, uint8_t ch)
>   {
> @@ -107,7 +273,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
>   static RAMBlock *last_block;
>   static ram_addr_t last_offset;
>
> -static int ram_save_block(QEMUFile *f)
> +static int ram_save_block(QEMUFile *f, int stage)
>   {
>       RAMBlock *block = last_block;
>       ram_addr_t offset = last_offset;
> @@ -120,6 +286,7 @@ static int ram_save_block(QEMUFile *f)
>       current_addr = block->offset + offset;
>
>       do {
> +        lru_free_cb_t free_cb = qemu_free;
>           if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
>               uint8_t *p;
>               int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;
> @@ -128,28 +295,35 @@ static int ram_save_block(QEMUFile *f)
>                                               current_addr + TARGET_PAGE_SIZE,
>                                               MIGRATION_DIRTY_FLAG);
>
> -            p = block->host + offset;
> +            if (arch_mig_state.use_xbrle) {
> +                p = qemu_malloc(TARGET_PAGE_SIZE);
> +                memcpy(p, block->host + offset, TARGET_PAGE_SIZE);
> +            } else {
> +                p = block->host + offset;
> +            }
>
>               if (is_dup_page(p, *p)) {
> -                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
> -                if (!cont) {
> -                    qemu_put_byte(f, strlen(block->idstr));
> -                    qemu_put_buffer(f, (uint8_t *)block->idstr,
> -                                    strlen(block->idstr));
> -                }
> +                save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_COMPRESS);
>                   qemu_put_byte(f, *p);
>                   bytes_sent = 1;
> -            } else {
> -                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
> -                if (!cont) {
> -                    qemu_put_byte(f, strlen(block->idstr));
> -                    qemu_put_buffer(f, (uint8_t *)block->idstr,
> -                                    strlen(block->idstr));
> +                acct_info.dup_pages++;
> +                if (arch_mig_state.use_xbrle&&  !*p) {
> +                    p = dup_buf;
> +                    free_cb = NULL;
>                   }
> +            } else if (stage == 2&&  arch_mig_state.use_xbrle) {
> +                bytes_sent = save_xbrle_page(f, p, current_addr, block,
> +                    offset, cont);
> +            }
> +            if (!bytes_sent) {
> +                save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
>                   qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
>                   bytes_sent = TARGET_PAGE_SIZE;
> +                acct_info.norm_pages++;
> +            }
> +            if (arch_mig_state.use_xbrle) {
> +                lru_insert(current_addr, p, free_cb);
>               }
> -
>               break;
>           }
>
> @@ -221,6 +395,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>
>       if (stage<  0) {
>           cpu_physical_memory_set_dirty_tracking(0);
> +        if (arch_mig_state.use_xbrle) {
> +            lru_fini();
> +        }
>           return 0;
>       }
>
> @@ -235,6 +412,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>           last_block = NULL;
>           last_offset = 0;
>
> +        if (arch_mig_state.use_xbrle) {
> +            lru_init(arch_mig_state.xbrle_cache_size/TARGET_PAGE_SIZE, 0);
> +            acct_clear();
> +        }
> +
>           /* Make sure all dirty bits are set */
>           QLIST_FOREACH(block,&ram_list.blocks, next) {
>               for (addr = block->offset; addr<  block->offset + block->length;
> @@ -264,8 +446,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       while (!qemu_file_rate_limit(f)) {
>           int bytes_sent;
>
> -        bytes_sent = ram_save_block(f);
> +        bytes_sent = ram_save_block(f, stage);
>           bytes_transferred += bytes_sent;
> +        acct_info.iterations++;
>           if (bytes_sent == 0) { /* no more blocks */
>               break;
>           }
> @@ -285,19 +468,79 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>           int bytes_sent;
>
>           /* flush all remaining blocks regardless of rate limiting */
> -        while ((bytes_sent = ram_save_block(f)) != 0) {
> +        while ((bytes_sent = ram_save_block(f, stage))) {
>               bytes_transferred += bytes_sent;
>           }
>           cpu_physical_memory_set_dirty_tracking(0);
> +        if (arch_mig_state.use_xbrle) {
> +            lru_fini();
> +        }
>       }
>
>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>
>       expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
>
> +    DPRINTF("ram_save_live: expected(%ld)<= max(%ld)?\n", expected_time,
> +        migrate_max_downtime());
> +
>       return (stage == 2)&&  (expected_time<= migrate_max_downtime());
>   }
>
> +static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host)
> +{
> +    int len, rc = -1;
> +    uint8_t *encoded;
> +    XBZRLEHeader hdr = {0};
> +
> +    /* extract ZRLE header */
> +    hdr.xh_cksum = qemu_get_be32(f);
> +    qemu_get_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic));
> +    hdr.xh_len = qemu_get_be16(f);
> +    hdr.xh_flags = qemu_get_byte(f);
> +
> +    if (!(hdr.xh_flags&  ENCODING_FLAG_XBZRLE)) {
> +        fprintf(stderr, "Failed to load XZBRLE page - wrong compression!\n");
> +        goto done;
> +    }
> +
> +    if (hdr.xh_len>  TARGET_PAGE_SIZE) {
> +        fprintf(stderr, "Failed to load XZBRLE page - len overflow!\n");
> +        goto done;
> +    }
> +
> +    /* load data and decode */
> +    encoded = (uint8_t *) qemu_malloc(hdr.xh_len);
> +    qemu_get_buffer(f, encoded, hdr.xh_len);
> +    /* covert endianess if magic indicated destination differs from source */
> +    if (hdr.xh_magic != BE16_MAGIC) {
> +        const uint64_t *end = (uint64_t *) encoded +
> +            hdr.xh_len / sizeof (uint64_t);
> +        uint64_t *p;
> +        for (p = (uint64_t *) encoded; p<  end; p++) {
> +            bswap64s(p);
> +        }
> +    }
> +
> +    /* decode ZRLE */
> +    len = xbzrle_decode(host, host, encoded, hdr.xh_len);
> +    if (len == -1) {
> +        fprintf(stderr, "Failed to load XBZRLE page - decode error!\n");
> +        goto done;
> +    }
> +
> +    if (len != TARGET_PAGE_SIZE) {
> +        fprintf(stderr, "Failed to load XBZRLE page - size %d expected %d!\n",
> +            len, TARGET_PAGE_SIZE);
> +        goto done;
> +    }
> +
> +    rc = 0;
> +done:
> +    qemu_free(encoded);
> +    return rc;
> +}
> +
>   static inline void *host_from_stream_offset(QEMUFile *f,
>                                               ram_addr_t offset,
>                                               int flags)
> @@ -328,16 +571,38 @@ static inline void *host_from_stream_offset(QEMUFile *f,
>       return NULL;
>   }
>
> +static inline void *host_from_stream_offset_versioned(int version_id,
> +        QEMUFile *f, ram_addr_t offset, int flags)
> +{
> +        void *host;
> +        if (version_id == 3) {
> +                host = qemu_get_ram_ptr(offset);
> +        } else {
> +                host = host_from_stream_offset(f, offset, flags);
> +        }
> +        if (!host) {
> +            fprintf(stderr, "Failed to convert RAM address to host"
> +                    " for offset 0x%lX!\n", offset);
> +            abort();
> +        }
> +        return host;
> +}
> +
>   int ram_load(QEMUFile *f, void *opaque, int version_id)
>   {
>       ram_addr_t addr;
> -    int flags;
> +    int flags, ret = 0;
> +    static uint64_t seq_iter;
> +
> +    seq_iter++;
>
>       if (version_id<  3 || version_id>  4) {
> -        return -EINVAL;
> +        ret = -EINVAL;
> +        goto done;
>       }
>
>       do {
> +        void *host;
>           addr = qemu_get_be64(f);
>
>           flags = addr&  ~TARGET_PAGE_MASK;
> @@ -346,7 +611,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>           if (flags&  RAM_SAVE_FLAG_MEM_SIZE) {
>               if (version_id == 3) {
>                   if (addr != ram_bytes_total()) {
> -                    return -EINVAL;
> +                    ret = -EINVAL;
> +                    goto done;
>                   }
>               } else {
>                   /* Synchronize RAM block list */
> @@ -365,8 +631,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>
>                       QLIST_FOREACH(block,&ram_list.blocks, next) {
>                           if (!strncmp(id, block->idstr, sizeof(id))) {
> -                            if (block->length != length)
> -                                return -EINVAL;
> +                            if (block->length != length) {
> +                                ret = -EINVAL;
> +                                goto done;
> +                            }
>                               break;
>                           }
>                       }
> @@ -374,7 +642,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>                       if (!block) {
>                           fprintf(stderr, "Unknown ramblock \"%s\", cannot "
>                                   "accept migration\n", id);
> -                        return -EINVAL;
> +                        ret = -EINVAL;
> +                        goto done;
>                       }
>
>                       total_ram_bytes -= length;
> @@ -383,17 +652,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>           }
>
>           if (flags&  RAM_SAVE_FLAG_COMPRESS) {
> -            void *host;
>               uint8_t ch;
>
> -            if (version_id == 3)
> -                host = qemu_get_ram_ptr(addr);
> -            else
> -                host = host_from_stream_offset(f, addr, flags);
> -            if (!host) {
> -                return -EINVAL;
> -            }
> -
> +            host = host_from_stream_offset_versioned(version_id,
> +                            f, addr, flags);
>               ch = qemu_get_byte(f);
>               memset(host, ch, TARGET_PAGE_SIZE);
>   #ifndef _WIN32
> @@ -403,21 +665,28 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>               }
>   #endif
>           } else if (flags&  RAM_SAVE_FLAG_PAGE) {
> -            void *host;
> -
> -            if (version_id == 3)
> -                host = qemu_get_ram_ptr(addr);
> -            else
> -                host = host_from_stream_offset(f, addr, flags);
> -
> +            host = host_from_stream_offset_versioned(version_id,
> +                            f, addr, flags);
>               qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> +        } else if (flags&  RAM_SAVE_FLAG_XBZRLE) {
> +            host = host_from_stream_offset_versioned(version_id,
> +                            f, addr, flags);
> +            if (load_xbrle(f, addr, host)<  0) {
> +                ret = -EINVAL;
> +                goto done;
> +            }
>           }
> +
>           if (qemu_file_has_error(f)) {
> -            return -EIO;
> +            ret = -EIO;
> +            goto done;
>           }
>       } while (!(flags&  RAM_SAVE_FLAG_EOS));
>
> -    return 0;
> +done:
> +    DPRINTF("Completed load of VM with exit code %d seq iteration %ld\n",
> +            ret, seq_iter);
> +    return ret;
>   }
>
>   void qemu_service_io(void)
> diff --git a/block-migration.c b/block-migration.c
> index 3e66f49..504df70 100644
> --- a/block-migration.c
> +++ b/block-migration.c
> @@ -689,7 +689,8 @@ static int block_load(QEMUFile *f, void *opaque, int version_id)
>       return 0;
>   }
>
> -static void block_set_params(int blk_enable, int shared_base, void *opaque)
> +static void block_set_params(int blk_enable, int shared_base,
> +        int use_xbrle, int64_t xbrle_cache_size, void *opaque)
>   {
>       block_mig_state.blk_enable = blk_enable;
>       block_mig_state.shared_base = shared_base;
> diff --git a/hash.h b/hash.h
> new file mode 100644
> index 0000000..7109905
> --- /dev/null
> +++ b/hash.h
> @@ -0,0 +1,72 @@
> +#ifndef _LINUX_HASH_H
> +#define _LINUX_HASH_H
> +/* Fast hashing routine for ints,  longs and pointers.
> +   (C) 2002 William Lee Irwin III, IBM */
> +
> +/*
> + * Knuth recommends primes in approximately golden ratio to the maximum
> + * integer representable by a machine word for multiplicative hashing.
> + * Chuck Lever verified the effectiveness of this technique:
> + * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
> + *
> + * These primes are chosen to be bit-sparse, that is operations on
> + * them can use shifts and additions instead of multiplications for
> + * machines where multiplications are slow.
> + */
> +
> +typedef uint64_t u64;
> +typedef uint32_t u32;
> +#define BITS_PER_LONG TARGET_LONG_BITS
> +
> +/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
> +#define GOLDEN_RATIO_PRIME_32 0x9e370001UL
> +/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
> +#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL
> +
> +#if BITS_PER_LONG == 32
> +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32
> +#define hash_long(val, bits) hash_32(val, bits)
> +#elif BITS_PER_LONG == 64
> +#define hash_long(val, bits) hash_64(val, bits)
> +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64
> +#else
> +#error Wordsize not 32 or 64
> +#endif
> +
> +static inline u64 hash_64(u64 val, unsigned int bits)
> +{
> +    u64 hash = val;
> +
> +    /*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
> +    u64 n = hash;
> +    n<<= 18;
> +    hash -= n;
> +    n<<= 33;
> +    hash -= n;
> +    n<<= 3;
> +    hash += n;
> +    n<<= 3;
> +    hash -= n;
> +    n<<= 4;
> +    hash += n;
> +    n<<= 2;
> +    hash += n;
> +
> +    /* High bits are more random, so use them. */
> +    return hash>>  (64 - bits);
> +}
> +
> +static inline u32 hash_32(u32 val, unsigned int bits)
> +{
> +    /* On some cpus multiply is faster, on others gcc will do shifts */
> +    u32 hash = val * GOLDEN_RATIO_PRIME_32;
> +
> +    /* High bits are more random, so use them. */
> +    return hash>>  (32 - bits);
> +}
> +
> +static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
> +{
> +    return hash_long((unsigned long)ptr, bits);
> +}
> +#endif /* _LINUX_HASH_H */
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> old mode 100644
> new mode 100755
> index e5585ba..e49d5be
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -717,24 +717,27 @@ ETEXI
>
>       {
>           .name       = "migrate",
> -        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
> -        .params     = "[-d] [-b] [-i] uri",
> -        .help       = "migrate to URI (using -d to not wait for completion)"
> -                     "\n\t\t\t -b for migration without shared storage with"
> -                     " full copy of disk\n\t\t\t -i for migration without "
> -                     "shared storage with incremental copy of disk "
> -                     "(base image shared between src and destination)",
> +        .args_type  = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s",
> +        .params     = "[-d] [-b] [-i] [-x] uri",
> +        .help       = "migrate to URI"
> +                      "\n\t -d to not wait for completion"
> +                      "\n\t -b for migration without shared storage with"
> +                      " full copy of disk"
> +                      "\n\t -i for migration without"
> +                      " shared storage with incremental copy of disk"
> +                      " (base image shared between source and destination)"
> +                      "\n\t -x to use XBRLE page delta compression",
>           .user_print = monitor_user_noop,
>          .mhandler.cmd_new = do_migrate,
>       },
>
> -
>   STEXI
> -@item migrate [-d] [-b] [-i] @var{uri}
> +@item migrate [-d] [-b] [-i] [-x] @var{uri}
>   @findex migrate
>   Migrate to @var{uri} (using -d to not wait for completion).
>          -b for migration with full copy of disk
>          -i for migration with incremental copy of disk (base image is shared)
> +    -x to use XBRLE page delta compression
>   ETEXI
>
>       {
> @@ -753,10 +756,23 @@ Cancel the current VM migration.
>   ETEXI
>
>       {
> +        .name       = "migrate_set_cachesize",
> +        .args_type  = "value:s",
> +        .params     = "value",
> +        .help       = "set cache size (in MB) for XBRLE migrations",
> +        .mhandler.cmd = do_migrate_set_cachesize,
> +    },
> +
> +STEXI
> +@item migrate_set_cachesize @var{value}
> +Set cache size (in MB) for xbrle migrations.
> +ETEXI
> +
> +    {
>           .name       = "migrate_set_speed",
>           .args_type  = "value:o",
>           .params     = "value",
> -        .help       = "set maximum speed (in bytes) for migrations. "
> +        .help       = "set maximum XBRLE cache size (in bytes) for migrations. "
>          "Defaults to MB if no size suffix is specified, ie. B/K/M/G/T",
>           .user_print = monitor_user_noop,
>           .mhandler.cmd_new = do_migrate_set_speed,
> diff --git a/hw/hw.h b/hw/hw.h
> index 9d2cfc2..aa336ec 100644
> --- a/hw/hw.h
> +++ b/hw/hw.h
> @@ -239,7 +239,8 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
>   int64_t qemu_ftell(QEMUFile *f);
>   int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);
>
> -typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque);
> +typedef void SaveSetParamsHandler(int blk_enable, int shared,
> +        int use_xbrle, int64_t xbrle_cache_size, void *opaque);
>   typedef void SaveStateHandler(QEMUFile *f, void *opaque);
>   typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage,
>                                    void *opaque);
> diff --git a/lru.c b/lru.c
> new file mode 100644
> index 0000000..e7230d0
> --- /dev/null
> +++ b/lru.c
> @@ -0,0 +1,142 @@
> +#include<assert.h>
> +#include<math.h>
> +#include "qemu-common.h"
> +#include "qemu-queue.h"
> +#include "host-utils.h"
> +#include "lru.h"
> +#include "hash.h"
> +
> +typedef struct CacheItem {
> +    ram_addr_t it_addr;
> +    uint8_t *it_data;
> +    lru_free_cb_t it_free;
> +    QCIRCLEQ_ENTRY(CacheItem) it_lru_next;
> +    QCIRCLEQ_ENTRY(CacheItem) it_bucket_next;
> +} CacheItem;
> +
> +typedef QCIRCLEQ_HEAD(, CacheItem) CacheBucket;
> +static CacheBucket *page_hash;
> +static int64_t cache_table_size;
> +static uint64_t cache_max_items;
> +static int64_t cache_num_items;
> +static uint8_t cache_hash_bits;
> +
> +static QCIRCLEQ_HEAD(page_lru, CacheItem) page_lru;
> +
> +static uint64_t next_pow_of_2(uint64_t v)
> +{
> +    v--;
> +    v |= v>>  1;
> +    v |= v>>  2;
> +    v |= v>>  4;
> +    v |= v>>  8;
> +    v |= v>>  16;
> +    v |= v>>  32;
> +    v++;
> +    return v;
> +}
> +
> +void lru_init(int64_t max_items, void *param)
> +{
> +    int i;
> +
> +    cache_num_items = 0;
> +    cache_max_items = max_items;
> +    /* add 20% to table size to reduce collisions */
> +    cache_table_size = next_pow_of_2(1.2 * max_items);
> +    cache_hash_bits = ctz64(cache_table_size) - 1;
> +
> +    QCIRCLEQ_INIT(&page_lru);
> +
> +    page_hash = qemu_mallocz(sizeof(CacheBucket) * cache_table_size);
> +    assert(page_hash);
> +    for (i = 0; i<  cache_table_size; i++) {
> +        QCIRCLEQ_INIT(&page_hash[i]);
> +    }
> +}
> +
> +static CacheBucket *page_bucket_list(ram_addr_t addr)
> +{
> +    return&page_hash[hash_long(addr, cache_hash_bits)];
> +}
> +
> +static void do_lru_remove(CacheItem *it)
> +{
> +    assert(it);
> +
> +    QCIRCLEQ_REMOVE(&page_lru, it, it_lru_next);
> +    QCIRCLEQ_REMOVE(page_bucket_list(it->it_addr), it, it_bucket_next);
> +    if (it->it_free) {
> +        (*it->it_free)(it->it_data);
> +    }
> +    qemu_free(it);
> +    cache_num_items--;
> +}
> +
> +static int do_lru_remove_first(void)
> +{
> +    CacheItem *first;
> +
> +    if (QCIRCLEQ_EMPTY(&page_lru)) {
> +        return -1;
> +    }
> +    first = QCIRCLEQ_FIRST(&page_lru);
> +    do_lru_remove(first);
> +    return 0;
> +}
> +
> +
> +void lru_fini(void)
> +{
> +    while (!do_lru_remove_first()) {
> +    }
> +    qemu_free(page_hash);
> +}
> +
> +static CacheItem *do_lru_lookup(ram_addr_t addr)
> +{
> +    CacheBucket *head = page_bucket_list(addr);
> +    CacheItem *it;
> +
> +    if (QCIRCLEQ_EMPTY(head)) {
> +        return NULL;
> +    }
> +    QCIRCLEQ_FOREACH(it, head, it_bucket_next) {
> +        if (addr == it->it_addr) {
> +            return it;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +uint8_t *lru_lookup(ram_addr_t addr)
> +{
> +    CacheItem *it = do_lru_lookup(addr);
> +    return it ? it->it_data : NULL;
> +}
> +
> +void lru_insert(ram_addr_t addr, uint8_t *data, lru_free_cb_t free_cb)
> +{
> +    CacheItem *it;
> +
> +    /* remove old if item exists */
> +    it = do_lru_lookup(addr);
> +    if (it) {
> +        do_lru_remove(it);
> +    }
> +
> +    /* evict LRU if require free space */
> +    if (cache_num_items == cache_max_items) {
> +        do_lru_remove_first();
> +    }
> +
> +    /* add new entry */
> +    it = qemu_mallocz(sizeof(*it));
> +    it->it_addr = addr;
> +    it->it_data = data;
> +    it->it_free = free_cb;
> +    QCIRCLEQ_INSERT_HEAD(page_bucket_list(addr), it, it_bucket_next);
> +    QCIRCLEQ_INSERT_TAIL(&page_lru, it, it_lru_next);
> +    cache_num_items++;
> +}
> +
> diff --git a/lru.h b/lru.h
> new file mode 100644
> index 0000000..6c70095
> --- /dev/null
> +++ b/lru.h
> @@ -0,0 +1,13 @@
> +#ifndef _LRU_H_
> +#define _LRU_H_
> +
> +#include<unistd.h>
> +#include<stdint.h>
> +#include "cpu-all.h"
> +typedef void (*lru_free_cb_t)(void *);
> +void lru_init(ssize_t num_items, void *param);
> +void lru_fini(void);
> +void lru_insert(ram_addr_t id, uint8_t *pdata, lru_free_cb_t free_cb);
> +uint8_t *lru_lookup(ram_addr_t addr);
> +#endif
> +
> diff --git a/migration-exec.c b/migration-exec.c
> index 14718dd..fe8254a 100644
> --- a/migration-exec.c
> +++ b/migration-exec.c
> @@ -67,7 +67,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon,
>                                                int64_t bandwidth_limit,
>                                                int detach,
>                                                int blk,
> -                                             int inc)
> +                          int inc,
> +                          int use_xbrle,
> +                          int64_t xbrle_cache_size)
>   {
>       FdMigrationState *s;
>       FILE *f;
> @@ -99,6 +101,8 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon,
>
>       s->mig_state.blk = blk;
>       s->mig_state.shared = inc;
> +    s->mig_state.use_xbrle = use_xbrle;
> +    s->mig_state.xbrle_cache_size = xbrle_cache_size;
>
>       s->state = MIG_STATE_ACTIVE;
>       s->mon = NULL;
> diff --git a/migration-fd.c b/migration-fd.c
> index 6d14505..4a1ddbd 100644
> --- a/migration-fd.c
> +++ b/migration-fd.c
> @@ -56,7 +56,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon,
>                                              int64_t bandwidth_limit,
>                                              int detach,
>                                              int blk,
> -                                           int inc)
> +                        int inc,
> +                        int use_xbrle,
> +                        int64_t xbrle_cache_size)
>   {
>       FdMigrationState *s;
>
> @@ -82,6 +84,8 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon,
>
>       s->mig_state.blk = blk;
>       s->mig_state.shared = inc;
> +    s->mig_state.use_xbrle = use_xbrle;
> +    s->mig_state.xbrle_cache_size = xbrle_cache_size;
>
>       s->state = MIG_STATE_ACTIVE;
>       s->mon = NULL;
> diff --git a/migration-tcp.c b/migration-tcp.c
> index b55f419..4ca5bf6 100644
> --- a/migration-tcp.c
> +++ b/migration-tcp.c
> @@ -81,7 +81,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
>                                                int64_t bandwidth_limit,
>                                                int detach,
>                                               int blk,
> -                                            int inc)
> +                         int inc,
> +                         int use_xbrle,
> +                         int64_t xbrle_cache_size)
>   {
>       struct sockaddr_in addr;
>       FdMigrationState *s;
> @@ -101,6 +103,8 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
>
>       s->mig_state.blk = blk;
>       s->mig_state.shared = inc;
> +    s->mig_state.use_xbrle = use_xbrle;
> +    s->mig_state.xbrle_cache_size = xbrle_cache_size;
>
>       s->state = MIG_STATE_ACTIVE;
>       s->mon = NULL;
> diff --git a/migration-unix.c b/migration-unix.c
> index 57232c0..0813902 100644
> --- a/migration-unix.c
> +++ b/migration-unix.c
> @@ -80,7 +80,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon,
>                                                int64_t bandwidth_limit,
>                                                int detach,
>                                                int blk,
> -                                             int inc)
> +                          int inc,
> +                          int use_xbrle,
> +                          int64_t xbrle_cache_size)
>   {
>       FdMigrationState *s;
>       struct sockaddr_un addr;
> @@ -100,6 +102,8 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon,
>
>       s->mig_state.blk = blk;
>       s->mig_state.shared = inc;
> +    s->mig_state.use_xbrle = use_xbrle;
> +    s->mig_state.xbrle_cache_size = xbrle_cache_size;
>
>       s->state = MIG_STATE_ACTIVE;
>       s->mon = NULL;
> diff --git a/migration.c b/migration.c
> old mode 100644
> new mode 100755
> index 9ee8b17..ccacf81
> --- a/migration.c
> +++ b/migration.c
> @@ -34,6 +34,11 @@
>   /* Migration speed throttling */
>   static uint32_t max_throttle = (32<<  20);
>
> +/* Migration XBRLE cache size */
> +#define DEFAULT_MIGRATE_CACHE_SIZE (64 * 1024 * 1024)
> +
> +static int64_t migrate_cache_size = DEFAULT_MIGRATE_CACHE_SIZE;
> +
>   static MigrationState *current_migration;
>
>   int qemu_start_incoming_migration(const char *uri)
> @@ -80,6 +85,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
>       int detach = qdict_get_try_bool(qdict, "detach", 0);
>       int blk = qdict_get_try_bool(qdict, "blk", 0);
>       int inc = qdict_get_try_bool(qdict, "inc", 0);
> +    int use_xbrle = qdict_get_try_bool(qdict, "xbrle", 0);
>       const char *uri = qdict_get_str(qdict, "uri");
>
>       if (current_migration&&
> @@ -90,17 +96,21 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
>
>       if (strstart(uri, "tcp:",&p)) {
>           s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
> -                                         blk, inc);
> +                                         blk, inc, use_xbrle,
> +                                         migrate_cache_size);
>   #if !defined(WIN32)
>       } else if (strstart(uri, "exec:",&p)) {
>           s = exec_start_outgoing_migration(mon, p, max_throttle, detach,
> -                                          blk, inc);
> +                                          blk, inc, use_xbrle,
> +                                          migrate_cache_size);
>       } else if (strstart(uri, "unix:",&p)) {
>           s = unix_start_outgoing_migration(mon, p, max_throttle, detach,
> -                                          blk, inc);
> +                                          blk, inc, use_xbrle,
> +                                          migrate_cache_size);
>       } else if (strstart(uri, "fd:",&p)) {
>           s = fd_start_outgoing_migration(mon, p, max_throttle, detach,
> -                                        blk, inc);
> +                                        blk, inc, use_xbrle,
> +                                        migrate_cache_size);
>   #endif
>       } else {
>           monitor_printf(mon, "unknown migration protocol: %s\n", uri);
> @@ -185,6 +195,36 @@ static void migrate_print_status(Monitor *mon, const char *name,
>                           qdict_get_int(qdict, "total")>>  10);
>   }
>
> +static void migrate_print_ram_status(Monitor *mon, const char *name,
> +                                 const QDict *status_dict)
> +{
> +    QDict *qdict;
> +    uint64_t overflow, cache_hit, cache_lookup;
> +
> +    qdict = qobject_to_qdict(qdict_get(status_dict, name));
> +
> +    monitor_printf(mon, "transferred %s: %" PRIu64 " kbytes\n", name,
> +                        qdict_get_int(qdict, "bytes")>>  10);
> +    monitor_printf(mon, "transferred %s: %" PRIu64 " pages\n", name,
> +                        qdict_get_int(qdict, "pages"));
> +    overflow = qdict_get_int(qdict, "overflow");
> +    if (overflow>  0) {
> +        monitor_printf(mon, "overflow %s: %" PRIu64 " pages\n", name,
> +            overflow);
> +    }
> +    cache_hit = qdict_get_int(qdict, "cache-hit");
> +    if (cache_hit>  0) {
> +        monitor_printf(mon, "cache-hit %s: %" PRIu64 " pages\n", name,
> +            cache_hit);
> +    }
> +    cache_lookup = qdict_get_int(qdict, "cache-lookup");
> +    if (cache_lookup>  0) {
> +        monitor_printf(mon, "cache-lookup %s: %" PRIu64 " pages\n", name,
> +            cache_lookup);
> +    }
> +
> +}
> +
>   void do_info_migrate_print(Monitor *mon, const QObject *data)
>   {
>       QDict *qdict;
> @@ -198,6 +238,18 @@ void do_info_migrate_print(Monitor *mon, const QObject *data)
>           migrate_print_status(mon, "ram", qdict);
>       }
>
> +    if (qdict_haskey(qdict, "ram-duplicate")) {
> +        migrate_print_ram_status(mon, "ram-duplicate", qdict);
> +    }
> +
> +    if (qdict_haskey(qdict, "ram-normal")) {
> +        migrate_print_ram_status(mon, "ram-normal", qdict);
> +    }
> +
> +    if (qdict_haskey(qdict, "ram-xbrle")) {
> +        migrate_print_ram_status(mon, "ram-xbrle", qdict);
> +    }
> +
>       if (qdict_haskey(qdict, "disk")) {
>           migrate_print_status(mon, "disk", qdict);
>       }
> @@ -214,6 +266,23 @@ static void migrate_put_status(QDict *qdict, const char *name,
>       qdict_put_obj(qdict, name, obj);
>   }
>
> +static void migrate_put_ram_status(QDict *qdict, const char *name,
> +                               uint64_t bytes, uint64_t pages,
> +                               uint64_t overflow, uint64_t cache_hit,
> +                               uint64_t cache_lookup)
> +{
> +    QObject *obj;
> +
> +    obj = qobject_from_jsonf("{ 'bytes': %" PRId64 ", "
> +                               "'pages': %" PRId64 ", "
> +                               "'overflow': %" PRId64 ", "
> +                               "'cache-hit': %" PRId64 ", "
> +                               "'cache-lookup': %" PRId64 " }",
> +                               bytes, pages, overflow, cache_hit,
> +                               cache_lookup);
> +    qdict_put_obj(qdict, name, obj);
> +}
> +
>   void do_info_migrate(Monitor *mon, QObject **ret_data)
>   {
>       QDict *qdict;
> @@ -228,6 +297,21 @@ void do_info_migrate(Monitor *mon, QObject **ret_data)
>               migrate_put_status(qdict, "ram", ram_bytes_transferred(),
>                                  ram_bytes_remaining(), ram_bytes_total());
>
> +            if (s->use_xbrle) {
> +                migrate_put_ram_status(qdict, "ram-duplicate",
> +                                   dup_mig_bytes_transferred(),
> +                                   dup_mig_pages_transferred(), 0, 0, 0);
> +                migrate_put_ram_status(qdict, "ram-normal",
> +                                   norm_mig_bytes_transferred(),
> +                                   norm_mig_pages_transferred(), 0, 0, 0);
> +                migrate_put_ram_status(qdict, "ram-xbrle",
> +                                   xbrle_mig_bytes_transferred(),
> +                                   xbrle_mig_pages_transferred(),
> +                                   xbrle_mig_pages_overflow(),
> +                                   xbrle_mig_pages_cache_hit(),
> +                                   xbrle_mig_pages_cache_lookup());
> +            }
> +
>               if (blk_mig_active()) {
>                   migrate_put_status(qdict, "disk", blk_mig_bytes_transferred(),
>                                      blk_mig_bytes_remaining(),
> @@ -341,7 +425,8 @@ void migrate_fd_connect(FdMigrationState *s)
>
>       DPRINTF("beginning savevm\n");
>       ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk,
> -                                  s->mig_state.shared);
> +                                  s->mig_state.shared, s->mig_state.use_xbrle,
> +                                  s->mig_state.xbrle_cache_size);
>       if (ret<  0) {
>           DPRINTF("failed, %d\n", ret);
>           migrate_fd_error(s);
> @@ -448,3 +533,27 @@ int migrate_fd_close(void *opaque)
>       qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>       return s->close(s);
>   }
> +
> +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict)
> +{
> +    ssize_t bytes;
> +    const char *value = qdict_get_str(qdict, "value");
> +
> +    bytes = strtosz(value, NULL);
> +    if (bytes<  0) {
> +        monitor_printf(mon, "invalid cache size: %s\n", value);
> +        return;
> +    }
> +
> +    /* On 32-bit hosts, QEMU is limited by virtual address space */
> +    if (bytes>  (2047<<  20)&&  HOST_LONG_BITS == 32) {
> +        monitor_printf(mon, "cache can't exceed 2047 MB RAM limit on host\n");
> +        return;
> +    }
> +    if (bytes != (uint64_t) bytes) {
> +        monitor_printf(mon, "cache size too large\n");
> +        return;
> +    }
> +    migrate_cache_size = bytes;
> +}
> +
> diff --git a/migration.h b/migration.h
> index d13ed4f..6dc0543 100644
> --- a/migration.h
> +++ b/migration.h
> @@ -32,6 +32,8 @@ struct MigrationState
>       void (*release)(MigrationState *s);
>       int blk;
>       int shared;
> +    int use_xbrle;
> +    int64_t xbrle_cache_size;
>   };
>
>   typedef struct FdMigrationState FdMigrationState;
> @@ -76,7 +78,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon,
>                                                int64_t bandwidth_limit,
>                                                int detach,
>                                                int blk,
> -                                             int inc);
> +                          int inc,
> +                          int use_xbrle,
> +                          int64_t xbrle_cache_size);
>
>   int tcp_start_incoming_migration(const char *host_port);
>
> @@ -85,7 +89,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
>                                               int64_t bandwidth_limit,
>                                               int detach,
>                                               int blk,
> -                                            int inc);
> +                         int inc,
> +                         int use_xbrle,
> +                         int64_t xbrle_cache_size);
>
>   int unix_start_incoming_migration(const char *path);
>
> @@ -94,7 +100,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon,
>                                                int64_t bandwidth_limit,
>                                                int detach,
>                                                int blk,
> -                                             int inc);
> +                          int inc,
> +                          int use_xbrle,
> +                          int64_t xbrle_cache_size);
>
>   int fd_start_incoming_migration(const char *path);
>
> @@ -103,7 +111,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon,
>                                              int64_t bandwidth_limit,
>                                              int detach,
>                                              int blk,
> -                                           int inc);
> +                        int inc,
> +                        int use_xbrle,
> +                        int64_t xbrle_cache_size);
>
>   void migrate_fd_monitor_suspend(FdMigrationState *s, Monitor *mon);
>
> @@ -134,4 +144,11 @@ static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state)
>       return container_of(mig_state, FdMigrationState, mig_state);
>   }
>
> +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict);
> +
> +void arch_set_params(int blk_enable, int shared_base,
> +        int use_xbrle, int64_t xbrle_cache_size, void *opaque);
> +
> +int xbrle_mig_active(void);
> +
>   #endif
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index 793cf1c..8fbe64b 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -431,13 +431,16 @@ EQMP
>
>       {
>           .name       = "migrate",
> -        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
> -        .params     = "[-d] [-b] [-i] uri",
> -        .help       = "migrate to URI (using -d to not wait for completion)"
> -                     "\n\t\t\t -b for migration without shared storage with"
> -                     " full copy of disk\n\t\t\t -i for migration without "
> -                     "shared storage with incremental copy of disk "
> -                     "(base image shared between src and destination)",
> +        .args_type  = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s",
> +        .params     = "[-d] [-b] [-i] [-x] uri",
> +        .help       = "migrate to URI"
> +                      "\n\t -d to not wait for completion"
> +                      "\n\t -b for migration without shared storage with"
> +                      " full copy of disk"
> +                      "\n\t -i for migration without"
> +                      " shared storage with incremental copy of disk"
> +                      " (base image shared between source and destination)"
> +                      "\n\t -x to use XBRLE page delta compression",
>           .user_print = monitor_user_noop,
>          .mhandler.cmd_new = do_migrate,
>       },
> @@ -453,6 +456,7 @@ Arguments:
>   - "blk": block migration, full disk copy (json-bool, optional)
>   - "inc": incremental disk copy (json-bool, optional)
>   - "uri": Destination URI (json-string)
> +- "xbrle": to use XBRLE page delta compression
>
>   Example:
>
> @@ -494,6 +498,31 @@ Example:
>   EQMP
>
>       {
> +        .name       = "migrate_set_cachesize",
> +        .args_type  = "value:s",
> +        .params     = "value",
> +        .help       = "set cache size (in MB) for xbrle migrations",
> +        .mhandler.cmd = do_migrate_set_cachesize,
> +    },
> +
> +SQMP
> +migrate_set_cachesize
> +---------------------
> +
> +Set cache size to be used by XBRLE migration
> +
> +Arguments:
> +
> +- "value": cache size in bytes (json-number)
> +
> +Example:
> +
> +->  { "execute": "migrate_set_cachesize", "arguments": { "value": 500M } }
> +<- { "return": {} }
> +
> +EQMP
> +
> +    {
>           .name       = "migrate_set_speed",
>           .args_type  = "value:f",
>           .params     = "value",
> diff --git a/savevm.c b/savevm.c
> index 4e49765..93b512b 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1141,7 +1141,8 @@ int register_savevm(DeviceState *dev,
>                       void *opaque)
>   {
>       return register_savevm_live(dev, idstr, instance_id, version_id,
> -                                NULL, NULL, save_state, load_state, opaque);
> +                                arch_set_params, NULL, save_state,
> +                                load_state, opaque);
>   }
>
>   void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque)
> @@ -1428,15 +1429,17 @@ static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
>   #define QEMU_VM_SUBSECTION           0x05
>
>   int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
> -                            int shared)
> +                            int shared, int use_xbrle,
> +                            int64_t xbrle_cache_size)
>   {
>       SaveStateEntry *se;
>
>       QTAILQ_FOREACH(se,&savevm_handlers, entry) {
>           if(se->set_params == NULL) {
>               continue;
> -       }
> -       se->set_params(blk_enable, shared, se->opaque);
> +        }
> +        se->set_params(blk_enable, shared, use_xbrle, xbrle_cache_size,
> +                se->opaque);
>       }
>
>       qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
> @@ -1577,7 +1580,7 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
>
>       bdrv_flush_all();
>
> -    ret = qemu_savevm_state_begin(mon, f, 0, 0);
> +    ret = qemu_savevm_state_begin(mon, f, 0, 0, 0, 0);
>       if (ret<  0)
>           goto out;
>
> diff --git a/sysemu.h b/sysemu.h
> index b81a70e..eb53bf7 100644
> --- a/sysemu.h
> +++ b/sysemu.h
> @@ -44,6 +44,16 @@ uint64_t ram_bytes_remaining(void);
>   uint64_t ram_bytes_transferred(void);
>   uint64_t ram_bytes_total(void);
>
> +uint64_t dup_mig_bytes_transferred(void);
> +uint64_t dup_mig_pages_transferred(void);
> +uint64_t norm_mig_bytes_transferred(void);
> +uint64_t norm_mig_pages_transferred(void);
> +uint64_t xbrle_mig_bytes_transferred(void);
> +uint64_t xbrle_mig_pages_transferred(void);
> +uint64_t xbrle_mig_pages_overflow(void);
> +uint64_t xbrle_mig_pages_cache_lookup(void);
> +uint64_t xbrle_mig_pages_cache_hit(void);
> +
>   int64_t cpu_get_ticks(void);
>   void cpu_enable_ticks(void);
>   void cpu_disable_ticks(void);
> @@ -74,7 +84,8 @@ void qemu_announce_self(void);
>   void main_loop_wait(int nonblocking);
>
>   int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
> -                            int shared);
> +                            int shared, int use_xbrle,
> +                            int64_t xbrle_cache_size);
>   int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
>   int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
>   void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
> diff --git a/xbzrle.c b/xbzrle.c
> new file mode 100644
> index 0000000..e9285e0
> --- /dev/null
> +++ b/xbzrle.c
> @@ -0,0 +1,126 @@
> +#include<stdint.h>
> +#include<string.h>
> +#include<assert.h>
> +#include "cpu-all.h"
> +#include "xbzrle.h"
> +
> +typedef struct {
> +    uint64_t c;
> +    uint64_t num;
> +} zero_encoding_t;
> +
> +typedef struct {
> +    uint64_t c;
> +} char_encoding_t;
> +
> +static int rle_encode(uint64_t *in, int slen, uint8_t *out, const int dlen)
> +{
> +    int dl = 0;
> +    uint64_t cp = 0, c, run_len = 0;
> +
> +    if (slen<=  0)
> +        return -1;
> +
> +    while (1) {
> +        if (!slen)
> +            break;
> +        c = *in++;
> +        slen--;
> +        if (!(cp || c)) {
> +            run_len++;
> +        } else if (!cp) {
> +            ((zero_encoding_t *)out)->c = cp;
> +            ((zero_encoding_t *)out)->num = run_len;
> +            dl += sizeof(zero_encoding_t);
> +            out += sizeof(zero_encoding_t);
> +            run_len = 1;
> +        } else {
> +            ((char_encoding_t *)out)->c = cp;
> +            dl += sizeof(char_encoding_t);
> +            out += sizeof(char_encoding_t);
> +                }
> +        cp = c;
> +    }
> +
> +    if (!cp) {
> +        ((zero_encoding_t *)out)->c = cp;
> +        ((zero_encoding_t *)out)->num = run_len;
> +        dl += sizeof(zero_encoding_t);
> +        out += sizeof(zero_encoding_t);
> +    } else {
> +        ((char_encoding_t *)out)->c = cp;
> +        dl += sizeof(char_encoding_t);
> +        out += sizeof(char_encoding_t);
> +    }
> +    return dl;
> +}
> +
> +static int rle_decode(const uint8_t *in, int slen, uint64_t *out, int dlen)
> +{
> +    int tb = 0;
> +    uint64_t run_len, c;
> +
> +    while (slen>  0) {
> +        c = ((char_encoding_t *) in)->c;
> +        if (c) {
> +            slen -= sizeof(char_encoding_t);
> +            in += sizeof(char_encoding_t);
> +            *out++ = c;
> +            tb++;
> +            continue;
> +        }
> +        run_len = ((zero_encoding_t *) in)->num;
> +        slen -= sizeof(zero_encoding_t);
> +        in += sizeof(zero_encoding_t);
> +        while (run_len-->  0) {
> +            *out++ = c;
> +            tb++;
> +        }
> +    }
> +    return tb;
> +}
> +
> +static void xor_encode_word(uint8_t *dst, const uint8_t *src1,
> +    const uint8_t *src2)
> +{
> +    int len = TARGET_PAGE_SIZE / sizeof (uint64_t);
> +    uint64_t *dstw = (uint64_t *) dst;
> +    const uint64_t *srcw1 = (const uint64_t *) src1;
> +    const uint64_t *srcw2 = (const uint64_t *) src2;
> +
> +    while (len--) {
> +        *dstw++ = *srcw1++ ^ *srcw2++;
> +    }
> +}
> +
> +int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old, const uint8_t *curr,
> +    const size_t max_compressed_len)
> +{
> +    int compressed_len;
> +    uint8_t xor_buf[TARGET_PAGE_SIZE];
> +    uint8_t work_buf[TARGET_PAGE_SIZE * 2]; /* worst case xbzrle is 150% */
> +
> +    xor_encode_word(xor_buf, old, curr);
> +    compressed_len = rle_encode((uint64_t *)xor_buf,
> +        sizeof(xor_buf)/sizeof(uint64_t), work_buf,
> +        sizeof(work_buf));
> +    if (compressed_len>  max_compressed_len) {
> +        return -1;
> +    }
> +    memcpy(xbzrle, work_buf, compressed_len);
> +    return compressed_len;
> +}
> +
> +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle,
> +    const size_t compressed_len)
> +{
> +    uint8_t xor_buf[TARGET_PAGE_SIZE];
> +
> +    int len = rle_decode(xbrle, compressed_len,
> +         (uint64_t *)xor_buf, sizeof(xor_buf)/sizeof(uint64_t));
> +    if (len<  0) {
> +        return len;
> +    }
> +    xor_encode_word(curr, old, xor_buf);
> +    return len * sizeof(uint64_t);
> +}
> diff --git a/xbzrle.h b/xbzrle.h
> new file mode 100644
> index 0000000..5d625a0
> --- /dev/null
> +++ b/xbzrle.h
> @@ -0,0 +1,12 @@
> +#ifndef _XBZRLE_H_
> +#define _XBZRLE_H_
> +
> +#include<stdio.h>
> +
> +int xbzrle_encode(uint8_t *xbrle, const uint8_t *old, const uint8_t *curr,
> +    const size_t len);
> +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle,
> +    const size_t len);
> +
> +#endif
> +
>
Alexander Graf Aug. 8, 2011, 1:41 p.m. UTC | #2
On 08.08.2011, at 15:29, Anthony Liguori wrote:

> On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
>> Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
>> From: Aidan Shribman<aidan.shribman@sap.com>
>> 
>> By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
>> and total live-migration time of VMs running memory write intensive workloads
>> typical of large enterprise applications such as SAP ERP Systems, and generally
>> speaking for any application with a sparse memory update pattern.
>> 
>> On the sender side XBZRLE is used as a compact delta encoding of page updates,
>> retrieving the old page content from an LRU cache (default size of 64 MB). The
>> receiving side uses the existing page content and XBZRLE to decode the new page
>> content.
>> 
>> Work was originally based on research results published VEE 2011: Evaluation of
>> Delta Compression Techniques for Efficient Live Migration of Large Virtual
>> Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
>> XBRLE was improved further using XBZRLE instead.
>> 
>> XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
>> ideal for in-line, real-time encoding such as is needed for live-migration.
>> 
>> A typical usage scenario:
>>     {qemu} migrate_set_cachesize 256m
>>     {qemu} migrate -x -d tcp:destination.host:4444
>>     {qemu} info migrate
>>     ...
>>     transferred ram-duplicate: A kbytes
>>     transferred ram-duplicate: B pages
>>     transferred ram-normal: C kbytes
>>     transferred ram-normal: D pages
>>     transferred ram-xbrle: E kbytes
>>     transferred ram-xbrle: F pages
>>     overflow ram-xbrle: G pages
>>     cache-hit ram-xbrle: H pages
>>     cache-lookup ram-xbrle: J pages
>> 
>> Testing: live migration with XBZRLE completed in 110 seconds, without live
>> migration was not able to complete.
>> 
>> A simple synthetic memory r/w load generator:
>> ..    include<stdlib.h>
>> ..    include<stdio.h>
>> ..    int main()
>> ..    {
>> ..        char *buf = (char *) calloc(4096, 4096);
>> ..        while (1) {
>> ..            int i;
>> ..            for (i = 0; i<  4096 * 4; i++) {
>> ..                buf[i * 4096 / 4]++;
>> ..            }
>> ..            printf(".");
>> ..        }
>> ..    }
>> 
>> Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com>
>> Signed-off-by: Petter Svard<petters@cs.umu.se>
>> Signed-off-by: Aidan Shribman<aidan.shribman@sap.com>
> 
> One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
> 
> I think workload aware migration compression is possible for a lot of different types of workloads.  That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
> 
> It makes me think that this logic may really belong at a higher level where more information is known about the workload.  For instance, I can imagine XBZRLE living in something like libvirt.
> 
> Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0.  That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this.
> 
> Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol.  That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering.
> 
> I'm curious what people think about this type of approach.  CC'ing libvirt to get their input.

In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :).


Alex
Anthony Liguori Aug. 8, 2011, 1:46 p.m. UTC | #3
On 08/08/2011 08:41 AM, Alexander Graf wrote:
>
> On 08.08.2011, at 15:29, Anthony Liguori wrote:
>
>> One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
>>
>> I think workload aware migration compression is possible for a lot of different types of workloads.  That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
>>
>> It makes me think that this logic may really belong at a higher level where more information is known about the workload.  For instance, I can imagine XBZRLE living in something like libvirt.
>>
>> Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0.  That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this.
>>
>> Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol.  That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering.
>>
>> I'm curious what people think about this type of approach.  CC'ing libvirt to get their input.
>
> In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :).

The way I originally envisioned things, you'd have:

(qemu) migrate xbzrle://destination?opt1=value1&opt2=value2

Which would in turn be equivalent to:

(qemu) migrate exec:///usr/libexec/qemu/migration-helper-xbzrle 
--opt1=value1 --opt2=value2

But even if we supported that, it wouldn't get exposed via libvirt 
unless the libvirt guys exposed QEMU URIs directly.

So I think the open question is, how do we do transport plugins in a way 
that makes libvirt and QEMU both happy?

Regards,

Anthony Liguori

>
> Alex
>
>
Avi Kivity Aug. 8, 2011, 1:49 p.m. UTC | #4
On 08/08/2011 04:41 PM, Alexander Graf wrote:
> In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer.

Very much yes.
Avi Kivity Aug. 8, 2011, 1:51 p.m. UTC | #5
On 08/08/2011 04:29 PM, Anthony Liguori wrote:
>
> One thing that strikes me about this algorithm is that it's very good 
> for a particular type of workload--shockingly good really.

Poking bytes at random places in memory is fairly generic.  If you have 
a lot of small objects, and modify a subset of them, this is the pattern 
you get.

>
> I think workload aware migration compression is possible for a lot of 
> different types of workloads.  That makes me a bit wary of QEMU 
> growing quite a lot of compression mechanisms.
>
> It makes me think that this logic may really belong at a higher level 
> where more information is known about the workload.  For instance, I 
> can imagine XBZRLE living in something like libvirt.

A better model would be plugin based.
Daniel P. Berrangé Aug. 8, 2011, 2:04 p.m. UTC | #6
On Mon, Aug 08, 2011 at 08:29:51AM -0500, Anthony Liguori wrote:
> On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
> >Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
> >From: Aidan Shribman<aidan.shribman@sap.com>
> >
> >By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
> >and total live-migration time of VMs running memory write intensive workloads
> >typical of large enterprise applications such as SAP ERP Systems, and generally
> >speaking for any application with a sparse memory update pattern.

[snip]

> One thing that strikes me about this algorithm is that it's very
> good for a particular type of workload--shockingly good really.
> 
> I think workload aware migration compression is possible for a lot
> of different types of workloads.  That makes me a bit wary of QEMU
> growing quite a lot of compression mechanisms.
> 
> It makes me think that this logic may really belong at a higher
> level where more information is known about the workload.  For
> instance, I can imagine XBZRLE living in something like libvirt.
> 
> Today, parsing migration traffic is pretty horrible but I think
> we're pretty strongly committed to fixing that in 1.0.  That makes
> me wonder if it would be nicer architecturally for a higher level
> tool to own something like this.
> 
> Originally, when I added migration, I had the view that we would
> have transport plugins based on the exec: protocol.  That hasn't
> really happened since libvirt really owns migration but I think
> having XBZRLE as a transport plugin for libvirt is something worth
> considering.

NB I've not been much of a fan of the exec: migration code, since it
has proved rather buggy in practice when we used it for 'save/restore
to/from file' support. It has been hard to diagnose when things go
wrong, and difficult for QEMU to report any useful error messages.
Even with the tcp: protocol, QEMU is seemingly unable to provide any
useful error reporting even of things as simple as "unable to connect
to remote host". So with one exception, current libvirt now uses the
'fd:' protocol for everything, and the last exception will be removed
soon too.

> I'm curious what people think about this type of approach.  CC'ing
> libvirt to get their input.

In "normal" migration though, even when using fd:, we don't make
any attempt to touch the data stream. We just pass a pre-connected
TCP socket into QEMU and let it write directly to it. This avoids
extra data copying via libvirt.

In our alternative "tunnelled" migration mode, libvirt does touch
the data stream, passing a pipe FD into QEMU, and copying the data
from the pipe into packets to be sent over libvirtd's existing
secure RPC stream, and then copying it back to QEMU on the destination.
The downside here is that we've added several extra data copies.

In our "save/restore to file" code, we use 'fd:' and always have
to send the data via a filter program. For example, we have the
ability to compress/decompress data via gzip, bzip, xz, and lzop,
for which instead pass QEMU as pipe FD to the external compression
helper program. We also have another new option where we send data
via another I/O helper program that uses O_DIRECT, so save/restore
does not pollute the page cache.

With this kind of existing precedent, I won't strongly argue against
libvirt adding a filter to support this XBZRLE encoding scheme for
migration, or indeed save/restore too, if it proves better than
lzop which is our current optimal speed/compression winner.

My main concern with all these scenarios where libvirt touches the
actual data stream though is that we're introducing extra data copies
into the migration path which potentially waste CPU cycles.
If QEMU can directly XBZRLE encode data into the FD passed via 'fd:'
then we minimize data copies. Whether this is a big enough benefit
to offset the burden of having to maintain various compression code
options in QEMU I can't answer.

Regards,
Daniel
Anthony Liguori Aug. 8, 2011, 2:15 p.m. UTC | #7
On 08/08/2011 08:51 AM, Avi Kivity wrote:
> On 08/08/2011 04:29 PM, Anthony Liguori wrote:
>>
>> One thing that strikes me about this algorithm is that it's very good
>> for a particular type of workload--shockingly good really.
>
> Poking bytes at random places in memory is fairly generic. If you have a
> lot of small objects, and modify a subset of them, this is the pattern
> you get.
>
>>
>> I think workload aware migration compression is possible for a lot of
>> different types of workloads. That makes me a bit wary of QEMU growing
>> quite a lot of compression mechanisms.
>>
>> It makes me think that this logic may really belong at a higher level
>> where more information is known about the workload. For instance, I
>> can imagine XBZRLE living in something like libvirt.
>
> A better model would be plugin based.

exec helpers are plugins.  They just live in a different address space 
and a channel to exchange data (pipe).

If we did .so plugins, which I'm really not opposed to, I'd want the 
interface to be something like:

typedef struct MigrationTransportClass
{
    ssize_t (*writev)(MigrationTransport *obj,
                      struct iovec *iov,
                      int iovcnt);
} MigrationTransportClass;

I think it's useful to use an interface like this because it makes it 
easy to put the transport in a dedicated thread that didn't hold 
qemu_mutex (which is sort of equivalent to using a fork'd helper but is 
zero-copy at the expense of less isolation).

Regards,

Anthony Liguori

>
Avi Kivity Aug. 8, 2011, 2:23 p.m. UTC | #8
On 08/08/2011 05:15 PM, Anthony Liguori wrote:
>>
>>>
>>> I think workload aware migration compression is possible for a lot of
>>> different types of workloads. That makes me a bit wary of QEMU growing
>>> quite a lot of compression mechanisms.
>>>
>>> It makes me think that this logic may really belong at a higher level
>>> where more information is known about the workload. For instance, I
>>> can imagine XBZRLE living in something like libvirt.
>>
>> A better model would be plugin based.
>
>
> exec helpers are plugins.  They just live in a different address space 
> and a channel to exchange data (pipe).

libvirt isn't an exec helper.

>
> If we did .so plugins, which I'm really not opposed to, I'd want the 
> interface to be something like:
>
> typedef struct MigrationTransportClass
> {
>    ssize_t (*writev)(MigrationTransport *obj,
>                      struct iovec *iov,
>                      int iovcnt);
> } MigrationTransportClass;
>
> I think it's useful to use an interface like this because it makes it 
> easy to put the transport in a dedicated thread that didn't hold 
> qemu_mutex (which is sort of equivalent to using a fork'd helper but 
> is zero-copy at the expense of less isolation).

If we have a shared object helper, the thread should be maintained by 
qemu proper, not the plugin.

I wouldn't call it "migration transport", but instead a 
compression/decompression plugin.

I don't think it merits a plugin at all though.  There's limited scope 
for compression and it best sits in qemu proper.  If anything, it needs 
to be more integrated (for example turning itself off if it doesn't 
match enough).
Anthony Liguori Aug. 8, 2011, 2:33 p.m. UTC | #9
On 08/08/2011 09:23 AM, Avi Kivity wrote:
> On 08/08/2011 05:15 PM, Anthony Liguori wrote:
>>
>> If we did .so plugins, which I'm really not opposed to, I'd want the
>> interface to be something like:
>>
>> typedef struct MigrationTransportClass
>> {
>> ssize_t (*writev)(MigrationTransport *obj,
>> struct iovec *iov,
>> int iovcnt);
>> } MigrationTransportClass;
>>
>> I think it's useful to use an interface like this because it makes it
>> easy to put the transport in a dedicated thread that didn't hold
>> qemu_mutex (which is sort of equivalent to using a fork'd helper but
>> is zero-copy at the expense of less isolation).
>
> If we have a shared object helper, the thread should be maintained by
> qemu proper, not the plugin.
>
> I wouldn't call it "migration transport", but instead a
> compression/decompression plugin.
>
> I don't think it merits a plugin at all though. There's limited scope
> for compression and it best sits in qemu proper. If anything, it needs
> to be more integrated (for example turning itself off if it doesn't
> match enough).

That adds a tremendous amount of complexity to QEMU.  If we're going to 
change our compression algorithm, we would need to use a single 
algorithm that worked well for a wide variety of workloads.

We struggle enough with migration as it is, it only would get worse if 
we have 10 different algorithms that we were dynamically enabling/disabling.

The other option is to allow 1-off compression algorithms in the form of 
plugins.  I think in this case, plugins are a pretty good compromise in 
terms of isolating complexity while allowing something that at least 
works very well for one particular type of workload.

Regards,

Anthony Liguori
Avi Kivity Aug. 8, 2011, 2:39 p.m. UTC | #10
On 08/08/2011 05:33 PM, Anthony Liguori wrote:
>> If we have a shared object helper, the thread should be maintained by
>> qemu proper, not the plugin.
>>
>> I wouldn't call it "migration transport", but instead a
>> compression/decompression plugin.
>>
>> I don't think it merits a plugin at all though. There's limited scope
>> for compression and it best sits in qemu proper. If anything, it needs
>> to be more integrated (for example turning itself off if it doesn't
>> match enough).
>
>
> That adds a tremendous amount of complexity to QEMU. 

Tremendous?  You exaggerate.  It's a lot simpler than the block or char 
layers, for example.

> If we're going to change our compression algorithm, we would need to 
> use a single algorithm that worked well for a wide variety of workloads.

That algorithm will have to include XBZRLE as a subset, since it matches 
what workloads actually do (touch memory sparsely).

>
> We struggle enough with migration as it is, it only would get worse if 
> we have 10 different algorithms that we were dynamically 
> enabling/disabling.
>
> The other option is to allow 1-off compression algorithms in the form 
> of plugins.  I think in this case, plugins are a pretty good 
> compromise in terms of isolating complexity while allowing something 
> that at least works very well for one particular type of workload.

I think you underestimate the generality of XBZRLE (or maybe I'm 
overestimating it?).  It's not reasonable to ask users to match a 
compression algorithm to their workload; most times they won't be 
interacting with the host at all.  We need compression to be enabled at 
all time, turning itself off if it finds it isn't effective so it can 
consume less cpu.
Avi Kivity Aug. 8, 2011, 2:42 p.m. UTC | #11
On 08/08/2011 05:04 PM, Daniel P. Berrange wrote:
> My main concern with all these scenarios where libvirt touches the
> actual data stream though is that we're introducing extra data copies
> into the migration path which potentially waste CPU cycles.
> If QEMU can directly XBZRLE encode data into the FD passed via 'fd:'
> then we minimize data copies. Whether this is a big enough benefit
> to offset the burden of having to maintain various compression code
> options in QEMU I can't answer.
>

It's counterproductive to force an unneeded data copy in order to 
increase bandwidth.
Avi Kivity Aug. 8, 2011, 2:46 p.m. UTC | #12
On 08/08/2011 11:42 AM, Shribman, Aidan wrote:
> Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
> From: Aidan Shribman<aidan.shribman@sap.com>
>
> By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
> and total live-migration time of VMs running memory write intensive workloads
> typical of large enterprise applications such as SAP ERP Systems, and generally
> speaking for any application with a sparse memory update pattern.
>
> On the sender side XBZRLE is used as a compact delta encoding of page updates,
> retrieving the old page content from an LRU cache (default size of 64 MB). The
> receiving side uses the existing page content and XBZRLE to decode the new page
> content.
>
> Work was originally based on research results published VEE 2011: Evaluation of
> Delta Compression Techniques for Efficient Live Migration of Large Virtual
> Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
> XBRLE was improved further using XBZRLE instead.
>
> XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
> ideal for in-line, real-time encoding such as is needed for live-migration.
>
> A typical usage scenario:
>      {qemu} migrate_set_cachesize 256m
>      {qemu} migrate -x -d tcp:destination.host:4444
>      {qemu} info migrate
>      ...
>      transferred ram-duplicate: A kbytes
>      transferred ram-duplicate: B pages
>      transferred ram-normal: C kbytes
>      transferred ram-normal: D pages
>      transferred ram-xbrle: E kbytes
>      transferred ram-xbrle: F pages
>      overflow ram-xbrle: G pages
>      cache-hit ram-xbrle: H pages
>      cache-lookup ram-xbrle: J pages
>
> Testing: live migration with XBZRLE completed in 110 seconds, without live
> migration was not able to complete.
>
> A simple synthetic memory r/w load generator:
> ..    include<stdlib.h>
> ..    include<stdio.h>
> ..    int main()
> ..    {
> ..        char *buf = (char *) calloc(4096, 4096);
> ..        while (1) {
> ..            int i;
> ..            for (i = 0; i<  4096 * 4; i++) {
> ..                buf[i * 4096 / 4]++;
> ..            }
> ..            printf(".");
> ..        }
> ..    }
>
>


Please provide documentation in docs/ of the compression format.

IMO it should be disabled by default (with an option to disable it, via, 
sat, migrate-set-options, so we can migrate to older hosts).

The protocol should allow XBZRLE to turn itself off if it detects that 
it isn't effective.
Avi Kivity Aug. 8, 2011, 2:47 p.m. UTC | #13
On 08/08/2011 05:46 PM, Avi Kivity wrote:
>
> Please provide documentation in docs/ of the compression format.
>
> IMO it should be disabled by default (with an option to disable it, 
> via, sat, migrate-set-options, so we can migrate to older hosts).
>
> The protocol should allow XBZRLE to turn itself off if it detects that 
> it isn't effective.
>

IOW, this should be part of the standard migration protocol, not some 
side option that is enabled if the user remembers.  It should not be 
mutually exclusive with future migration extensions, including compression.
Stefan Hajnoczi Aug. 8, 2011, 2:56 p.m. UTC | #14
On Mon, Aug 8, 2011 at 3:47 PM, Avi Kivity <avi@redhat.com> wrote:
> On 08/08/2011 05:46 PM, Avi Kivity wrote:
>>
>> Please provide documentation in docs/ of the compression format.
>>
>> IMO it should be disabled by default (with an option to disable it, via,
>> sat, migrate-set-options, so we can migrate to older hosts).
>>
>> The protocol should allow XBZRLE to turn itself off if it detects that it
>> isn't effective.
>>
>
> IOW, this should be part of the standard migration protocol, not some side
> option that is enabled if the user remembers.  It should not be mutually
> exclusive with future migration extensions, including compression.

This is an attractive option.  With some polish maybe XBZRLE could be
integrated as a default option that does not degrade performance.
Adding features that require user configuration isn't worthwhile
because they won't be used or they'll be misused - let's not make QEMU
more complicated if it can be avoided.

If there is no way to make XBZRLE automatic then I think it should
live outside QEMU because it will be a niche feature that relatively
few will use but adds complexity to migration.

Stefan
Avi Kivity Aug. 8, 2011, 3:01 p.m. UTC | #15
On 08/08/2011 05:56 PM, Stefan Hajnoczi wrote:
> >  IOW, this should be part of the standard migration protocol, not some side
> >  option that is enabled if the user remembers.  It should not be mutually
> >  exclusive with future migration extensions, including compression.
>
> This is an attractive option.  With some polish maybe XBZRLE could be
> integrated as a default option that does not degrade performance.
> Adding features that require user configuration isn't worthwhile
> because they won't be used or they'll be misused - let's not make QEMU
> more complicated if it can be avoided.
>
> If there is no way to make XBZRLE automatic then I think it should
> live outside QEMU because it will be a niche feature that relatively
> few will use but adds complexity to migration.

Agree.  Aidan, can you provide impact numbers on non-XBZRLE favourable 
workloads (both throughput and cpu usage)?  What about turning itself 
off automatically if the hit rate is too low?
Anthony Liguori Aug. 8, 2011, 3:08 p.m. UTC | #16
On 08/08/2011 09:39 AM, Avi Kivity wrote:
>> The other option is to allow 1-off compression algorithms in the form
>> of plugins. I think in this case, plugins are a pretty good compromise
>> in terms of isolating complexity while allowing something that at
>> least works very well for one particular type of workload.
>
> I think you underestimate the generality of XBZRLE (or maybe I'm
> overestimating it?).

This is really my fundamental concern.  When it comes to something that 
we have to support for a very long time, no one should be estimating 
anything.  We should make these decisions based on an awful lot of 
analysis on a wide variety of workloads.

It's hard to do this in QEMU today because we don't have a module 
mechanism to make it easy for users to try out new things without fully 
committing to including something in the tree.

But I don't think that's the root of the problem I have.  I really am 
just extremely reluctant to commit to something that we have to support 
forever.

Thinking more about it though, I think there can be another 
solution--feature negotiation.

I view adding feature negotiation as a pre-requisite to adding any type 
of transport compression such as XBZRLE.  That will let us support 
migration to older QEMUs and also to eventually remove XBZRLE if we 
decide it doesn't make sense anymore.

Regards,

Anthony Liguori

> It's not reasonable to ask users to match a
> compression algorithm to their workload; most times they won't be
> interacting with the host at all. We need compression to be enabled at
> all time, turning itself off if it finds it isn't effective so it can
> consume less cpu.
>
Anthony Liguori Aug. 8, 2011, 3:10 p.m. UTC | #17
On 08/08/2011 09:47 AM, Avi Kivity wrote:
> On 08/08/2011 05:46 PM, Avi Kivity wrote:
>>
>> Please provide documentation in docs/ of the compression format.
>>
>> IMO it should be disabled by default (with an option to disable it,
>> via, sat, migrate-set-options, so we can migrate to older hosts).
>>
>> The protocol should allow XBZRLE to turn itself off if it detects that
>> it isn't effective.
>>
>
> IOW, this should be part of the standard migration protocol, not some
> side option that is enabled if the user remembers. It should not be
> mutually exclusive with future migration extensions, including compression.

Are you thinking of a static decision or a dynamic decision?

I think feature negotiation would address static decision making.  For 
dynamic decision making, you could look to something like the VNC 
protocol and how it encodes pixel data.  The flow looks something like:

1) All clients/servers must support raw encoding

2) Client presents list of support encodings

3) Server takes intersection of client supported encodings and server 
supported encodings.

4) Server can choose to encode updates using any encoding supported by 
client and server.

Regards,

Anthony Liguori
Avi Kivity Aug. 8, 2011, 3:15 p.m. UTC | #18
On 08/08/2011 06:10 PM, Anthony Liguori wrote:
> On 08/08/2011 09:47 AM, Avi Kivity wrote:
>> On 08/08/2011 05:46 PM, Avi Kivity wrote:
>>>
>>> Please provide documentation in docs/ of the compression format.
>>>
>>> IMO it should be disabled by default (with an option to disable it,
>>> via, sat, migrate-set-options, so we can migrate to older hosts).
>>>
>>> The protocol should allow XBZRLE to turn itself off if it detects that
>>> it isn't effective.
>>>
>>
>> IOW, this should be part of the standard migration protocol, not some
>> side option that is enabled if the user remembers. It should not be
>> mutually exclusive with future migration extensions, including 
>> compression.
>
> Are you thinking of a static decision or a dynamic decision?
>

Dynamic.  If the cache hit rate is too low, disable XBZRLE and eliminate 
the overhead of copying pages to the history buffer.

> I think feature negotiation would address static decision making.  For 
> dynamic decision making, you could look to something like the VNC 
> protocol and how it encodes pixel data.  The flow looks something like:
>
> 1) All clients/servers must support raw encoding
>
> 2) Client presents list of support encodings
>
> 3) Server takes intersection of client supported encodings and server 
> supported encodings.
>
> 4) Server can choose to encode updates using any encoding supported by 
> client and server.

Feature negotiation in the migration protocol itself would break exec: 
migration (and any existing single duplex proxies).

We can do a poor man's feature negotiation via capabilities, relying on 
management to disable features which don't exist on the other side.  It 
isn't pretty, but it's the best we can do at this point.

Real feature negotiation will likely have to wait until the next version 
of the migration protocol.
Anthony Liguori Aug. 8, 2011, 4:19 p.m. UTC | #19
On 08/08/2011 10:15 AM, Avi Kivity wrote:
> On 08/08/2011 06:10 PM, Anthony Liguori wrote:
>> On 08/08/2011 09:47 AM, Avi Kivity wrote:
>>> On 08/08/2011 05:46 PM, Avi Kivity wrote:
>>>>
>>>> Please provide documentation in docs/ of the compression format.
>>>>
>>>> IMO it should be disabled by default (with an option to disable it,
>>>> via, sat, migrate-set-options, so we can migrate to older hosts).
>>>>
>>>> The protocol should allow XBZRLE to turn itself off if it detects that
>>>> it isn't effective.
>>>>
>>>
>>> IOW, this should be part of the standard migration protocol, not some
>>> side option that is enabled if the user remembers. It should not be
>>> mutually exclusive with future migration extensions, including
>>> compression.
>>
>> Are you thinking of a static decision or a dynamic decision?
>>
>
> Dynamic. If the cache hit rate is too low, disable XBZRLE and eliminate
> the overhead of copying pages to the history buffer.
>
>> I think feature negotiation would address static decision making. For
>> dynamic decision making, you could look to something like the VNC
>> protocol and how it encodes pixel data. The flow looks something like:
>>
>> 1) All clients/servers must support raw encoding
>>
>> 2) Client presents list of support encodings
>>
>> 3) Server takes intersection of client supported encodings and server
>> supported encodings.
>>
>> 4) Server can choose to encode updates using any encoding supported by
>> client and server.
>
> Feature negotiation in the migration protocol itself would break exec:
> migration (and any existing single duplex proxies).
>
> We can do a poor man's feature negotiation via capabilities, relying on
> management to disable features which don't exist on the other side. It
> isn't pretty, but it's the best we can do at this point.

I think the above can be done via capabilities too fwiw.  In this case, 
the source and destination advertise the compression formats they 
support, and the management tool takes the interaction and sets the 
capability mask on the source and destination appropriately.

There's no need for a full duplex protocol because the source just sends 
compressed data in whatever format it thinks is appropriate at any given 
point in time.

> Real feature negotiation will likely have to wait until the next version
> of the migration protocol.

Since we're talking about moving to ASN.1 for 1.0, I think we should 
think also include memory compression and wait until we rev the protocol 
before introducing any type of compression.

Regards,

Anthony Liguori
Avi Kivity Aug. 8, 2011, 4:53 p.m. UTC | #20
On 08/08/2011 07:19 PM, Anthony Liguori wrote:
>
>> Real feature negotiation will likely have to wait until the next version
>> of the migration protocol.
>
> Since we're talking about moving to ASN.1 for 1.0, I think we should 
> think also include memory compression and wait until we rev the 
> protocol before introducing any type of compression.

Is anyone actually working on this?
Anthony Liguori Aug. 8, 2011, 4:55 p.m. UTC | #21
On 08/08/2011 11:53 AM, Avi Kivity wrote:
> On 08/08/2011 07:19 PM, Anthony Liguori wrote:
>>
>>> Real feature negotiation will likely have to wait until the next version
>>> of the migration protocol.
>>
>> Since we're talking about moving to ASN.1 for 1.0, I think we should
>> think also include memory compression and wait until we rev the
>> protocol before introducing any type of compression.
>
> Is anyone actually working on this?

Yes.

Regards,

Anthony Liguori
Shribman, Aidan Aug. 10, 2011, 3:07 p.m. UTC | #22
-----Original Message-----
From: Anthony Liguori [mailto:anthony@codemonkey.ws] 
Sent: Monday, August 08, 2011 7:56 PM
To: Avi Kivity
Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel Developers
Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

On 08/08/2011 11:53 AM, Avi Kivity wrote:
> On 08/08/2011 07:19 PM, Anthony Liguori wrote:
>>
>>> Real feature negotiation will likely have to wait until the next version
>>> of the migration protocol.
>>
>> Since we're talking about moving to ASN.1 for 1.0, I think we should
>> think also include memory compression and wait until we rev the
>> protocol before introducing any type of compression.
>
> Is anyone actually working on this?

XBZRLE will very rarely (if at all) degrade live-migration as it runs at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by using 128bit registers instead of the 64bit registers used currently. IMO XBZRLE could safely be used by default exposing capabilities by Qemu such that higher level management would handle static negotiation (as suggested).

Given that XBZRLE will seldom fail due to inflated encoded output (an example for such a case -> dirty the new page every 2nd 64bit word: the word-wise Xor would give 0x0y0z... ZRLE would future encode as 01x01y01z... a +50% increase), I see little incentive in automatic XBZRLE disablement.

As to implementing XBZRLE delta compression as a compression plug-in - this is not that straight forward as it has some interesting interplay with DUP packat's which are crucial for performance, specifically a page consisting of only zero's is LRU cached as reference without the standard qemu_malloc()/memcpy() done in other cases. This is especially important for eliminating slowdown during live-migration initiation.

As to waiting for ASN.1 capability - I can see this will make parsing of live-migration messages much more reliable (ensuring that Qemu is able to detect an incorrect protocol version) but I can't say I am very happy waiting for 1.0 - are there any alternatives?

Aidan
Avi Kivity Aug. 10, 2011, 3:12 p.m. UTC | #23
On 08/10/2011 06:07 PM, Shribman, Aidan wrote:
> XBZRLE will very rarely (if at all) degrade live-migration as it runs at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by using 128bit registers instead of the 64bit registers used currently. IMO XBZRLE could safely be used by default exposing capabilities by Qemu such that higher level management would handle static negotiation (as suggested).
>
> Given that XBZRLE will seldom fail due to inflated encoded output (an example for such a case ->  dirty the new page every 2nd 64bit word: the word-wise Xor would give 0x0y0z... ZRLE would future encode as 01x01y01z... a +50% increase), I see little incentive in automatic XBZRLE disablement.

My concern is not reduced migration bandwidth or inflated image size, 
but increased cpu use for copying pages to the cache and xoring them.

> As to implementing XBZRLE delta compression as a compression plug-in - this is not that straight forward as it has some interesting interplay with DUP packat's which are crucial for performance, specifically a page consisting of only zero's is LRU cached as reference without the standard qemu_malloc()/memcpy() done in other cases. This is especially important for eliminating slowdown during live-migration initiation.

I agree, it should be on-by-default and in the main code base.  Please 
provide numbers to justify this on non-artificial workloads, and on 
artificial worst-case workloads.

> As to waiting for ASN.1 capability - I can see this will make parsing of live-migration messages much more reliable (ensuring that Qemu is able to detect an incorrect protocol version) but I can't say I am very happy waiting for 1.0 - are there any alternatives?
>

I don't think we should couple the two features together.
Anthony Liguori Aug. 10, 2011, 3:58 p.m. UTC | #24
On 08/10/2011 10:12 AM, Avi Kivity wrote:
> On 08/10/2011 06:07 PM, Shribman, Aidan wrote:
>> XBZRLE will very rarely (if at all) degrade live-migration as it runs
>> at ~2 GB/s or 16 Gbps. Additionally XBZRLE could get even faster by
>> using 128bit registers instead of the 64bit registers used currently.
>> IMO XBZRLE could safely be used by default exposing capabilities by
>> Qemu such that higher level management would handle static negotiation
>> (as suggested).
>>
>> Given that XBZRLE will seldom fail due to inflated encoded output (an
>> example for such a case -> dirty the new page every 2nd 64bit word:
>> the word-wise Xor would give 0x0y0z... ZRLE would future encode as
>> 01x01y01z... a +50% increase), I see little incentive in automatic
>> XBZRLE disablement.
>
> My concern is not reduced migration bandwidth or inflated image size,
> but increased cpu use for copying pages to the cache and xoring them.
>
>> As to implementing XBZRLE delta compression as a compression plug-in -
>> this is not that straight forward as it has some interesting interplay
>> with DUP packat's which are crucial for performance, specifically a
>> page consisting of only zero's is LRU cached as reference without the
>> standard qemu_malloc()/memcpy() done in other cases. This is
>> especially important for eliminating slowdown during live-migration
>> initiation.
>
> I agree, it should be on-by-default and in the main code base. Please
> provide numbers to justify this on non-artificial workloads, and on
> artificial worst-case workloads.
>
>> As to waiting for ASN.1 capability - I can see this will make parsing
>> of live-migration messages much more reliable (ensuring that Qemu is
>> able to detect an incorrect protocol version) but I can't say I am
>> very happy waiting for 1.0 - are there any alternatives?
>>
>
> I don't think we should couple the two features together.

ASN.1 is orthogonal to capabilities.

Capabilities are a hard requirement before merging any new type of 
compression algorithm IMO.

Regards,

Anthony Liguori

>
Avi Kivity Aug. 10, 2011, 4:08 p.m. UTC | #25
On 08/10/2011 06:58 PM, Anthony Liguori wrote:
>> I don't think we should couple the two features together.
>
>
> ASN.1 is orthogonal to capabilities.
>
> Capabilities are a hard requirement before merging any new type of 
> compression algorithm IMO.

Right now we have capabilties in the form of -help output.

If -help says

   -no-xzbrle   disable xzbrle support

(or -migration-compression xzbrle=off, or something) that's sufficient 
for management tools.

We shouldn't block this feature just because some monitor facility is 
not yet implemented.
Anthony Liguori Aug. 10, 2011, 4:23 p.m. UTC | #26
On 08/10/2011 11:08 AM, Avi Kivity wrote:
> On 08/10/2011 06:58 PM, Anthony Liguori wrote:
>>> I don't think we should couple the two features together.
>>
>>
>> ASN.1 is orthogonal to capabilities.
>>
>> Capabilities are a hard requirement before merging any new type of
>> compression algorithm IMO.
>
> Right now we have capabilties in the form of -help output.
>
> If -help says
>
> -no-xzbrle disable xzbrle support
>
> (or -migration-compression xzbrle=off, or something) that's sufficient
> for management tools.

This is static, not dynamic.  You may attempt to migrate to another host 
that supports it and then migrate to a second host that doesn't support 
it after the first migration fails.

>
> We shouldn't block this feature just because some monitor facility is
> not yet implemented.

We shouldn't make *any* changes to the migration protocol before we have 
a feature negotiation capability.  I only want to do a hard break of the 
protocol once.

Regards,

Anthony Liguori
Avi Kivity Aug. 10, 2011, 4:40 p.m. UTC | #27
On 08/10/2011 07:23 PM, Anthony Liguori wrote:
>> Right now we have capabilties in the form of -help output.
>>
>> If -help says
>>
>> -no-xzbrle disable xzbrle support
>>
>> (or -migration-compression xzbrle=off, or something) that's sufficient
>> for management tools.
>
>
> This is static, not dynamic.  You may attempt to migrate to another 
> host that supports it and then migrate to a second host that doesn't 
> support it after the first migration fails.

This may be acceptable, wait until the entire migration cluster is 
xzbrle capable before enabling it.  If not, add a monitor command.

>
>>
>> We shouldn't block this feature just because some monitor facility is
>> not yet implemented.
>
> We shouldn't make *any* changes to the migration protocol before we 
> have a feature negotiation capability.  I only want to do a hard break 
> of the protocol once.

Didn't we agree that management tool mediated feature negotiation (that 
is, outside the migration protocol itself) is acceptable?
Anthony Liguori Aug. 10, 2011, 7:27 p.m. UTC | #28
On 08/10/2011 11:40 AM, Avi Kivity wrote:
> On 08/10/2011 07:23 PM, Anthony Liguori wrote:
>>> Right now we have capabilties in the form of -help output.
>>>
>>> If -help says
>>>
>>> -no-xzbrle disable xzbrle support
>>>
>>> (or -migration-compression xzbrle=off, or something) that's sufficient
>>> for management tools.
>>
>>
>> This is static, not dynamic. You may attempt to migrate to another
>> host that supports it and then migrate to a second host that doesn't
>> support it after the first migration fails.
>
> This may be acceptable, wait until the entire migration cluster is
> xzbrle capable before enabling it. If not, add a monitor command.

1) xzbrle needs to be disabled by default.  That way management tools 
don't unknowingly enable it by not passing -no-xzbrle.

2) there needs to be a mechanism for the management tool to query 
whether qemu supports xzbrle.

3) a management tool should be able to query the source and destination, 
and then enable xzbrle if both sides support it.

You can argue that (3) could be static.  A command could be added to 
toggle it dynamically through the monitor.

But no matter what, someone has to touch libvirt and any other tool that 
works with QEMU to make this thing work.  But this is a general problem. 
  Any optional change to the migration protocol has exactly the same 
characteristics whether it's XZBRLE, XZBRLE v2 (if there is a v2), 
ASN.1, or any other form of compression that rolls around.

Instead of teaching management tools how to deal with all of these 
things, let's just fix this problem once.  It just takes:

a) A query-migration-caps command that returns a dict with two lists of 
strings.  Something like:

{ 'execute': 'query-migration-caps' }
{ 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }

b) A set-migration-caps command that takes a list of strings.  It simply 
takes the intersection of the capabilities set with the argument and 
sets the current set to the result.  Something like:

{ 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }}
{ 'return' : {} }

c) An internal interface to register a capability and an internal 
interface to check if a capability is currently enabled.  The xzbrle 
code just needs to disable itself if the capability isn't set.

Then we teach libvirt (and other tools) to query the caps list on the 
source, set the destination, query the current set on the destination, 
and then set that set on the source.

As we introduce new things, like the next great compression protocol, or 
ASN.1, we don't need to touch libvirt again.  libvirt can still know 
about the caps and selectively override QEMU if it's so inclined but it 
prevents us from reinventing the same mechanisms over and over again.

>>> We shouldn't block this feature just because some monitor facility is
>>> not yet implemented.
>>
>> We shouldn't make *any* changes to the migration protocol before we
>> have a feature negotiation capability. I only want to do a hard break
>> of the protocol once.
>
> Didn't we agree that management tool mediated feature negotiation (that
> is, outside the migration protocol itself) is acceptable?

Yes.  But that negotiation needs to become part of the "protocol" for 
migration.  In the absence of that negotiation, we need to use the wire 
protocol we use today.  We cannot have ad-hoc feature negotiation for 
every change we make to the wire protocol.

Regards,

Anthony Liguori

>
Shribman, Aidan Aug. 11, 2011, 8:03 a.m. UTC | #29
> From: Anthony Liguori [mailto:anthony@codemonkey.ws]
> Sent: Wednesday, August 10, 2011 10:28 PM
> To: Avi Kivity
> Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel
> Developers; libvir-list@redhat.com
> Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of
> large memory apps

> a) A query-migration-caps command that returns a dict with two lists of
> strings.  Something like:
> 
> { 'execute': 'query-migration-caps' }
> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }
> 
> b) A set-migration-caps command that takes a list of strings.  It
> simply
> takes the intersection of the capabilities set with the argument and
> sets the current set to the result.  Something like:
> 
> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ]
> }}
> { 'return' : {} }

We may want to further sub-divide capabilities into categories:
{ 'execute': 'query-migration-caps' }
{ 'return' : 
  { 'encoding' : { 'current', 'asn.1', 'proto2', 'thrift', etc. } },
  { 'delta' : { 'xbzrle', "xdelta", ...} },
  { 'compression' : { 'snappy', 'lzo' } } }
This would help libvirt/management to select features automatically or manually (via UI) without having to 'understand' the any given capability meaning.

> Yes.  But that negotiation needs to become part of the "protocol" for
> migration.  In the absence of that negotiation, we need to use the wire
> protocol we use today.  We cannot have ad-hoc feature negotiation for
> every change we make to the wire protocol.

Agreed. Therefore caps plus xbzrle could be added before ASN.1/v1.0 without breaking anything as long as when 'set-migration-caps' is not issued Qemu uses the current protocol.

Aidan
Avi Kivity Aug. 11, 2011, 8:17 a.m. UTC | #30
On 08/10/2011 10:27 PM, Anthony Liguori wrote:
>> This may be acceptable, wait until the entire migration cluster is
>> xzbrle capable before enabling it. If not, add a monitor command.
>
>
> 1) xzbrle needs to be disabled by default.  That way management tools 
> don't unknowingly enable it by not passing -no-xzbrle.

We could hook it to -M, though it's a bit gross.  Otherwise we need to 
document this clearly in the management tool author's guide.

>
> 3) a management tool should be able to query the source and 
> destination, and then enable xzbrle if both sides support it.
>
> You can argue that (3) could be static.  A command could be added to 
> toggle it dynamically through the monitor.
>
> But no matter what, someone has to touch libvirt and any other tool 
> that works with QEMU to make this thing work.  But this is a general 
> problem.  Any optional change to the migration protocol has exactly 
> the same characteristics whether it's XZBRLE, XZBRLE v2 (if there is a 
> v2), ASN.1, or any other form of compression that rolls around.

If we have two-way communication we can do this transparently in the 
protocol itself.

>
> Instead of teaching management tools how to deal with all of these 
> things, let's just fix this problem once.  It just takes:
>
> a) A query-migration-caps command that returns a dict with two lists 
> of strings.  Something like:
>
> { 'execute': 'query-migration-caps' }
> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }
>
> b) A set-migration-caps command that takes a list of strings.  It 
> simply takes the intersection of the capabilities set with the 
> argument and sets the current set to the result.  Something like:
>
> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }}
> { 'return' : {} }
>
> c) An internal interface to register a capability and an internal 
> interface to check if a capability is currently enabled.  The xzbrle 
> code just needs to disable itself if the capability isn't set.
>
> Then we teach libvirt (and other tools) to query the caps list on the 
> source, set the destination, query the current set on the destination, 
> and then set that set on the source.

This is only if the capability has no side effect.

>
> As we introduce new things, like the next great compression protocol, 
> or ASN.1, we don't need to touch libvirt again.  libvirt can still 
> know about the caps and selectively override QEMU if it's so inclined 
> but it prevents us from reinventing the same mechanisms over and over 
> again.

Right.

>
> Yes.  But that negotiation needs to become part of the "protocol" for 
> migration.  In the absence of that negotiation, we need to use the 
> wire protocol we use today.  We cannot have ad-hoc feature negotiation 
> for every change we make to the wire protocol.

Okay, as long as we have someone willing to implement it.
Daniel P. Berrangé Aug. 11, 2011, 9:16 a.m. UTC | #31
On Thu, Aug 11, 2011 at 11:17:09AM +0300, Avi Kivity wrote:
> On 08/10/2011 10:27 PM, Anthony Liguori wrote:
> >>This may be acceptable, wait until the entire migration cluster is
> >>xzbrle capable before enabling it. If not, add a monitor command.
> >
> >
> >1) xzbrle needs to be disabled by default.  That way management
> >tools don't unknowingly enable it by not passing -no-xzbrle.
> 
> We could hook it to -M, though it's a bit gross.

That would needlessly prevent its use for any existing installed
guests with a older machine type, which are running in a new QEMU
Some kind of monitor capabilities seems good to me.

Daniel
Avi Kivity Aug. 11, 2011, 9:20 a.m. UTC | #32
On 08/11/2011 12:16 PM, Daniel P. Berrange wrote:
> On Thu, Aug 11, 2011 at 11:17:09AM +0300, Avi Kivity wrote:
> >  On 08/10/2011 10:27 PM, Anthony Liguori wrote:
> >  >>This may be acceptable, wait until the entire migration cluster is
> >  >>xzbrle capable before enabling it. If not, add a monitor command.
> >  >
> >  >
> >  >1) xzbrle needs to be disabled by default.  That way management
> >  >tools don't unknowingly enable it by not passing -no-xzbrle.
> >
> >  We could hook it to -M, though it's a bit gross.
>
> That would needlessly prevent its use for any existing installed
> guests with a older machine type, which are running in a new QEMU

You could still enable it explicitly; I'm just trying to get it to be 
enabled by default.

> Some kind of monitor capabilities seems good to me.
>

Live migration is probably mostly done in managed environments, so I 
think you're right.
Daniel P. Berrangé Aug. 11, 2011, 9:24 a.m. UTC | #33
On Wed, Aug 10, 2011 at 02:27:41PM -0500, Anthony Liguori wrote:
> On 08/10/2011 11:40 AM, Avi Kivity wrote:
> Instead of teaching management tools how to deal with all of these
> things, let's just fix this problem once.  It just takes:
> 
> a) A query-migration-caps command that returns a dict with two lists
> of strings.  Something like:
> 
> { 'execute': 'query-migration-caps' }
> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }
> 
> b) A set-migration-caps command that takes a list of strings.  It
> simply takes the intersection of the capabilities set with the
> argument and sets the current set to the result.  Something like:
> 
> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }}
> { 'return' : {} }

We have a number of commands to set migration parameters (bandwidth,
max downtime, etc). One thing that has always troubled me with this,
is that they are a global setting for QEMU, not simply the next
migration command. This is fine if you expect that only a single
'migrate' command will ever be invoked during the lifetime of QEMU,
but this doesn't hold true when we use 'migrate' as a means to
implement saving of snapshots/save-to-file, or "core" dumps.

For example, when libvirt does 'save to file', we want to set the
max bandwidth to unlimited for that, but we don't want that 'unlimited'
setting to apply to a future cross-host migration attempt. This means
we have to send three commands 

   migrate_set_speed 9223372036854775808
   migrate file:/some/file
   migrate_set_speed 33554432

it doubly sucks because there is no way to reset the migration speed
to the QEMU default, so we have to hardcoded (32 << 20) which is what
QEMU currently uses.

It would be more desirable if we could simply pass in the desired
speed, compression algorithm, max downtime, etc, as parameters to
the 'migrate' command.  And then have 'migrate_set_speed' only
affect the current active migration, not any future ones.

So a 'query-migration-caps' command is nice, but I think having a
set-migration-caps command is wrong. There should just be a 'caps'
parameter for 'migrate'.

Regards,
Daniel
Anthony Liguori Aug. 11, 2011, 1 p.m. UTC | #34
On 08/11/2011 03:03 AM, Shribman, Aidan wrote:
>> From: Anthony Liguori [mailto:anthony@codemonkey.ws]
>> Sent: Wednesday, August 10, 2011 10:28 PM
>> To: Avi Kivity
>> Cc: Blue Swirl; Stefan Hajnoczi; Shribman, Aidan; qemu-devel
>> Developers; libvir-list@redhat.com
>> Subject: Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of
>> large memory apps
>
>> a) A query-migration-caps command that returns a dict with two lists of
>> strings.  Something like:
>>
>> { 'execute': 'query-migration-caps' }
>> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }
>>
>> b) A set-migration-caps command that takes a list of strings.  It
>> simply
>> takes the intersection of the capabilities set with the argument and
>> sets the current set to the result.  Something like:
>>
>> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ]
>> }}
>> { 'return' : {} }
>
> We may want to further sub-divide capabilities into categories:
> { 'execute': 'query-migration-caps' }
> { 'return' :
>    { 'encoding' : { 'current', 'asn.1', 'proto2', 'thrift', etc. } },
>    { 'delta' : { 'xbzrle', "xdelta", ...} },
>    { 'compression' : { 'snappy', 'lzo' } } }
> This would help libvirt/management to select features automatically or manually (via UI) without having to 'understand' the any given capability meaning.

I would prefer caps to be mostly transparent to libvirt.  In fact, I'd 
like to see exactly three caps: xbzrle, asn1, and autonegotiate.

I'd like to move the caps negotation into the protocol itself.

>> Yes.  But that negotiation needs to become part of the "protocol" for
>> migration.  In the absence of that negotiation, we need to use the wire
>> protocol we use today.  We cannot have ad-hoc feature negotiation for
>> every change we make to the wire protocol.
>
> Agreed. Therefore caps plus xbzrle could be added before ASN.1/v1.0 without breaking anything as long as when 'set-migration-caps' is not issued Qemu uses the current protocol.

Exactly.

Regards,

Anthony Liguori

> Aidan
>
Anthony Liguori Aug. 11, 2011, 1:03 p.m. UTC | #35
On 08/11/2011 03:17 AM, Avi Kivity wrote:
>> 3) a management tool should be able to query the source and
>> destination, and then enable xzbrle if both sides support it.
>>
>> You can argue that (3) could be static. A command could be added to
>> toggle it dynamically through the monitor.
>>
>> But no matter what, someone has to touch libvirt and any other tool
>> that works with QEMU to make this thing work. But this is a general
>> problem. Any optional change to the migration protocol has exactly the
>> same characteristics whether it's XZBRLE, XZBRLE v2 (if there is a
>> v2), ASN.1, or any other form of compression that rolls around.
>
> If we have two-way communication we can do this transparently in the
> protocol itself.

Yes.  This should be one of the initial caps to introduce.

>> Instead of teaching management tools how to deal with all of these
>> things, let's just fix this problem once. It just takes:
>>
>> a) A query-migration-caps command that returns a dict with two lists
>> of strings. Something like:
>>
>> { 'execute': 'query-migration-caps' }
>> { 'return' : { 'capabilities': [ 'xbzrle' ], 'current': [] } }
>>
>> b) A set-migration-caps command that takes a list of strings. It
>> simply takes the intersection of the capabilities set with the
>> argument and sets the current set to the result. Something like:
>>
>> { 'execute': 'set-migration-caps', 'arguments': { 'set': [ 'xbzrle' ] }}
>> { 'return' : {} }
>>
>> c) An internal interface to register a capability and an internal
>> interface to check if a capability is currently enabled. The xzbrle
>> code just needs to disable itself if the capability isn't set.
>>
>> Then we teach libvirt (and other tools) to query the caps list on the
>> source, set the destination, query the current set on the destination,
>> and then set that set on the source.
>
> This is only if the capability has no side effect.

Right, it can't change the output of any monitor commands or anything 
like that.  It's strictly about the encoding of the wire protocol which 
ought to be transparent to libvirt.

>> As we introduce new things, like the next great compression protocol,
>> or ASN.1, we don't need to touch libvirt again. libvirt can still know
>> about the caps and selectively override QEMU if it's so inclined but
>> it prevents us from reinventing the same mechanisms over and over again.
>
> Right.
>
>>
>> Yes. But that negotiation needs to become part of the "protocol" for
>> migration. In the absence of that negotiation, we need to use the wire
>> protocol we use today. We cannot have ad-hoc feature negotiation for
>> every change we make to the wire protocol.
>
> Okay, as long as we have someone willing to implement it.

Sounds like a good hackathon project :-)

Regards,

Anthony Liguori
diff mbox

Patch

diff --git a/Makefile.target b/Makefile.target
index 2800f47..b3215de 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -186,6 +186,7 @@  endif #CONFIG_BSD_USER
 ifdef CONFIG_SOFTMMU

 obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o
+obj-y += lru.o xbzrle.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
 obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o
diff --git a/arch_init.c b/arch_init.c
old mode 100644
new mode 100755
index 4486925..d67dc82
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,17 @@ 
 #include "net.h"
 #include "gdbstub.h"
 #include "hw/smbios.h"
+#include "lru.h"
+#include "xbzrle.h"
+
+//#define DEBUG_ARCH_INIT
+#ifdef DEBUG_ARCH_INIT
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif

 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -88,6 +99,161 @@  const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_PAGE     0x08
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
+#define RAM_SAVE_FLAG_XBZRLE    0x40
+
+/***********************************************************/
+/* RAM Migration State */
+typedef struct ArchMigrationState {
+    int use_xbrle;
+    int64_t xbrle_cache_size;
+} ArchMigrationState;
+
+static ArchMigrationState arch_mig_state;
+
+void arch_set_params(int blk_enable, int shared_base, int use_xbrle,
+        int64_t xbrle_cache_size, void *opaque)
+{
+    arch_mig_state.use_xbrle = use_xbrle;
+    arch_mig_state.xbrle_cache_size = xbrle_cache_size;
+}
+
+#define BE16_MAGIC 0x0123
+
+/***********************************************************/
+/* XBZRLE (Xor Binary Zero Run-Length Encoding) */
+typedef struct XBZRLEHeader {
+    uint32_t xh_cksum; /* not used */
+    uint16_t xh_magic;
+    uint16_t xh_len;
+    uint8_t xh_flags;
+} XBZRLEHeader;
+
+static uint8_t dup_buf[TARGET_PAGE_SIZE];
+
+/***********************************************************/
+/* accounting */
+typedef struct AccountingInfo{
+    uint64_t dup_pages;
+    uint64_t norm_pages;
+    uint64_t xbrle_bytes;
+    uint64_t xbrle_pages;
+    uint64_t xbrle_overflow;
+    uint64_t xbrle_cache_lookup;
+    uint64_t xbrle_cache_hit;
+    uint64_t iterations;
+} AccountingInfo;
+
+static AccountingInfo acct_info;
+
+static void acct_clear(void)
+{
+    memset(&acct_info, 0, sizeof(acct_info));
+}
+
+uint64_t dup_mig_bytes_transferred(void)
+{
+    return acct_info.dup_pages;
+}
+
+uint64_t dup_mig_pages_transferred(void)
+{
+    return acct_info.dup_pages;
+}
+
+uint64_t norm_mig_bytes_transferred(void)
+{
+    return acct_info.norm_pages * TARGET_PAGE_SIZE;
+}
+
+uint64_t norm_mig_pages_transferred(void)
+{
+    return acct_info.norm_pages;
+}
+
+uint64_t xbrle_mig_bytes_transferred(void)
+{
+    return acct_info.xbrle_bytes;
+}
+
+uint64_t xbrle_mig_pages_transferred(void)
+{
+    return acct_info.xbrle_pages;
+}
+
+uint64_t xbrle_mig_pages_overflow(void)
+{
+    return acct_info.xbrle_overflow;
+}
+
+uint64_t xbrle_mig_pages_cache_hit(void)
+{
+    return acct_info.xbrle_cache_hit;
+}
+
+uint64_t xbrle_mig_pages_cache_lookup(void)
+{
+    return acct_info.xbrle_cache_lookup;
+}
+
+static void save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
+        int cont, int flag)
+{
+        qemu_put_be64(f, offset | cont | flag);
+        if (!cont) {
+                qemu_put_byte(f, strlen(block->idstr));
+                qemu_put_buffer(f, (uint8_t *)block->idstr,
+                                strlen(block->idstr));
+        }
+}
+
+#define ENCODING_FLAG_XBZRLE 0x1
+
+static int save_xbrle_page(QEMUFile *f, uint8_t *current_page,
+        ram_addr_t current_addr, RAMBlock *block, ram_addr_t offset, int cont)
+{
+    int encoded_len = 0, bytes_sent = 0;
+    XBZRLEHeader hdr = {0, BE16_MAGIC};
+    uint8_t *encoded, *old_page;
+
+    /* abort if page not cached */
+    acct_info.xbrle_cache_lookup++;
+    old_page = lru_lookup(current_addr);
+    if (!old_page) {
+        goto done;
+    }
+    acct_info.xbrle_cache_hit++;
+
+    /* XBZRLE (XOR+ZRLE) encoding */
+    encoded = (uint8_t *) qemu_malloc(TARGET_PAGE_SIZE);
+    encoded_len = xbzrle_encode(encoded, old_page, current_page,
+            TARGET_PAGE_SIZE);
+
+    if (encoded_len < 0) {
+        DPRINTF("XBZRLE encoding overflow - sending uncompressed\n");
+        acct_info.xbrle_overflow++;
+        goto done;
+    }
+
+    hdr.xh_len = encoded_len;
+    hdr.xh_flags |= ENCODING_FLAG_XBZRLE;
+
+    /* Send XBZRLE compressed page */
+    save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBZRLE);
+
+    qemu_put_be32(f, hdr.xh_cksum);
+    qemu_put_buffer(f, (uint8_t *) &hdr.xh_magic, sizeof (hdr.xh_magic));
+    qemu_put_be16(f, hdr.xh_len);
+    qemu_put_byte(f, hdr.xh_flags);
+
+    qemu_put_buffer(f, encoded, encoded_len);
+    acct_info.xbrle_pages++;
+    bytes_sent = encoded_len + sizeof(hdr);
+    acct_info.xbrle_bytes += bytes_sent;
+
+done:
+    qemu_free(encoded);
+    return bytes_sent;
+}

 static int is_dup_page(uint8_t *page, uint8_t ch)
 {
@@ -107,7 +273,7 @@  static int is_dup_page(uint8_t *page, uint8_t ch)
 static RAMBlock *last_block;
 static ram_addr_t last_offset;

-static int ram_save_block(QEMUFile *f)
+static int ram_save_block(QEMUFile *f, int stage)
 {
     RAMBlock *block = last_block;
     ram_addr_t offset = last_offset;
@@ -120,6 +286,7 @@  static int ram_save_block(QEMUFile *f)
     current_addr = block->offset + offset;

     do {
+        lru_free_cb_t free_cb = qemu_free;
         if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
             uint8_t *p;
             int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;
@@ -128,28 +295,35 @@  static int ram_save_block(QEMUFile *f)
                                             current_addr + TARGET_PAGE_SIZE,
                                             MIGRATION_DIRTY_FLAG);

-            p = block->host + offset;
+            if (arch_mig_state.use_xbrle) {
+                p = qemu_malloc(TARGET_PAGE_SIZE);
+                memcpy(p, block->host + offset, TARGET_PAGE_SIZE);
+            } else {
+                p = block->host + offset;
+            }

             if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
-                }
+                save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_COMPRESS);
                 qemu_put_byte(f, *p);
                 bytes_sent = 1;
-            } else {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
+                acct_info.dup_pages++;
+                if (arch_mig_state.use_xbrle && !*p) {
+                    p = dup_buf;
+                    free_cb = NULL;
                 }
+            } else if (stage == 2 && arch_mig_state.use_xbrle) {
+                bytes_sent = save_xbrle_page(f, p, current_addr, block,
+                    offset, cont);
+            }
+            if (!bytes_sent) {
+                save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
                 qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
                 bytes_sent = TARGET_PAGE_SIZE;
+                acct_info.norm_pages++;
+            }
+            if (arch_mig_state.use_xbrle) {
+                lru_insert(current_addr, p, free_cb);
             }
-
             break;
         }

@@ -221,6 +395,9 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)

     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
+        if (arch_mig_state.use_xbrle) {
+            lru_fini();
+        }
         return 0;
     }

@@ -235,6 +412,11 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         last_block = NULL;
         last_offset = 0;

+        if (arch_mig_state.use_xbrle) {
+            lru_init(arch_mig_state.xbrle_cache_size/TARGET_PAGE_SIZE, 0);
+            acct_clear();
+        }
+
         /* Make sure all dirty bits are set */
         QLIST_FOREACH(block, &ram_list.blocks, next) {
             for (addr = block->offset; addr < block->offset + block->length;
@@ -264,8 +446,9 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     while (!qemu_file_rate_limit(f)) {
         int bytes_sent;

-        bytes_sent = ram_save_block(f);
+        bytes_sent = ram_save_block(f, stage);
         bytes_transferred += bytes_sent;
+        acct_info.iterations++;
         if (bytes_sent == 0) { /* no more blocks */
             break;
         }
@@ -285,19 +468,79 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         int bytes_sent;

         /* flush all remaining blocks regardless of rate limiting */
-        while ((bytes_sent = ram_save_block(f)) != 0) {
+        while ((bytes_sent = ram_save_block(f, stage))) {
             bytes_transferred += bytes_sent;
         }
         cpu_physical_memory_set_dirty_tracking(0);
+        if (arch_mig_state.use_xbrle) {
+            lru_fini();
+        }
     }

     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

     expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;

+    DPRINTF("ram_save_live: expected(%ld) <= max(%ld)?\n", expected_time,
+        migrate_max_downtime());
+
     return (stage == 2) && (expected_time <= migrate_max_downtime());
 }

+static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host)
+{
+    int len, rc = -1;
+    uint8_t *encoded;
+    XBZRLEHeader hdr = {0};
+
+    /* extract ZRLE header */
+    hdr.xh_cksum = qemu_get_be32(f);
+    qemu_get_buffer(f, (uint8_t *) &hdr.xh_magic, sizeof (hdr.xh_magic));
+    hdr.xh_len = qemu_get_be16(f);
+    hdr.xh_flags = qemu_get_byte(f);
+
+    if (!(hdr.xh_flags & ENCODING_FLAG_XBZRLE)) {
+        fprintf(stderr, "Failed to load XZBRLE page - wrong compression!\n");
+        goto done;
+    }
+
+    if (hdr.xh_len > TARGET_PAGE_SIZE) {
+        fprintf(stderr, "Failed to load XZBRLE page - len overflow!\n");
+        goto done;
+    }
+
+    /* load data and decode */
+    encoded = (uint8_t *) qemu_malloc(hdr.xh_len);
+    qemu_get_buffer(f, encoded, hdr.xh_len);
+    /* covert endianess if magic indicated destination differs from source */
+    if (hdr.xh_magic != BE16_MAGIC) {
+        const uint64_t *end = (uint64_t *) encoded +
+            hdr.xh_len / sizeof (uint64_t);
+        uint64_t *p;
+        for (p = (uint64_t *) encoded; p < end; p++) {
+            bswap64s(p);
+        }
+    }
+
+    /* decode ZRLE */
+    len = xbzrle_decode(host, host, encoded, hdr.xh_len);
+    if (len == -1) {
+        fprintf(stderr, "Failed to load XBZRLE page - decode error!\n");
+        goto done;
+    }
+
+    if (len != TARGET_PAGE_SIZE) {
+        fprintf(stderr, "Failed to load XBZRLE page - size %d expected %d!\n",
+            len, TARGET_PAGE_SIZE);
+        goto done;
+    }
+
+    rc = 0;
+done:
+    qemu_free(encoded);
+    return rc;
+}
+
 static inline void *host_from_stream_offset(QEMUFile *f,
                                             ram_addr_t offset,
                                             int flags)
@@ -328,16 +571,38 @@  static inline void *host_from_stream_offset(QEMUFile *f,
     return NULL;
 }

+static inline void *host_from_stream_offset_versioned(int version_id,
+        QEMUFile *f, ram_addr_t offset, int flags)
+{
+        void *host;
+        if (version_id == 3) {
+                host = qemu_get_ram_ptr(offset);
+        } else {
+                host = host_from_stream_offset(f, offset, flags);
+        }
+        if (!host) {
+            fprintf(stderr, "Failed to convert RAM address to host"
+                    " for offset 0x%lX!\n", offset);
+            abort();
+        }
+        return host;
+}
+
 int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     ram_addr_t addr;
-    int flags;
+    int flags, ret = 0;
+    static uint64_t seq_iter;
+
+    seq_iter++;

     if (version_id < 3 || version_id > 4) {
-        return -EINVAL;
+        ret = -EINVAL;
+        goto done;
     }

     do {
+        void *host;
         addr = qemu_get_be64(f);

         flags = addr & ~TARGET_PAGE_MASK;
@@ -346,7 +611,8 @@  int ram_load(QEMUFile *f, void *opaque, int version_id)
         if (flags & RAM_SAVE_FLAG_MEM_SIZE) {
             if (version_id == 3) {
                 if (addr != ram_bytes_total()) {
-                    return -EINVAL;
+                    ret = -EINVAL;
+                    goto done;
                 }
             } else {
                 /* Synchronize RAM block list */
@@ -365,8 +631,10 @@  int ram_load(QEMUFile *f, void *opaque, int version_id)

                     QLIST_FOREACH(block, &ram_list.blocks, next) {
                         if (!strncmp(id, block->idstr, sizeof(id))) {
-                            if (block->length != length)
-                                return -EINVAL;
+                            if (block->length != length) {
+                                ret = -EINVAL;
+                                goto done;
+                            }
                             break;
                         }
                     }
@@ -374,7 +642,8 @@  int ram_load(QEMUFile *f, void *opaque, int version_id)
                     if (!block) {
                         fprintf(stderr, "Unknown ramblock \"%s\", cannot "
                                 "accept migration\n", id);
-                        return -EINVAL;
+                        ret = -EINVAL;
+                        goto done;
                     }

                     total_ram_bytes -= length;
@@ -383,17 +652,10 @@  int ram_load(QEMUFile *f, void *opaque, int version_id)
         }

         if (flags & RAM_SAVE_FLAG_COMPRESS) {
-            void *host;
             uint8_t ch;

-            if (version_id == 3)
-                host = qemu_get_ram_ptr(addr);
-            else
-                host = host_from_stream_offset(f, addr, flags);
-            if (!host) {
-                return -EINVAL;
-            }
-
+            host = host_from_stream_offset_versioned(version_id,
+                            f, addr, flags);
             ch = qemu_get_byte(f);
             memset(host, ch, TARGET_PAGE_SIZE);
 #ifndef _WIN32
@@ -403,21 +665,28 @@  int ram_load(QEMUFile *f, void *opaque, int version_id)
             }
 #endif
         } else if (flags & RAM_SAVE_FLAG_PAGE) {
-            void *host;
-
-            if (version_id == 3)
-                host = qemu_get_ram_ptr(addr);
-            else
-                host = host_from_stream_offset(f, addr, flags);
-
+            host = host_from_stream_offset_versioned(version_id,
+                            f, addr, flags);
             qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+        } else if (flags & RAM_SAVE_FLAG_XBZRLE) {
+            host = host_from_stream_offset_versioned(version_id,
+                            f, addr, flags);
+            if (load_xbrle(f, addr, host) < 0) {
+                ret = -EINVAL;
+                goto done;
+            }
         }
+
         if (qemu_file_has_error(f)) {
-            return -EIO;
+            ret = -EIO;
+            goto done;
         }
     } while (!(flags & RAM_SAVE_FLAG_EOS));

-    return 0;
+done:
+    DPRINTF("Completed load of VM with exit code %d seq iteration %ld\n",
+            ret, seq_iter);
+    return ret;
 }

 void qemu_service_io(void)
diff --git a/block-migration.c b/block-migration.c
index 3e66f49..504df70 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -689,7 +689,8 @@  static int block_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }

-static void block_set_params(int blk_enable, int shared_base, void *opaque)
+static void block_set_params(int blk_enable, int shared_base,
+        int use_xbrle, int64_t xbrle_cache_size, void *opaque)
 {
     block_mig_state.blk_enable = blk_enable;
     block_mig_state.shared_base = shared_base;
diff --git a/hash.h b/hash.h
new file mode 100644
index 0000000..7109905
--- /dev/null
+++ b/hash.h
@@ -0,0 +1,72 @@ 
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for ints,  longs and pointers.
+   (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+
+typedef uint64_t u64;
+typedef uint32_t u32;
+#define BITS_PER_LONG TARGET_LONG_BITS
+
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME_32 0x9e370001UL
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL
+
+#if BITS_PER_LONG == 32
+#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32
+#define hash_long(val, bits) hash_32(val, bits)
+#elif BITS_PER_LONG == 64
+#define hash_long(val, bits) hash_64(val, bits)
+#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64
+#else
+#error Wordsize not 32 or 64
+#endif
+
+static inline u64 hash_64(u64 val, unsigned int bits)
+{
+    u64 hash = val;
+
+    /*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
+    u64 n = hash;
+    n <<= 18;
+    hash -= n;
+    n <<= 33;
+    hash -= n;
+    n <<= 3;
+    hash += n;
+    n <<= 3;
+    hash -= n;
+    n <<= 4;
+    hash += n;
+    n <<= 2;
+    hash += n;
+
+    /* High bits are more random, so use them. */
+    return hash >> (64 - bits);
+}
+
+static inline u32 hash_32(u32 val, unsigned int bits)
+{
+    /* On some cpus multiply is faster, on others gcc will do shifts */
+    u32 hash = val * GOLDEN_RATIO_PRIME_32;
+
+    /* High bits are more random, so use them. */
+    return hash >> (32 - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+    return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff --git a/hmp-commands.hx b/hmp-commands.hx
old mode 100644
new mode 100755
index e5585ba..e49d5be
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -717,24 +717,27 @@  ETEXI

     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
-        .help       = "migrate to URI (using -d to not wait for completion)"
-                     "\n\t\t\t -b for migration without shared storage with"
-                     " full copy of disk\n\t\t\t -i for migration without "
-                     "shared storage with incremental copy of disk "
-                     "(base image shared between src and destination)",
+        .args_type  = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s",
+        .params     = "[-d] [-b] [-i] [-x] uri",
+        .help       = "migrate to URI"
+                      "\n\t -d to not wait for completion"
+                      "\n\t -b for migration without shared storage with"
+                      " full copy of disk"
+                      "\n\t -i for migration without"
+                      " shared storage with incremental copy of disk"
+                      " (base image shared between source and destination)"
+                      "\n\t -x to use XBRLE page delta compression",
         .user_print = monitor_user_noop,
        .mhandler.cmd_new = do_migrate,
     },

-
 STEXI
-@item migrate [-d] [-b] [-i] @var{uri}
+@item migrate [-d] [-b] [-i] [-x] @var{uri}
 @findex migrate
 Migrate to @var{uri} (using -d to not wait for completion).
        -b for migration with full copy of disk
        -i for migration with incremental copy of disk (base image is shared)
+    -x to use XBRLE page delta compression
 ETEXI

     {
@@ -753,10 +756,23 @@  Cancel the current VM migration.
 ETEXI

     {
+        .name       = "migrate_set_cachesize",
+        .args_type  = "value:s",
+        .params     = "value",
+        .help       = "set cache size (in MB) for XBRLE migrations",
+        .mhandler.cmd = do_migrate_set_cachesize,
+    },
+
+STEXI
+@item migrate_set_cachesize @var{value}
+Set cache size (in MB) for xbrle migrations.
+ETEXI
+
+    {
         .name       = "migrate_set_speed",
         .args_type  = "value:o",
         .params     = "value",
-        .help       = "set maximum speed (in bytes) for migrations. "
+        .help       = "set maximum XBRLE cache size (in bytes) for migrations. "
        "Defaults to MB if no size suffix is specified, ie. B/K/M/G/T",
         .user_print = monitor_user_noop,
         .mhandler.cmd_new = do_migrate_set_speed,
diff --git a/hw/hw.h b/hw/hw.h
index 9d2cfc2..aa336ec 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -239,7 +239,8 @@  static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 int64_t qemu_ftell(QEMUFile *f);
 int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);

-typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque);
+typedef void SaveSetParamsHandler(int blk_enable, int shared,
+        int use_xbrle, int64_t xbrle_cache_size, void *opaque);
 typedef void SaveStateHandler(QEMUFile *f, void *opaque);
 typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage,
                                  void *opaque);
diff --git a/lru.c b/lru.c
new file mode 100644
index 0000000..e7230d0
--- /dev/null
+++ b/lru.c
@@ -0,0 +1,142 @@ 
+#include <assert.h>
+#include <math.h>
+#include "qemu-common.h"
+#include "qemu-queue.h"
+#include "host-utils.h"
+#include "lru.h"
+#include "hash.h"
+
+typedef struct CacheItem {
+    ram_addr_t it_addr;
+    uint8_t *it_data;
+    lru_free_cb_t it_free;
+    QCIRCLEQ_ENTRY(CacheItem) it_lru_next;
+    QCIRCLEQ_ENTRY(CacheItem) it_bucket_next;
+} CacheItem;
+
+typedef QCIRCLEQ_HEAD(, CacheItem) CacheBucket;
+static CacheBucket *page_hash;
+static int64_t cache_table_size;
+static uint64_t cache_max_items;
+static int64_t cache_num_items;
+static uint8_t cache_hash_bits;
+
+static QCIRCLEQ_HEAD(page_lru, CacheItem) page_lru;
+
+static uint64_t next_pow_of_2(uint64_t v)
+{
+    v--;
+    v |= v >> 1;
+    v |= v >> 2;
+    v |= v >> 4;
+    v |= v >> 8;
+    v |= v >> 16;
+    v |= v >> 32;
+    v++;
+    return v;
+}
+
+void lru_init(int64_t max_items, void *param)
+{
+    int i;
+
+    cache_num_items = 0;
+    cache_max_items = max_items;
+    /* add 20% to table size to reduce collisions */
+    cache_table_size = next_pow_of_2(1.2 * max_items);
+    cache_hash_bits = ctz64(cache_table_size) - 1;
+
+    QCIRCLEQ_INIT(&page_lru);
+
+    page_hash = qemu_mallocz(sizeof(CacheBucket) * cache_table_size);
+    assert(page_hash);
+    for (i = 0; i < cache_table_size; i++) {
+        QCIRCLEQ_INIT(&page_hash[i]);
+    }
+}
+
+static CacheBucket *page_bucket_list(ram_addr_t addr)
+{
+    return &page_hash[hash_long(addr, cache_hash_bits)];
+}
+
+static void do_lru_remove(CacheItem *it)
+{
+    assert(it);
+
+    QCIRCLEQ_REMOVE(&page_lru, it, it_lru_next);
+    QCIRCLEQ_REMOVE(page_bucket_list(it->it_addr), it, it_bucket_next);
+    if (it->it_free) {
+        (*it->it_free)(it->it_data);
+    }
+    qemu_free(it);
+    cache_num_items--;
+}
+
+static int do_lru_remove_first(void)
+{
+    CacheItem *first;
+
+    if (QCIRCLEQ_EMPTY(&page_lru)) {
+        return -1;
+    }
+    first = QCIRCLEQ_FIRST(&page_lru);
+    do_lru_remove(first);
+    return 0;
+}
+
+
+void lru_fini(void)
+{
+    while (!do_lru_remove_first()) {
+    }
+    qemu_free(page_hash);
+}
+
+static CacheItem *do_lru_lookup(ram_addr_t addr)
+{
+    CacheBucket *head = page_bucket_list(addr);
+    CacheItem *it;
+
+    if (QCIRCLEQ_EMPTY(head)) {
+        return NULL;
+    }
+    QCIRCLEQ_FOREACH(it, head, it_bucket_next) {
+        if (addr == it->it_addr) {
+            return it;
+        }
+    }
+    return NULL;
+}
+
+uint8_t *lru_lookup(ram_addr_t addr)
+{
+    CacheItem *it = do_lru_lookup(addr);
+    return it ? it->it_data : NULL;
+}
+
+void lru_insert(ram_addr_t addr, uint8_t *data, lru_free_cb_t free_cb)
+{
+    CacheItem *it;
+
+    /* remove old if item exists */
+    it = do_lru_lookup(addr);
+    if (it) {
+        do_lru_remove(it);
+    }
+
+    /* evict LRU if require free space */
+    if (cache_num_items == cache_max_items) {
+        do_lru_remove_first();
+    }
+
+    /* add new entry */
+    it = qemu_mallocz(sizeof(*it));
+    it->it_addr = addr;
+    it->it_data = data;
+    it->it_free = free_cb;
+    QCIRCLEQ_INSERT_HEAD(page_bucket_list(addr), it, it_bucket_next);
+    QCIRCLEQ_INSERT_TAIL(&page_lru, it, it_lru_next);
+    cache_num_items++;
+}
+
diff --git a/lru.h b/lru.h
new file mode 100644
index 0000000..6c70095
--- /dev/null
+++ b/lru.h
@@ -0,0 +1,13 @@ 
+#ifndef _LRU_H_
+#define _LRU_H_
+
+#include <unistd.h>
+#include <stdint.h>
+#include "cpu-all.h"
+typedef void (*lru_free_cb_t)(void *);
+void lru_init(ssize_t num_items, void *param);
+void lru_fini(void);
+void lru_insert(ram_addr_t id, uint8_t *pdata, lru_free_cb_t free_cb);
+uint8_t *lru_lookup(ram_addr_t addr);
+#endif
+
diff --git a/migration-exec.c b/migration-exec.c
index 14718dd..fe8254a 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -67,7 +67,9 @@  MigrationState *exec_start_outgoing_migration(Monitor *mon,
                                              int64_t bandwidth_limit,
                                              int detach,
                                              int blk,
-                                             int inc)
+                          int inc,
+                          int use_xbrle,
+                          int64_t xbrle_cache_size)
 {
     FdMigrationState *s;
     FILE *f;
@@ -99,6 +101,8 @@  MigrationState *exec_start_outgoing_migration(Monitor *mon,

     s->mig_state.blk = blk;
     s->mig_state.shared = inc;
+    s->mig_state.use_xbrle = use_xbrle;
+    s->mig_state.xbrle_cache_size = xbrle_cache_size;

     s->state = MIG_STATE_ACTIVE;
     s->mon = NULL;
diff --git a/migration-fd.c b/migration-fd.c
index 6d14505..4a1ddbd 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -56,7 +56,9 @@  MigrationState *fd_start_outgoing_migration(Monitor *mon,
                                            int64_t bandwidth_limit,
                                            int detach,
                                            int blk,
-                                           int inc)
+                        int inc,
+                        int use_xbrle,
+                        int64_t xbrle_cache_size)
 {
     FdMigrationState *s;

@@ -82,6 +84,8 @@  MigrationState *fd_start_outgoing_migration(Monitor *mon,

     s->mig_state.blk = blk;
     s->mig_state.shared = inc;
+    s->mig_state.use_xbrle = use_xbrle;
+    s->mig_state.xbrle_cache_size = xbrle_cache_size;

     s->state = MIG_STATE_ACTIVE;
     s->mon = NULL;
diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..4ca5bf6 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -81,7 +81,9 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,
                                              int64_t bandwidth_limit,
                                              int detach,
                                             int blk,
-                                            int inc)
+                         int inc,
+                         int use_xbrle,
+                         int64_t xbrle_cache_size)
 {
     struct sockaddr_in addr;
     FdMigrationState *s;
@@ -101,6 +103,8 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,

     s->mig_state.blk = blk;
     s->mig_state.shared = inc;
+    s->mig_state.use_xbrle = use_xbrle;
+    s->mig_state.xbrle_cache_size = xbrle_cache_size;

     s->state = MIG_STATE_ACTIVE;
     s->mon = NULL;
diff --git a/migration-unix.c b/migration-unix.c
index 57232c0..0813902 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -80,7 +80,9 @@  MigrationState *unix_start_outgoing_migration(Monitor *mon,
                                              int64_t bandwidth_limit,
                                              int detach,
                                              int blk,
-                                             int inc)
+                          int inc,
+                          int use_xbrle,
+                          int64_t xbrle_cache_size)
 {
     FdMigrationState *s;
     struct sockaddr_un addr;
@@ -100,6 +102,8 @@  MigrationState *unix_start_outgoing_migration(Monitor *mon,

     s->mig_state.blk = blk;
     s->mig_state.shared = inc;
+    s->mig_state.use_xbrle = use_xbrle;
+    s->mig_state.xbrle_cache_size = xbrle_cache_size;

     s->state = MIG_STATE_ACTIVE;
     s->mon = NULL;
diff --git a/migration.c b/migration.c
old mode 100644
new mode 100755
index 9ee8b17..ccacf81
--- a/migration.c
+++ b/migration.c
@@ -34,6 +34,11 @@ 
 /* Migration speed throttling */
 static uint32_t max_throttle = (32 << 20);

+/* Migration XBRLE cache size */
+#define DEFAULT_MIGRATE_CACHE_SIZE (64 * 1024 * 1024)
+
+static int64_t migrate_cache_size = DEFAULT_MIGRATE_CACHE_SIZE;
+
 static MigrationState *current_migration;

 int qemu_start_incoming_migration(const char *uri)
@@ -80,6 +85,7 @@  int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
     int detach = qdict_get_try_bool(qdict, "detach", 0);
     int blk = qdict_get_try_bool(qdict, "blk", 0);
     int inc = qdict_get_try_bool(qdict, "inc", 0);
+    int use_xbrle = qdict_get_try_bool(qdict, "xbrle", 0);
     const char *uri = qdict_get_str(qdict, "uri");

     if (current_migration &&
@@ -90,17 +96,21 @@  int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)

     if (strstart(uri, "tcp:", &p)) {
         s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
-                                         blk, inc);
+                                         blk, inc, use_xbrle,
+                                         migrate_cache_size);
 #if !defined(WIN32)
     } else if (strstart(uri, "exec:", &p)) {
         s = exec_start_outgoing_migration(mon, p, max_throttle, detach,
-                                          blk, inc);
+                                          blk, inc, use_xbrle,
+                                          migrate_cache_size);
     } else if (strstart(uri, "unix:", &p)) {
         s = unix_start_outgoing_migration(mon, p, max_throttle, detach,
-                                          blk, inc);
+                                          blk, inc, use_xbrle,
+                                          migrate_cache_size);
     } else if (strstart(uri, "fd:", &p)) {
         s = fd_start_outgoing_migration(mon, p, max_throttle, detach,
-                                        blk, inc);
+                                        blk, inc, use_xbrle,
+                                        migrate_cache_size);
 #endif
     } else {
         monitor_printf(mon, "unknown migration protocol: %s\n", uri);
@@ -185,6 +195,36 @@  static void migrate_print_status(Monitor *mon, const char *name,
                         qdict_get_int(qdict, "total") >> 10);
 }

+static void migrate_print_ram_status(Monitor *mon, const char *name,
+                                 const QDict *status_dict)
+{
+    QDict *qdict;
+    uint64_t overflow, cache_hit, cache_lookup;
+
+    qdict = qobject_to_qdict(qdict_get(status_dict, name));
+
+    monitor_printf(mon, "transferred %s: %" PRIu64 " kbytes\n", name,
+                        qdict_get_int(qdict, "bytes") >> 10);
+    monitor_printf(mon, "transferred %s: %" PRIu64 " pages\n", name,
+                        qdict_get_int(qdict, "pages"));
+    overflow = qdict_get_int(qdict, "overflow");
+    if (overflow > 0) {
+        monitor_printf(mon, "overflow %s: %" PRIu64 " pages\n", name,
+            overflow);
+    }
+    cache_hit = qdict_get_int(qdict, "cache-hit");
+    if (cache_hit > 0) {
+        monitor_printf(mon, "cache-hit %s: %" PRIu64 " pages\n", name,
+            cache_hit);
+    }
+    cache_lookup = qdict_get_int(qdict, "cache-lookup");
+    if (cache_lookup > 0) {
+        monitor_printf(mon, "cache-lookup %s: %" PRIu64 " pages\n", name,
+            cache_lookup);
+    }
+
+}
+
 void do_info_migrate_print(Monitor *mon, const QObject *data)
 {
     QDict *qdict;
@@ -198,6 +238,18 @@  void do_info_migrate_print(Monitor *mon, const QObject *data)
         migrate_print_status(mon, "ram", qdict);
     }

+    if (qdict_haskey(qdict, "ram-duplicate")) {
+        migrate_print_ram_status(mon, "ram-duplicate", qdict);
+    }
+
+    if (qdict_haskey(qdict, "ram-normal")) {
+        migrate_print_ram_status(mon, "ram-normal", qdict);
+    }
+
+    if (qdict_haskey(qdict, "ram-xbrle")) {
+        migrate_print_ram_status(mon, "ram-xbrle", qdict);
+    }
+
     if (qdict_haskey(qdict, "disk")) {
         migrate_print_status(mon, "disk", qdict);
     }
@@ -214,6 +266,23 @@  static void migrate_put_status(QDict *qdict, const char *name,
     qdict_put_obj(qdict, name, obj);
 }

+static void migrate_put_ram_status(QDict *qdict, const char *name,
+                               uint64_t bytes, uint64_t pages,
+                               uint64_t overflow, uint64_t cache_hit,
+                               uint64_t cache_lookup)
+{
+    QObject *obj;
+
+    obj = qobject_from_jsonf("{ 'bytes': %" PRId64 ", "
+                               "'pages': %" PRId64 ", "
+                               "'overflow': %" PRId64 ", "
+                               "'cache-hit': %" PRId64 ", "
+                               "'cache-lookup': %" PRId64 " }",
+                               bytes, pages, overflow, cache_hit,
+                               cache_lookup);
+    qdict_put_obj(qdict, name, obj);
+}
+
 void do_info_migrate(Monitor *mon, QObject **ret_data)
 {
     QDict *qdict;
@@ -228,6 +297,21 @@  void do_info_migrate(Monitor *mon, QObject **ret_data)
             migrate_put_status(qdict, "ram", ram_bytes_transferred(),
                                ram_bytes_remaining(), ram_bytes_total());

+            if (s->use_xbrle) {
+                migrate_put_ram_status(qdict, "ram-duplicate",
+                                   dup_mig_bytes_transferred(),
+                                   dup_mig_pages_transferred(), 0, 0, 0);
+                migrate_put_ram_status(qdict, "ram-normal",
+                                   norm_mig_bytes_transferred(),
+                                   norm_mig_pages_transferred(), 0, 0, 0);
+                migrate_put_ram_status(qdict, "ram-xbrle",
+                                   xbrle_mig_bytes_transferred(),
+                                   xbrle_mig_pages_transferred(),
+                                   xbrle_mig_pages_overflow(),
+                                   xbrle_mig_pages_cache_hit(),
+                                   xbrle_mig_pages_cache_lookup());
+            }
+
             if (blk_mig_active()) {
                 migrate_put_status(qdict, "disk", blk_mig_bytes_transferred(),
                                    blk_mig_bytes_remaining(),
@@ -341,7 +425,8 @@  void migrate_fd_connect(FdMigrationState *s)

     DPRINTF("beginning savevm\n");
     ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk,
-                                  s->mig_state.shared);
+                                  s->mig_state.shared, s->mig_state.use_xbrle,
+                                  s->mig_state.xbrle_cache_size);
     if (ret < 0) {
         DPRINTF("failed, %d\n", ret);
         migrate_fd_error(s);
@@ -448,3 +533,27 @@  int migrate_fd_close(void *opaque)
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
     return s->close(s);
 }
+
+void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict)
+{
+    ssize_t bytes;
+    const char *value = qdict_get_str(qdict, "value");
+
+    bytes = strtosz(value, NULL);
+    if (bytes < 0) {
+        monitor_printf(mon, "invalid cache size: %s\n", value);
+        return;
+    }
+
+    /* On 32-bit hosts, QEMU is limited by virtual address space */
+    if (bytes > (2047 << 20) && HOST_LONG_BITS == 32) {
+        monitor_printf(mon, "cache can't exceed 2047 MB RAM limit on host\n");
+        return;
+    }
+    if (bytes != (uint64_t) bytes) {
+        monitor_printf(mon, "cache size too large\n");
+        return;
+    }
+    migrate_cache_size = bytes;
+}
+
diff --git a/migration.h b/migration.h
index d13ed4f..6dc0543 100644
--- a/migration.h
+++ b/migration.h
@@ -32,6 +32,8 @@  struct MigrationState
     void (*release)(MigrationState *s);
     int blk;
     int shared;
+    int use_xbrle;
+    int64_t xbrle_cache_size;
 };

 typedef struct FdMigrationState FdMigrationState;
@@ -76,7 +78,9 @@  MigrationState *exec_start_outgoing_migration(Monitor *mon,
                                              int64_t bandwidth_limit,
                                              int detach,
                                              int blk,
-                                             int inc);
+                          int inc,
+                          int use_xbrle,
+                          int64_t xbrle_cache_size);

 int tcp_start_incoming_migration(const char *host_port);

@@ -85,7 +89,9 @@  MigrationState *tcp_start_outgoing_migration(Monitor *mon,
                                             int64_t bandwidth_limit,
                                             int detach,
                                             int blk,
-                                            int inc);
+                         int inc,
+                         int use_xbrle,
+                         int64_t xbrle_cache_size);

 int unix_start_incoming_migration(const char *path);

@@ -94,7 +100,9 @@  MigrationState *unix_start_outgoing_migration(Monitor *mon,
                                              int64_t bandwidth_limit,
                                              int detach,
                                              int blk,
-                                             int inc);
+                          int inc,
+                          int use_xbrle,
+                          int64_t xbrle_cache_size);

 int fd_start_incoming_migration(const char *path);

@@ -103,7 +111,9 @@  MigrationState *fd_start_outgoing_migration(Monitor *mon,
                                            int64_t bandwidth_limit,
                                            int detach,
                                            int blk,
-                                           int inc);
+                        int inc,
+                        int use_xbrle,
+                        int64_t xbrle_cache_size);

 void migrate_fd_monitor_suspend(FdMigrationState *s, Monitor *mon);

@@ -134,4 +144,11 @@  static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state)
     return container_of(mig_state, FdMigrationState, mig_state);
 }

+void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict);
+
+void arch_set_params(int blk_enable, int shared_base,
+        int use_xbrle, int64_t xbrle_cache_size, void *opaque);
+
+int xbrle_mig_active(void);
+
 #endif
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 793cf1c..8fbe64b 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -431,13 +431,16 @@  EQMP

     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
-        .help       = "migrate to URI (using -d to not wait for completion)"
-                     "\n\t\t\t -b for migration without shared storage with"
-                     " full copy of disk\n\t\t\t -i for migration without "
-                     "shared storage with incremental copy of disk "
-                     "(base image shared between src and destination)",
+        .args_type  = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s",
+        .params     = "[-d] [-b] [-i] [-x] uri",
+        .help       = "migrate to URI"
+                      "\n\t -d to not wait for completion"
+                      "\n\t -b for migration without shared storage with"
+                      " full copy of disk"
+                      "\n\t -i for migration without"
+                      " shared storage with incremental copy of disk"
+                      " (base image shared between source and destination)"
+                      "\n\t -x to use XBRLE page delta compression",
         .user_print = monitor_user_noop,
        .mhandler.cmd_new = do_migrate,
     },
@@ -453,6 +456,7 @@  Arguments:
 - "blk": block migration, full disk copy (json-bool, optional)
 - "inc": incremental disk copy (json-bool, optional)
 - "uri": Destination URI (json-string)
+- "xbrle": to use XBRLE page delta compression

 Example:

@@ -494,6 +498,31 @@  Example:
 EQMP

     {
+        .name       = "migrate_set_cachesize",
+        .args_type  = "value:s",
+        .params     = "value",
+        .help       = "set cache size (in MB) for xbrle migrations",
+        .mhandler.cmd = do_migrate_set_cachesize,
+    },
+
+SQMP
+migrate_set_cachesize
+---------------------
+
+Set cache size to be used by XBRLE migration
+
+Arguments:
+
+- "value": cache size in bytes (json-number)
+
+Example:
+
+-> { "execute": "migrate_set_cachesize", "arguments": { "value": 500M } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "migrate_set_speed",
         .args_type  = "value:f",
         .params     = "value",
diff --git a/savevm.c b/savevm.c
index 4e49765..93b512b 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1141,7 +1141,8 @@  int register_savevm(DeviceState *dev,
                     void *opaque)
 {
     return register_savevm_live(dev, idstr, instance_id, version_id,
-                                NULL, NULL, save_state, load_state, opaque);
+                                arch_set_params, NULL, save_state,
+                                load_state, opaque);
 }

 void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque)
@@ -1428,15 +1429,17 @@  static int vmstate_save(QEMUFile *f, SaveStateEntry *se)
 #define QEMU_VM_SUBSECTION           0x05

 int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared)
+                            int shared, int use_xbrle,
+                            int64_t xbrle_cache_size)
 {
     SaveStateEntry *se;

     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         if(se->set_params == NULL) {
             continue;
-       }
-       se->set_params(blk_enable, shared, se->opaque);
+        }
+        se->set_params(blk_enable, shared, use_xbrle, xbrle_cache_size,
+                se->opaque);
     }

     qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
@@ -1577,7 +1580,7 @@  static int qemu_savevm_state(Monitor *mon, QEMUFile *f)

     bdrv_flush_all();

-    ret = qemu_savevm_state_begin(mon, f, 0, 0);
+    ret = qemu_savevm_state_begin(mon, f, 0, 0, 0, 0);
     if (ret < 0)
         goto out;

diff --git a/sysemu.h b/sysemu.h
index b81a70e..eb53bf7 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -44,6 +44,16 @@  uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
 uint64_t ram_bytes_total(void);

+uint64_t dup_mig_bytes_transferred(void);
+uint64_t dup_mig_pages_transferred(void);
+uint64_t norm_mig_bytes_transferred(void);
+uint64_t norm_mig_pages_transferred(void);
+uint64_t xbrle_mig_bytes_transferred(void);
+uint64_t xbrle_mig_pages_transferred(void);
+uint64_t xbrle_mig_pages_overflow(void);
+uint64_t xbrle_mig_pages_cache_lookup(void);
+uint64_t xbrle_mig_pages_cache_hit(void);
+
 int64_t cpu_get_ticks(void);
 void cpu_enable_ticks(void);
 void cpu_disable_ticks(void);
@@ -74,7 +84,8 @@  void qemu_announce_self(void);
 void main_loop_wait(int nonblocking);

 int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared);
+                            int shared, int use_xbrle,
+                            int64_t xbrle_cache_size);
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
diff --git a/xbzrle.c b/xbzrle.c
new file mode 100644
index 0000000..e9285e0
--- /dev/null
+++ b/xbzrle.c
@@ -0,0 +1,126 @@ 
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include "cpu-all.h"
+#include "xbzrle.h"
+
+typedef struct {
+    uint64_t c;
+    uint64_t num;
+} zero_encoding_t;
+
+typedef struct {
+    uint64_t c;
+} char_encoding_t;
+
+static int rle_encode(uint64_t *in, int slen, uint8_t *out, const int dlen)
+{
+    int dl = 0;
+    uint64_t cp = 0, c, run_len = 0;
+
+    if (slen <=  0)
+        return -1;
+
+    while (1) {
+        if (!slen)
+            break;
+        c = *in++;
+        slen--;
+        if (!(cp || c)) {
+            run_len++;
+        } else if (!cp) {
+            ((zero_encoding_t *)out)->c = cp;
+            ((zero_encoding_t *)out)->num = run_len;
+            dl += sizeof(zero_encoding_t);
+            out += sizeof(zero_encoding_t);
+            run_len = 1;
+        } else {
+            ((char_encoding_t *)out)->c = cp;
+            dl += sizeof(char_encoding_t);
+            out += sizeof(char_encoding_t);
+                }
+        cp = c;
+    }
+
+    if (!cp) {
+        ((zero_encoding_t *)out)->c = cp;
+        ((zero_encoding_t *)out)->num = run_len;
+        dl += sizeof(zero_encoding_t);
+        out += sizeof(zero_encoding_t);
+    } else {
+        ((char_encoding_t *)out)->c = cp;
+        dl += sizeof(char_encoding_t);
+        out += sizeof(char_encoding_t);
+    }
+    return dl;
+}
+
+static int rle_decode(const uint8_t *in, int slen, uint64_t *out, int dlen)
+{
+    int tb = 0;
+    uint64_t run_len, c;
+
+    while (slen > 0) {
+        c = ((char_encoding_t *) in)->c;
+        if (c) {
+            slen -= sizeof(char_encoding_t);
+            in += sizeof(char_encoding_t);
+            *out++ = c;
+            tb++;
+            continue;
+        }
+        run_len = ((zero_encoding_t *) in)->num;
+        slen -= sizeof(zero_encoding_t);
+        in += sizeof(zero_encoding_t);
+        while (run_len-- > 0) {
+            *out++ = c;
+            tb++;
+        }
+    }
+    return tb;
+}
+
+static void xor_encode_word(uint8_t *dst, const uint8_t *src1,
+    const uint8_t *src2)
+{
+    int len = TARGET_PAGE_SIZE / sizeof (uint64_t);
+    uint64_t *dstw = (uint64_t *) dst;
+    const uint64_t *srcw1 = (const uint64_t *) src1;
+    const uint64_t *srcw2 = (const uint64_t *) src2;
+
+    while (len--) {
+        *dstw++ = *srcw1++ ^ *srcw2++;
+    }
+}
+
+int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old, const uint8_t *curr,
+    const size_t max_compressed_len)
+{
+    int compressed_len;
+    uint8_t xor_buf[TARGET_PAGE_SIZE];
+    uint8_t work_buf[TARGET_PAGE_SIZE * 2]; /* worst case xbzrle is 150% */
+
+    xor_encode_word(xor_buf, old, curr);
+    compressed_len = rle_encode((uint64_t *)xor_buf,
+        sizeof(xor_buf)/sizeof(uint64_t), work_buf,
+        sizeof(work_buf));
+    if (compressed_len > max_compressed_len) {
+        return -1;
+    }
+    memcpy(xbzrle, work_buf, compressed_len);
+    return compressed_len;
+}
+
+int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle,
+    const size_t compressed_len)
+{
+    uint8_t xor_buf[TARGET_PAGE_SIZE];
+
+    int len = rle_decode(xbrle, compressed_len,
+         (uint64_t *)xor_buf, sizeof(xor_buf)/sizeof(uint64_t));
+    if (len < 0) {
+        return len;
+    }
+    xor_encode_word(curr, old, xor_buf);
+    return len * sizeof(uint64_t);
+}
diff --git a/xbzrle.h b/xbzrle.h
new file mode 100644
index 0000000..5d625a0
--- /dev/null
+++ b/xbzrle.h
@@ -0,0 +1,12 @@ 
+#ifndef _XBZRLE_H_
+#define _XBZRLE_H_
+
+#include <stdio.h>
+
+int xbzrle_encode(uint8_t *xbrle, const uint8_t *old, const uint8_t *curr,
+    const size_t len);
+int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle,
+    const size_t len);
+
+#endif
+