Patchwork [1/2,v8] add-cow file format

login
register
mail settings
Submitter Robert Wang
Date April 18, 2012, 8:18 a.m.
Message ID <1334737097-20680-1-git-send-email-wdongxu@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/153433/
State New
Headers show

Comments

Robert Wang - April 18, 2012, 8:18 a.m.
From: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>

Provide a new file format: add-cow. The usage can be found in add-cow.txt of
this patch.

CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
---
 Makefile.objs          |    1 +
 block.c                |    2 +-
 block.h                |    1 +
 block/add-cow-cache.c  |  197 ++++++++++++++++++++++++++++
 block/add-cow.c        |  332 ++++++++++++++++++++++++++++++++++++++++++++++++
 block/add-cow.h        |   63 +++++++++
 block_int.h            |    1 +
 docs/specs/add-cow.txt |   68 ++++++++++
 8 files changed, 664 insertions(+), 1 deletions(-)
 create mode 100644 block/add-cow-cache.c
 create mode 100644 block/add-cow.c
 create mode 100644 block/add-cow.h
 create mode 100644 docs/specs/add-cow.txt
Stefan Hajnoczi - April 25, 2012, 4:09 p.m.
On Wed, Apr 18, 2012 at 9:18 AM, Dong Xu Wang
<wdongxu@linux.vnet.ibm.com> wrote:

QEMU normally does not cache data, only metadata.  There are a couple
of reasons:

1. The host page cache already provides this, we don't need to
duplicate this in QEMU.  The user can decide whether to cache data in
host memory or not using the cache=writeback|writethrough vs
cache=none|directsync options.

2. It increases the risk of data loss since anything in QEMU memory is
likely to be lost in a crash or power failure.  When we use the host
page cache at least guest data is not lost when the QEMU process
crashes.

One consequence of caching 8 64 KB data clusters is that the cache
cannot effectively reduce metadata I/O without increasing QEMU's
memory footprint a lot.  Since a cache entry only holds 8 bits of the
COW bitmap we probably have a lot of seeks required to access bitmap
metadata.  We could increase the cache size but each cache entry will
take more than 8 * 64 KB = 512 KB.

For these reasons I suggest only caching the COW bitmap.  Each entry
can cache 64 KB of the COW bitmap, that means 64 KB bitmap * 8 bits *
64 KB data bytes = allocation information for a 32 GB region of the
disk.  Since a single cache entry covers 32 GB of the disk we can
expect pretty good c

> +static coroutine_fn int add_cow_co_writev(BlockDriverState *bs,
> +        int64_t sector_num, int remaining_sectors, QEMUIOVector *qiov)
> +{
> +    BDRVAddCowState *s = bs->opaque;
> +    int ret = 0;
> +    int cur_nr_sectors;
> +    QEMUIOVector hd_qiov;
> +    uint64_t bytes_done = 0;
> +    uint8_t *table;
> +    uint8_t bitmap;
> +    int64_t index;
> +    int i;
> +    uint8_t *cluster_data = NULL;
> +    qemu_co_mutex_lock(&s->lock);
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +    while (remaining_sectors != 0) {
> +        index = sector_num & 1023;

Please use a constant to explain what this calculation does instead of 1023.

> +        cur_nr_sectors = MIN(remaining_sectors,
> +            (sector_num | 1023) - sector_num + 1);

A clearer expression would be: MIN(remaining_sectors,
SECTORS_PER_CLUSTER - index)

I just invented SECTORS_PER_CLUSTER but you should be able to define
something like that if you don't have it already.

> +
> +        ret = add_cow_cache_get(bs, s->bitmap_cache,
> +            sector_num & ~1023, &bitmap, (void **)&table);
> +        if (ret < 0) {
> +            goto fail;
> +        }

Here I wonder why add_cow_cache_get() doesn't allow us to pass in
sector_num (unmodified) and tells us whether or not this COW bit is
set.  Instead it fills in a uint8_t that we need to index into, which
requires every caller to duplicate the bitmap indexing code.

An even better strategy is to use bdrv_co_is_allocated() because it
tells you how many contiguous clusters are allocated.  If you're lucky
and they are all allocated/unallocated you can handle the entire I/O
request in one host I/O operation.  It's more efficient than doing
things 1 cluster at a time.

> +
> +        cluster_data = qemu_blockalign(bs, BDRV_SECTOR_SIZE * cur_nr_sectors);
> +        qemu_iovec_reset(&hd_qiov);
> +        qemu_iovec_copy(&hd_qiov, qiov, bytes_done,
> +            cur_nr_sectors * BDRV_SECTOR_SIZE);
> +        qemu_iovec_to_buffer(&hd_qiov, cluster_data);
> +
> +        memcpy(table + index * BDRV_SECTOR_SIZE,
> +            cluster_data,
> +            BDRV_SECTOR_SIZE * cur_nr_sectors);
> +        for (i = index / 128;

Please use a constant.  It should be clear what 128 means (I guess 128
x 512 byte sectors = 64 KB cluster size).

> +            i <= (index + cur_nr_sectors - 1) / 128;
> +            i++) {
> +                bitmap |= 1 << i;
> +        }
> +        add_cow_cache_entry_mark_dirty(s->bitmap_cache,
> +            bitmap,
> +            table);

It seems you always mark this cached COW bitmap entry dirty, even when
the cluster was already allocated.  When you switch to a pure metadata
cache this will cause unnecessary I/O when flushing or evicting cache
entries.  We should only mark the table dirty if a COW bit
transitioned from 0 -> 1.

> +        remaining_sectors -= cur_nr_sectors;
> +        sector_num += cur_nr_sectors;
> +        bytes_done += cur_nr_sectors * BDRV_SECTOR_SIZE;
> +    }
> +    ret = 0;
> +fail:
> +    ret = add_cow_cache_flush(bs, s->bitmap_cache);
> +    if (ret < 0) {
> +        goto fail;
> +    }

There are two options with metadata flushing:

1. If BDRV_O_CACHE_WB is set then we should follow the rule that
metadata updates are buffered in memory (for speed).  Data writes are
issued immediately (we don't buffer them because we don't want to use
too much memory for guest data).  If a flush is needed, then we must
first ensure all data is written and flushed to the .raw file.  Then
we write out metadata and flush it.

2. If not BDRV_O_CACHE_WB then we need to write out data to the .raw
file, flush the .raw file, and then write out metadata.

Note that if no COW bits transitioned from 0 -> 1 then we have no
metadata updates and can simply perform the data writes to the .raw
file!

> +    qemu_co_mutex_unlock(&s->lock);
> +    qemu_vfree(cluster_data);

It seems like cluster_data gets leaked since the allocation is inside
the loop and may happen multiple times but we only free once.

> +    qemu_iovec_destroy(&hd_qiov);
> +    return ret;
> +}
> +
> +static int bdrv_add_cow_truncate(BlockDriverState *bs, int64_t offset)
> +{
> +    return bdrv_truncate(bs->file,
> +        sizeof(AddCowHeader) + ((offset / BDRV_SECTOR_SIZE + 1023) >> 10));

This calculation is unclear to me.  I think this is saying that each
COW bit covers 64 KB of image data.  Please rewrite the expression
using constants from add-cow.h.  Don't precompute parts of the
expression, the compiler will do that for you and it's more important
to show where this calculation comes from to the reader of the code.

I think the .raw file should also be truncated.  Since add_cow_open()
uses the .raw file size we need to keep the .raw file correctly sized
at all times.

> +=Specification=
> +
> +The file format looks like this:
> +
> + +---------------+--------------------------+
> + |     Header    |           Data           |
> + +---------------+--------------------------+

'Metadata' or even 'COW bitmap' is clearer than 'Data'.  No guest data
is stored in the .add-cow file, only metadata.

Patch

diff --git a/Makefile.objs b/Makefile.objs
index 5c3bcda..c32c627 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -52,6 +52,7 @@  block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
+block-nested-y += add-cow.o add-cow-cache.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-y += stream.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
diff --git a/block.c b/block.c
index c0c90f0..abada9f 100644
--- a/block.c
+++ b/block.c
@@ -194,7 +194,7 @@  static void bdrv_io_limits_intercept(BlockDriverState *bs,
 }
 
 /* check if the path starts with "<protocol>:" */
-static int path_has_protocol(const char *path)
+int path_has_protocol(const char *path)
 {
 #ifdef _WIN32
     if (is_windows_drive(path) ||
diff --git a/block.h b/block.h
index f163e54..f74c79e 100644
--- a/block.h
+++ b/block.h
@@ -319,6 +319,7 @@  char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn);
 
 char *get_human_readable_size(char *buf, int buf_size, int64_t size);
 int path_is_absolute(const char *path);
+int path_has_protocol(const char *path);
 void path_combine(char *dest, int dest_size,
                   const char *base_path,
                   const char *filename);
diff --git a/block/add-cow-cache.c b/block/add-cow-cache.c
new file mode 100644
index 0000000..2ea0ac4
--- /dev/null
+++ b/block/add-cow-cache.c
@@ -0,0 +1,197 @@ 
+/*
+ * Cache For QEMU ADD-COW Disk Format
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * Authors:
+ *  Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "block_int.h"
+#include "qemu-common.h"
+#include "add-cow.h"
+
+AddCowCache *add_cow_cache_create(BlockDriverState *bs, int num_tables)
+{
+    BDRVAddCowState *s = bs->opaque;
+    AddCowCache *c;
+    int i;
+
+    c = g_malloc0(sizeof(*c));
+    c->size = num_tables;
+    c->entries = g_malloc0(sizeof(*c->entries) * num_tables);
+
+    for (i = 0; i < c->size; i++) {
+        c->entries[i].table = qemu_blockalign(bs, 8 * s->cluster_size);
+        c->entries[i].offset = -1;
+    }
+
+    return c;
+}
+
+void add_cow_cache_destroy(BlockDriverState *bs, AddCowCache *c)
+{
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        qemu_vfree(c->entries[i].table);
+    }
+
+    g_free(c->entries);
+    g_free(c);
+}
+
+static int add_cow_cache_find_entry_to_replace(AddCowCache *c)
+{
+    int i;
+    int min_count = INT_MAX;
+    int min_index = -1;
+
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].cache_hits < min_count) {
+            min_index = i;
+            min_count = c->entries[i].cache_hits;
+        }
+
+        c->entries[i].cache_hits /= 2;
+    }
+
+    return min_index;
+}
+
+static int add_cow_cache_entry_flush(BlockDriverState *bs,
+                                        AddCowCache *c, int i)
+{
+    BDRVAddCowState *s = bs->opaque;
+    int ret = 0, j;
+
+    if (!c->entries[i].dirty || (-1 == c->entries[i].offset)) {
+        return 0;
+    }
+
+    for (j = 0; j < 8; j++) {
+        if (c->entries[i].bitmap & (1 << j)) {
+            ret = bdrv_pwrite(s->image_hd,
+                c->entries[i].offset * BDRV_SECTOR_SIZE + s->cluster_size * j,
+                c->entries[i].table + s->cluster_size * j,
+                s->cluster_size);
+        }
+        if (ret < 0) {
+            return ret;
+        }
+    }
+    ret = bdrv_flush(s->image_hd);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_pwrite(bs->file,
+                sizeof(AddCowHeader) + (c->entries[i].offset >> 10),
+                &c->entries[i].bitmap,
+                1);
+    if (ret < 0) {
+        return ret;
+    }
+    ret = bdrv_flush(bs->file);
+    if (ret < 0) {
+        return ret;
+    }
+
+    c->entries[i].dirty = false;
+    return 0;
+}
+
+void add_cow_cache_entry_mark_dirty(AddCowCache *c, uint8_t bitmap, void *table)
+{
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].table == table) {
+            goto found;
+        }
+    }
+    abort();
+
+found:
+    c->entries[i].dirty = true;
+    c->entries[i].bitmap = bitmap;
+}
+
+int add_cow_cache_flush(BlockDriverState *bs, AddCowCache *c)
+{
+    int result = 0;
+    int ret;
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        ret = add_cow_cache_entry_flush(bs, c, i);
+        if (ret < 0 && result != -ENOSPC) {
+            result = ret;
+        }
+    }
+    return result;
+}
+
+int add_cow_cache_get(BlockDriverState *bs, AddCowCache *c,
+    uint64_t sector_num, uint8_t *bitmap, void **table)
+{
+    BDRVAddCowState *s = bs->opaque;
+    int i, j;
+    int ret;
+    uint64_t offset = sector_num >> 10;
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].offset == sector_num) {
+            goto found;
+        }
+    }
+
+    i = add_cow_cache_find_entry_to_replace(c);
+    if (i < 0) {
+        return i;
+    }
+
+    ret = add_cow_cache_entry_flush(bs, c, i);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_pread(bs->file, sizeof(AddCowHeader) + offset,
+                        &c->entries[i].bitmap, 1);
+    if (ret < 0) {
+        return ret;
+    }
+
+    for (j = 0; j < 8; j++) {
+        if (s->image_hd->total_sectors * BDRV_SECTOR_SIZE <
+            sector_num + j * s->cluster_size) {
+            break;
+        }
+        if (c->entries[i].bitmap & (1 << j)) {
+            ret = bdrv_pread(s->image_hd,
+                sector_num * BDRV_SECTOR_SIZE + j * s->cluster_size,
+                c->entries[i].table + j * s->cluster_size,
+                s->cluster_size);
+        } else {
+            ret = bdrv_pread(bs->backing_hd,
+                sector_num * BDRV_SECTOR_SIZE + j * s->cluster_size,
+                c->entries[i].table + j * s->cluster_size,
+                s->cluster_size);
+        }
+        if (ret < 0) {
+            return ret;
+        }
+    }
+    c->entries[i].cache_hits = 32;
+    c->entries[i].offset = sector_num;
+
+found:
+    c->entries[i].cache_hits++;
+    *table = c->entries[i].table;
+    *bitmap = c->entries[i].bitmap;
+    return 0;
+}
diff --git a/block/add-cow.c b/block/add-cow.c
new file mode 100644
index 0000000..cbbd5f6
--- /dev/null
+++ b/block/add-cow.c
@@ -0,0 +1,332 @@ 
+/*
+ * QEMU ADD-COW Disk Format
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * Authors:
+ *  Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "block_int.h"
+#include "module.h"
+#include "add-cow.h"
+
+static int add_cow_probe(const uint8_t *buf, int buf_size, const char *filename)
+{
+    const AddCowHeader *header = (const AddCowHeader *)buf;
+
+    if (be64_to_cpu(header->magic) == ADD_COW_MAGIC &&
+        be32_to_cpu(header->version) == ADD_COW_VERSION) {
+        return 100;
+    } else {
+        return 0;
+    }
+}
+static int add_cow_create(const char *filename, QEMUOptionParameter *options)
+{
+    AddCowHeader header;
+    int64_t image_sectors = 0;
+    const char *backing_filename = NULL;
+    const char *image_filename = NULL;
+    int ret;
+    BlockDriverState *bs, *image_bs = NULL, *backing_bs = NULL;
+
+    while (options && options->name) {
+        if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
+            image_sectors = options->value.n / BDRV_SECTOR_SIZE;
+        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) {
+            backing_filename = options->value.s;
+        } else if (!strcmp(options->name, BLOCK_OPT_IMAGE_FILE)) {
+            image_filename = options->value.s;
+        }
+        options++;
+    }
+
+    if (!backing_filename || !image_filename) {
+        error_report("Both backing_file and image_file should be given.");
+        return -EINVAL;
+    }
+
+    ret = bdrv_file_open(&image_bs, image_filename, BDRV_O_RDWR
+            | BDRV_O_CACHE_WB);
+    if (ret < 0) {
+        return ret;
+    }
+    image_sectors = image_bs->total_sectors;
+    bdrv_delete(image_bs);
+
+    ret = bdrv_file_open(&backing_bs, backing_filename, BDRV_O_RDWR
+            | BDRV_O_CACHE_WB);
+    if (ret < 0) {
+        return ret;
+    }
+    bdrv_delete(backing_bs);
+
+    ret = bdrv_create_file(filename, NULL);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_file_open(&bs, filename, BDRV_O_RDWR);
+    if (ret < 0) {
+        return ret;
+    }
+
+    memset(&header, 0, sizeof(header));
+    header.magic = cpu_to_be64(ADD_COW_MAGIC);
+    header.version = cpu_to_be32(ADD_COW_VERSION);
+    pstrcpy(header.backing_file, sizeof(header.backing_file), backing_filename);
+    pstrcpy(header.image_file, sizeof(header.image_file), image_filename);
+
+    ret = bdrv_pwrite(bs, 0, &header, sizeof(header));
+    if (ret < 0) {
+        bdrv_delete(bs);
+        return ret;
+    }
+
+    BlockDriver *drv = bdrv_find_format("add-cow");
+    assert(drv != NULL);
+    ret = bdrv_open(bs, filename, BDRV_O_RDWR | BDRV_O_NO_FLUSH, drv);
+    if (ret < 0) {
+        bdrv_delete(bs);
+        return ret;
+    }
+
+    ret = bdrv_truncate(bs, image_sectors * BDRV_SECTOR_SIZE);
+    bdrv_delete(bs);
+    return ret;
+}
+
+static int add_cow_open(BlockDriverState *bs, int flags)
+{
+    AddCowHeader        header;
+    char                image_filename[ADD_COW_FILE_LEN];
+    BlockDriver         *image_drv = NULL;
+    int                 ret;
+    BDRVAddCowState     *s = bs->opaque;
+
+    ret = bdrv_pread(bs->file, 0, &header, sizeof(header));
+    if (ret != sizeof(header)) {
+        goto fail;
+    }
+
+    if (be64_to_cpu(header.magic) != ADD_COW_MAGIC) {
+        ret = -EINVAL;
+        goto fail;
+    }
+    if (be32_to_cpu(header.version) != ADD_COW_VERSION) {
+        char version[64];
+        snprintf(version, sizeof(version), "ADD-COW version %d",
+            be32_to_cpu(header.version));
+        qerror_report(QERR_UNKNOWN_BLOCK_FORMAT_FEATURE,
+            bs->device_name, "add-cow", version);
+        ret = -ENOTSUP;
+        goto fail;
+    }
+
+    QEMU_BUILD_BUG_ON(sizeof(bs->backing_file) != sizeof(header.backing_file));
+    pstrcpy(bs->backing_file, sizeof(bs->backing_file), header.backing_file);
+
+    if (header.image_file[0] == '\0') {
+        ret = -ENOENT;
+        goto fail;
+    }
+    header.image_file[ADD_COW_FILE_LEN - 1] = '\0';
+    s->image_hd = bdrv_new("");
+    if (path_has_protocol(header.image_file)) {
+        pstrcpy(image_filename, sizeof(image_filename), header.image_file);
+    } else {
+        path_combine(image_filename, sizeof(image_filename),
+                     bs->filename, header.image_file);
+    }
+
+    image_drv = bdrv_find_format("raw");
+    ret = bdrv_open(s->image_hd, image_filename, flags, image_drv);
+    if (ret < 0) {
+        bdrv_delete(s->image_hd);
+        goto fail;
+    }
+    bs->total_sectors = s->image_hd->total_sectors;
+    s->cluster_size = ADD_COW_CLUSTER_SIZE;
+    s->bitmap_cache = add_cow_cache_create(bs, ADD_COW_CACHE_SIZE);
+    qemu_co_mutex_init(&s->lock);
+    return 0;
+ fail:
+    return ret;
+}
+
+static void add_cow_close(BlockDriverState *bs)
+{
+    BDRVAddCowState *s = bs->opaque;
+    add_cow_cache_destroy(bs, s->bitmap_cache);
+    bdrv_delete(s->image_hd);
+}
+
+static coroutine_fn int add_cow_co_readv(BlockDriverState *bs,
+    int64_t sector_num, int remaining_sectors, QEMUIOVector *qiov)
+{
+    BDRVAddCowState *s = bs->opaque;
+    int cur_nr_sectors;
+    uint64_t bytes_done = 0;
+    int ret = 0;
+    uint8_t *table;
+    uint8_t bitmap;
+    QEMUIOVector hd_qiov;
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+
+    qemu_co_mutex_lock(&s->lock);
+    while (remaining_sectors != 0) {
+        cur_nr_sectors = MIN(remaining_sectors,
+            (sector_num | 1023) - sector_num + 1);
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_copy(&hd_qiov, qiov, bytes_done,
+            cur_nr_sectors * BDRV_SECTOR_SIZE);
+        ret = add_cow_cache_get(bs, s->bitmap_cache,
+            sector_num & ~1023, &bitmap, (void **)&table);
+        if (ret < 0) {
+            goto fail;
+        }
+        qemu_iovec_from_buffer(&hd_qiov,
+                table + (sector_num & 1023) * BDRV_SECTOR_SIZE,
+                BDRV_SECTOR_SIZE * cur_nr_sectors);
+
+        remaining_sectors -= cur_nr_sectors;
+        sector_num += cur_nr_sectors;
+        bytes_done += cur_nr_sectors * BDRV_SECTOR_SIZE;
+    }
+    ret = 0;
+fail:
+    qemu_co_mutex_unlock(&s->lock);
+    qemu_iovec_destroy(&hd_qiov);
+    return ret;
+}
+
+static coroutine_fn int add_cow_co_writev(BlockDriverState *bs,
+        int64_t sector_num, int remaining_sectors, QEMUIOVector *qiov)
+{
+    BDRVAddCowState *s = bs->opaque;
+    int ret = 0;
+    int cur_nr_sectors;
+    QEMUIOVector hd_qiov;
+    uint64_t bytes_done = 0;
+    uint8_t *table;
+    uint8_t bitmap;
+    int64_t index;
+    int i;
+    uint8_t *cluster_data = NULL;
+    qemu_co_mutex_lock(&s->lock);
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+    while (remaining_sectors != 0) {
+        index = sector_num & 1023;
+        cur_nr_sectors = MIN(remaining_sectors,
+            (sector_num | 1023) - sector_num + 1);
+
+        ret = add_cow_cache_get(bs, s->bitmap_cache,
+            sector_num & ~1023, &bitmap, (void **)&table);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        cluster_data = qemu_blockalign(bs, BDRV_SECTOR_SIZE * cur_nr_sectors);
+        qemu_iovec_reset(&hd_qiov);
+        qemu_iovec_copy(&hd_qiov, qiov, bytes_done,
+            cur_nr_sectors * BDRV_SECTOR_SIZE);
+        qemu_iovec_to_buffer(&hd_qiov, cluster_data);
+
+        memcpy(table + index * BDRV_SECTOR_SIZE,
+            cluster_data,
+            BDRV_SECTOR_SIZE * cur_nr_sectors);
+        for (i = index / 128;
+            i <= (index + cur_nr_sectors - 1) / 128;
+            i++) {
+                bitmap |= 1 << i;
+        }
+        add_cow_cache_entry_mark_dirty(s->bitmap_cache,
+            bitmap,
+            table);
+        remaining_sectors -= cur_nr_sectors;
+        sector_num += cur_nr_sectors;
+        bytes_done += cur_nr_sectors * BDRV_SECTOR_SIZE;
+    }
+    ret = 0;
+fail:
+    ret = add_cow_cache_flush(bs, s->bitmap_cache);
+    if (ret < 0) {
+        goto fail;
+    }
+    qemu_co_mutex_unlock(&s->lock);
+    qemu_vfree(cluster_data);
+    qemu_iovec_destroy(&hd_qiov);
+    return ret;
+}
+
+static int bdrv_add_cow_truncate(BlockDriverState *bs, int64_t offset)
+{
+    return bdrv_truncate(bs->file,
+        sizeof(AddCowHeader) + ((offset / BDRV_SECTOR_SIZE + 1023) >> 10));
+}
+
+static coroutine_fn int add_cow_co_flush(BlockDriverState *bs)
+{
+    BDRVAddCowState *s = bs->opaque;
+    int ret;
+
+    qemu_co_mutex_lock(&s->lock);
+    ret = add_cow_cache_flush(bs, s->bitmap_cache);
+    qemu_co_mutex_unlock(&s->lock);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return bdrv_co_flush(bs->file);
+}
+
+static QEMUOptionParameter add_cow_create_options[] = {
+    {
+        .name = BLOCK_OPT_SIZE,
+        .type = OPT_SIZE,
+        .help = "Virtual disk size"
+    },
+    {
+        .name = BLOCK_OPT_BACKING_FILE,
+        .type = OPT_STRING,
+        .help = "File name of a base image"
+    },
+    {
+        .name = BLOCK_OPT_IMAGE_FILE,
+        .type = OPT_STRING,
+        .help = "File name of a image file"
+    },
+    {
+        .name = BLOCK_OPT_BACKING_FMT,
+        .type = OPT_STRING,
+        .help = "Image format of the base image"
+    },
+    { NULL }
+};
+
+static BlockDriver bdrv_add_cow = {
+    .format_name                = "add-cow",
+    .instance_size              = sizeof(BDRVAddCowState),
+    .bdrv_probe                 = add_cow_probe,
+    .bdrv_open                  = add_cow_open,
+    .bdrv_close                 = add_cow_close,
+    .bdrv_create                = add_cow_create,
+    .bdrv_co_readv              = add_cow_co_readv,
+    .bdrv_co_writev             = add_cow_co_writev,
+    .bdrv_truncate              = bdrv_add_cow_truncate,
+
+    .create_options             = add_cow_create_options,
+    .bdrv_co_flush_to_disk      = add_cow_co_flush,
+};
+
+static void bdrv_add_cow_init(void)
+{
+    bdrv_register(&bdrv_add_cow);
+}
+
+block_init(bdrv_add_cow_init);
diff --git a/block/add-cow.h b/block/add-cow.h
new file mode 100644
index 0000000..4805896
--- /dev/null
+++ b/block/add-cow.h
@@ -0,0 +1,63 @@ 
+/*
+ * QEMU ADD-COW Disk Format
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * Authors:
+ *  Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#ifndef BLOCK_ADD_COW_H
+#define BLOCK_ADD_COW_H
+
+#define ADD_COW_MAGIC       (((uint64_t)'A' << 56) | ((uint64_t)'D' << 48) | \
+                            ((uint64_t)'D' << 40) | ((uint64_t)'_' << 32) | \
+                            ((uint64_t)'C' << 24) | ((uint64_t)'O' << 16) | \
+                            ((uint64_t)'W' << 8) | 0xFF)
+#define ADD_COW_VERSION         1
+#define ADD_COW_FILE_LEN        1024
+#define ADD_COW_CACHE_SIZE      16
+#define ADD_COW_CLUSTER_SIZE    65536
+
+typedef struct AddCowHeader {
+    uint64_t        magic;
+    uint32_t        version;
+    char            backing_file[ADD_COW_FILE_LEN];
+    char            image_file[ADD_COW_FILE_LEN];
+    char            reserved[500];
+} QEMU_PACKED AddCowHeader;
+
+typedef struct AddCowCachedTable {
+    /* data for image_file */
+    void    *table;
+    /* offset of bitmap */
+    int64_t offset;
+    /* bitmap indicates each cluster is dirty or not */
+    uint8_t bitmap;
+    bool    dirty;
+    int     cache_hits;
+} AddCowCachedTable;
+
+typedef struct AddCowCache {
+    AddCowCachedTable       *entries;
+    int                     size;
+} AddCowCache;
+
+typedef struct BDRVAddCowState {
+    BlockDriverState    *image_hd;
+    CoMutex             lock;
+    int                 cluster_size;
+    AddCowCache         *bitmap_cache;
+} BDRVAddCowState;
+
+AddCowCache *add_cow_cache_create(BlockDriverState *bs, int num_tables);
+void add_cow_cache_destroy(BlockDriverState *bs, AddCowCache *c);
+void add_cow_cache_entry_mark_dirty(AddCowCache *c,
+    uint8_t bitmap, void *table);
+int add_cow_cache_get(BlockDriverState *bs, AddCowCache *c,
+    uint64_t offset, uint8_t *bitmap, void **table);
+int add_cow_cache_flush(BlockDriverState *bs, AddCowCache *c);
+#endif
diff --git a/block_int.h b/block_int.h
index 0e5a032..9e0e06c 100644
--- a/block_int.h
+++ b/block_int.h
@@ -50,6 +50,7 @@ 
 #define BLOCK_OPT_TABLE_SIZE    "table_size"
 #define BLOCK_OPT_PREALLOC      "preallocation"
 #define BLOCK_OPT_SUBFMT        "subformat"
+#define BLOCK_OPT_IMAGE_FILE    "image_file"
 
 typedef struct BdrvTrackedRequest BdrvTrackedRequest;
 
diff --git a/docs/specs/add-cow.txt b/docs/specs/add-cow.txt
new file mode 100644
index 0000000..4684e5e
--- /dev/null
+++ b/docs/specs/add-cow.txt
@@ -0,0 +1,68 @@ 
+== General ==
+
+Raw file format does not support backing_file and copy on write feature.
+The add-cow image format makes it possible to use backing files with raw
+image by keeping a separate .add-cow metadata file.  Once all sectors
+have been written to in the raw image it is safe to discard the .add-cow
+and backing files and instead use the raw image directly.
+
+When using add-cow, procedures may like this:
+(ubuntu.img is a disk image which has been installed OS.)
+    1)  Create a raw image with the same size of ubuntu.img
+            qemu-img create -f raw test.raw 8G
+    2)  Create a add-cow image which will store dirty bitmap
+            qemu-img create -f add-cow test.add-cow -o backing_file=ubuntu.img,image_file=test.raw
+    3)  Run qemu with add-cow image
+            qemu -drive if=virtio,file=test.add-cow
+
+=Specification=
+
+The file format looks like this:
+
+ +---------------+--------------------------+
+ |     Header    |           Data           |
+ +---------------+--------------------------+
+
+All numbers in add-cow are stored in Big Endian byte order.
+
+== Header ==
+
+The Header is included in the first bytes:
+
+    Byte  0 -  7:       magic
+                        add-cow magic string ("ADD_COW\xff")
+
+          8 -  11:      version
+                        Version number (only valid value is 1 now)
+
+          12 - 1035:    backing_file
+                        backing_file file name related to add-cow file. All
+                        unused bytes are padded with zeros. Must not be longer
+                        than 1023 bytes.
+
+         1036 - 2059:   image_file
+                        image_file is a raw file. All unused bytes are padded
+                        with zeros. Must not be longer than 1023 bytes.
+
+         2060  - 2559:   The Reserved field is used to make sure Data field starts
+                        at the multiple of 512, not used currently. All bytes are
+                        filled with 0.
+
+== Data ==
+
+The Data field starts at the 2560th byte, stores a bitmap related to backing_file
+and image_file. The bitmap will track whether the sector in backing_file is dirty
+or not.
+
+Each bit in the bitmap indicates one cluster's status. One cluster includes 128
+sectors, then each bit indicates 512 * 128 = 64k bytes, So the size of bitmap is
+calculated according to virtual size of image_file. In each byte, bit 0 to 7
+will track the 1st to 7th cluster in sequence, bit orders in one byte look like:
+ +----+----+----+----+----+----+----+----+
+ | b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
+ +----+----+----+----+----+----+----+----+
+
+If the bit is 0, indicates the sector has not been allocated in image_file, data
+should be loaded from backing_file while reading; if the bit is 1,  indicates the
+related sector has been dirty, should be loaded from image_file while reading.
+Writing to a sector causes the corresponding bit to be set to 1.