diff mbox

[v4,4/6] introduce new vma archive format

Message ID 1361352723-218544-5-git-send-email-dietmar@proxmox.com
State New
Headers show

Commit Message

Dietmar Maurer Feb. 20, 2013, 9:32 a.m. UTC
This is a very simple archive format, see docs/specs/vma_spec.txt

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 Makefile                |    3 +-
 Makefile.objs           |    2 +-
 backup.h                |    1 +
 blockdev.c              |    6 +-
 docs/specs/vma_spec.txt |   24 ++
 vma-reader.c            |  799 ++++++++++++++++++++++++++++++++++++++++
 vma-writer.c            |  932 +++++++++++++++++++++++++++++++++++++++++++++++
 vma.c                   |  559 ++++++++++++++++++++++++++++
 vma.h                   |  145 ++++++++
 9 files changed, 2467 insertions(+), 4 deletions(-)
 create mode 100644 docs/specs/vma_spec.txt
 create mode 100644 vma-reader.c
 create mode 100644 vma-writer.c
 create mode 100644 vma.c
 create mode 100644 vma.h

Comments

Eric Blake Feb. 21, 2013, 12:32 a.m. UTC | #1
On 02/20/2013 02:32 AM, Dietmar Maurer wrote:
> This is a very simple archive format, see docs/specs/vma_spec.txt
> 
> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
> ---

> +++ b/docs/specs/vma_spec.txt
> @@ -0,0 +1,24 @@
> +=Virtual Machine Archive format (VMA)=
> +
> +This format contains a header which includes the VM configuration as
> +binary blobs, and a list of devices (dev_id, name).

Is there a magic number, for quickly identifying whether a file is
likely to be a vma?  What endianness are multi-byte numbers interpreted
with?  Does the overall header file leave ample room for adding later
extensions in a manner that is reliably detected as an unsupported
feature in older tools?

> +
> +The actual VM image data is stored inside extents. An extent contains
> +up to 64 clusters, and start with a 512 byte header containing
> +additional information for those clusters.

Doesn't this create alignment slowdowns on modern disks that prefer 4k
alignment?  Wouldn't it be better to have the header be 4096 bytes, so
that each of the 64 clusters in an extent is also 4096-aligned?  If you
_do_ go with a 4096 header per extent, then it might be better to go
with 16 bytes per cluster and 256 clusters per extent, instead of 8
bytes per cluster and 512 clusters per extent.

> +
> +We use a cluster size of 65536, and use 8 bytes for each
> +cluster in the header to store the following information:
> +
> +* 1 byte dev_id (to identity the drive)
> +* 2 bytes zero indicator (mark zero regions (16x4096))
> +* 4 bytes cluster number

Is that sufficient, or are we artificially limiting the maximum size of
a disk image that can be stored to 64k*4G = 128T? Again, going with
16-bytes per cluster instead of 8 bytes per cluster, to get up to a
4096-byte header alignment so that all clusters fall on nice alignment
boundaries, would leave you room to supply a 64-bit offset instead of a
32-bit cluster number.  Don't know if that would be helpful or not, but
food for thought.

> +* 1 byte not used (reserved)

Can these be rearranged to 'dev_id, reserved, zero indicator, cluster
number' to achieve natural alignment when reading the cluster number?

> +
> +We only store non-zero blocks (such block is 4096 bytes).

So if I understand correctly, your current layout is divided into
extents of  up to 4194816 bytes each (512 header, then 4M divided into
64 clusters of 64k each).  Then, if the zero indicator is all 0, then
the corresponding cluster will be 64k bytes on the wire; if it is
0x0001, then the first 4096 bytes of the corresponding cluster will be
all zeros, and the cluster itself will occupy only 60k of the vma file?

> +
> +Each archive is marked with a uuid. The archive header and all
> +extent headers includes that uuid and a MD5 checksum (over header
> +data).

Layout of this header?

Hint - look at how the qcow2 file format is specified.  You need a lot
more details - enough that someone could independently implement a
program to read and create vma images that would be compatible with what
your implementation produces.

> +++ b/vma-reader.c
> @@ -0,0 +1,799 @@
> +/*
> + * VMA: Virtual Machine Archive
> + *
> + * Copyright (C) 2012 Proxmox Server Solutions

It's 2013.

> +
> +#define BITS_PER_LONG  (sizeof(unsigned long) * 8)

8 is a magic number; you should be using CHAR_BIT from limits.h instead.
Dietmar Maurer Feb. 21, 2013, 8:20 a.m. UTC | #2
> > +This format contains a header which includes the VM configuration as

> > +binary blobs, and a list of devices (dev_id, name).

> 

> Is there a magic number, for quickly identifying whether a file is likely to be a

> vma?  


Yes  ('VMA')

> What endianness are multi-byte numbers interpreted with?  


BE

>Does the

> overall header file leave ample room for adding later extensions in a manner

> that is reliably detected as an unsupported feature in older tools?


yes

> > +The actual VM image data is stored inside extents. An extent contains

> > +up to 64 clusters, and start with a 512 byte header containing

> > +additional information for those clusters.

> 

> Doesn't this create alignment slowdowns on modern disks that prefer 4k

> alignment?  


No, because it is meant to be read sequentially. Direct access to specific blocks
is not needed.

> Wouldn't it be better to have the header be 4096 bytes, so that each

> of the 64 clusters in an extent is also 4096-aligned?  If you _do_ go with a 4096

> header per extent, then it might be better to go with 16 bytes per cluster and

> 256 clusters per extent, instead of 8 bytes per cluster and 512 clusters per

> extent.


That would increase extend size to 16MB, and would increase memory footprint
because we need to have at least 2 extends in RAM. 

But again, there is no need to align to 4096 bytes.

> > +We use a cluster size of 65536, and use 8 bytes for each cluster in

> > +the header to store the following information:

> > +

> > +* 1 byte dev_id (to identity the drive)

> > +* 2 bytes zero indicator (mark zero regions (16x4096))

> > +* 4 bytes cluster number

> 

> Is that sufficient, or are we artificially limiting the maximum size of a disk image

> that can be stored to 64k*4G = 128T? Again, going with 16-bytes per cluster

> instead of 8 bytes per cluster, to get up to a 4096-byte header alignment so that

> all clusters fall on nice alignment boundaries, would leave you room to supply a

> 64-bit offset instead of a 32-bit cluster number.  Don't know if that would be

> helpful or not, but food for thought.


Honestly, 64k*4G = 128T is not a limit for me. And we still have one byte reserved,
so we can have up to 1P per image, and up to 253 images.

For now I have no plans to backup such large VMs.
 
> > +* 1 byte not used (reserved)

> 

> Can these be rearranged to 'dev_id, reserved, zero indicator, cluster number' to

> achieve natural alignment when reading the cluster number?


It is already ordered that way in the code. I will fix the order in the text.
 
> > +We only store non-zero blocks (such block is 4096 bytes).

> 

> So if I understand correctly, your current layout is divided into extents of  up to

> 4194816 bytes each (512 header, then 4M divided into

> 64 clusters of 64k each).  Then, if the zero indicator is all 0, then the

> corresponding cluster will be 64k bytes on the wire; if it is 0x0001, then the first

> 4096 bytes of the corresponding cluster will be all zeros, and the cluster itself

> will occupy only 60k of the vma file?


yes

> > +Each archive is marked with a uuid. The archive header and all extent

> > +headers includes that uuid and a MD5 checksum (over header data).

> 

> Layout of this header?


see vma.h
 
> Hint - look at how the qcow2 file format is specified.  You need a lot more

> details - enough that someone could independently implement a program to

> read and create vma images that would be compatible with what your

> implementation produces.


The format is really simple, you have the definitions in the header file,
and a reference implementation. I am quite sure any experienced developer 
can write an implementation in a few hours.

We do not talk about an extremely complex format like 'qcow2' here.

> > +++ b/vma-reader.c

> > @@ -0,0 +1,799 @@

> > +/*

> > + * VMA: Virtual Machine Archive

> > + *

> > + * Copyright (C) 2012 Proxmox Server Solutions

> 

> It's 2013.

> 

> > +

> > +#define BITS_PER_LONG  (sizeof(unsigned long) * 8)

> 

> 8 is a magic number; you should be using CHAR_BIT from limits.h instead.


Ok, will change that.
Kevin Wolf Feb. 21, 2013, 1:03 p.m. UTC | #3
On Thu, Feb 21, 2013 at 08:20:28AM +0000, Dietmar Maurer wrote:
> Honestly, 64k*4G = 128T is not a limit for me. And we still have one byte reserved,
> so we can have up to 1P per image, and up to 253 images.
> 
> For now I have no plans to backup such large VMs.

640k ought to be enough for anybody?

Code is easy enough to change that "works for now" can be good enough.
Changing file formats and external interfaces is hard, though, so better
get it right the first time. I don't want to get patches for a new
format in a year or two just because 128T was enough for the first few
users.

> > Hint - look at how the qcow2 file format is specified.  You need a lot more
> > details - enough that someone could independently implement a program to
> > read and create vma images that would be compatible with what your
> > implementation produces.
> 
> The format is really simple, you have the definitions in the header file,
> and a reference implementation. I am quite sure any experienced developer 
> can write an implementation in a few hours.
> 
> We do not talk about an extremely complex format like 'qcow2' here.

This is not an excuse for lacking details in the spec. Quite the
opposite. If someone has to look at your code in order to implement the
format, you have failed as a spec writer.

Kevin
Dietmar Maurer Feb. 21, 2013, 3:32 p.m. UTC | #4
> > For now I have no plans to backup such large VMs.
> 
> 640k ought to be enough for anybody?
> 
> Code is easy enough to change that "works for now" can be good enough.
> Changing file formats and external interfaces is hard, though, so better get it
> right the first time. I don't want to get patches for a new format in a year or two
> just because 128T was enough for the first few users.

OK, will try to remove those limits.

> > > Hint - look at how the qcow2 file format is specified.  You need a
> > > lot more details - enough that someone could independently implement
> > > a program to read and create vma images that would be compatible
> > > with what your implementation produces.
> >
> > The format is really simple, you have the definitions in the header
> > file, and a reference implementation. I am quite sure any experienced
> > developer can write an implementation in a few hours.
> >
> > We do not talk about an extremely complex format like 'qcow2' here.
> 
> This is not an excuse for lacking details in the spec. Quite the opposite. If
> someone has to look at your code in order to implement the format, you have
> failed as a spec writer.

I do not agree here. Clean source code is as good as a spec.
Eric Blake Feb. 21, 2013, 5:49 p.m. UTC | #5
On 02/21/2013 08:32 AM, Dietmar Maurer wrote:
>>>> Hint - look at how the qcow2 file format is specified.  You need a
>>>> lot more details - enough that someone could independently implement
>>>> a program to read and create vma images that would be compatible
>>>> with what your implementation produces.
>>>
>>> The format is really simple, you have the definitions in the header
>>> file, and a reference implementation. I am quite sure any experienced
>>> developer can write an implementation in a few hours.
>>>
>>> We do not talk about an extremely complex format like 'qcow2' here.
>>
>> This is not an excuse for lacking details in the spec. Quite the opposite. If
>> someone has to look at your code in order to implement the format, you have
>> failed as a spec writer.
> 
> I do not agree here. Clean source code is as good as a spec.

Not in qemu.  There's a reason that we ask for clean specs, independent
of source code.  That is the only way that we can later change the
source code to do something more efficient or to define an extension,
and still have clean documentation of what the extensions are, vs. how
an older version will behave.  The initial implementation source code
might be easy to read, but that condition is not guaranteed to last.

If reading the source code to determine the header format is as easy as
you say, then you should have no problem writing the spec as detailed as
we have asked.
Dietmar Maurer Feb. 22, 2013, 5:48 a.m. UTC | #6
> Not in qemu.  There's a reason that we ask for clean specs, independent of

> source code.  That is the only way that we can later change the source code to

> do something more efficient or to define an extension, and still have clean

> documentation of what the extensions are, vs. how an older version will behave.

> The initial implementation source code might be easy to read, but that condition

> is not guaranteed to last.

> 

> If reading the source code to determine the header format is as easy as you say,

> then you should have no problem writing the spec as detailed as we have asked.


Sure, that is no problem. But so far there is no indication that this code will be added
to qemu.
Stefan Hajnoczi Feb. 22, 2013, 10:04 a.m. UTC | #7
On Fri, Feb 22, 2013 at 05:48:33AM +0000, Dietmar Maurer wrote:
> > Not in qemu.  There's a reason that we ask for clean specs, independent of
> > source code.  That is the only way that we can later change the source code to
> > do something more efficient or to define an extension, and still have clean
> > documentation of what the extensions are, vs. how an older version will behave.
> > The initial implementation source code might be easy to read, but that condition
> > is not guaranteed to last.
> > 
> > If reading the source code to determine the header format is as easy as you say,
> > then you should have no problem writing the spec as detailed as we have asked.
> 
> Sure, that is no problem. But so far there is no indication that this code will be added
> to qemu.

FWIW the backup block job looks like a good feature and there's enough
time to get it merged for QEMU 1.5.

I'm not convinced by the backup writer part of this series, but don't
let the discussions about that discourage you.  The fact that we are
discussing it means that it's worth discussing and we just need to keep
communicating until we arrive at something that makes sense for everyone.

Stefan
Dietmar Maurer Feb. 22, 2013, 10:25 a.m. UTC | #8
> > Sure, that is no problem. But so far there is no indication that this
> > code will be added to qemu.
> 
> FWIW the backup block job looks like a good feature and there's enough time to
> get it merged for QEMU 1.5.
> 
> I'm not convinced by the backup writer part of this series, but don't let the
> discussions about that discourage you.  The fact that we are discussing it means
> that it's worth discussing and we just need to keep communicating until we
> arrive at something that makes sense for everyone.

Sure. I just want to concentrate on the first parts of the series before I start 
writing perfect documentation.
Dietmar Maurer Feb. 22, 2013, 11:21 a.m. UTC | #9
> > For now I have no plans to backup such large VMs.
> 
> 640k ought to be enough for anybody?
> 
> Code is easy enough to change that "works for now" can be good enough.
> Changing file formats and external interfaces is hard, though, so better get it
> right the first time. I don't want to get patches for a new format in a year or two
> just because 128T was enough for the first few users.

The limits are not really arbitrary - I choose them carefully to keep the overhead small.

We currently only need 8 bytes per cluster, which results in small files.

So I guess it is better to use a different format (version) if you want to store such big files.
Kevin Wolf Feb. 22, 2013, 12:24 p.m. UTC | #10
Am 22.02.2013 um 12:21 hat Dietmar Maurer geschrieben:
> > > For now I have no plans to backup such large VMs.
> > 
> > 640k ought to be enough for anybody?
> > 
> > Code is easy enough to change that "works for now" can be good enough.
> > Changing file formats and external interfaces is hard, though, so better get it
> > right the first time. I don't want to get patches for a new format in a year or two
> > just because 128T was enough for the first few users.
> 
> The limits are not really arbitrary - I choose them carefully to keep the overhead small.
> 
> We currently only need 8 bytes per cluster, which results in small files.

So how big is the metadata overhead? If you always copy 64k at once
(which is a very conservative assumption - at least for the background
copy you'll have much larger blocks), then an 8 byte header for each is
0.01%. Increasing that to 0.02% doesn't sound like a huge problem to me.

And I think some other reasons were already suggested why using 16 bytes
would be better, so maybe we should just do it.

Kevin
Dietmar Maurer Feb. 22, 2013, 1:11 p.m. UTC | #11
> > We currently only need 8 bytes per cluster, which results in small files.
> 
> So how big is the metadata overhead? If you always copy 64k at once (which is a
> very conservative assumption - at least for the background copy you'll have
> much larger blocks), then an 8 byte header for each is 0.01%. Increasing that to
> 0.02% doesn't sound like a huge problem to me.

We do not always write 64K, because we omit regions filled with zeroes.
So the picture gets a bit worse when you have sparse images.

For example, if you backup a file which mostly contains empty blocks,
you will duplicate the size of the resulting file!

And this is a very common case.

Opposed to that, VM images >= 1Petabyte are very uncommon.
Kevin Wolf Feb. 22, 2013, 1:23 p.m. UTC | #12
Am 22.02.2013 um 14:11 hat Dietmar Maurer geschrieben:
> > > We currently only need 8 bytes per cluster, which results in small files.
> > 
> > So how big is the metadata overhead? If you always copy 64k at once (which is a
> > very conservative assumption - at least for the background copy you'll have
> > much larger blocks), then an 8 byte header for each is 0.01%. Increasing that to
> > 0.02% doesn't sound like a huge problem to me.
> 
> We do not always write 64K, because we omit regions filled with zeroes.
> So the picture gets a bit worse when you have sparse images.
> 
> For example, if you backup a file which mostly contains empty blocks,
> you will duplicate the size of the resulting file!

If describing an empty image takes more than a couple of bytes, you're
doing something seriously wrong. Because for that case it would be easy
enough to have an entry that just says "0 GB - 2 GB" is zero, and
that's it. 16 bytes for 2 GB of virtual disk size, sounds pretty good to
me.

And it would only be needed if VMAs support backing files, because for
normal sparse blocks, I don't even see any reason why the VMA should
contain any information about them. They are never written to, so they
are by definition sparse.

Kevin
Dietmar Maurer Feb. 22, 2013, 1:52 p.m. UTC | #13
> > For example, if you backup a file which mostly contains empty blocks,
> > you will duplicate the size of the resulting file!
> 
> If describing an empty image takes more than a couple of bytes, you're doing
> something seriously wrong. 

really?

> Because for that case it would be easy enough to
> have an entry that just says "0 GB - 2 GB" is zero, and that's it. 16 bytes for 2 GB
> of virtual disk size, sounds pretty good to me.

Zero region are distributed, not continuous, in most cases.

> And it would only be needed if VMAs support backing files, because for normal
> sparse blocks, I don't even see any reason why the VMA should contain any
> information about them. They are never written to, so they are by definition
> sparse.

We track zero regions at 4K level, and Cluster size is 64K.

So I normally just use 1bit to store information about empty blocks.

I thought that is quite good, but you obviously have a better idea?
Dietmar Maurer Feb. 22, 2013, 1:57 p.m. UTC | #14
> And it would only be needed if VMAs support backing files, because for normal
> sparse blocks, I don't even see any reason why the VMA should contain any
> information about them. They are never written to, so they are by definition
> sparse.

But we also talk about regions filled with zeroes here.
Dietmar Maurer Feb. 22, 2013, 2:31 p.m. UTC | #15
> > Honestly, 64k*4G = 128T is not a limit for me. And we still have one
> > byte reserved, so we can have up to 1P per image, and up to 253 images.
> >
> > For now I have no plans to backup such large VMs.
> 
> 640k ought to be enough for anybody?
> 
> Code is easy enough to change that "works for now" can be good enough.
> Changing file formats and external interfaces is hard, though, so better get it
> right the first time. I don't want to get patches for a new format in a year or two
> just because 128T was enough for the first few users.

So what address space do you want - 64bit?
Kevin Wolf Feb. 22, 2013, 2:54 p.m. UTC | #16
Am 22.02.2013 um 15:31 hat Dietmar Maurer geschrieben:
> > > Honestly, 64k*4G = 128T is not a limit for me. And we still have one
> > > byte reserved, so we can have up to 1P per image, and up to 253 images.
> > >
> > > For now I have no plans to backup such large VMs.
> > 
> > 640k ought to be enough for anybody?
> > 
> > Code is easy enough to change that "works for now" can be good enough.
> > Changing file formats and external interfaces is hard, though, so better get it
> > right the first time. I don't want to get patches for a new format in a year or two
> > just because 128T was enough for the first few users.
> 
> So what address space do you want - 64bit?

Yes, I think 64 bits makes the most sense.

Kevin
Kevin Wolf Feb. 22, 2013, 2:56 p.m. UTC | #17
Am 22.02.2013 um 14:57 hat Dietmar Maurer geschrieben:
> > And it would only be needed if VMAs support backing files, because for normal
> > sparse blocks, I don't even see any reason why the VMA should contain any
> > information about them. They are never written to, so they are by definition
> > sparse.
> 
> But we also talk about regions filled with zeroes here.

Without backing files, this is equivalent.

Kevin
Kevin Wolf Feb. 22, 2013, 3:02 p.m. UTC | #18
Am 22.02.2013 um 14:52 hat Dietmar Maurer geschrieben:
> > > For example, if you backup a file which mostly contains empty blocks,
> > > you will duplicate the size of the resulting file!
> > 
> > If describing an empty image takes more than a couple of bytes, you're doing
> > something seriously wrong. 
> 
> really?
> 
> > Because for that case it would be easy enough to
> > have an entry that just says "0 GB - 2 GB" is zero, and that's it. 16 bytes for 2 GB
> > of virtual disk size, sounds pretty good to me.
> 
> Zero region are distributed, not continuous, in most cases.

Yes, but an image "which mostly contains empty block" has zero regions
that are way larger than 64k. Maybe you can't describe full 2 GB with 16
bytes, but quite sure some megabytes.

> > And it would only be needed if VMAs support backing files, because for normal
> > sparse blocks, I don't even see any reason why the VMA should contain any
> > information about them. They are never written to, so they are by definition
> > sparse.
> 
> We track zero regions at 4K level, and Cluster size is 64K.
> 
> So I normally just use 1bit to store information about empty blocks.
> 
> I thought that is quite good, but you obviously have a better idea?

Maybe I didn't understand right then. These zero regions are just for
describing empty sectors within non-empty clusters? But why are you
concerned about mostly empty images then?

Kevin
Dietmar Maurer Feb. 22, 2013, 4:13 p.m. UTC | #19
> > I thought that is quite good, but you obviously have a better idea?
> 
> Maybe I didn't understand right then. These zero regions are just for describing
> empty sectors within non-empty clusters? But why are you concerned about
> mostly empty images then?

They are more likely than Petabyte images.

Anyway, I guess I can support 64bit address space by using 9bytes overhead. There is
no need to use 16bytes.
Dietmar Maurer Feb. 22, 2013, 4:20 p.m. UTC | #20
> > We track zero regions at 4K level, and Cluster size is 64K.
> >
> > So I normally just use 1bit to store information about empty blocks.
> >
> > I thought that is quite good, but you obviously have a better idea?
> 
> Maybe I didn't understand right then. These zero regions are just for describing
> empty sectors within non-empty clusters? 

I currently also write an entry for empty sectors. But this is only done as additional
error check, to prove that the backup tool wrote all data.
Dietmar Maurer Feb. 22, 2013, 4:29 p.m. UTC | #21
> > > We track zero regions at 4K level, and Cluster size is 64K.
> > >
> > > So I normally just use 1bit to store information about empty blocks.
> > >
> > > I thought that is quite good, but you obviously have a better idea?
> >
> > Maybe I didn't understand right then. These zero regions are just for
> > describing empty sectors within non-empty clusters?
> 
> I currently also write an entry for empty sectors. But this is only done as
> additional error check, to prove that the backup tool wrote all data.

BTW, this can be also useful when you have a damaged backup file. The extend header
contains checksum and special markers. All that info together let you know
what blocks are missing (exactly).
diff mbox

Patch

diff --git a/Makefile b/Makefile
index 0d9099a..16f1c25 100644
--- a/Makefile
+++ b/Makefile
@@ -115,7 +115,7 @@  ifeq ($(CONFIG_SMARTCARD_NSS),y)
 include $(SRC_PATH)/libcacard/Makefile
 endif
 
-all: $(DOCS) $(TOOLS) $(HELPERS-y) recurse-all
+all: $(DOCS) $(TOOLS) vma$(EXESUF) $(HELPERS-y) recurse-all
 
 config-host.h: config-host.h-timestamp
 config-host.h-timestamp: config-host.mak
@@ -167,6 +167,7 @@  qemu-img.o: qemu-img-cmds.h
 qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-io$(EXESUF): qemu-io.o cmd.o $(block-obj-y) libqemuutil.a libqemustub.a
+vma$(EXESUF): vma.o vma-writer.o vma-reader.o $(block-obj-y)  libqemuutil.a libqemustub.a
 
 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
 
diff --git a/Makefile.objs b/Makefile.objs
index df64f70..91f133b 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -13,7 +13,7 @@  block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qapi-types.o qapi-visit.o
-block-obj-y += backup.o
+block-obj-y += vma-writer.o backup.o
 
 block-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
 block-obj-y += qemu-coroutine-sleep.o
diff --git a/backup.h b/backup.h
index c8ba153..406f011 100644
--- a/backup.h
+++ b/backup.h
@@ -15,6 +15,7 @@ 
 #define QEMU_BACKUP_H
 
 #include <uuid/uuid.h>
+#include "block/block.h"
 
 #define BACKUP_CLUSTER_BITS 16
 #define BACKUP_CLUSTER_SIZE (1<<BACKUP_CLUSTER_BITS)
diff --git a/blockdev.c b/blockdev.c
index c340fde..1cfc780 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -21,6 +21,7 @@ 
 #include "trace.h"
 #include "sysemu/arch_init.h"
 #include "backup.h"
+#include "vma.h"
 
 static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
 
@@ -1530,10 +1531,11 @@  char *qmp_backup(const char *backup_file, bool has_format, BackupFormat format,
     /* Todo: try to auto-detect format based on file name */
     format = has_format ? format : BACKUP_FORMAT_VMA;
 
-    /* fixme: find driver for specifued format */
     const BackupDriver *driver = NULL;
 
-    if (!driver) {
+    if (format == BACKUP_FORMAT_VMA) {
+        driver = &backup_vma_driver;
+    } else {
         error_set(errp, ERROR_CLASS_GENERIC_ERROR, "unknown backup format");
         return NULL;
     }
diff --git a/docs/specs/vma_spec.txt b/docs/specs/vma_spec.txt
new file mode 100644
index 0000000..052c629
--- /dev/null
+++ b/docs/specs/vma_spec.txt
@@ -0,0 +1,24 @@ 
+=Virtual Machine Archive format (VMA)=
+
+This format contains a header which includes the VM configuration as
+binary blobs, and a list of devices (dev_id, name).
+
+The actual VM image data is stored inside extents. An extent contains
+up to 64 clusters, and start with a 512 byte header containing
+additional information for those clusters.
+
+We use a cluster size of 65536, and use 8 bytes for each
+cluster in the header to store the following information:
+
+* 1 byte dev_id (to identity the drive)
+* 2 bytes zero indicator (mark zero regions (16x4096))
+* 4 bytes cluster number
+* 1 byte not used (reserved)
+
+We only store non-zero blocks (such block is 4096 bytes).
+
+Each archive is marked with a uuid. The archive header and all
+extent headers includes that uuid and a MD5 checksum (over header
+data).
+
+
diff --git a/vma-reader.c b/vma-reader.c
new file mode 100644
index 0000000..7e81847
--- /dev/null
+++ b/vma-reader.c
@@ -0,0 +1,799 @@ 
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) 2012 Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+#include <uuid/uuid.h>
+
+#include "qemu-common.h"
+#include "qemu/timer.h"
+#include "qemu/ratelimit.h"
+#include "vma.h"
+#include "block/block.h"
+
+#define BITS_PER_LONG  (sizeof(unsigned long) * 8)
+
+static unsigned char zero_vma_block[VMA_BLOCK_SIZE];
+
+typedef struct VmaRestoreState {
+    BlockDriverState *bs;
+    bool write_zeroes;
+    unsigned long *bitmap;
+    int bitmap_size;
+}  VmaRestoreState;
+
+struct VmaReader {
+    int fd;
+    GChecksum *md5csum;
+    GHashTable *blob_hash;
+    unsigned char *head_data;
+    VmaDeviceInfo devinfo[256];
+    VmaRestoreState rstate[256];
+    GList *cdata_list;
+    guint8 vmstate_stream;
+    uint32_t vmstate_clusters;
+    /* to show restore percentage if run with -v */
+    time_t start_time;
+    int64_t cluster_count;
+    int64_t clusters_read;
+    int clusters_read_per;
+};
+
+static guint
+g_int32_hash(gconstpointer v)
+{
+    return *(const uint32_t *)v;
+}
+
+static gboolean
+g_int32_equal(gconstpointer v1, gconstpointer v2)
+{
+    return *((const uint32_t *)v1) == *((const uint32_t *)v2);
+}
+
+static int vma_reader_get_bitmap(VmaRestoreState *rstate, int64_t cluster_num)
+{
+    assert(rstate);
+    assert(rstate->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(rstate->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = rstate->bitmap[idx];
+
+    return !!(val & (1UL << bit));
+}
+
+static void vma_reader_set_bitmap(VmaRestoreState *rstate, int64_t cluster_num,
+                                  int dirty)
+{
+    assert(rstate);
+    assert(rstate->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(rstate->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = rstate->bitmap[idx];
+    if (dirty) {
+        if (!(val & (1UL << bit))) {
+            val |= 1UL << bit;
+        }
+    } else {
+        if (val & (1UL << bit)) {
+            val &= ~(1UL << bit);
+        }
+    }
+    rstate->bitmap[idx] = val;
+}
+
+typedef struct VmaBlob {
+    uint32_t start;
+    uint32_t len;
+    void *data;
+} VmaBlob;
+
+static const VmaBlob *get_header_blob(VmaReader *vmar, uint32_t pos)
+{
+    assert(vmar);
+    assert(vmar->blob_hash);
+
+    return g_hash_table_lookup(vmar->blob_hash, &pos);
+}
+
+static const char *get_header_str(VmaReader *vmar, uint32_t pos)
+{
+    const VmaBlob *blob = get_header_blob(vmar, pos);
+    if (!blob) {
+        return NULL;
+    }
+    const char *res = (char *)blob->data;
+    if (res[blob->len-1] != '\0') {
+        return NULL;
+    }
+    return res;
+}
+
+static ssize_t
+safe_read(int fd, unsigned char *buf, size_t count)
+{
+    ssize_t n;
+
+    do {
+        n = read(fd, buf, count);
+    } while (n < 0 && errno == EINTR);
+
+    return n;
+}
+
+static ssize_t
+full_read(int fd, unsigned char *buf, size_t len)
+{
+    ssize_t n;
+    size_t total;
+
+    total = 0;
+
+    while (len > 0) {
+        n = safe_read(fd, buf, len);
+
+        if (n == 0) {
+            return total;
+        }
+
+        if (n <= 0) {
+            break;
+        }
+
+        buf += n;
+        total += n;
+        len -= n;
+    }
+
+    if (len) {
+        return -1;
+    }
+
+    return total;
+}
+
+void vma_reader_destroy(VmaReader *vmar)
+{
+    assert(vmar);
+
+    if (vmar->fd >= 0) {
+        close(vmar->fd);
+    }
+
+    if (vmar->cdata_list) {
+        g_list_free(vmar->cdata_list);
+    }
+
+    int i;
+    for (i = 1; i < 256; i++) {
+        if (vmar->rstate[i].bitmap) {
+            g_free(vmar->rstate[i].bitmap);
+        }
+    }
+
+    if (vmar->md5csum) {
+        g_checksum_free(vmar->md5csum);
+    }
+
+    if (vmar->blob_hash) {
+        g_hash_table_destroy(vmar->blob_hash);
+    }
+
+    if (vmar->head_data) {
+        g_free(vmar->head_data);
+    }
+
+    g_free(vmar);
+
+};
+
+static int vma_reader_read_head(VmaReader *vmar, Error **errp)
+{
+    assert(vmar);
+    assert(errp);
+    assert(*errp == NULL);
+
+    unsigned char md5sum[16];
+    int i;
+    int ret = 0;
+
+    vmar->head_data = g_malloc(sizeof(VmaHeader));
+
+    if (full_read(vmar->fd, vmar->head_data, sizeof(VmaHeader)) !=
+        sizeof(VmaHeader)) {
+        error_setg(errp, "can't read vma header - %s",
+                   errno ? strerror(errno) : "got EOF");
+        return -1;
+    }
+
+    VmaHeader *h = (VmaHeader *)vmar->head_data;
+
+    if (h->magic != VMA_MAGIC) {
+        error_setg(errp, "not a vma file - wrong magic number");
+        return -1;
+    }
+
+    uint32_t header_size = GUINT32_FROM_BE(h->header_size);
+    int need = header_size - sizeof(VmaHeader);
+    if (need <= 0) {
+        error_setg(errp, "wrong vma header size %d", header_size);
+        return -1;
+    }
+
+    vmar->head_data = g_realloc(vmar->head_data, header_size);
+    h = (VmaHeader *)vmar->head_data;
+
+    if (full_read(vmar->fd, vmar->head_data + sizeof(VmaHeader), need) !=
+        need) {
+        error_setg(errp, "can't read vma header data - %s",
+                   errno ? strerror(errno) : "got EOF");
+        return -1;
+    }
+
+    memcpy(md5sum, h->md5sum, 16);
+    memset(h->md5sum, 0, 16);
+
+    g_checksum_reset(vmar->md5csum);
+    g_checksum_update(vmar->md5csum, vmar->head_data, header_size);
+    gsize csize = 16;
+    g_checksum_get_digest(vmar->md5csum, (guint8 *)(h->md5sum), &csize);
+
+    if (memcmp(md5sum, h->md5sum, 16) != 0) {
+        error_setg(errp, "wrong vma header chechsum");
+        return -1;
+    }
+
+    /* we can modify header data after checksum verify */
+    h->header_size = header_size;
+
+    h->version = GUINT32_FROM_BE(h->version);
+    if (h->version != 1) {
+        error_setg(errp, "wrong vma version %d", h->version);
+        return -1;
+    }
+
+    h->ctime = GUINT64_FROM_BE(h->ctime);
+    h->blob_buffer_offset = GUINT32_FROM_BE(h->blob_buffer_offset);
+    h->blob_buffer_size = GUINT32_FROM_BE(h->blob_buffer_size);
+
+    uint32_t bstart = h->blob_buffer_offset + 1;
+    uint32_t bend = h->blob_buffer_offset + h->blob_buffer_size;
+
+    if (bstart <= sizeof(VmaHeader)) {
+        error_setg(errp, "wrong vma blob buffer offset %d",
+                   h->blob_buffer_offset);
+        return -1;
+    }
+
+    if (bend > header_size) {
+        error_setg(errp, "wrong vma blob buffer size %d/%d",
+                   h->blob_buffer_offset, h->blob_buffer_size);
+        return -1;
+    }
+
+    while ((bstart + 2) <= bend) {
+        uint32_t size = vmar->head_data[bstart] +
+            (vmar->head_data[bstart+1] << 8);
+        if ((bstart + size + 2) <= bend) {
+            VmaBlob *blob = g_new0(VmaBlob, 1);
+            blob->start = bstart - h->blob_buffer_offset;
+            blob->len = size;
+            blob->data = vmar->head_data + bstart + 2;
+            g_hash_table_insert(vmar->blob_hash, &blob->start, blob);
+        }
+        bstart += size + 2;
+    }
+
+
+    int count = 0;
+    for (i = 1; i < 256; i++) {
+        VmaDeviceInfoHeader *dih = &h->dev_info[i];
+        uint32_t devname_ptr = GUINT32_FROM_BE(dih->devname_ptr);
+        uint64_t size = GUINT64_FROM_BE(dih->size);
+        const char *devname =  get_header_str(vmar, devname_ptr);
+
+        if (size && devname) {
+            count++;
+            vmar->devinfo[i].size = size;
+            vmar->devinfo[i].devname = devname;
+
+            if (strcmp(devname, "vmstate") == 0) {
+                vmar->vmstate_stream = i;
+            }
+        }
+    }
+
+    if (!count) {
+        error_setg(errp, "vma does not contain data");
+        return -1;
+    }
+
+    for (i = 0; i < VMA_MAX_CONFIGS; i++) {
+        uint32_t name_ptr = GUINT32_FROM_BE(h->config_names[i]);
+        uint32_t data_ptr = GUINT32_FROM_BE(h->config_data[i]);
+
+        if (!(name_ptr && data_ptr)) {
+            continue;
+        }
+        const char *name =  get_header_str(vmar, name_ptr);
+        const VmaBlob *blob = get_header_blob(vmar, data_ptr);
+
+        if (!(name && blob)) {
+            error_setg(errp, "vma contains invalid data pointers");
+            return -1;
+        }
+
+        VmaConfigData *cdata = g_new0(VmaConfigData, 1);
+        cdata->name = name;
+        cdata->data = blob->data;
+        cdata->len = blob->len;
+
+        vmar->cdata_list = g_list_append(vmar->cdata_list, cdata);
+    }
+
+    return ret;
+};
+
+VmaReader *vma_reader_create(const char *filename, Error **errp)
+{
+    assert(filename);
+    assert(errp);
+
+    VmaReader *vmar = g_new0(VmaReader, 1);
+
+    if (strcmp(filename, "-") == 0) {
+        vmar->fd = dup(0);
+    } else {
+        vmar->fd = open(filename, O_RDONLY);
+    }
+
+    if (vmar->fd < 0) {
+        error_setg(errp, "can't open file %s - %s\n", filename,
+                   strerror(errno));
+        goto err;
+    }
+
+    vmar->md5csum = g_checksum_new(G_CHECKSUM_MD5);
+    if (!vmar->md5csum) {
+        error_setg(errp, "can't allocate cmsum\n");
+        goto err;
+    }
+
+    vmar->blob_hash = g_hash_table_new_full(g_int32_hash, g_int32_equal,
+                                            NULL, g_free);
+
+    if (vma_reader_read_head(vmar, errp) < 0) {
+        goto err;
+    }
+
+    return vmar;
+
+err:
+    if (vmar) {
+        vma_reader_destroy(vmar);
+    }
+
+    return NULL;
+}
+
+VmaHeader *vma_reader_get_header(VmaReader *vmar)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    return (VmaHeader *)(vmar->head_data);
+}
+
+GList *vma_reader_get_config_data(VmaReader *vmar)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    return vmar->cdata_list;
+}
+
+VmaDeviceInfo *vma_reader_get_device_info(VmaReader *vmar, guint8 dev_id)
+{
+    assert(vmar);
+    assert(dev_id);
+
+    if (vmar->devinfo[dev_id].size && vmar->devinfo[dev_id].devname) {
+        return &vmar->devinfo[dev_id];
+    }
+
+    return NULL;
+}
+
+int vma_reader_register_bs(VmaReader *vmar, guint8 dev_id, BlockDriverState *bs,
+                           bool write_zeroes, Error **errp)
+{
+    assert(vmar);
+    assert(bs != NULL);
+    assert(dev_id);
+    assert(vmar->rstate[dev_id].bs == NULL);
+
+    int64_t size = bdrv_getlength(bs);
+    if (size != vmar->devinfo[dev_id].size) {
+        error_setg(errp, "vma_reader_register_bs for stream %s failed - "
+                   "unexpected size %zd != %zd", vmar->devinfo[dev_id].devname,
+                   size, vmar->devinfo[dev_id].size);
+        return -1;
+    }
+
+    vmar->rstate[dev_id].bs = bs;
+    vmar->rstate[dev_id].write_zeroes = write_zeroes;
+
+    int64_t bitmap_size = (size/BDRV_SECTOR_SIZE) +
+        (VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE) * BITS_PER_LONG - 1;
+    bitmap_size /= (VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE) * BITS_PER_LONG;
+
+    vmar->rstate[dev_id].bitmap_size = bitmap_size;
+    vmar->rstate[dev_id].bitmap = g_new0(unsigned long, bitmap_size);
+
+    vmar->cluster_count += size/VMA_CLUSTER_SIZE;
+
+    return 0;
+}
+
+static ssize_t safe_write(int fd, void *buf, size_t count)
+{
+    ssize_t n;
+
+    do {
+        n = write(fd, buf, count);
+    } while (n < 0 && errno == EINTR);
+
+    return n;
+}
+
+static size_t full_write(int fd, void *buf, size_t len)
+{
+    ssize_t n;
+    size_t total;
+
+    total = 0;
+
+    while (len > 0) {
+        n = safe_write(fd, buf, len);
+        if (n < 0) {
+            return n;
+        }
+        buf += n;
+        total += n;
+        len -= n;
+    }
+
+    if (len) {
+        /* incomplete write ? */
+        return -1;
+    }
+
+    return total;
+}
+
+static int restore_write_data(VmaReader *vmar, guint8 dev_id,
+                              BlockDriverState *bs, int vmstate_fd,
+                              unsigned char *buf, int64_t sector_num,
+                              int nb_sectors, Error **errp)
+{
+    assert(vmar);
+
+    if (dev_id == vmar->vmstate_stream) {
+        if (vmstate_fd >= 0) {
+            int len = nb_sectors * BDRV_SECTOR_SIZE;
+            int res = full_write(vmstate_fd, buf, len);
+            if (res < 0) {
+                error_setg(errp, "write vmstate failed %d", res);
+                return -1;
+            }
+        }
+    } else {
+        int res = bdrv_write(bs, sector_num, buf, nb_sectors);
+        if (res < 0) {
+            error_setg(errp, "bdrv_write to %s failed (%d)",
+                       bdrv_get_device_name(bs), res);
+            return -1;
+        }
+    }
+    return 0;
+}
+static int restore_extent(VmaReader *vmar, unsigned char *buf,
+                          int extent_size, int vmstate_fd,
+                          bool verbose, Error **errp)
+{
+    assert(vmar);
+    assert(buf);
+
+    VmaExtentHeader *ehead = (VmaExtentHeader *)buf;
+    int start = VMA_EXTENT_HEADER_SIZE;
+    int i;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTENT; i++) {
+        uint64_t block_info = GUINT64_FROM_BE(ehead->blockinfo[i]);
+        uint64_t cluster_num = block_info & 0xffffffff;
+        uint8_t dev_id = (block_info >> 32) & 0xff;
+        uint16_t mask = block_info >> (32+16);
+        int64_t max_sector;
+
+        if (!dev_id) {
+            continue;
+        }
+
+        VmaRestoreState *rstate = &vmar->rstate[dev_id];
+        BlockDriverState *bs = NULL;
+
+        if (dev_id != vmar->vmstate_stream) {
+            bs = rstate->bs;
+            if (!bs) {
+                error_setg(errp, "got wrong dev id %d", dev_id);
+                return -1;
+            }
+
+            if (vma_reader_get_bitmap(rstate, cluster_num)) {
+                error_setg(errp, "found duplicated cluster %zd for stream %s",
+                          cluster_num, vmar->devinfo[dev_id].devname);
+                return -1;
+            }
+            vma_reader_set_bitmap(rstate, cluster_num, 1);
+
+            max_sector = vmar->devinfo[dev_id].size/BDRV_SECTOR_SIZE;
+        } else {
+            max_sector = G_MAXINT64;
+            if (cluster_num != vmar->vmstate_clusters) {
+                error_setg(errp, "found out of order vmstate data");
+                return -1;
+            }
+            vmar->vmstate_clusters++;
+        }
+
+        vmar->clusters_read++;
+
+        if (verbose) {
+            time_t duration = time(NULL) - vmar->start_time;
+            int percent = (vmar->clusters_read*100)/vmar->cluster_count;
+            if (percent != vmar->clusters_read_per) {
+                printf("progress %d%% (read %zd bytes, duration %zd sec)\n",
+                       percent, vmar->clusters_read*VMA_CLUSTER_SIZE,
+                       duration);
+                fflush(stdout);
+                vmar->clusters_read_per = percent;
+            }
+        }
+
+        /* try to write whole clusters to speedup restore */
+        if (mask == 0xffff) {
+            if ((start + VMA_CLUSTER_SIZE) > extent_size) {
+                error_setg(errp, "short vma extent - too many blocks");
+                return -1;
+            }
+            int64_t sector_num = (cluster_num * VMA_CLUSTER_SIZE) /
+                BDRV_SECTOR_SIZE;
+            int64_t end_sector = sector_num +
+                VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE;
+
+            if (end_sector > max_sector) {
+                end_sector = max_sector;
+            }
+
+            if (end_sector <= sector_num) {
+                error_setg(errp, "got wrong block address - write bejond end");
+                return -1;
+            }
+
+            int nb_sectors = end_sector - sector_num;
+            if (restore_write_data(vmar, dev_id, bs, vmstate_fd, buf + start,
+                                   sector_num, nb_sectors, errp) < 0) {
+                return -1;
+            }
+
+            start += VMA_CLUSTER_SIZE;
+        } else {
+            int j;
+            int bit = 1;
+
+            for (j = 0; j < 16; j++) {
+                int64_t sector_num = (cluster_num*VMA_CLUSTER_SIZE +
+                                      j*VMA_BLOCK_SIZE)/BDRV_SECTOR_SIZE;
+
+                int64_t end_sector = sector_num +
+                    VMA_BLOCK_SIZE/BDRV_SECTOR_SIZE;
+                if (end_sector > max_sector) {
+                    end_sector = max_sector;
+                }
+
+                if (mask & bit) {
+                    if ((start + VMA_BLOCK_SIZE) > extent_size) {
+                        error_setg(errp, "short vma extent - too many blocks");
+                        return -1;
+                    }
+
+                    if (end_sector <= sector_num) {
+                        error_setg(errp, "got wrong block address - "
+                                   "write bejond end");
+                        return -1;
+                    }
+
+                    int nb_sectors = end_sector - sector_num;
+                    if (restore_write_data(vmar, dev_id, bs, vmstate_fd,
+                                           buf + start, sector_num,
+                                           nb_sectors, errp) < 0) {
+                        return -1;
+                    }
+
+                    start += VMA_BLOCK_SIZE;
+
+                } else {
+
+                    if (rstate->write_zeroes && (end_sector > sector_num)) {
+                        /* Todo: use bdrv_co_write_zeroes (but that need to
+                         * be run inside coroutine?)
+                         */
+                        int nb_sectors = end_sector - sector_num;
+                        if (restore_write_data(vmar, dev_id, bs, vmstate_fd,
+                                              zero_vma_block, sector_num,
+                                               nb_sectors, errp) < 0) {
+                            return -1;
+                        }
+                    }
+                }
+
+                bit = bit << 1;
+            }
+        }
+    }
+
+    if (start != extent_size) {
+        error_setg(errp, "vma extent error - missing blocks");
+        return -1;
+    }
+
+    return 0;
+}
+
+int vma_reader_restore(VmaReader *vmar, int vmstate_fd, bool verbose,
+                       Error **errp)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    int ret = 0;
+    unsigned char buf[VMA_MAX_EXTENT_SIZE];
+    int buf_pos = 0;
+    unsigned char md5sum[16];
+    VmaHeader *h = (VmaHeader *)vmar->head_data;
+
+    vmar->start_time = time(NULL);
+
+    while (1) {
+        int bytes = full_read(vmar->fd, buf + buf_pos, sizeof(buf) - buf_pos);
+        if (bytes < 0) {
+            error_setg(errp, "read failed - %s", strerror(errno));
+            return -1;
+        }
+
+        buf_pos += bytes;
+
+        if (!buf_pos) {
+            break; /* EOF */
+        }
+
+        if (buf_pos < VMA_EXTENT_HEADER_SIZE) {
+            error_setg(errp, "read short extent (%d bytes)", buf_pos);
+            return -1;
+        }
+
+        VmaExtentHeader *ehead = (VmaExtentHeader *)buf;
+
+        /* extract md5sum */
+        memcpy(md5sum, ehead->md5sum, sizeof(ehead->md5sum));
+        memset(ehead->md5sum, 0, sizeof(ehead->md5sum));
+
+        g_checksum_reset(vmar->md5csum);
+        g_checksum_update(vmar->md5csum, buf, VMA_EXTENT_HEADER_SIZE);
+        gsize csize = 16;
+        g_checksum_get_digest(vmar->md5csum, ehead->md5sum, &csize);
+
+        if (memcmp(md5sum, ehead->md5sum, 16) != 0) {
+            error_setg(errp, "wrong vma extent header chechsum");
+            return -1;
+        }
+
+        if (memcmp(h->uuid, ehead->uuid, sizeof(ehead->uuid)) != 0) {
+            error_setg(errp, "wrong vma extent uuid");
+            return -1;
+        }
+
+        if (ehead->magic != VMA_EXTENT_MAGIC || ehead->reserved1 != 0) {
+            error_setg(errp, "wrong vma extent header magic");
+            return -1;
+        }
+
+        int block_count = GUINT16_FROM_BE(ehead->block_count);
+        int extent_size = VMA_EXTENT_HEADER_SIZE + block_count*VMA_BLOCK_SIZE;
+
+        if (buf_pos < extent_size) {
+            error_setg(errp, "short vma extent (%d < %d)", buf_pos,
+                       extent_size);
+            return -1;
+        }
+
+        if (restore_extent(vmar, buf, extent_size, vmstate_fd, verbose,
+                           errp) < 0) {
+            return -1;
+        }
+
+        if (buf_pos > extent_size) {
+            memmove(buf, buf + extent_size, buf_pos - extent_size);
+            buf_pos = buf_pos - extent_size;
+        } else {
+            buf_pos = 0;
+        }
+    }
+
+    bdrv_drain_all();
+
+    int i;
+    for (i = 1; i < 256; i++) {
+        VmaRestoreState *rstate = &vmar->rstate[i];
+        if (!rstate->bs) {
+            continue;
+        }
+
+        if (bdrv_flush(rstate->bs) < 0) {
+            error_setg(errp, "vma bdrv_flush %s failed",
+                       vmar->devinfo[i].devname);
+            return -1;
+        }
+
+        if (vmar->devinfo[i].size &&
+            (strcmp(vmar->devinfo[i].devname, "vmstate") != 0)) {
+            assert(rstate->bitmap);
+
+            int64_t cluster_num, end;
+
+            end = (vmar->devinfo[i].size + VMA_CLUSTER_SIZE - 1) /
+                VMA_CLUSTER_SIZE;
+
+            for (cluster_num = 0; cluster_num < end; cluster_num++) {
+                if (!vma_reader_get_bitmap(rstate, cluster_num)) {
+                    error_setg(errp, "detected missing cluster %zd "
+                               "for stream %s", cluster_num,
+                               vmar->devinfo[i].devname);
+                    return -1;
+                }
+            }
+        }
+    }
+
+    return ret;
+}
+
diff --git a/vma-writer.c b/vma-writer.c
new file mode 100644
index 0000000..761d7ca
--- /dev/null
+++ b/vma-writer.c
@@ -0,0 +1,932 @@ 
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) 2012 Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+#include <uuid/uuid.h>
+
+#include "qemu-common.h"
+#include "vma.h"
+#include "block/block.h"
+#include "monitor/monitor.h"
+
+#define DEBUG_VMA 0
+
+#define DPRINTF(fmt, ...)\
+    do { if (DEBUG_VMA) { printf("vma: " fmt, ## __VA_ARGS__); } } while (0)
+
+#define WRITE_BUFFERS 5
+
+typedef struct VmaAIOCB VmaAIOCB;
+struct VmaAIOCB {
+    unsigned char buffer[VMA_MAX_EXTENT_SIZE];
+    VmaWriter *vmaw;
+    size_t bytes;
+    Coroutine *co;
+};
+
+struct VmaWriter {
+    int fd;
+    FILE *cmd;
+    int status;
+    char errmsg[8192];
+    uuid_t uuid;
+    bool header_written;
+    bool closed;
+
+    /* we always write extents */
+    unsigned char outbuf[VMA_MAX_EXTENT_SIZE];
+    int outbuf_pos; /* in bytes */
+    int outbuf_count; /* in VMA_BLOCKS */
+    uint64_t outbuf_block_info[VMA_BLOCKS_PER_EXTENT];
+
+    VmaAIOCB *aiocbs[WRITE_BUFFERS];
+    CoQueue wqueue;
+
+    GChecksum *md5csum;
+    CoMutex writer_lock;
+    CoMutex flush_lock;
+    Coroutine *co_writer;
+
+    /* drive informations */
+    VmaStreamInfo stream_info[256];
+    guint stream_count;
+
+    guint8 vmstate_stream;
+    uint32_t vmstate_clusters;
+
+    /* header blob table */
+    char *header_blob_table;
+    uint32_t header_blob_table_size;
+    uint32_t header_blob_table_pos;
+
+    /* store for config blobs */
+    uint32_t config_names[VMA_MAX_CONFIGS]; /* offset into blob_buffer table */
+    uint32_t config_data[VMA_MAX_CONFIGS];  /* offset into blob_buffer table */
+    uint32_t config_count;
+};
+
+void vma_writer_set_error(VmaWriter *vmaw, const char *fmt, ...)
+{
+    va_list ap;
+
+    if (vmaw->status < 0) {
+        return;
+    }
+
+    vmaw->status = -1;
+
+    va_start(ap, fmt);
+    g_vsnprintf(vmaw->errmsg, sizeof(vmaw->errmsg), fmt, ap);
+    va_end(ap);
+
+    DPRINTF("vma_writer_set_error: %s\n", vmaw->errmsg);
+}
+
+static uint32_t allocate_header_blob(VmaWriter *vmaw, const char *data,
+                                     size_t len)
+{
+    if (len > 65535) {
+        return 0;
+    }
+
+    if (!vmaw->header_blob_table ||
+        (vmaw->header_blob_table_size <
+         (vmaw->header_blob_table_pos + len + 2))) {
+        int newsize = vmaw->header_blob_table_size + ((len + 2 + 511)/512)*512;
+
+        vmaw->header_blob_table = g_realloc(vmaw->header_blob_table, newsize);
+        memset(vmaw->header_blob_table + vmaw->header_blob_table_size,
+               0, newsize - vmaw->header_blob_table_size);
+        vmaw->header_blob_table_size = newsize;
+    }
+
+    uint32_t cpos = vmaw->header_blob_table_pos;
+    vmaw->header_blob_table[cpos] = len & 255;
+    vmaw->header_blob_table[cpos+1] = (len >> 8) & 255;
+    memcpy(vmaw->header_blob_table + cpos + 2, data, len);
+    vmaw->header_blob_table_pos += len + 2;
+    return cpos;
+}
+
+static uint32_t allocate_header_string(VmaWriter *vmaw, const char *str)
+{
+    assert(vmaw);
+
+    size_t len = strlen(str) + 1;
+
+    return allocate_header_blob(vmaw, str, len);
+}
+
+int vma_writer_add_config(VmaWriter *vmaw, const char *name, gpointer data,
+                          gsize len)
+{
+    assert(vmaw);
+    assert(!vmaw->header_written);
+    assert(vmaw->config_count < VMA_MAX_CONFIGS);
+    assert(name);
+    assert(data);
+    assert(len);
+
+    uint32_t name_ptr = allocate_header_string(vmaw, name);
+    if (!name_ptr) {
+        return -1;
+    }
+
+    uint32_t data_ptr = allocate_header_blob(vmaw, data, len);
+    if (!data_ptr) {
+        return -1;
+    }
+
+    vmaw->config_names[vmaw->config_count] = name_ptr;
+    vmaw->config_data[vmaw->config_count] = data_ptr;
+
+    vmaw->config_count++;
+
+    return 0;
+}
+
+int vma_writer_register_stream(VmaWriter *vmaw, const char *devname,
+                               size_t size)
+{
+    assert(vmaw);
+    assert(devname);
+    assert(!vmaw->status);
+
+    if (vmaw->header_written) {
+        vma_writer_set_error(vmaw, "vma_writer_register_stream: header "
+                             "already written");
+        return -1;
+    }
+
+    guint n = vmaw->stream_count + 1;
+
+    /* we can have dev_ids form 1 to 255 (0 reserved)
+     * 255(-1) reseverd for safety
+     */
+    if (n > 254) {
+        vma_writer_set_error(vmaw, "vma_writer_register_stream: "
+                             "too many drives");
+        return -1;
+    }
+
+    if (size <= 0) {
+        vma_writer_set_error(vmaw, "vma_writer_register_stream: "
+                             "got strange size %zd", size);
+        return -1;
+    }
+
+    DPRINTF("vma_writer_register_stream %s %zu %d\n", devname, size, n);
+
+    vmaw->stream_info[n].devname = g_strdup(devname);
+    vmaw->stream_info[n].size = size;
+
+    vmaw->stream_info[n].cluster_count = (size + VMA_CLUSTER_SIZE - 1) /
+        VMA_CLUSTER_SIZE;
+
+    vmaw->stream_count = n;
+
+    if (strcmp(devname, "vmstate") == 0) {
+        vmaw->vmstate_stream = n;
+    }
+
+    return n;
+}
+
+static void vma_co_continue_write(void *opaque)
+{
+    VmaWriter *vmaw = opaque;
+
+    DPRINTF("vma_co_continue_write\n");
+    qemu_coroutine_enter(vmaw->co_writer, NULL);
+}
+
+static int vma_co_write_finished(void *opaque)
+{
+    VmaWriter *vmaw = opaque;
+
+    return (vmaw->co_writer != 0);
+}
+
+static ssize_t coroutine_fn
+vma_co_write(VmaWriter *vmaw, const void *buf, size_t bytes)
+{
+    size_t done = 0;
+    ssize_t ret;
+
+    /* atomic writes (we cannot interleave writes) */
+    qemu_co_mutex_lock(&vmaw->writer_lock);
+
+    DPRINTF("vma_co_write enter %zd\n", bytes);
+
+    assert(vmaw->co_writer == NULL);
+
+    vmaw->co_writer = qemu_coroutine_self();
+
+    qemu_aio_set_fd_handler(vmaw->fd, NULL, vma_co_continue_write,
+                            vma_co_write_finished, vmaw);
+
+    DPRINTF("vma_co_write wait until writable\n");
+    qemu_coroutine_yield();
+    DPRINTF("vma_co_write starting %zd\n", bytes);
+
+    while (done < bytes) {
+        ret = write(vmaw->fd, buf + done, bytes - done);
+        if (ret > 0) {
+            done += ret;
+            DPRINTF("vma_co_write written %zd %zd\n", done, ret);
+        } else if (ret < 0) {
+            if (errno == EAGAIN || errno == EWOULDBLOCK) {
+                DPRINTF("vma_co_write yield %zd\n", done);
+                qemu_coroutine_yield();
+                DPRINTF("vma_co_write restart %zd\n", done);
+            } else {
+                vma_writer_set_error(vmaw, "vma_co_write write error - %s",
+                                     strerror(errno));
+                done = -1; /* always return failure for partial writes */
+                break;
+            }
+        } else if (ret == 0) {
+            /* should not happen - simply try again */
+        }
+    }
+
+    qemu_aio_set_fd_handler(vmaw->fd, NULL, NULL, NULL, NULL);
+
+    vmaw->co_writer = NULL;
+
+    qemu_co_mutex_unlock(&vmaw->writer_lock);
+
+    DPRINTF("vma_co_write leave %zd\n", done);
+    return done;
+}
+
+static void coroutine_fn vma_co_writer_task(void *opaque)
+{
+    VmaAIOCB *cb = opaque;
+
+    DPRINTF("vma_co_writer_task start\n");
+
+    int64_t done = vma_co_write(cb->vmaw, cb->buffer, cb->bytes);
+    DPRINTF("vma_co_writer_task write done %zd\n", done);
+
+    if (done != cb->bytes) {
+        DPRINTF("vma_co_writer_task failed write %zd %zd", cb->bytes, done);
+        vma_writer_set_error(cb->vmaw, "vma_co_writer_task failed write %zd",
+                             done);
+    }
+
+    cb->bytes = 0;
+
+    qemu_co_queue_next(&cb->vmaw->wqueue);
+
+    DPRINTF("vma_co_writer_task end\n");
+}
+
+static void coroutine_fn vma_queue_flush(VmaWriter *vmaw)
+{
+    DPRINTF("vma_queue_flush enter\n");
+
+    assert(vmaw);
+
+    while (1) {
+        int i;
+        VmaAIOCB *cb = NULL;
+        for (i = 0; i < WRITE_BUFFERS; i++) {
+            if (vmaw->aiocbs[i]->bytes) {
+                cb = vmaw->aiocbs[i];
+                DPRINTF("FOUND USED AIO BUFFER %d %zd\n", i,
+                        vmaw->aiocbs[i]->bytes);
+                break;
+            }
+        }
+        if (!cb) {
+            break;
+        }
+        qemu_co_queue_wait(&vmaw->wqueue);
+    }
+
+    DPRINTF("vma_queue_flush leave\n");
+}
+
+/**
+ * NOTE: pipe buffer size in only 4096 bytes on linux (see 'ulimit -a')
+ * So we need to create a coroutione to allow 'parallel' execution.
+ */
+static ssize_t coroutine_fn
+vma_queue_write(VmaWriter *vmaw, const void *buf, size_t bytes)
+{
+    DPRINTF("vma_queue_write enter %zd\n", bytes);
+
+    assert(vmaw);
+    assert(buf);
+    assert(bytes <= VMA_MAX_EXTENT_SIZE);
+
+    VmaAIOCB *cb = NULL;
+    while (!cb) {
+        int i;
+        for (i = 0; i < WRITE_BUFFERS; i++) {
+            if (!vmaw->aiocbs[i]->bytes) {
+                cb = vmaw->aiocbs[i];
+                break;
+            }
+        }
+        if (!cb) {
+            qemu_co_queue_wait(&vmaw->wqueue);
+        }
+    }
+
+    memcpy(cb->buffer, buf, bytes);
+    cb->bytes = bytes;
+    cb->vmaw = vmaw;
+
+    DPRINTF("vma_queue_write start %zd\n", bytes);
+    cb->co = qemu_coroutine_create(vma_co_writer_task);
+    qemu_coroutine_enter(cb->co, cb);
+
+    DPRINTF("vma_queue_write leave\n");
+
+    return bytes;
+}
+
+VmaWriter *vma_writer_create(const char *filename, uuid_t uuid, Error **errp)
+{
+    const char *p;
+
+    assert(sizeof(VmaHeader) == (4096 + 8192));
+    assert(sizeof(VmaExtentHeader) == 512);
+
+    VmaWriter *vmaw = g_new0(VmaWriter, 1);
+    vmaw->fd = -1;
+
+    vmaw->md5csum = g_checksum_new(G_CHECKSUM_MD5);
+    if (!vmaw->md5csum) {
+        error_setg(errp, "can't allocate cmsum\n");
+        goto err;
+    }
+
+    if (strstart(filename, "exec:", &p)) {
+        vmaw->cmd = popen(p, "w");
+        if (vmaw->cmd == NULL) {
+            error_setg(errp, "can't popen command '%s' - %s\n", p,
+                       strerror(errno));
+            goto err;
+        }
+        vmaw->fd = fileno(vmaw->cmd);
+
+        /* try to use O_NONBLOCK and O_DIRECT */
+        fcntl(vmaw->fd, F_SETFL, fcntl(vmaw->fd, F_GETFL)|O_NONBLOCK);
+        fcntl(vmaw->fd, F_SETFL, fcntl(vmaw->fd, F_GETFL)|O_DIRECT);
+
+    } else {
+        struct stat st;
+        int oflags;
+        const char *tmp_id_str;
+
+        if ((stat(filename, &st) == 0) && S_ISFIFO(st.st_mode)) {
+            oflags = O_NONBLOCK|O_DIRECT|O_WRONLY;
+            vmaw->fd = qemu_open(filename, oflags, 0644);
+        } else if (strstart(filename, "/dev/fdset/", &tmp_id_str)) {
+            oflags = O_NONBLOCK|O_DIRECT|O_WRONLY;
+            vmaw->fd = qemu_open(filename, oflags, 0644);
+        } else if (strstart(filename, "/dev/fdname/", &tmp_id_str)) {
+            vmaw->fd = monitor_get_fd(cur_mon, tmp_id_str, errp);
+            if (vmaw->fd < 0) {
+                goto err;
+            }
+            /* try to use O_NONBLOCK and O_DIRECT */
+            fcntl(vmaw->fd, F_SETFL, fcntl(vmaw->fd, F_GETFL)|O_NONBLOCK);
+            fcntl(vmaw->fd, F_SETFL, fcntl(vmaw->fd, F_GETFL)|O_DIRECT);
+        } else  {
+            oflags = O_NONBLOCK|O_DIRECT|O_WRONLY|O_CREAT|O_EXCL;
+            vmaw->fd = qemu_open(filename, oflags, 0644);
+        }
+
+        if (vmaw->fd < 0) {
+            error_setg(errp, "can't open file %s - %s\n", filename,
+                       strerror(errno));
+            goto err;
+        }
+    }
+
+    /* we use O_DIRECT, so we need to align IO buffers */
+    int i;
+    for (i = 0; i < WRITE_BUFFERS; i++) {
+        vmaw->aiocbs[i] = qemu_memalign(512, sizeof(VmaAIOCB));
+        memset(vmaw->aiocbs[i], 0, sizeof(VmaAIOCB));
+    }
+
+    vmaw->outbuf_count = 0;
+    vmaw->outbuf_pos = VMA_EXTENT_HEADER_SIZE;
+
+    vmaw->header_blob_table_pos = 1; /* start at pos 1 */
+
+    qemu_co_mutex_init(&vmaw->writer_lock);
+    qemu_co_mutex_init(&vmaw->flush_lock);
+    qemu_co_queue_init(&vmaw->wqueue);
+
+    uuid_copy(vmaw->uuid, uuid);
+
+    return vmaw;
+
+err:
+    if (vmaw) {
+        if (vmaw->cmd) {
+            pclose(vmaw->cmd);
+        } else if (vmaw->fd >= 0) {
+            close(vmaw->fd);
+        }
+
+        if (vmaw->md5csum) {
+            g_checksum_free(vmaw->md5csum);
+        }
+
+        g_free(vmaw);
+    }
+
+    return NULL;
+}
+
+static int coroutine_fn vma_write_header(VmaWriter *vmaw)
+{
+    assert(vmaw);
+    int header_clusters = 8;
+    char buf[65536*header_clusters];
+    VmaHeader *head = (VmaHeader *)buf;
+
+    int i;
+
+    DPRINTF("VMA WRITE HEADER\n");
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    memset(buf, 0, sizeof(buf));
+
+    head->magic = VMA_MAGIC;
+    head->version = GUINT32_TO_BE(1); /* v1 */
+    memcpy(head->uuid, vmaw->uuid, 16);
+
+    time_t ctime = time(NULL);
+    head->ctime = GUINT64_TO_BE(ctime);
+
+    if (!vmaw->stream_count) {
+        return -1;
+    }
+
+    for (i = 0; i < VMA_MAX_CONFIGS; i++) {
+        head->config_names[i] = GUINT32_TO_BE(vmaw->config_names[i]);
+        head->config_data[i] = GUINT32_TO_BE(vmaw->config_data[i]);
+    }
+
+    /* 32 bytes per device (12 used currently) = 8192 bytes max */
+    for (i = 1; i <= 254; i++) {
+        VmaStreamInfo *si = &vmaw->stream_info[i];
+        if (si->size) {
+            assert(si->devname);
+            uint32_t devname_ptr = allocate_header_string(vmaw, si->devname);
+            if (!devname_ptr) {
+                return -1;
+            }
+            head->dev_info[i].devname_ptr = GUINT32_TO_BE(devname_ptr);
+            head->dev_info[i].size = GUINT64_TO_BE(si->size);
+        }
+    }
+
+    uint32_t header_size = sizeof(VmaHeader) + vmaw->header_blob_table_size;
+    head->header_size = GUINT32_TO_BE(header_size);
+
+    if (header_size > sizeof(buf)) {
+        return -1; /* just to be sure */
+    }
+
+    uint32_t blob_buffer_offset = sizeof(VmaHeader);
+    memcpy(buf + blob_buffer_offset, vmaw->header_blob_table,
+           vmaw->header_blob_table_size);
+    head->blob_buffer_offset = GUINT32_TO_BE(blob_buffer_offset);
+    head->blob_buffer_size = GUINT32_TO_BE(vmaw->header_blob_table_pos);
+
+    g_checksum_reset(vmaw->md5csum);
+    g_checksum_update(vmaw->md5csum, (const guchar *)buf, header_size);
+    gsize csize = 16;
+    g_checksum_get_digest(vmaw->md5csum, (guint8 *)(head->md5sum), &csize);
+
+    return vma_queue_write(vmaw, buf, header_size);
+}
+
+static int coroutine_fn vma_writer_flush(VmaWriter *vmaw)
+{
+    assert(vmaw);
+
+    int ret;
+    int i;
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    if (!vmaw->header_written) {
+        vmaw->header_written = true;
+        ret = vma_write_header(vmaw);
+        if (ret < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_flush: write header failed");
+            return ret;
+        }
+    }
+
+    DPRINTF("VMA WRITE FLUSH %d %d\n", vmaw->outbuf_count, vmaw->outbuf_pos);
+
+
+    VmaExtentHeader *ehead = (VmaExtentHeader *)vmaw->outbuf;
+
+    ehead->magic = VMA_EXTENT_MAGIC;
+    ehead->reserved1 = 0;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTENT; i++) {
+        ehead->blockinfo[i] = GUINT64_TO_BE(vmaw->outbuf_block_info[i]);
+    }
+
+    guint16 block_count = (vmaw->outbuf_pos - VMA_EXTENT_HEADER_SIZE) /
+        VMA_BLOCK_SIZE;
+
+    ehead->block_count = GUINT16_TO_BE(block_count);
+
+    memcpy(ehead->uuid, vmaw->uuid, sizeof(ehead->uuid));
+    memset(ehead->md5sum, 0, sizeof(ehead->md5sum));
+
+    g_checksum_reset(vmaw->md5csum);
+    g_checksum_update(vmaw->md5csum, vmaw->outbuf, VMA_EXTENT_HEADER_SIZE);
+    gsize csize = 16;
+    g_checksum_get_digest(vmaw->md5csum, ehead->md5sum, &csize);
+
+    int bytes = vmaw->outbuf_pos;
+    ret = vma_queue_write(vmaw, vmaw->outbuf, bytes);
+    if (ret != bytes) {
+        vma_writer_set_error(vmaw, "vma_writer_flush: failed write");
+    }
+
+    vmaw->outbuf_count = 0;
+    vmaw->outbuf_pos = VMA_EXTENT_HEADER_SIZE;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTENT; i++) {
+        vmaw->outbuf_block_info[i] = 0;
+    }
+
+    return vmaw->status;
+}
+
+static int vma_count_open_streams(VmaWriter *vmaw)
+{
+    g_assert(vmaw != NULL);
+
+    int i;
+    int open_drives = 0;
+    for (i = 0; i <= 255; i++) {
+        if (vmaw->stream_info[i].size && !vmaw->stream_info[i].finished) {
+            open_drives++;
+        }
+    }
+
+    return open_drives;
+}
+
+/**
+ * all jobs should call this when there is no more data
+ * Returns: number of remaining stream (0 ==> finished)
+ */
+int coroutine_fn
+vma_writer_close_stream(VmaWriter *vmaw, uint8_t dev_id)
+{
+    g_assert(vmaw != NULL);
+
+    DPRINTF("vma_writer_set_status %d\n", dev_id);
+    if (!vmaw->stream_info[dev_id].size) {
+        vma_writer_set_error(vmaw, "vma_writer_close_stream: "
+                             "no such stream %d", dev_id);
+        return -1;
+    }
+    if (vmaw->stream_info[dev_id].finished) {
+        vma_writer_set_error(vmaw, "vma_writer_close_stream: "
+                             "stream already closed %d", dev_id);
+        return -1;
+    }
+
+    vmaw->stream_info[dev_id].finished = true;
+
+    int open_drives = vma_count_open_streams(vmaw);
+
+    if (open_drives <= 0) {
+        DPRINTF("vma_writer_set_status all drives completed\n");
+        qemu_co_mutex_lock(&vmaw->flush_lock);
+        int ret = vma_writer_flush(vmaw);
+        qemu_co_mutex_unlock(&vmaw->flush_lock);
+        if (ret < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close_stream: flush failed");
+        }
+    }
+
+    return open_drives;
+}
+
+int vma_writer_get_status(VmaWriter *vmaw, VmaStatus *status)
+{
+    int i;
+
+    g_assert(vmaw != NULL);
+
+    if (status) {
+        status->status = vmaw->status;
+        g_strlcpy(status->errmsg, vmaw->errmsg, sizeof(status->errmsg));
+        for (i = 0; i <= 255; i++) {
+            status->stream_info[i] = vmaw->stream_info[i];
+        }
+
+        uuid_unparse_lower(vmaw->uuid, status->uuid_str);
+    }
+
+    status->closed = vmaw->closed;
+
+    return vmaw->status;
+}
+
+static int vma_writer_get_buffer(VmaWriter *vmaw)
+{
+    int ret = 0;
+
+    qemu_co_mutex_lock(&vmaw->flush_lock);
+
+    /* wait until buffer is available */
+    while (vmaw->outbuf_count >= (VMA_BLOCKS_PER_EXTENT - 1)) {
+        ret = vma_writer_flush(vmaw);
+        if (ret < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_get_buffer: flush failed");
+            break;
+        }
+    }
+
+    qemu_co_mutex_unlock(&vmaw->flush_lock);
+
+    return ret;
+}
+
+
+int64_t coroutine_fn
+vma_writer_write(VmaWriter *vmaw, uint8_t dev_id, int64_t cluster_num,
+                 unsigned char *buf, size_t *zero_bytes)
+{
+    g_assert(vmaw != NULL);
+    g_assert(zero_bytes != NULL);
+
+    *zero_bytes = 0;
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    if (!dev_id || !vmaw->stream_info[dev_id].size) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "no such stream %d", dev_id);
+        return -1;
+    }
+
+    if (vmaw->stream_info[dev_id].finished) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "stream already closed %d", dev_id);
+        return -1;
+    }
+
+
+    if (cluster_num >= (((uint64_t)1)<<32)) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "cluster number out of range");
+        return -1;
+    }
+
+    if (dev_id == vmaw->vmstate_stream) {
+        if (cluster_num != vmaw->vmstate_clusters) {
+            vma_writer_set_error(vmaw, "vma_writer_write: "
+                                 "non sequential vmstate write");
+        }
+        vmaw->vmstate_clusters++;
+    } else if (cluster_num >= vmaw->stream_info[dev_id].cluster_count) {
+        vma_writer_set_error(vmaw, "vma_writer_write: cluster number too big");
+        return -1;
+    }
+
+    /* wait until buffer is available */
+    if (vma_writer_get_buffer(vmaw) < 0) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "vma_writer_get_buffer failed");
+        return -1;
+    }
+
+    DPRINTF("VMA WRITE %d %zd\n", dev_id, cluster_num);
+
+    int i;
+    int bit = 1;
+    uint16_t mask = 0;
+    for (i = 0; i < 16; i++) {
+        unsigned char *vmablock = buf + (i*VMA_BLOCK_SIZE);
+        if (!buffer_is_zero(vmablock, VMA_BLOCK_SIZE)) {
+            mask |= bit;
+            memcpy(vmaw->outbuf + vmaw->outbuf_pos, vmablock, VMA_BLOCK_SIZE);
+            vmaw->outbuf_pos += VMA_BLOCK_SIZE;
+        } else {
+            DPRINTF("VMA WRITE %zd ZERO BLOCK %d\n", cluster_num, i);
+            vmaw->stream_info[dev_id].zero_bytes += VMA_BLOCK_SIZE;
+            *zero_bytes += VMA_BLOCK_SIZE;
+        }
+
+        bit = bit << 1;
+    }
+
+    uint64_t block_info = ((uint64_t)mask) << (32+16);
+    block_info |= ((uint64_t)dev_id) << 32;
+    block_info |= (cluster_num & 0xffffffff);
+    vmaw->outbuf_block_info[vmaw->outbuf_count] = block_info;
+
+    DPRINTF("VMA WRITE MASK %zd %zx\n", cluster_num, block_info);
+
+    vmaw->outbuf_count++;
+
+    /** NOTE: We allways write whole clusters, but we correctly set
+     * transferred bytes. So transferred == size when when everything
+     * went OK.
+     */
+    size_t transferred = VMA_CLUSTER_SIZE;
+
+    if (dev_id != vmaw->vmstate_stream) {
+        uint64_t last = (cluster_num + 1) * VMA_CLUSTER_SIZE;
+        if (last > vmaw->stream_info[dev_id].size) {
+            uint64_t diff = last - vmaw->stream_info[dev_id].size;
+            if (diff >= VMA_CLUSTER_SIZE) {
+                vma_writer_set_error(vmaw, "vma_writer_write: "
+                                     "read after last cluster");
+                return -1;
+            }
+            transferred -= diff;
+        }
+    }
+
+    vmaw->stream_info[dev_id].transferred += transferred;
+
+    return transferred;
+}
+
+int vma_writer_close(VmaWriter *vmaw, Error **errp)
+{
+    g_assert(vmaw != NULL);
+
+    int i;
+
+    vma_queue_flush(vmaw);
+
+    /* this should not happen - just to be sure */
+    while (!qemu_co_queue_empty(&vmaw->wqueue)) {
+        DPRINTF("vma_writer_close wait\n");
+        co_sleep_ns(rt_clock, 1000000);
+    }
+
+    if (vmaw->cmd) {
+        if (pclose(vmaw->cmd) < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close: "
+                                 "pclose failed - %s", strerror(errno));
+        }
+    } else {
+        if (close(vmaw->fd) < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close: "
+                                 "close failed - %s", strerror(errno));
+        }
+    }
+
+    for (i = 0; i <= 255; i++) {
+        VmaStreamInfo *si = &vmaw->stream_info[i];
+        if (si->size) {
+            if (!si->finished) {
+                vma_writer_set_error(vmaw, "vma_writer_close: "
+                                     "detected open stream '%s'", si->devname);
+            } else if ((si->transferred != si->size) &&
+                       (i != vmaw->vmstate_stream)) {
+                vma_writer_set_error(vmaw, "vma_writer_close: "
+                                     "incomplete stream '%s' (%zd != %zd)",
+                                     si->devname, si->transferred, si->size);
+            }
+        }
+    }
+
+    for (i = 0; i <= 255; i++) {
+        vmaw->stream_info[i].finished = 1; /* mark as closed */
+    }
+
+    vmaw->closed = 1;
+
+    if (vmaw->status < 0 && *errp == NULL) {
+        error_setg(errp, "%s", vmaw->errmsg);
+    }
+
+    return vmaw->status;
+}
+
+void vma_writer_destroy(VmaWriter *vmaw)
+{
+    assert(vmaw);
+
+    int i;
+
+    for (i = 0; i <= 255; i++) {
+        if (vmaw->stream_info[i].devname) {
+            g_free(vmaw->stream_info[i].devname);
+        }
+    }
+
+    if (vmaw->md5csum) {
+        g_checksum_free(vmaw->md5csum);
+    }
+
+    for (i = 0; i < WRITE_BUFFERS; i++) {
+        free(vmaw->aiocbs[i]);
+    }
+
+    g_free(vmaw);
+}
+
+/* backup driver plugin */
+
+static int vma_dump_cb(void *opaque, uint8_t dev_id, int64_t cluster_num,
+                       unsigned char *buf, size_t *zero_bytes)
+{
+    VmaWriter *vmaw = opaque;
+
+    return vma_writer_write(vmaw, dev_id, cluster_num, buf, zero_bytes);
+}
+
+static int vma_close_cb(void *opaque, Error **errp)
+{
+    VmaWriter *vmaw = opaque;
+
+    int res = vma_writer_close(vmaw, errp);
+    vma_writer_destroy(vmaw);
+
+    return res;
+}
+
+static int vma_complete_cb(void *opaque, uint8_t dev_id, int ret)
+{
+    VmaWriter *vmaw = opaque;
+
+    if (ret < 0) {
+        vma_writer_set_error(vmaw, "backup_complete_cb %d", ret);
+    }
+
+    return vma_writer_close_stream(vmaw, dev_id);
+}
+
+static int vma_register_stream_cb(void *opaque, const char *devname,
+                                  size_t size)
+{
+    VmaWriter *vmaw = opaque;
+
+    return vma_writer_register_stream(vmaw, devname, size);
+}
+
+static int vma_register_config_cb(void *opaque, const char *name,
+                                  gpointer data, size_t data_len)
+{
+    VmaWriter *vmaw = opaque;
+
+    return vma_writer_add_config(vmaw, name, data, data_len);
+}
+
+static void *vma_open_cb(const char *filename, uuid_t uuid, Error **errp)
+{
+    return vma_writer_create(filename, uuid, errp);
+}
+
+const BackupDriver backup_vma_driver = {
+    .format = "vma",
+    .open_cb = vma_open_cb,
+    .close_cb = vma_close_cb,
+    .register_config_cb = vma_register_config_cb,
+    .register_stream_cb = vma_register_stream_cb,
+    .dump_cb = vma_dump_cb,
+    .complete_cb = vma_complete_cb,
+};
+
diff --git a/vma.c b/vma.c
new file mode 100644
index 0000000..b2e276c
--- /dev/null
+++ b/vma.c
@@ -0,0 +1,559 @@ 
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) 2012 Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "vma.h"
+#include "block/block.h"
+
+static void help(void)
+{
+    const char *help_msg =
+        "usage: vma command [command options]\n"
+        "\n"
+        "vma list <filename>\n"
+        "vma create <filename> [-c config] <archive> pathname ...\n"
+        "vma extract <filename> [-r] <targetdir>\n"
+        ;
+
+    printf("%s", help_msg);
+    exit(1);
+}
+
+static const char *extract_devname(const char *path, char **devname, int index)
+{
+    assert(path);
+
+    const char *sep = strchr(path, '=');
+
+    if (sep) {
+        *devname = g_strndup(path, sep - path);
+        path = sep + 1;
+    } else {
+        if (index >= 0) {
+            *devname = g_strdup_printf("disk%d", index);
+        } else {
+            *devname = NULL;
+        }
+    }
+
+    return path;
+}
+
+static void print_content(VmaReader *vmar)
+{
+    assert(vmar);
+
+    VmaHeader *head = vma_reader_get_header(vmar);
+
+    GList *l = vma_reader_get_config_data(vmar);
+    while (l && l->data) {
+        VmaConfigData *cdata = (VmaConfigData *)l->data;
+        l = g_list_next(l);
+        printf("CFG: size: %d name: %s\n", cdata->len, cdata->name);
+    }
+
+    int i;
+    VmaDeviceInfo *di;
+    for (i = 1; i < 255; i++) {
+        di = vma_reader_get_device_info(vmar, i);
+        if (di) {
+            if (strcmp(di->devname, "vmstate") == 0) {
+                printf("VMSTATE: dev_id=%d memory: %zd\n", i, di->size);
+            } else {
+                printf("DEV: dev_id=%d size: %zd devname: %s\n",
+                       i, di->size, di->devname);
+            }
+        }
+    }
+    /* ctime is the last entry we print */
+    printf("CTIME: %s", ctime(&head->ctime));
+    fflush(stdout);
+}
+
+static int list_content(int argc, char **argv)
+{
+    int c, ret = 0;
+    const char *filename;
+
+    for (;;) {
+        c = getopt(argc, argv, "h");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    /* Get the filename */
+    if ((optind + 1) != argc) {
+        help();
+    }
+    filename = argv[optind++];
+
+    Error *errp = NULL;
+    VmaReader *vmar = vma_reader_create(filename, &errp);
+
+    if (!vmar) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    print_content(vmar);
+
+    vma_reader_destroy(vmar);
+
+    return ret;
+}
+
+typedef struct RestoreMap {
+    char *devname;
+    char *path;
+    bool write_zero;
+} RestoreMap;
+
+static int extract_content(int argc, char **argv)
+{
+    int c, ret = 0;
+    int verbose = 0;
+    const char *filename;
+    const char *dirname;
+    const char *readmap = NULL;
+
+    for (;;) {
+        c = getopt(argc, argv, "hvr:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'r':
+            readmap = optarg;
+            break;
+        case 'v':
+            verbose = 1;
+            break;
+        default:
+            help();
+        }
+    }
+
+    /* Get the filename */
+    if ((optind + 2) != argc) {
+        help();
+    }
+    filename = argv[optind++];
+    dirname = argv[optind++];
+
+    Error *errp = NULL;
+    VmaReader *vmar = vma_reader_create(filename, &errp);
+
+    if (!vmar) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    if (mkdir(dirname, 0777) < 0) {
+        g_error("unable to create target directory %s - %s",
+                dirname, strerror(errno));
+    }
+
+    GList *l = vma_reader_get_config_data(vmar);
+    while (l && l->data) {
+        VmaConfigData *cdata = (VmaConfigData *)l->data;
+        l = g_list_next(l);
+        char *cfgfn = g_strdup_printf("%s/%s", dirname, cdata->name);
+        GError *err = NULL;
+        if (!g_file_set_contents(cfgfn, (gchar *)cdata->data, cdata->len,
+                                 &err)) {
+            g_error("unable to write file: %s", err->message);
+        }
+    }
+
+    GHashTable *devmap = g_hash_table_new(g_str_hash, g_str_equal);
+
+    if (readmap) {
+        print_content(vmar);
+
+        FILE *map = fopen(readmap, "r");
+        if (!map) {
+            g_error("unable to open fifo %s - %s", readmap, strerror(errno));
+        }
+
+        while (1) {
+            char inbuf[8192];
+            char *line = fgets(inbuf, sizeof(inbuf), map);
+            if (!line || line[0] == '\0' || !strcmp(line, "done\n")) {
+                break;
+            }
+            int len = strlen(line);
+            if (line[len - 1] == '\n') {
+                line[len - 1] = '\0';
+                if (len == 1) {
+                    break;
+                }
+            }
+
+            const char *path;
+            bool write_zero;
+            if (line[0] == '0' && line[1] == ':') {
+                path = inbuf + 2;
+                write_zero = false;
+            } else if (line[0] == '1' && line[1] == ':') {
+                path = inbuf + 2;
+                write_zero = true;
+            } else {
+                g_error("read map failed - parse error ('%s')", inbuf);
+            }
+
+            char *devname = NULL;
+            path = extract_devname(path, &devname, -1);
+            if (!devname) {
+                g_error("read map failed - no dev name specified ('%s')",
+                        inbuf);
+            }
+
+            RestoreMap *map = g_new0(RestoreMap, 1);
+            map->devname = g_strdup(devname);
+            map->path = g_strdup(path);
+            map->write_zero = write_zero;
+
+            g_hash_table_insert(devmap, map->devname, map);
+
+        };
+    }
+
+    int i;
+    int vmstate_fd = -1;
+    guint8 vmstate_stream = 0;
+
+    for (i = 1; i < 255; i++) {
+        VmaDeviceInfo *di = vma_reader_get_device_info(vmar, i);
+        if (di && (strcmp(di->devname, "vmstate") == 0)) {
+            vmstate_stream = i;
+            char *statefn = g_strdup_printf("%s/vmstate.bin", dirname);
+            vmstate_fd = open(statefn, O_WRONLY|O_CREAT|O_EXCL, 0644);
+            if (vmstate_fd < 0) {
+                g_error("create vmstate file '%s' failed - %s", statefn,
+                        strerror(errno));
+            }
+            g_free(statefn);
+        } else if (di) {
+            char *devfn = NULL;
+            int flags = BDRV_O_RDWR|BDRV_O_CACHE_WB;
+            bool write_zero = true;
+
+            if (readmap) {
+                RestoreMap *map;
+                map = (RestoreMap *)g_hash_table_lookup(devmap, di->devname);
+                if (map == NULL) {
+                    g_error("no device name mapping for %s", di->devname);
+                }
+                devfn = map->path;
+                write_zero = map->write_zero;
+            } else {
+                devfn = g_strdup_printf("%s/tmp-disk-%s.raw",
+                                        dirname, di->devname);
+                printf("DEVINFO %s %zd\n", devfn, di->size);
+
+                bdrv_img_create(devfn, "raw", NULL, NULL, NULL, di->size,
+                                flags, &errp);
+                if (error_is_set(&errp)) {
+                    g_error("can't create file %s: %s", devfn,
+                            error_get_pretty(errp));
+                }
+
+                /* Note: we created an empty file above, so there is no
+                 * need to write zeroes (so we generate a sparse file)
+                 */
+                write_zero = false;
+            }
+
+            BlockDriverState *bs = bdrv_new(di->devname);
+            if (bdrv_open(bs, devfn, flags, NULL)) {
+                g_error("can't open file %s", devfn);
+            }
+            if (vma_reader_register_bs(vmar, i, bs, write_zero, &errp) < 0) {
+                g_error("%s", error_get_pretty(errp));
+            }
+
+            if (!readmap) {
+                g_free(devfn);
+            }
+        }
+    }
+
+    if (vma_reader_restore(vmar, vmstate_fd, verbose, &errp) < 0) {
+        g_error("restore failed - %s", error_get_pretty(errp));
+    }
+
+    if (!readmap) {
+        for (i = 1; i < 255; i++) {
+            VmaDeviceInfo *di = vma_reader_get_device_info(vmar, i);
+            if (di && (i != vmstate_stream)) {
+                char *tmpfn = g_strdup_printf("%s/tmp-disk-%s.raw",
+                                              dirname, di->devname);
+                char *fn = g_strdup_printf("%s/disk-%s.raw",
+                                           dirname, di->devname);
+                if (rename(tmpfn, fn) != 0) {
+                    g_error("rename %s to %s failed - %s",
+                            tmpfn, fn, strerror(errno));
+                }
+            }
+        }
+    }
+
+    vma_reader_destroy(vmar);
+
+    bdrv_close_all();
+
+    return ret;
+}
+
+typedef struct BackupCB {
+    VmaWriter *vmaw;
+    uint8_t dev_id;
+} BackupCB;
+
+static int backup_dump_cb(void *opaque, BlockDriverState *bs,
+                          int64_t cluster_num, unsigned char *buf)
+{
+    BackupCB *bcb = opaque;
+    size_t zb = 0;
+    if (vma_writer_write(bcb->vmaw, bcb->dev_id, cluster_num, buf, &zb) < 0) {
+        g_warning("backup_dump_cb vma_writer_write failed");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void backup_complete_cb(void *opaque, int ret)
+{
+    BackupCB *bcb = opaque;
+
+    if (ret < 0) {
+        vma_writer_set_error(bcb->vmaw, "backup_complete_cb %d", ret);
+    }
+
+    if (vma_writer_close_stream(bcb->vmaw, bcb->dev_id) <= 0) {
+        Error *err = NULL;
+        if (vma_writer_close(bcb->vmaw, &err) != 0) {
+            g_warning("vma_writer_close failed %s", error_get_pretty(err));
+        }
+    }
+}
+
+static int create_archive(int argc, char **argv)
+{
+    int i, c, res;
+    int verbose = 0;
+    const char *archivename;
+    GList *config_files = NULL;
+
+    for (;;) {
+        c = getopt(argc, argv, "hvc:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'c':
+            config_files = g_list_append(config_files, optarg);
+            break;
+        case 'v':
+            verbose = 1;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+
+    /* make sure we have archive name and at least one path */
+    if ((optind + 2) > argc) {
+        help();
+    }
+
+    archivename = argv[optind++];
+
+    uuid_t uuid;
+    uuid_generate(uuid);
+
+    Error *local_err = NULL;
+    VmaWriter *vmaw = vma_writer_create(archivename, uuid, &local_err);
+
+    if (vmaw == NULL) {
+        g_error("%s", error_get_pretty(local_err));
+    }
+
+    GList *l = config_files;
+    while (l && l->data) {
+        char *name = l->data;
+        char *cdata = NULL;
+        gsize clen = 0;
+        GError *err = NULL;
+        if (!g_file_get_contents(name, &cdata, &clen, &err)) {
+            unlink(archivename);
+            g_error("Unable to read file: %s", err->message);
+        }
+
+        if (vma_writer_add_config(vmaw, name, cdata, clen) != 0) {
+            unlink(archivename);
+            g_error("Unable to append config data %s (len = %zd)",
+                    name, clen);
+        }
+        l = g_list_next(l);
+    }
+
+    int ind = 0;
+    while (optind < argc) {
+        const char *path = argv[optind++];
+        char *devname = NULL;
+        path = extract_devname(path, &devname, ind++);
+
+        BlockDriver *drv = NULL;
+        BlockDriverState *bs = bdrv_new(devname);
+
+        res = bdrv_open(bs, path, BDRV_O_CACHE_WB , drv);
+        if (res < 0) {
+            unlink(archivename);
+            g_error("bdrv_open '%s' failed", path);
+        }
+        int64_t size = bdrv_getlength(bs);
+        int dev_id = vma_writer_register_stream(vmaw, devname, size);
+        if (dev_id <= 0) {
+            unlink(archivename);
+            g_error("vma_writer_register_stream '%s' failed", devname);
+        }
+
+        BackupCB *bcb = g_new0(BackupCB, 1);
+        bcb->vmaw = vmaw;
+        bcb->dev_id = dev_id;
+
+        if (backup_job_create(bs, backup_dump_cb, backup_complete_cb,
+                              bcb, 0) < 0) {
+            unlink(archivename);
+            g_error("backup_job_start failed");
+        } else {
+            backup_job_start(bs, false);
+        }
+    }
+
+    VmaStatus vmastat;
+    int percent = 0;
+    int last_percent = -1;
+
+    while (1) {
+        main_loop_wait(false);
+        vma_writer_get_status(vmaw, &vmastat);
+
+        if (verbose) {
+
+            uint64_t total = 0;
+            uint64_t transferred = 0;
+            uint64_t zero_bytes = 0;
+
+            int i;
+            for (i = 0; i < 256; i++) {
+                if (vmastat.stream_info[i].size) {
+                    total += vmastat.stream_info[i].size;
+                    transferred += vmastat.stream_info[i].transferred;
+                    zero_bytes += vmastat.stream_info[i].zero_bytes;
+                }
+            }
+            percent = (transferred*100)/total;
+            if (percent != last_percent) {
+                printf("progress %d%% %zd/%zd %zd\n", percent,
+                       transferred, total, zero_bytes);
+
+                last_percent = percent;
+            }
+        }
+
+        if (vmastat.closed) {
+            break;
+        }
+    }
+
+    bdrv_drain_all();
+
+    vma_writer_get_status(vmaw, &vmastat);
+
+    if (verbose) {
+        for (i = 0; i < 256; i++) {
+            VmaStreamInfo *si = &vmastat.stream_info[i];
+            if (si->size) {
+                printf("image %s: size=%zd zeros=%zd saved=%zd\n", si->devname,
+                       si->size, si->zero_bytes, si->size - si->zero_bytes);
+            }
+        }
+    }
+
+    if (vmastat.status < 0) {
+        unlink(archivename);
+        g_error("creating vma archive failed");
+    }
+
+    return 0;
+}
+
+int main(int argc, char **argv)
+{
+    const char *cmdname;
+
+    error_set_progname(argv[0]);
+
+    qemu_init_main_loop();
+
+    bdrv_init();
+
+    if (argc < 2) {
+        help();
+    }
+
+    cmdname = argv[1];
+    argc--; argv++;
+
+
+    if (!strcmp(cmdname, "list")) {
+        return list_content(argc, argv);
+    } else if (!strcmp(cmdname, "create")) {
+        return create_archive(argc, argv);
+    } else if (!strcmp(cmdname, "extract")) {
+        return extract_content(argc, argv);
+    }
+
+    help();
+    return 0;
+}
diff --git a/vma.h b/vma.h
new file mode 100644
index 0000000..76d0dc8
--- /dev/null
+++ b/vma.h
@@ -0,0 +1,145 @@ 
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef BACKUP_VMA_H
+#define BACKUP_VMA_H
+
+#include "backup.h"
+#include "error.h"
+
+#define VMA_BLOCK_BITS 12
+#define VMA_BLOCK_SIZE (1<<VMA_BLOCK_BITS)
+#define VMA_CLUSTER_BITS (VMA_BLOCK_BITS+4)
+#define VMA_CLUSTER_SIZE (1<<VMA_CLUSTER_BITS)
+
+#if VMA_CLUSTER_SIZE != 65536
+#error unexpected cluster size
+#endif
+
+#define VMA_EXTENT_HEADER_SIZE 512
+#define VMA_BLOCKS_PER_EXTENT 59
+#define VMA_MAX_CONFIGS 256
+
+#define VMA_MAX_EXTENT_SIZE \
+    (VMA_EXTENT_HEADER_SIZE+VMA_CLUSTER_SIZE*VMA_BLOCKS_PER_EXTENT)
+#if VMA_MAX_EXTENT_SIZE != 3867136
+#error unexpected VMA_EXTENT_SIZE
+#endif
+
+/* File Format Definitions */
+
+#define VMA_MAGIC (GUINT32_TO_BE(('V'<<24)|('M'<<16)|('A'<<8)|0x00))
+#define VMA_EXTENT_MAGIC (GUINT32_TO_BE(('V'<<24)|('M'<<16)|('A'<<8)|'E'))
+
+typedef struct VmaDeviceInfoHeader {
+    uint32_t devname_ptr; /* offset into blob_buffer table */
+    uint32_t reserved0;
+    uint64_t size; /* device size in bytes */
+    uint64_t reserved1;
+    uint64_t reserved2;
+} VmaDeviceInfoHeader;
+
+typedef struct VmaHeader {
+    uint32_t magic;
+    uint32_t version;
+    unsigned char uuid[16];
+    int64_t ctime;
+    unsigned char md5sum[16];
+
+    uint32_t blob_buffer_offset;
+    uint32_t blob_buffer_size;
+    uint32_t header_size;
+
+    unsigned char reserved[1984];
+
+    uint32_t config_names[VMA_MAX_CONFIGS]; /* offset into blob_buffer table */
+    uint32_t config_data[VMA_MAX_CONFIGS];  /* offset into blob_buffer table */
+
+    VmaDeviceInfoHeader dev_info[256];
+} VmaHeader;
+
+typedef struct VmaExtentHeader {
+    uint32_t magic;
+    uint16_t reserved1;
+    uint16_t block_count;
+    unsigned char uuid[16];
+    unsigned char md5sum[16];
+    uint64_t blockinfo[VMA_BLOCKS_PER_EXTENT];
+} VmaExtentHeader;
+
+/* functions/definitions to read/write vma files */
+
+typedef struct VmaReader VmaReader;
+
+typedef struct VmaWriter VmaWriter;
+
+typedef struct VmaConfigData {
+    const char *name;
+    const void *data;
+    uint32_t len;
+} VmaConfigData;
+
+typedef struct VmaStreamInfo {
+    uint64_t size;
+    uint64_t cluster_count;
+    uint64_t transferred;
+    uint64_t zero_bytes;
+    int finished;
+    char *devname;
+} VmaStreamInfo;
+
+typedef struct VmaStatus {
+    int status;
+    bool closed;
+    char errmsg[8192];
+    char uuid_str[37];
+    VmaStreamInfo stream_info[256];
+} VmaStatus;
+
+typedef struct VmaDeviceInfo {
+    uint64_t size; /* device size in bytes */
+    const char *devname;
+} VmaDeviceInfo;
+
+extern const BackupDriver backup_vma_driver;
+
+VmaWriter *vma_writer_create(const char *filename, uuid_t uuid, Error **errp);
+int vma_writer_close(VmaWriter *vmaw, Error **errp);
+void vma_writer_destroy(VmaWriter *vmaw);
+int vma_writer_add_config(VmaWriter *vmaw, const char *name, gpointer data,
+                          size_t len);
+int vma_writer_register_stream(VmaWriter *vmaw, const char *devname,
+                               size_t size);
+
+int64_t coroutine_fn vma_writer_write(VmaWriter *vmaw, uint8_t dev_id,
+                                      int64_t cluster_num, unsigned char *buf,
+                                      size_t *zero_bytes);
+
+int coroutine_fn vma_writer_close_stream(VmaWriter *vmaw, uint8_t dev_id);
+
+int vma_writer_get_status(VmaWriter *vmaw, VmaStatus *status);
+void vma_writer_set_error(VmaWriter *vmaw, const char *fmt, ...);
+
+
+VmaReader *vma_reader_create(const char *filename, Error **errp);
+void vma_reader_destroy(VmaReader *vmar);
+VmaHeader *vma_reader_get_header(VmaReader *vmar);
+GList *vma_reader_get_config_data(VmaReader *vmar);
+VmaDeviceInfo *vma_reader_get_device_info(VmaReader *vmar, guint8 dev_id);
+int vma_reader_register_bs(VmaReader *vmar, guint8 dev_id,
+                           BlockDriverState *bs, bool write_zeroes,
+                           Error **errp);
+int vma_reader_restore(VmaReader *vmar, int vmstate_fd, bool verbose,
+                       Error **errp);
+
+#endif /* BACKUP_VMA_H */