Patchwork [v3,1/6] RFC: Efficient VM backup for qemu

login
register
mail settings
Submitter Dietmar Maurer
Date Feb. 19, 2013, 11:31 a.m.
Message ID <1361273503-974882-1-git-send-email-dietmar@proxmox.com>
Download mbox | patch
Permalink /patch/221668/
State New
Headers show

Comments

Dietmar Maurer - Feb. 19, 2013, 11:31 a.m.
This series provides a way to efficiently backup VMs.

* Backup to a single archive file
* Backup contain all data to restore VM (full backup)
* Do not depend on storage type or image format
* Avoid use of temporary storage
* store sparse images efficiently

The file docs/backup-rfc.txt contains more details.

Changes since v1:

* fix spelling errors
* move BackupInfo from BDS to BackupBlockJob
* introduce BackupDriver to allow more than one backup format
* vma: add suport to store vmstate (size is not known in advance)
* add ability to store VM state

Changes since v2:

* BackupDriver: remove cancel_cb
* use enum for BackupFormat
* vma: use bdrv_open instead of bdrv_file_open
* vma: fix aio, use O_DIRECT
* backup one drive after another (try to avoid high load)

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 119 insertions(+), 0 deletions(-)
 create mode 100644 docs/backup-rfc.txt
Eric Blake - Feb. 19, 2013, 7:53 p.m.
On 02/19/2013 04:31 AM, Dietmar Maurer wrote:
> This series provides a way to efficiently backup VMs.
> 
> * Backup to a single archive file
> * Backup contain all data to restore VM (full backup)
> * Do not depend on storage type or image format
> * Avoid use of temporary storage
> * store sparse images efficiently

It is customary to send a 0/6 cover letter for details like this, rather
than slamming it into the first patch (git send-email --cover-letter).
Remember, once it is in git, it is no longer as easy to identify where a
series starts and ends, so the contents of the cover letter is not
essential to git history, just to reviewers.

> 
> The file docs/backup-rfc.txt contains more details.

While naming the file *-rfc is fine for an RFC patch series, it better
not be the final name that you actually want committed.

> 
> Changes since v1:
> 
> * fix spelling errors
> * move BackupInfo from BDS to BackupBlockJob
> * introduce BackupDriver to allow more than one backup format
> * vma: add suport to store vmstate (size is not known in advance)
> * add ability to store VM state
> 
> Changes since v2:
> 
> * BackupDriver: remove cancel_cb
> * use enum for BackupFormat
> * vma: use bdrv_open instead of bdrv_file_open
> * vma: fix aio, use O_DIRECT
> * backup one drive after another (try to avoid high load)

Also, it is customary to list series revision history after the ---
separator; again, something useful for reviewers, but pointless in the
actual git history.

> 
> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
> ---
>  docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 119 insertions(+), 0 deletions(-)
>  create mode 100644 docs/backup-rfc.txt
> 
> diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
> new file mode 100644
> index 0000000..5b4b3df
> --- /dev/null
> +++ b/docs/backup-rfc.txt
> @@ -0,0 +1,119 @@
> +RFC: Efficient VM backup for qemu

You already have RFC in the subject line; you don't need it here in your
proposed contents.

> +
> +That basically means that any data written during backup involve
> +considerable overhead. For LVM we get the following steps:
> +
> +1.) read original data (VM write)

Shouldn't that be '(VM read)'?

> +2.) write original data into snapshot (VM write)
> +3.) write new data (VM write)
> +4.) read data from snapshot (backup)
> +5.) write data from snapshot into tar file (backup)
> +
> +Another approach to backup VM images is to create a new qcow2 image
> +which use the old image as base. During backup, writes are redirected
> +to the new image, so the old image represents a 'snapshot'. After
> +backup, data need to be copied back from new image into the old
> +one (commit). So a simple write during backup triggers the following
> +steps:
> +
> +1.) write new data to new image (VM write)
> +2.) read data from old image (backup)
> +3.) write data from old image into tar file (backup)
> +
> +4.) read data from new image (commit)
> +5.) write data to old image (commit)
> +
> +This is in fact the same overhead as before. Other tools like qemu
> +livebackup produces similar overhead (2 reads, 3 writes).
> +
> +Some storage types/formats supports internal snapshots using some kind
> +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
> +to use that for backups, but for now we want to be storage-independent.
> +
> +Note: It turned out that taking a qcow2 snapshot can take a very long
> +time on larger files.

That's an independent issue, and there have been patches proposed to try
and reduce that time.

> +
> +=Make it more efficient=
> +
> +The be more efficient, we simply need to avoid unnecessary steps. The
> +following steps are always required:
> +
> +1.) read old data before it gets overwritten
> +2.) write that data into the backup archive
> +3.) write new data (VM write)
> +
> +As you can see, this involves only one read, an two writes.

s/an/and/

> +
> +To make that work, our backup archive need to be able to store image
> +data 'out of order'. It is important to notice that this will not work
> +with traditional archive formats like tar.

Are you also requiring that the output file descriptor be seekable?  Tar
has the advantage of using a pipe; requiring a seekable file might be an
acceptable tradeoff, but it does limit what you can do when you can't
pass a pipe in for the destination.

> +
> +During backup we simply intercept writes, then read existing data and
> +store that directly into the archive. After that we can continue the
> +write.
> +
> +==Advantages==
> +
> +* very good performance (1 read, 2 writes)
> +* works on any storage type and image format.
> +* avoid usage of temporary storage
> +* we can define a new and simple archive format, which is able to
> +  store sparse files efficiently.
> +
> +Note: Storing sparse files is a mess with existing archive
> +formats. For example, tar requires information about holes at the
> +beginning of the archive.
> +
> +==Disadvantages==
> +
> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.
> +
> +* archive contains data 'out of order'
> +
> +If you want to access image data in sequential order, you need to
> +re-order archive data. It would be possible to to that on the fly,
> +using temporary files.
> +
> +Fortunately, a normal restore/extract works perfectly with 'out of
> +order' data, because the target files are seekable.
> +
> +* slow backup storage can slow down VM during backup
> +
> +It is important to note that we only do sequential writes to the
> +backup storage. Furthermore one can compress the backup stream. IMHO,
> +it is better to slow down the VM a bit. All other solutions creates
> +large amounts of temporary data during backup.
> +
> +=Archive format requirements=
> +
> +The basic requirement for such new format is that we can store image
> +date 'out of order'. It is also very likely that we have less than 256
> +drives/images per VM, and we want to be able to store VM configuration
> +files.
> +
> +We have defined a very simply format with those properties, see:
> +
> +docs/specs/vma_spec.txt

This file should be part of the same patch that first mentions it.

> +
> +Please let us know if you know an existing format which provides the
> +same functionality.
> +
> +
>
Dietmar Maurer - Feb. 20, 2013, 6:02 a.m.
> > * Backup to a single archive file

> > * Backup contain all data to restore VM (full backup)

> > * Do not depend on storage type or image format

> > * Avoid use of temporary storage

> > * store sparse images efficiently

> 

> It is customary to send a 0/6 cover letter for details like this, rather than

> slamming it into the first patch (git send-email --cover-letter).


But how do I maintain the content of that cover-letter when it is not part of the git tree?
Dietmar Maurer - Feb. 20, 2013, 7:23 a.m.
First, many thanks for the review!

> It is customary to send a 0/6 cover letter for details like this, rather than

> slamming it into the first patch (git send-email --cover-letter).

> Remember, once it is in git, it is no longer as easy to identify where a series

> starts and ends, so the contents of the cover letter is not essential to git

> history, just to reviewers.

> 

> >

> > The file docs/backup-rfc.txt contains more details.

> 

> While naming the file *-rfc is fine for an RFC patch series, it better not be the

> final name that you actually want committed.

> 

> >

> > Changes since v1:

> >

> > * fix spelling errors

> > * move BackupInfo from BDS to BackupBlockJob

> > * introduce BackupDriver to allow more than one backup format

> > * vma: add suport to store vmstate (size is not known in advance)

> > * add ability to store VM state

> >

> > Changes since v2:

> >

> > * BackupDriver: remove cancel_cb

> > * use enum for BackupFormat

> > * vma: use bdrv_open instead of bdrv_file_open

> > * vma: fix aio, use O_DIRECT

> > * backup one drive after another (try to avoid high load)

> 

> Also, it is customary to list series revision history after the --- separator;

> again, something useful for reviewers, but pointless in the actual git history.

> 


OK, I will send a cover-letter next time.

> > Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>

> > ---

> >  docs/backup-rfc.txt |  119

> > +++++++++++++++++++++++++++++++++++++++++++++++++++

> >  1 files changed, 119 insertions(+), 0 deletions(-)  create mode

> > 100644 docs/backup-rfc.txt

> >

> > diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt new file mode

> > 100644 index 0000000..5b4b3df

> > --- /dev/null

> > +++ b/docs/backup-rfc.txt

> > @@ -0,0 +1,119 @@

> > +RFC: Efficient VM backup for qemu

> 

> You already have RFC in the subject line; you don't need it here in your

> proposed contents.


OK

> 

> > +

> > +That basically means that any data written during backup involve

> > +considerable overhead. For LVM we get the following steps:

> > +

> > +1.) read original data (VM write)

> 

> Shouldn't that be '(VM read)'?


No, that 'read' is triggered by the VM write .

> > +2.) write original data into snapshot (VM write)

> > +3.) write new data (VM write)

> > +4.) read data from snapshot (backup)

> > +5.) write data from snapshot into tar file (backup)

> > +

> > +Another approach to backup VM images is to create a new qcow2 image

> > +which use the old image as base. During backup, writes are redirected

> > +to the new image, so the old image represents a 'snapshot'. After

> > +backup, data need to be copied back from new image into the old one

> > +(commit). So a simple write during backup triggers the following

> > +steps:

> > +

> > +1.) write new data to new image (VM write)

> > +2.) read data from old image (backup)

> > +3.) write data from old image into tar file (backup)

> > +

> > +4.) read data from new image (commit)

> > +5.) write data to old image (commit)

> > +

> > +This is in fact the same overhead as before. Other tools like qemu

> > +livebackup produces similar overhead (2 reads, 3 writes).

> > +

> > +Some storage types/formats supports internal snapshots using some

> > +kind of reference counting (rados, sheepdog, dm-thin, qcow2). It

> > +would be possible to use that for backups, but for now we want to be

> storage-independent.

> > +

> > +Note: It turned out that taking a qcow2 snapshot can take a very long

> > +time on larger files.

> 

> That's an independent issue, and there have been patches proposed to try

> and reduce that time.


will remove that comment.

> 

> > +

> > +=Make it more efficient=

> > +

> > +The be more efficient, we simply need to avoid unnecessary steps. The

> > +following steps are always required:

> > +

> > +1.) read old data before it gets overwritten

> > +2.) write that data into the backup archive

> > +3.) write new data (VM write)

> > +

> > +As you can see, this involves only one read, an two writes.

> 

> s/an/and/

> 

> > +

> > +To make that work, our backup archive need to be able to store image

> > +data 'out of order'. It is important to notice that this will not

> > +work with traditional archive formats like tar.

> 

> Are you also requiring that the output file descriptor be seekable?  


No, it works with pipes (like tar).
Markus Armbruster - Feb. 20, 2013, 8:01 a.m.
Dietmar Maurer <dietmar@proxmox.com> writes:

>> > * Backup to a single archive file
>> > * Backup contain all data to restore VM (full backup)
>> > * Do not depend on storage type or image format
>> > * Avoid use of temporary storage
>> > * store sparse images efficiently
>> 
>> It is customary to send a 0/6 cover letter for details like this, rather than
>> slamming it into the first patch (git send-email --cover-letter).
>
> But how do I maintain the content of that cover-letter when it is not
> part of the git tree?

Nothing stops you from committing it to your branch.  The extra commit
isn't sent out, of course.

I guess people usually archive it in e-mail instead.
Kevin Wolf - Feb. 20, 2013, 11:03 a.m.
On Wed, Feb 20, 2013 at 09:01:16AM +0100, Markus Armbruster wrote:
> Dietmar Maurer <dietmar@proxmox.com> writes:
> 
> >> > * Backup to a single archive file
> >> > * Backup contain all data to restore VM (full backup)
> >> > * Do not depend on storage type or image format
> >> > * Avoid use of temporary storage
> >> > * store sparse images efficiently
> >> 
> >> It is customary to send a 0/6 cover letter for details like this, rather than
> >> slamming it into the first patch (git send-email --cover-letter).
> >
> > But how do I maintain the content of that cover-letter when it is not
> > part of the git tree?
> 
> Nothing stops you from committing it to your branch.  The extra commit
> isn't sent out, of course.
> 
> I guess people usually archive it in e-mail instead.

I use separate format-patch and send-email steps, so I'll still have the
cover letter file around when I send the next version.

Kevin

Patch

diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
new file mode 100644
index 0000000..5b4b3df
--- /dev/null
+++ b/docs/backup-rfc.txt
@@ -0,0 +1,119 @@ 
+RFC: Efficient VM backup for qemu
+
+=Requirements=
+
+* Backup to a single archive file
+* Backup needs to contain all data to restore VM (full backup)
+* Do not depend on storage type or image format
+* Avoid use of temporary storage
+* store sparse images efficiently
+
+=Introduction=
+
+Most VM backup solutions use some kind of snapshot to get a consistent
+VM view at a specific point in time. For example, we previously used
+LVM to create a snapshot of all used VM images, which are then copied
+into a tar file.
+
+That basically means that any data written during backup involve
+considerable overhead. For LVM we get the following steps:
+
+1.) read original data (VM write)
+2.) write original data into snapshot (VM write)
+3.) write new data (VM write)
+4.) read data from snapshot (backup)
+5.) write data from snapshot into tar file (backup)
+
+Another approach to backup VM images is to create a new qcow2 image
+which use the old image as base. During backup, writes are redirected
+to the new image, so the old image represents a 'snapshot'. After
+backup, data need to be copied back from new image into the old
+one (commit). So a simple write during backup triggers the following
+steps:
+
+1.) write new data to new image (VM write)
+2.) read data from old image (backup)
+3.) write data from old image into tar file (backup)
+
+4.) read data from new image (commit)
+5.) write data to old image (commit)
+
+This is in fact the same overhead as before. Other tools like qemu
+livebackup produces similar overhead (2 reads, 3 writes).
+
+Some storage types/formats supports internal snapshots using some kind
+of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
+to use that for backups, but for now we want to be storage-independent.
+
+Note: It turned out that taking a qcow2 snapshot can take a very long
+time on larger files.
+
+=Make it more efficient=
+
+The be more efficient, we simply need to avoid unnecessary steps. The
+following steps are always required:
+
+1.) read old data before it gets overwritten
+2.) write that data into the backup archive
+3.) write new data (VM write)
+
+As you can see, this involves only one read, an two writes.
+
+To make that work, our backup archive need to be able to store image
+data 'out of order'. It is important to notice that this will not work
+with traditional archive formats like tar.
+
+During backup we simply intercept writes, then read existing data and
+store that directly into the archive. After that we can continue the
+write.
+
+==Advantages==
+
+* very good performance (1 read, 2 writes)
+* works on any storage type and image format.
+* avoid usage of temporary storage
+* we can define a new and simple archive format, which is able to
+  store sparse files efficiently.
+
+Note: Storing sparse files is a mess with existing archive
+formats. For example, tar requires information about holes at the
+beginning of the archive.
+
+==Disadvantages==
+
+* we need to define a new archive format
+
+Note: Most existing archive formats are optimized to store small files
+including file attributes. We simply do not need that for VM archives.
+
+* archive contains data 'out of order'
+
+If you want to access image data in sequential order, you need to
+re-order archive data. It would be possible to to that on the fly,
+using temporary files.
+
+Fortunately, a normal restore/extract works perfectly with 'out of
+order' data, because the target files are seekable.
+
+* slow backup storage can slow down VM during backup
+
+It is important to note that we only do sequential writes to the
+backup storage. Furthermore one can compress the backup stream. IMHO,
+it is better to slow down the VM a bit. All other solutions creates
+large amounts of temporary data during backup.
+
+=Archive format requirements=
+
+The basic requirement for such new format is that we can store image
+date 'out of order'. It is also very likely that we have less than 256
+drives/images per VM, and we want to be able to store VM configuration
+files.
+
+We have defined a very simply format with those properties, see:
+
+docs/specs/vma_spec.txt
+
+Please let us know if you know an existing format which provides the
+same functionality.
+
+