diff mbox

[v2] persistent dirty bitmap: add QDB file spec.

Message ID 1416479664-3414-1-git-send-email-vsementsov@parallels.com
State New
Headers show

Commit Message

Vladimir Sementsov-Ogievskiy Nov. 20, 2014, 10:34 a.m. UTC
QDB file is for storing dirty bitmap. The specification is based on
qcow2 specification.

Saving several bitmaps is necessary when server shutdowns during
backup. In this case 2 tables for each disk are available. One
collected for a previous period and one active. Though this feature
is discussable.

Big endian format and Standard Cluster Descriptor are used to simplify
integration with qcow2, to support internal bitmaps for qcow2 in future.

The idea is that the same procedure writing the data to QDB file could
do the same for QCOW2. The only difference is cluster refcount table.
Should we use it here or not is still questionable.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
---
 docs/specs/qdb.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 docs/specs/qdb.txt

Comments

Vladimir Sementsov-Ogievskiy Nov. 20, 2014, 10:41 a.m. UTC | #1
Also, it may be better to make this as qcow2 extension. And bitmap will 
be saved in separate qcow2 file, which will contain only the bitmap(s) 
and no other data (no disk, no snapshots).

Best regards,
Vladimir

On 20.11.2014 13:34, Vladimir Sementsov-Ogievskiy wrote:
> QDB file is for storing dirty bitmap. The specification is based on
> qcow2 specification.
>
> Saving several bitmaps is necessary when server shutdowns during
> backup. In this case 2 tables for each disk are available. One
> collected for a previous period and one active. Though this feature
> is discussable.
>
> Big endian format and Standard Cluster Descriptor are used to simplify
> integration with qcow2, to support internal bitmaps for qcow2 in future.
>
> The idea is that the same procedure writing the data to QDB file could
> do the same for QCOW2. The only difference is cluster refcount table.
> Should we use it here or not is still questionable.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
> ---
>   docs/specs/qdb.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 132 insertions(+)
>   create mode 100644 docs/specs/qdb.txt
>
> diff --git a/docs/specs/qdb.txt b/docs/specs/qdb.txt
> new file mode 100644
> index 0000000..d570a69
> --- /dev/null
> +++ b/docs/specs/qdb.txt
> @@ -0,0 +1,132 @@
> +== General ==
> +
> +"QDB" means "Qemu Dirty Bitmaps". QDB file can store several dirty bitmaps.
> +QDB file is organized in units of constant size, which are called clusters.
> +
> +All numbers in QDB are stored in Big Endian byte order.
> +
> +== Header ==
> +
> +The first cluster of a QDB image contains the file header:
> +
> +    Byte  0 -  3:   magic
> +                    QDB magic string ("QDB\0")
> +
> +          4 -  7:   version
> +                    Version number (valid value is 1)
> +
> +          8 - 11:   cluster_bits
> +                    Number of bits that are used for addressing an offset
> +                    within a cluster (1 << cluster_bits is the cluster size).
> +                    Must not be less than 9 (i.e. 512 byte clusters).
> +
> +         12 - 15:   nb_bitmaps
> +                    Number of bitmaps contained in the file
> +
> +         16 - 23:   bitmaps_offset
> +                    Offset into the QDB file at which the bitmap table starts.
> +                    Must be aligned to a cluster boundary.
> +
> +         24 - 27:   header_length
> +                    Length of the header structure in bytes.
> +
> +Like in qcow2, directly after the image header, optional sections called header extensions can
> +be stored. Each extension has a structure like the following:
> +
> +    Byte  0 -  3:   Header extension type:
> +                        0x00000000 - End of the header extension area
> +                        other      - Unknown header extension, can be safely
> +                                     ignored
> +
> +          4 -  7:   Length of the header extension data
> +
> +          8 -  n:   Header extension data
> +
> +          n -  m:   Padding to round up the header extension size to the next
> +                    multiple of 8.
> +
> +Unless stated otherwise, each header extension type shall appear at most once
> +in the same image.
> +
> +== Cluster mapping ==
> +
> +QDB uses a ONE-level structure for the mapping of
> +bitmaps to host clusters. It is called L1 table.
> +
> +The L1 table has a variable size (stored in the Bitmap table entry) and may
> +use multiple clusters, however it must be contiguous in the QDB file.
> +
> +Given a offset into the bitmap, the offset into the QDB file can be
> +obtained as follows:
> +
> +    offset = l1_table[offset / cluster_size] + (offset % cluster_size)
> +
> +L1 table entry:
> +
> +    Bit  0 -  61:   Cluster descriptor
> +
> +        62 -  63:   Reserved
> +
> +Standard Cluster Descriptor (the same as in qcow2):
> +
> +    Bit       0:    If set to 1, the cluster reads as all zeros. The host
> +                    cluster offset can be used to describe a preallocation,
> +                    but it won't be used for reading data from this cluster,
> +                    nor is data read from the backing file if the cluster is
> +                    unallocated.
> +
> +         1 -  8:    Reserved (set to 0)
> +
> +         9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
> +                    cluster boundary. If the offset is 0, the cluster is
> +                    unallocated.
> +
> +        56 - 61:    Reserved (set to 0)
> +
> +If a cluster is unallocated, read requests shall read zero.
> +
> +== Bitmap table ==
> +
> +QDB supports storing of several bitmaps.
> +
> +A directory of all bitmaps is stored in the bitmap table, a contiguous area
> +in the QDB file, whose starting offset and length are given by the header
> +fields bitmaps_offset and nb_bitmaps. The entries of the bitmap table
> +have variable length, depending on the length of name and extra data.
> +
> +Bitmap table entry:
> +
> +    Byte 0 -  7:    Offset into the QDB file at which the L1 table for the
> +                    bitmap starts. Must be aligned to a cluster boundary.
> +
> +         8 - 11:    Number of entries in the L1 table of the bitmap
> +
> +        12 - 15:    Bitmap granularity
> +                    As represented in HBitmap structure. Given a granularity of
> +                    G, each bit in the bitmap will actually represent a group
> +                    of 2^G bytes.
> +
> +        16 - 23:    Bitmap size
> +                    The size of really stored data is
> +                    (size + (1 << granularity) - 1) >> granularity
> +
> +             24:    Bitmap enabled flag (valid values are 1 and 0)
> +
> +             25:    File dirty flag (valid values are 1 and 0;
> +                    if value is 1 the bitmap is inconsistent)
> +
> +        26 - 27:    Size of the bitmap name
> +
> +        36 - 39:    Size of extra data in the table entry (used for future
> +                    extensions of the format)
> +
> +        variable:   Extra data for future extensions. Unknown fields must be
> +                    ignored.
> +
> +        variable:   Name of the bitmap (not null terminated)
> +
> +        variable:   Padding to round up the bitmap table entry size to the
> +                    next multiple of 8.
> +
> +The fields "size", "granularity", "enabled" and "name" are corresponding with
> +the fields in struct BdrvDirtyBitmap.
Stefan Hajnoczi Nov. 20, 2014, 11:36 a.m. UTC | #2
On Thu, Nov 20, 2014 at 01:41:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Also, it may be better to make this as qcow2 extension. And bitmap will be
> saved in separate qcow2 file, which will contain only the bitmap(s) and no
> other data (no disk, no snapshots).

I think you are on to something with the idea of making the persistent
dirty bitmap itself a disk image.

That way drive-mirror and other commands can be used to live migrate the
dirty bitmap along with the guest's disks.  This allows both QEMU and
management tools to reuse existing code.

(We may need to allow multiple block jobs per BlockDriverState to make
this work but in theory that can be done.)

There is a constraint if we want to get live migration for free: The
bitmap contents must be accessible with bdrv_read() and
bdrv_get_block_status() to skip zero regions.

Putting the dirty bitmap into its own data structure in qcow2 and not
accessible as a BlockDriverState bdrv_read() means custom code must be
written to migrate the dirty bitmap.

So I suggest putting the bitmap contents into a disk image that can be
accessed as a BlockDriverState with bdrv_read().  The metadata (bitmap
name, granularity, etc) doesn't need to be stored in the image file
because management tools must be aware of it anyway.

The only thing besides the data that really needs to be stored is the
up-to-date flag to decide whether this dirty bitmap was synced cleanly.
A much simpler format would do for that.

Stefan
Eric Blake Nov. 21, 2014, 12:24 a.m. UTC | #3
On 11/20/2014 03:34 AM, Vladimir Sementsov-Ogievskiy wrote:
> QDB file is for storing dirty bitmap. The specification is based on
> qcow2 specification.
> 
> Saving several bitmaps is necessary when server shutdowns during
> backup. In this case 2 tables for each disk are available. One
> collected for a previous period and one active. Though this feature
> is discussable.
> 
> Big endian format and Standard Cluster Descriptor are used to simplify
> integration with qcow2, to support internal bitmaps for qcow2 in future.
> 
> The idea is that the same procedure writing the data to QDB file could
> do the same for QCOW2. The only difference is cluster refcount table.
> Should we use it here or not is still questionable.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@parallels.com>
> ---
>  docs/specs/qdb.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100644 docs/specs/qdb.txt

No comment on whether the approach itself makes sense - just a
high-level review of this document in isolation.

> 
> diff --git a/docs/specs/qdb.txt b/docs/specs/qdb.txt
> new file mode 100644
> index 0000000..d570a69
> --- /dev/null
> +++ b/docs/specs/qdb.txt
> @@ -0,0 +1,132 @@
> +== General ==

Missing a copyright notice.  Yeah, you've got a lot of bad examples in
this directory (in docs/* in general), but there ARE a few of the newer
files that are starting to buck the trend and use a copyright/license blurb.

> +
> +"QDB" means "Qemu Dirty Bitmaps". QDB file can store several dirty bitmaps.
> +QDB file is organized in units of constant size, which are called clusters.
> +
> +All numbers in QDB are stored in Big Endian byte order.
> +
> +== Header ==
> +
> +The first cluster of a QDB image contains the file header:
> +
> +    Byte  0 -  3:   magic
> +                    QDB magic string ("QDB\0")
> +
> +          4 -  7:   version
> +                    Version number (valid value is 1)
> +
> +          8 - 11:   cluster_bits
> +                    Number of bits that are used for addressing an offset
> +                    within a cluster (1 << cluster_bits is the cluster size).
> +                    Must not be less than 9 (i.e. 512 byte clusters).

Is there a maximum?

> +
> +         12 - 15:   nb_bitmaps
> +                    Number of bitmaps contained in the file
> +
> +         16 - 23:   bitmaps_offset
> +                    Offset into the QDB file at which the bitmap table starts.
> +                    Must be aligned to a cluster boundary.
> +
> +         24 - 27:   header_length
> +                    Length of the header structure in bytes.

does that include the length of all extensions?  Should we enforce a
maximum header length of one cluster?

> +
> +Like in qcow2, directly after the image header, optional sections called header extensions can
> +be stored. Each extension has a structure like the following:
> +
> +    Byte  0 -  3:   Header extension type:
> +                        0x00000000 - End of the header extension area
> +                        other      - Unknown header extension, can be safely
> +                                     ignored
> +
> +          4 -  7:   Length of the header extension data
> +
> +          8 -  n:   Header extension data
> +
> +          n -  m:   Padding to round up the header extension size to the next
> +                    multiple of 8.
> +
> +Unless stated otherwise, each header extension type shall appear at most once
> +in the same image.

I like how qcow2 v3 has a header extension for listing the name of each
header extension, for nicer error messages.  Also, I think that
declaring all unknown extensions as ignorable may be dangerous, since
you lack a capability bitmask.  Maybe it would be wise to copy the qcow2
v3 capabilities (including flags for ignorable vs. mandatory support of
given features, where a client can sanely decide what to do if it does
not recognize a feature).

> +
> +        26 - 27:    Size of the bitmap name
> +
> +        36 - 39:    Size of extra data in the table entry (used for future
> +                    extensions of the format)
> +
> +        variable:   Extra data for future extensions. Unknown fields must be
> +                    ignored.

This block is width 0 if bytes 36-39 is 0?  How are extensions
identified?  Are they required to be done like overall file headers,
with an id, length, and then variable data, so that it is possible to
scan to the end of each unknown extension to see if the next extension
is known?  This is where capability bits in the overall header may make
more sense.

> +
> +        variable:   Name of the bitmap (not null terminated)

The length of this block is determined by bytes 26-27?

> +
> +        variable:   Padding to round up the bitmap table entry size to the
> +                    next multiple of 8.
> +
> +The fields "size", "granularity", "enabled" and "name" are corresponding with
> +the fields in struct BdrvDirtyBitmap.
Vladimir Sementsov-Ogievskiy Nov. 21, 2014, 10:27 a.m. UTC | #4
> There is a constraint if we want to get live migration for free: The
> bitmap contents must be accessible with bdrv_read() and
> bdrv_get_block_status() to skip zero regions.
Hm. I'm afraid, it still will not be free. If bitmap is active, it's 
actual version is in memory. To migrate bitmap file like a disk image, 
we should start syncing it with every write to corresponding disk, 
doubling number of io.

Moreover, we have normal dirty bitmaps, which have no name/file, do we 
migrate them? If, for example, the migration occurs when backup in 
progress? Active bitmaps should be migrated in the same way for 
persistent/named/normal bitmaps. I can't find in qemu source, is there 
bitmap migration?

Or you are saying about migrating disabled bitmaps? Hm. We should sync 
bitmap file on bitmap_disable. Disabled persistent bitmap is just a 
static file ~30mb, we can easily migrate it without common procedure 
with cow or something like this..

Best regards,
Vladimir

On 20.11.2014 14:36, Stefan Hajnoczi wrote:
> On Thu, Nov 20, 2014 at 01:41:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Also, it may be better to make this as qcow2 extension. And bitmap will be
>> saved in separate qcow2 file, which will contain only the bitmap(s) and no
>> other data (no disk, no snapshots).
> I think you are on to something with the idea of making the persistent
> dirty bitmap itself a disk image.
>
> That way drive-mirror and other commands can be used to live migrate the
> dirty bitmap along with the guest's disks.  This allows both QEMU and
> management tools to reuse existing code.
>
> (We may need to allow multiple block jobs per BlockDriverState to make
> this work but in theory that can be done.)
>
> There is a constraint if we want to get live migration for free: The
> bitmap contents must be accessible with bdrv_read() and
> bdrv_get_block_status() to skip zero regions.
>
> Putting the dirty bitmap into its own data structure in qcow2 and not
> accessible as a BlockDriverState bdrv_read() means custom code must be
> written to migrate the dirty bitmap.
>
> So I suggest putting the bitmap contents into a disk image that can be
> accessed as a BlockDriverState with bdrv_read().  The metadata (bitmap
> name, granularity, etc) doesn't need to be stored in the image file
> because management tools must be aware of it anyway.
>
> The only thing besides the data that really needs to be stored is the
> up-to-date flag to decide whether this dirty bitmap was synced cleanly.
> A much simpler format would do for that.
>
> Stefan
Vladimir Sementsov-Ogievskiy Nov. 21, 2014, 12:56 p.m. UTC | #5
> The metadata (bitmap
> name, granularity, etc) doesn't need to be stored in the image file
> because management tools must be aware of it anyway.
What tools do you mean? In my opinion dirty bitmap should exist as a 
separate object. If it exists, it should be loaded with it's drive image 
and it should be maintained by qemu (loaded and enabled as a 
BdrvDirtyBitmap). If we use qcow2 format for dirty bitmaps, we can store 
metadata using header extension..

Also snapshots may be used to store several bitmaps in case when server 
shutdowns during backup and we need to store both current active bitmap 
and it's snapshot used by backup.

Best regards,
Vladimir

On 20.11.2014 14:36, Stefan Hajnoczi wrote:
> On Thu, Nov 20, 2014 at 01:41:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Also, it may be better to make this as qcow2 extension. And bitmap will be
>> saved in separate qcow2 file, which will contain only the bitmap(s) and no
>> other data (no disk, no snapshots).
> I think you are on to something with the idea of making the persistent
> dirty bitmap itself a disk image.
>
> That way drive-mirror and other commands can be used to live migrate the
> dirty bitmap along with the guest's disks.  This allows both QEMU and
> management tools to reuse existing code.
>
> (We may need to allow multiple block jobs per BlockDriverState to make
> this work but in theory that can be done.)
>
> There is a constraint if we want to get live migration for free: The
> bitmap contents must be accessible with bdrv_read() and
> bdrv_get_block_status() to skip zero regions.
>
> Putting the dirty bitmap into its own data structure in qcow2 and not
> accessible as a BlockDriverState bdrv_read() means custom code must be
> written to migrate the dirty bitmap.
>
> So I suggest putting the bitmap contents into a disk image that can be
> accessed as a BlockDriverState with bdrv_read().  The metadata (bitmap
> name, granularity, etc) doesn't need to be stored in the image file
> because management tools must be aware of it anyway.
>
> The only thing besides the data that really needs to be stored is the
> up-to-date flag to decide whether this dirty bitmap was synced cleanly.
> A much simpler format would do for that.
>
> Stefan
Stefan Hajnoczi Nov. 21, 2014, 4:55 p.m. UTC | #6
On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >There is a constraint if we want to get live migration for free: The
> >bitmap contents must be accessible with bdrv_read() and
> >bdrv_get_block_status() to skip zero regions.
> Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual
> version is in memory. To migrate bitmap file like a disk image, we should
> start syncing it with every write to corresponding disk, doubling number of
> io.

It would be possible to drive-mirror the persistent dirty bitmap and
then flush it like all drives when the guest vCPUs are paused for
migration.

After thinking more about it though, this approach places more I/O into
the critical guest downtime phase.  In other words, slow disk I/O could
lead to long guest downtimes while QEMU tries to write out the dirty
bitmap.

> Moreover, we have normal dirty bitmaps, which have no name/file, do we
> migrate them? If, for example, the migration occurs when backup in progress?
> Active bitmaps should be migrated in the same way for
> persistent/named/normal bitmaps. I can't find in qemu source, is there
> bitmap migration?

bs->dirty_bitmaps is not migrated, in fact none of BlockDriverState is
migrated.

QEMU only migrates emulated device state (e.g. the hardware registers
and associated state).  It does not emulate host state that the guest
cannot see like the dirty bitmap.

> Or you are saying about migrating disabled bitmaps? Hm. We should sync
> bitmap file on bitmap_disable. Disabled persistent bitmap is just a static
> file ~30mb, we can easily migrate it without common procedure with cow or
> something like this..

Active dirty bitmaps should migrate too.  I'm thinking now that the
appropriate thing is to add live migration of dirty bitmaps to QEMU
(regardless of whether they are active or not).

Stefan
Vladimir Sementsov-Ogievskiy Nov. 24, 2014, 9:19 a.m. UTC | #7
> Active dirty bitmaps should migrate too.  I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
Only for persistent bitmaps, or for all named bitmaps? If for all named 
bitmaps, then this migration should not be connected with bitmap file 
and it's format.

Best regards,
Vladimir

On 21.11.2014 19:55, Stefan Hajnoczi wrote:
> On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> There is a constraint if we want to get live migration for free: The
>>> bitmap contents must be accessible with bdrv_read() and
>>> bdrv_get_block_status() to skip zero regions.
>> Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual
>> version is in memory. To migrate bitmap file like a disk image, we should
>> start syncing it with every write to corresponding disk, doubling number of
>> io.
> It would be possible to drive-mirror the persistent dirty bitmap and
> then flush it like all drives when the guest vCPUs are paused for
> migration.
>
> After thinking more about it though, this approach places more I/O into
> the critical guest downtime phase.  In other words, slow disk I/O could
> lead to long guest downtimes while QEMU tries to write out the dirty
> bitmap.
>
>> Moreover, we have normal dirty bitmaps, which have no name/file, do we
>> migrate them? If, for example, the migration occurs when backup in progress?
>> Active bitmaps should be migrated in the same way for
>> persistent/named/normal bitmaps. I can't find in qemu source, is there
>> bitmap migration?
> bs->dirty_bitmaps is not migrated, in fact none of BlockDriverState is
> migrated.
>
> QEMU only migrates emulated device state (e.g. the hardware registers
> and associated state).  It does not emulate host state that the guest
> cannot see like the dirty bitmap.
>
>> Or you are saying about migrating disabled bitmaps? Hm. We should sync
>> bitmap file on bitmap_disable. Disabled persistent bitmap is just a static
>> file ~30mb, we can easily migrate it without common procedure with cow or
>> something like this..
> Active dirty bitmaps should migrate too.  I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
>
> Stefan
Vladimir Sementsov-Ogievskiy Nov. 25, 2014, 5:58 p.m. UTC | #8
> I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
Digging the code around, I've found this:

in mig_save_device_dirty which is actually an iteration of live block 
migration, after sending a sector we need to clear appropriate bit in 
migration dirty bitmap (bmds->dirty_bitmap). But we clear such bits in 
all bitmaps, associated with this device:

bdrv_reset_dirty(bmds->bs, sector, nr_sectors);

which is

void bdrv_reset_dirty(BlockDriverState *bs, int64_t cur_sector, int 
nr_sectors)
{
     BdrvDirtyBitmap *bitmap;
     QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) {
         hbitmap_reset(bitmap->bitmap, cur_sector, nr_sectors);
     }
}

I don't know why is it so, but with such approach we cant talk about 
dirty bitmap migration. Actually, all other dirty bitmaps, not related 
to this migration are broken because of this.

It's a mistake or I don't understand the concept of several dirty 
bitmaps per device in qemu. I've thought that they are separate 
entities, which are maintained by qemu. And other subsystems like backup 
or migration can create for itself a bitmap and use it not touching 
other bitmaps.. Am I wrong?

Best regards,
Vladimir

On 21.11.2014 19:55, Stefan Hajnoczi wrote:
> On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> There is a constraint if we want to get live migration for free: The
>>> bitmap contents must be accessible with bdrv_read() and
>>> bdrv_get_block_status() to skip zero regions.
>> Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual
>> version is in memory. To migrate bitmap file like a disk image, we should
>> start syncing it with every write to corresponding disk, doubling number of
>> io.
> It would be possible to drive-mirror the persistent dirty bitmap and
> then flush it like all drives when the guest vCPUs are paused for
> migration.
>
> After thinking more about it though, this approach places more I/O into
> the critical guest downtime phase.  In other words, slow disk I/O could
> lead to long guest downtimes while QEMU tries to write out the dirty
> bitmap.
>
>> Moreover, we have normal dirty bitmaps, which have no name/file, do we
>> migrate them? If, for example, the migration occurs when backup in progress?
>> Active bitmaps should be migrated in the same way for
>> persistent/named/normal bitmaps. I can't find in qemu source, is there
>> bitmap migration?
> bs->dirty_bitmaps is not migrated, in fact none of BlockDriverState is
> migrated.
>
> QEMU only migrates emulated device state (e.g. the hardware registers
> and associated state).  It does not emulate host state that the guest
> cannot see like the dirty bitmap.
>
>> Or you are saying about migrating disabled bitmaps? Hm. We should sync
>> bitmap file on bitmap_disable. Disabled persistent bitmap is just a static
>> file ~30mb, we can easily migrate it without common procedure with cow or
>> something like this..
> Active dirty bitmaps should migrate too.  I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
>
> Stefan
Vladimir Sementsov-Ogievskiy Nov. 28, 2014, 1:28 p.m. UTC | #9
On 21.11.2014 19:55, Stefan Hajnoczi wrote:
> Active dirty bitmaps should migrate too.  I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
I think, we should migrate named dirty bitmaps, which are not used now. 
So if some external mechanism uses the bitmap (for example - backup) - 
we actually can't migrate this process, because we will need to restore 
the whole backup structure including a pointer to the bitmap, which is 
too hard and includes not only bitmap migration.
So, if named bitmap is enabled, but not used (only bdrv_aligned_pwritev 
writes to it) it can be migrated. For this I see the following solutions:

1) Just save all corresponding pieces of named bitmaps with every 
migrated block. The block size is 1mb, so the overhead for migrating 
additionally a bitmap with 64kb granularity would be 2b, and it would be 
256b for bitmap with 512b granularity. This approach needs additional 
fields in BlkMigBlock, for saving bitmaps pieces.

2) Add DIRTY flag to migrated block flags, to distinguish blocks, which 
became dirty while migrating. Save all the bitmaps separately, and also 
update them on block_load, when we receive block with DIRTY flag on. 
Some information will be lost, migrated dirty bitmaps may be "more 
dirty" then original ones. This approach needs additional field "bool 
dirty" in BlkMigBlock, and saving this flag in blk_send.

These solutions don't depend on "persistence" of dirty bitmaps or 
persistent bitmap file format.

Best regards,
Vladimir

On 21.11.2014 19:55, Stefan Hajnoczi wrote:
> On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> There is a constraint if we want to get live migration for free: The
>>> bitmap contents must be accessible with bdrv_read() and
>>> bdrv_get_block_status() to skip zero regions.
>> Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual
>> version is in memory. To migrate bitmap file like a disk image, we should
>> start syncing it with every write to corresponding disk, doubling number of
>> io.
> It would be possible to drive-mirror the persistent dirty bitmap and
> then flush it like all drives when the guest vCPUs are paused for
> migration.
>
> After thinking more about it though, this approach places more I/O into
> the critical guest downtime phase.  In other words, slow disk I/O could
> lead to long guest downtimes while QEMU tries to write out the dirty
> bitmap.
>
>> Moreover, we have normal dirty bitmaps, which have no name/file, do we
>> migrate them? If, for example, the migration occurs when backup in progress?
>> Active bitmaps should be migrated in the same way for
>> persistent/named/normal bitmaps. I can't find in qemu source, is there
>> bitmap migration?
> bs->dirty_bitmaps is not migrated, in fact none of BlockDriverState is
> migrated.
>
> QEMU only migrates emulated device state (e.g. the hardware registers
> and associated state).  It does not emulate host state that the guest
> cannot see like the dirty bitmap.
>
>> Or you are saying about migrating disabled bitmaps? Hm. We should sync
>> bitmap file on bitmap_disable. Disabled persistent bitmap is just a static
>> file ~30mb, we can easily migrate it without common procedure with cow or
>> something like this..
> Active dirty bitmaps should migrate too.  I'm thinking now that the
> appropriate thing is to add live migration of dirty bitmaps to QEMU
> (regardless of whether they are active or not).
>
> Stefan
Stefan Hajnoczi Dec. 1, 2014, 11:02 a.m. UTC | #10
On Fri, Nov 28, 2014 at 04:28:57PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 21.11.2014 19:55, Stefan Hajnoczi wrote:
> >Active dirty bitmaps should migrate too.  I'm thinking now that the
> >appropriate thing is to add live migration of dirty bitmaps to QEMU
> >(regardless of whether they are active or not).
> I think, we should migrate named dirty bitmaps, which are not used now. So
> if some external mechanism uses the bitmap (for example - backup) - we
> actually can't migrate this process, because we will need to restore the
> whole backup structure including a pointer to the bitmap, which is too hard
> and includes not only bitmap migration.
> So, if named bitmap is enabled, but not used (only bdrv_aligned_pwritev
> writes to it) it can be migrated. For this I see the following solutions:
> 
> 1) Just save all corresponding pieces of named bitmaps with every migrated
> block. The block size is 1mb, so the overhead for migrating additionally a
> bitmap with 64kb granularity would be 2b, and it would be 256b for bitmap
> with 512b granularity. This approach needs additional fields in BlkMigBlock,
> for saving bitmaps pieces.

block-migration.c is not used for all live migration.  So it's important
not to tie dirty bitmap migration to block-migration.c, at least there
needs to be a way to skip actually copying disk contents in
block-migration.c.

(When there is shared storage that both source and destination hosts can
access then block-migration.c is not used.  Also, there is a newer
non-shared storage migration mechanism that is used instead of
block-migration.c which is not tied into the live migration data stream,
so block-migration.c is optional.)

> 2) Add DIRTY flag to migrated block flags, to distinguish blocks, which
> became dirty while migrating. Save all the bitmaps separately, and also
> update them on block_load, when we receive block with DIRTY flag on. Some
> information will be lost, migrated dirty bitmaps may be "more dirty" then
> original ones. This approach needs additional field "bool dirty" in
> BlkMigBlock, and saving this flag in blk_send.
> 
> These solutions don't depend on "persistence" of dirty bitmaps or persistent
> bitmap file format.

That's an important characteristic since we probably want to migrate
named dirty bitmaps, whether they are persistent or not.

Stefan
diff mbox

Patch

diff --git a/docs/specs/qdb.txt b/docs/specs/qdb.txt
new file mode 100644
index 0000000..d570a69
--- /dev/null
+++ b/docs/specs/qdb.txt
@@ -0,0 +1,132 @@ 
+== General ==
+
+"QDB" means "Qemu Dirty Bitmaps". QDB file can store several dirty bitmaps.
+QDB file is organized in units of constant size, which are called clusters.
+
+All numbers in QDB are stored in Big Endian byte order.
+
+== Header ==
+
+The first cluster of a QDB image contains the file header:
+
+    Byte  0 -  3:   magic
+                    QDB magic string ("QDB\0")
+
+          4 -  7:   version
+                    Version number (valid value is 1)
+
+          8 - 11:   cluster_bits
+                    Number of bits that are used for addressing an offset
+                    within a cluster (1 << cluster_bits is the cluster size).
+                    Must not be less than 9 (i.e. 512 byte clusters).
+
+         12 - 15:   nb_bitmaps
+                    Number of bitmaps contained in the file
+
+         16 - 23:   bitmaps_offset
+                    Offset into the QDB file at which the bitmap table starts.
+                    Must be aligned to a cluster boundary.
+
+         24 - 27:   header_length
+                    Length of the header structure in bytes.
+
+Like in qcow2, directly after the image header, optional sections called header extensions can
+be stored. Each extension has a structure like the following:
+
+    Byte  0 -  3:   Header extension type:
+                        0x00000000 - End of the header extension area
+                        other      - Unknown header extension, can be safely
+                                     ignored
+
+          4 -  7:   Length of the header extension data
+
+          8 -  n:   Header extension data
+
+          n -  m:   Padding to round up the header extension size to the next
+                    multiple of 8.
+
+Unless stated otherwise, each header extension type shall appear at most once
+in the same image.
+
+== Cluster mapping ==
+
+QDB uses a ONE-level structure for the mapping of
+bitmaps to host clusters. It is called L1 table.
+
+The L1 table has a variable size (stored in the Bitmap table entry) and may
+use multiple clusters, however it must be contiguous in the QDB file.
+
+Given a offset into the bitmap, the offset into the QDB file can be
+obtained as follows:
+
+    offset = l1_table[offset / cluster_size] + (offset % cluster_size)
+
+L1 table entry:
+
+    Bit  0 -  61:   Cluster descriptor
+
+        62 -  63:   Reserved
+
+Standard Cluster Descriptor (the same as in qcow2):
+
+    Bit       0:    If set to 1, the cluster reads as all zeros. The host
+                    cluster offset can be used to describe a preallocation,
+                    but it won't be used for reading data from this cluster,
+                    nor is data read from the backing file if the cluster is
+                    unallocated.
+
+         1 -  8:    Reserved (set to 0)
+
+         9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
+                    cluster boundary. If the offset is 0, the cluster is
+                    unallocated.
+
+        56 - 61:    Reserved (set to 0)
+
+If a cluster is unallocated, read requests shall read zero.
+
+== Bitmap table ==
+
+QDB supports storing of several bitmaps.
+
+A directory of all bitmaps is stored in the bitmap table, a contiguous area
+in the QDB file, whose starting offset and length are given by the header
+fields bitmaps_offset and nb_bitmaps. The entries of the bitmap table
+have variable length, depending on the length of name and extra data.
+
+Bitmap table entry:
+
+    Byte 0 -  7:    Offset into the QDB file at which the L1 table for the
+                    bitmap starts. Must be aligned to a cluster boundary.
+
+         8 - 11:    Number of entries in the L1 table of the bitmap
+
+        12 - 15:    Bitmap granularity
+                    As represented in HBitmap structure. Given a granularity of
+                    G, each bit in the bitmap will actually represent a group
+                    of 2^G bytes.
+
+        16 - 23:    Bitmap size
+                    The size of really stored data is
+                    (size + (1 << granularity) - 1) >> granularity
+
+             24:    Bitmap enabled flag (valid values are 1 and 0)
+
+             25:    File dirty flag (valid values are 1 and 0;
+                    if value is 1 the bitmap is inconsistent)
+
+        26 - 27:    Size of the bitmap name
+
+        36 - 39:    Size of extra data in the table entry (used for future
+                    extensions of the format)
+
+        variable:   Extra data for future extensions. Unknown fields must be
+                    ignored.
+
+        variable:   Name of the bitmap (not null terminated)
+
+        variable:   Padding to round up the bitmap table entry size to the
+                    next multiple of 8.
+
+The fields "size", "granularity", "enabled" and "name" are corresponding with
+the fields in struct BdrvDirtyBitmap.