Patchwork [RFC] Specification for qcow2 version 3

login
register
mail settings
Submitter Kevin Wolf
Date May 9, 2011, 3:51 p.m.
Message ID <1304956314-7806-1-git-send-email-kwolf@redhat.com>
Download mbox | patch
Permalink /patch/94785/
State New
Headers show

Comments

Kevin Wolf - May 9, 2011, 3:51 p.m.
Hi all,

this is a first draft for what I think could be added when we increase qcow2's
version number to 3. This includes points that have been made by several people
over the past few months. We're probably not going to implement this next week,
but I think it's important to get discussions started early, so here it is.

I hope the intentions of each change are clear, but feel free to ask if they
aren't. Also when I wasn't if/how exactly to specify things, I left a TODO in
some places.

Kevin
---
 docs/specs/qcow2.txt |   98 +++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 77 insertions(+), 21 deletions(-)
Kevin Wolf - May 13, 2011, 12:29 p.m.
Am 09.05.2011 17:51, schrieb Kevin Wolf:
> Hi all,
> 
> this is a first draft for what I think could be added when we increase qcow2's
> version number to 3. This includes points that have been made by several people
> over the past few months. We're probably not going to implement this next week,
> but I think it's important to get discussions started early, so here it is.
> 
> I hope the intentions of each change are clear, but feel free to ask if they
> aren't. Also when I wasn't if/how exactly to specify things, I left a TODO in
> some places.

Silence means that you all agree with the proposal?

Kevin
Stefan Hajnoczi - May 24, 2011, 10:41 a.m.
On Mon, May 9, 2011 at 4:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> I hope the intentions of each change are clear, but feel free to ask if they
> aren't. Also when I wasn't if/how exactly to specify things, I left a TODO in
> some places.

Here is what I've picked up on and a summary for lazy readers who
don't want to reverse-engineer the rationale for proposed changes:

1. Feature bits

In order to support extending the format in the future a flexible
mechanism for specifying image features is required.  The single file
format version number isn't enough to express the various
compatibility strategies that could apply when introducing new
features.

Qcow2v3 adds feature bitfields for specifying individual format features.

2. Sub-clusters

A 64-cluster region of the image file can be allocated at once in
order to reduce fragmentation.  The sub-cluster bitfield indicates
which sub-clusters are actually allocated, eliminating the need to
zero out (or read from the backing file) the entire 64-cluster region
at allocation time.

3. Zero clusters

Cluster descriptor bit 0 can mark clusters as zero.  This prevents
access to the backing file and instead reads zeroes.

This is not really compatible with sub-clusters because it works at
cluster granularity?

Zero clusters enable efficient TRIM implementation even when a backing
file is in use.

> @@ -67,6 +67,42 @@ The first cluster of a qcow2 image contains the file header:
>                     Offset into the image file at which the snapshot table
>                     starts. Must be aligned to a cluster boundary.
>
> +If the version is 3 or higher, the header has the following additional fields.
> +For version 2, the values are assumed to be zero, unless specified otherwise
> +in the description of a field.
> +
> +         72 - 75:   incompatible_features

Is there a reason to use 32-bit instead of 64-bit?  I think virtio
recently learnt that wider feature bitfields are useful :).

> +                    Bitmask of incompatible features. An implementation must
> +                    fail to open an image if an unknown bit is set.
> +
> +                    Bit 0:      The reference counts in the image file may be
> +                                inaccurate. Implementations must check/rebuild
> +                                them if they rely on them.
> +
> +                    Bit 1:      Enable subclusters. This affects the L2 table
> +                                format.
> +
> +                    Bits 2-31:  Reserved (set to 0)
> +
> +         76 - 79:   compatible_features
> +                    Bitmask of compatible features. An implementation can
> +                    safely ignore any unknown bits that are set.
> +                    No compatible feature bits are defined yet.

Reserved, set to 0.

> +
> +         80 - 83:   autoclear_features
> +                    Bitmask of auto-clear features. An implementation may only
> +                    write to an image with unknown auto-clear features if it
> +                    clears the respective bits from this field first.
> +                    No auto-clear feature bits are defined yet.

Reserved, set to 0.

> +
> +         84 - 87:   refcount_bits
> +                    Size of a reference count block entry in bits. For version 2
> +                    images, the size is always 16 bits.

Version 2 does not have this field but always uses the default size of
16 bits?  I'm checking because earlier you wrote "For version 2, the
values are assumed to be zero, unless specified otherwise in the
description of a field".  But you don't expect v2 files to actually
store the value 16 here, right?

Valid ranges for this field?

> +                    [ TODO: Define order in sub-byte sizes ]
> +
> +        [ TODO: Add per-L2-table dirty flag to L1? ]
> +        [ TODO: Add per-refcount-block full flag to refcount table? ]
> +
>  Directly after the image header, optional sections called header extensions can
>  be stored. Each extension has a structure like the following:
>
> @@ -87,6 +123,8 @@ The remaining space between the end of the header extension area and the end of
>  the first cluster can be used for other data. Usually, the backing file name is
>  stored there.
>
> +[ TODO Feature name table? ]

There was discussion about using string names rather than feature
bits.  This would make failure on unknown feature bits much clearer to
end-users: unable to open test.qcow3, feature "new_feature" not
supported

The issue with feature names as strings is that it makes header
parsing more difficult - especially updating in place (delete or
insert).  For this reason I don't see string names as essential.

Perhaps there was another requirement for feature names that I forgot about?

> +
>
>  == Host cluster management ==
>
> @@ -138,7 +176,8 @@ guest clusters to host clusters. They are called L1 and L2 table.
>
>  The L1 table has a variable size (stored in the header) and may use multiple
>  clusters, however it must be contiguous in the image file. L2 tables are
> -exactly one cluster in size.
> +exactly one cluster in size if subclusters are disabled, and two clusters if
> +they are enabled.
>
>  Given a offset into the virtual disk, the offset into the image file can be
>  obtained as follows:
> @@ -168,9 +207,32 @@ L1 table entry:
>                     refcount is exactly one. This information is only accurate
>                     in the active L1 table.
>
> -L2 table entry (for normal clusters):
> +L2 table entry:
>
> -    Bit  0 -  8:    Reserved (set to 0)
> +    Bit  0 -  61:   Cluster descriptor
> +
> +              62:   0 for standard clusters
> +                    1 for compressed clusters
> +
> +              63:   0 for a cluster that is unused or requires COW, 1 if its
> +                    refcount is exactly one. This information is only accurate
> +                    in L2 tables that are reachable from the the active L1
> +                    table.
> +
> +        64 - 127:   If subclusters are enabled, this contains a bitmask that
> +                    describes the allocation status of all 64 subclusters. The
> +                    first subcluster is represented by the LSB. A 0 bit means
> +                    that the subcluster is unallocated.
> +
> +Standard Cluster Descriptor:
> +
> +    Bit       0:    If set to 1, the cluster reads as all zeros instead of
> +                    referring to the backing file if the (sub-)cluster is
> +                    unallocated.
> +
> +                    With version 2, this is always 0.
> +
> +         1 -  8:    Reserved (set to 0)
>
>          9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
>                     cluster boundary. If the offset is 0, the cluster is
> @@ -178,29 +240,17 @@ L2 table entry (for normal clusters):
>
>         56 - 61:    Reserved (set to 0)
>
> -             62:    0 (this cluster is not compressed)
> -
> -             63:    0 for a cluster that is unused or requires COW, 1 if its
> -                    refcount is exactly one. This information is only accurate
> -                    in L2 tables that are reachable from the the active L1
> -                    table.
>
> -L2 table entry (for compressed clusters; x = 62 - (cluster_size - 8)):
> +Compressed Clusters Descriptor (x = 62 - (cluster_size - 8)):
>
>     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
>                     cluster boundary!
>
>        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
>
> -             62:    1 (this cluster is compressed using zlib)
> -
> -             63:    0 for a cluster that is unused or requires COW, 1 if its
> -                    refcount is exactly one. This information is only accurate
> -                    in L2 tables that are reachable from the the active L1
> -                    table.
> -
> -If a cluster is unallocated, read requests shall read the data from the backing
> -file. If there is no backing file or the backing file is smaller than the image,
> +If a cluster or a subcluster is unallocated, read requests shall read the data
> +from the backing file (except if bit 0 in the Standard Cluster Descriptor is
> +set). If there is no backing file or the backing file is smaller than the image,
>  they shall read zeros for all parts that are not covered by the backing file.
>
>
> @@ -253,7 +303,13 @@ Snapshot table entry:
>         36 - 39:    Size of extra data in the table entry (used for future
>                     extensions of the format)
>
> -        variable:   Extra data for future extensions. Must be ignored.
> +        variable:   Extra data for future extensions. Unknown fields must be
> +                    ignored. Currently defined are (offset relative to snapshot
> +                    table entry):
> +
> +                    Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM
> +                                    state is saved. If this field is present,
> +                                    the 32-bit value in bytes 32-35 is ignored.

This is because you want a 64-bit VM state offset?

Need to add a note that this is v3-specific?

This field now preceeds the id_str and name variable length data?

Stefan
Kevin Wolf - May 24, 2011, 11:15 a.m.
Am 24.05.2011 12:41, schrieb Stefan Hajnoczi:
> 3. Zero clusters
> 
> Cluster descriptor bit 0 can mark clusters as zero.  This prevents
> access to the backing file and instead reads zeroes.
> 
> This is not really compatible with sub-clusters because it works at
> cluster granularity?

Right, that's something I wanted to discussed, too. In fact, it works
just fine with subclusters if you don't have a backing file, but you
can't have backing file references and zeros in the same cluster.

Should we use two bits for each subcluster? This would either reduce the
number of subclusters to 32, or we'd have to increase the size of L2
entries even further. A factor of 32 would mean 64k/2M which still
sounds reasonable, but it's not nice to have it as an absolute upper limit.

> Zero clusters enable efficient TRIM implementation even when a backing
> file is in use.

Actually, I think the main use case was maintaining sparseness over copy
on read.

> 
>> @@ -67,6 +67,42 @@ The first cluster of a qcow2 image contains the file header:
>>                     Offset into the image file at which the snapshot table
>>                     starts. Must be aligned to a cluster boundary.
>>
>> +If the version is 3 or higher, the header has the following additional fields.
>> +For version 2, the values are assumed to be zero, unless specified otherwise
>> +in the description of a field.
>> +
>> +         72 - 75:   incompatible_features
> 
> Is there a reason to use 32-bit instead of 64-bit?  I think virtio
> recently learnt that wider feature bitfields are useful :).

Not really, I'll change that.

> 
>> +                    Bitmask of incompatible features. An implementation must
>> +                    fail to open an image if an unknown bit is set.
>> +
>> +                    Bit 0:      The reference counts in the image file may be
>> +                                inaccurate. Implementations must check/rebuild
>> +                                them if they rely on them.
>> +
>> +                    Bit 1:      Enable subclusters. This affects the L2 table
>> +                                format.
>> +
>> +                    Bits 2-31:  Reserved (set to 0)
>> +
>> +         76 - 79:   compatible_features
>> +                    Bitmask of compatible features. An implementation can
>> +                    safely ignore any unknown bits that are set.
>> +                    No compatible feature bits are defined yet.
> 
> Reserved, set to 0.
> 
>> +
>> +         80 - 83:   autoclear_features
>> +                    Bitmask of auto-clear features. An implementation may only
>> +                    write to an image with unknown auto-clear features if it
>> +                    clears the respective bits from this field first.
>> +                    No auto-clear feature bits are defined yet.
> 
> Reserved, set to 0.

Will change it.

> 
>> +
>> +         84 - 87:   refcount_bits
>> +                    Size of a reference count block entry in bits. For version 2
>> +                    images, the size is always 16 bits.
> 
> Version 2 does not have this field but always uses the default size of
> 16 bits?  I'm checking because earlier you wrote "For version 2, the
> values are assumed to be zero, unless specified otherwise in the
> description of a field".  But you don't expect v2 files to actually
> store the value 16 here, right?

Right. Would it be clearer as "For version 2 images, the size is always
assumed to be 16 bits"?

> Valid ranges for this field?

I think restricting it to powers of two makes sense. I'm not sure if we
should impose further restrictions. Allowing a value in the format
doesn't automatically mean that qemu must support it. I think initially
we'll only allow 16 bits.

>> +                    [ TODO: Define order in sub-byte sizes ]

Another thing to discuss would be if you think we'll want to have
sub-byte sizes other than 1?

>> +
>> +        [ TODO: Add per-L2-table dirty flag to L1? ]
>> +        [ TODO: Add per-refcount-block full flag to refcount table? ]

What do you think about these? Helpful or not? Add as an incompatible
feature flag later or consider it from the beginning?

>> +
>>  Directly after the image header, optional sections called header extensions can
>>  be stored. Each extension has a structure like the following:
>>
>> @@ -87,6 +123,8 @@ The remaining space between the end of the header extension area and the end of
>>  the first cluster can be used for other data. Usually, the backing file name is
>>  stored there.
>>
>> +[ TODO Feature name table? ]
> 
> There was discussion about using string names rather than feature
> bits.  This would make failure on unknown feature bits much clearer to
> end-users: unable to open test.qcow3, feature "new_feature" not
> supported
> 
> The issue with feature names as strings is that it makes header
> parsing more difficult - especially updating in place (delete or
> insert).  For this reason I don't see string names as essential.
> 
> Perhaps there was another requirement for feature names that I forgot about?

Yes, it was about error reporting. I wouldn't replace the feature bits,
but rather add a header extension (yes, they are still useful :-)) that
contains a table which maps feature flags to strings.

>> +
>>
>>  == Host cluster management ==
>>
>> @@ -138,7 +176,8 @@ guest clusters to host clusters. They are called L1 and L2 table.
>>
>>  The L1 table has a variable size (stored in the header) and may use multiple
>>  clusters, however it must be contiguous in the image file. L2 tables are
>> -exactly one cluster in size.
>> +exactly one cluster in size if subclusters are disabled, and two clusters if
>> +they are enabled.
>>
>>  Given a offset into the virtual disk, the offset into the image file can be
>>  obtained as follows:
>> @@ -168,9 +207,32 @@ L1 table entry:
>>                     refcount is exactly one. This information is only accurate
>>                     in the active L1 table.
>>
>> -L2 table entry (for normal clusters):
>> +L2 table entry:
>>
>> -    Bit  0 -  8:    Reserved (set to 0)
>> +    Bit  0 -  61:   Cluster descriptor
>> +
>> +              62:   0 for standard clusters
>> +                    1 for compressed clusters
>> +
>> +              63:   0 for a cluster that is unused or requires COW, 1 if its
>> +                    refcount is exactly one. This information is only accurate
>> +                    in L2 tables that are reachable from the the active L1
>> +                    table.
>> +
>> +        64 - 127:   If subclusters are enabled, this contains a bitmask that
>> +                    describes the allocation status of all 64 subclusters. The
>> +                    first subcluster is represented by the LSB. A 0 bit means
>> +                    that the subcluster is unallocated.
>> +
>> +Standard Cluster Descriptor:
>> +
>> +    Bit       0:    If set to 1, the cluster reads as all zeros instead of
>> +                    referring to the backing file if the (sub-)cluster is
>> +                    unallocated.
>> +
>> +                    With version 2, this is always 0.
>> +
>> +         1 -  8:    Reserved (set to 0)
>>
>>          9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
>>                     cluster boundary. If the offset is 0, the cluster is
>> @@ -178,29 +240,17 @@ L2 table entry (for normal clusters):
>>
>>         56 - 61:    Reserved (set to 0)
>>
>> -             62:    0 (this cluster is not compressed)
>> -
>> -             63:    0 for a cluster that is unused or requires COW, 1 if its
>> -                    refcount is exactly one. This information is only accurate
>> -                    in L2 tables that are reachable from the the active L1
>> -                    table.
>>
>> -L2 table entry (for compressed clusters; x = 62 - (cluster_size - 8)):
>> +Compressed Clusters Descriptor (x = 62 - (cluster_size - 8)):
>>
>>     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
>>                     cluster boundary!
>>
>>        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
>>
>> -             62:    1 (this cluster is compressed using zlib)
>> -
>> -             63:    0 for a cluster that is unused or requires COW, 1 if its
>> -                    refcount is exactly one. This information is only accurate
>> -                    in L2 tables that are reachable from the the active L1
>> -                    table.
>> -
>> -If a cluster is unallocated, read requests shall read the data from the backing
>> -file. If there is no backing file or the backing file is smaller than the image,
>> +If a cluster or a subcluster is unallocated, read requests shall read the data
>> +from the backing file (except if bit 0 in the Standard Cluster Descriptor is
>> +set). If there is no backing file or the backing file is smaller than the image,
>>  they shall read zeros for all parts that are not covered by the backing file.
>>
>>
>> @@ -253,7 +303,13 @@ Snapshot table entry:
>>         36 - 39:    Size of extra data in the table entry (used for future
>>                     extensions of the format)
>>
>> -        variable:   Extra data for future extensions. Must be ignored.
>> +        variable:   Extra data for future extensions. Unknown fields must be
>> +                    ignored. Currently defined are (offset relative to snapshot
>> +                    table entry):
>> +
>> +                    Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM
>> +                                    state is saved. If this field is present,
>> +                                    the 32-bit value in bytes 32-35 is ignored.
> 
> This is because you want a 64-bit VM state offset?

Right. Currently we can't snapshot VMs with >= 4 GB RAM. This is a
change that has been on my list for a long time, but it was never
important enough. It doesn't even require v3, but bow that we change the
format anyway, I though I'd include it.

> Need to add a note that this is v3-specific?
> 
> This field now preceeds the id_str and name variable length data?

It's not really v3-specific, it depends on extra_data_size. An
implementation that supports onĺy v2 could implement 64 bit VM state
size without any problems.

And yes, it would preceed id_str and name.

Kevin

Patch

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 8fc3cb2..adf5616 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -18,7 +18,7 @@  The first cluster of a qcow2 image contains the file header:
                     QCOW magic string ("QFI\xfb")
 
           4 -  7:   version
-                    Version number (only valid value is 2)
+                    Version number (valid values are 2 and 3)
 
           8 - 15:   backing_file_offset
                     Offset into the image file at which the backing file name
@@ -67,6 +67,42 @@  The first cluster of a qcow2 image contains the file header:
                     Offset into the image file at which the snapshot table
                     starts. Must be aligned to a cluster boundary.
 
+If the version is 3 or higher, the header has the following additional fields.
+For version 2, the values are assumed to be zero, unless specified otherwise
+in the description of a field.
+
+         72 - 75:   incompatible_features
+                    Bitmask of incompatible features. An implementation must
+                    fail to open an image if an unknown bit is set.
+
+                    Bit 0:      The reference counts in the image file may be
+                                inaccurate. Implementations must check/rebuild
+                                them if they rely on them.
+
+                    Bit 1:      Enable subclusters. This affects the L2 table
+                                format.
+
+                    Bits 2-31:  Reserved (set to 0)
+
+         76 - 79:   compatible_features
+                    Bitmask of compatible features. An implementation can
+                    safely ignore any unknown bits that are set.
+                    No compatible feature bits are defined yet.
+
+         80 - 83:   autoclear_features
+                    Bitmask of auto-clear features. An implementation may only
+                    write to an image with unknown auto-clear features if it
+                    clears the respective bits from this field first.
+                    No auto-clear feature bits are defined yet.
+
+         84 - 87:   refcount_bits
+                    Size of a reference count block entry in bits. For version 2
+                    images, the size is always 16 bits.
+                    [ TODO: Define order in sub-byte sizes ]
+
+        [ TODO: Add per-L2-table dirty flag to L1? ]
+        [ TODO: Add per-refcount-block full flag to refcount table? ]
+
 Directly after the image header, optional sections called header extensions can
 be stored. Each extension has a structure like the following:
 
@@ -87,6 +123,8 @@  The remaining space between the end of the header extension area and the end of
 the first cluster can be used for other data. Usually, the backing file name is
 stored there.
 
+[ TODO Feature name table? ]
+
 
 == Host cluster management ==
 
@@ -138,7 +176,8 @@  guest clusters to host clusters. They are called L1 and L2 table.
 
 The L1 table has a variable size (stored in the header) and may use multiple
 clusters, however it must be contiguous in the image file. L2 tables are
-exactly one cluster in size.
+exactly one cluster in size if subclusters are disabled, and two clusters if
+they are enabled.
 
 Given a offset into the virtual disk, the offset into the image file can be
 obtained as follows:
@@ -168,9 +207,32 @@  L1 table entry:
                     refcount is exactly one. This information is only accurate
                     in the active L1 table.
 
-L2 table entry (for normal clusters):
+L2 table entry:
 
-    Bit  0 -  8:    Reserved (set to 0)
+    Bit  0 -  61:   Cluster descriptor
+
+              62:   0 for standard clusters
+                    1 for compressed clusters
+
+              63:   0 for a cluster that is unused or requires COW, 1 if its
+                    refcount is exactly one. This information is only accurate
+                    in L2 tables that are reachable from the the active L1
+                    table.
+
+        64 - 127:   If subclusters are enabled, this contains a bitmask that
+                    describes the allocation status of all 64 subclusters. The
+                    first subcluster is represented by the LSB. A 0 bit means
+                    that the subcluster is unallocated.
+
+Standard Cluster Descriptor:
+
+    Bit       0:    If set to 1, the cluster reads as all zeros instead of
+                    referring to the backing file if the (sub-)cluster is
+                    unallocated.
+
+                    With version 2, this is always 0.
+
+         1 -  8:    Reserved (set to 0)
 
          9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
                     cluster boundary. If the offset is 0, the cluster is
@@ -178,29 +240,17 @@  L2 table entry (for normal clusters):
 
         56 - 61:    Reserved (set to 0)
 
-             62:    0 (this cluster is not compressed)
-
-             63:    0 for a cluster that is unused or requires COW, 1 if its
-                    refcount is exactly one. This information is only accurate
-                    in L2 tables that are reachable from the the active L1
-                    table.
 
-L2 table entry (for compressed clusters; x = 62 - (cluster_size - 8)):
+Compressed Clusters Descriptor (x = 62 - (cluster_size - 8)):
 
     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
                     cluster boundary!
 
        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
 
-             62:    1 (this cluster is compressed using zlib)
-
-             63:    0 for a cluster that is unused or requires COW, 1 if its
-                    refcount is exactly one. This information is only accurate
-                    in L2 tables that are reachable from the the active L1
-                    table.
-
-If a cluster is unallocated, read requests shall read the data from the backing
-file. If there is no backing file or the backing file is smaller than the image,
+If a cluster or a subcluster is unallocated, read requests shall read the data
+from the backing file (except if bit 0 in the Standard Cluster Descriptor is
+set). If there is no backing file or the backing file is smaller than the image,
 they shall read zeros for all parts that are not covered by the backing file.
 
 
@@ -253,7 +303,13 @@  Snapshot table entry:
         36 - 39:    Size of extra data in the table entry (used for future
                     extensions of the format)
 
-        variable:   Extra data for future extensions. Must be ignored.
+        variable:   Extra data for future extensions. Unknown fields must be
+                    ignored. Currently defined are (offset relative to snapshot
+                    table entry):
+
+                    Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM
+                                    state is saved. If this field is present,
+                                    the 32-bit value in bytes 32-35 is ignored.
 
         variable:   Unique ID string for the snapshot (not null terminated)