diff mbox

[v2,3/7] docs: Add QED image format specification

Message ID 1286552914-27014-4-git-send-email-stefanha@linux.vnet.ibm.com
State New
Headers show

Commit Message

Stefan Hajnoczi Oct. 8, 2010, 3:48 p.m. UTC
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/qed_spec.txt

Comments

Avi Kivity Oct. 10, 2010, 9:20 a.m. UTC | #1
On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> ---
>   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 94 insertions(+), 0 deletions(-)
>
> +Feature bits:
> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.

Great that QED_F_NEED_CHECK is now non-optional.  However, suppose we 
add a new structure (e.g. persistent freelist); it's now impossible to 
tell whether the structure is updated or not:

1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 shuts down
5 old qemu opens image
6 doesn't update persistent freelist
7 clears need_check
8 shuts down

The image is now inconsistent, but has need_check cleared.

We can address this by having a third feature bitmask that is 
autocleared by guests that don't recognize various bits; so the sequence 
becomes:

1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 sets persistent_freelist
5 shuts down
6 old qemu opens image
7 clears persistent_freelist (and any other bits it doesn't recognize)
8 doesn't update persistent freelist
9 clears need_check
10 shuts down

The image is now consistent, since the persistent freelist has disappeared.

> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> +

It was suggested to have just a bit saying whether the backing format is 
raw or not.  This way you don't need to store the format.
Stefan Hajnoczi Oct. 11, 2010, 10:09 a.m. UTC | #2
On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >---
> >  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 94 insertions(+), 0 deletions(-)
> >
> >+Feature bits:
> >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> 
> Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> we add a new structure (e.g. persistent freelist); it's now
> impossible to tell whether the structure is updated or not:
> 
> 1 new qemu opens image
> 2 writes persistent freelist
> 3 clears need_check
> 4 shuts down
> 5 old qemu opens image
> 6 doesn't update persistent freelist
> 7 clears need_check
> 8 shuts down
> 
> The image is now inconsistent, but has need_check cleared.
> 
> We can address this by having a third feature bitmask that is
> autocleared by guests that don't recognize various bits; so the
> sequence becomes:
> 
> 1 new qemu opens image
> 2 writes persistent freelist
> 3 clears need_check
> 4 sets persistent_freelist
> 5 shuts down
> 6 old qemu opens image
> 7 clears persistent_freelist (and any other bits it doesn't recognize)
> 8 doesn't update persistent freelist
> 9 clears need_check
> 10 shuts down
> 
> The image is now consistent, since the persistent freelist has disappeared.

It is more complicated than just the feature bit.  The freelist requires
space in the image file.  Clearing the persistent_freelist bit leaks the
freelist.

We can solve this by using a compat feature bit and an autoclear feature
bit.  The autoclear bit specifies whether or not the freelist is valid
and the compat feature bit specifices whether or not the freelist
exists.

When the new qemu opens the image again it notices that the autoclear
bit is unset but the compat bit is set.  This means the freelist is
out-of-date and its space can be reclaimed.

I don't like the idea of doing these feature bit acrobatics because they
add complexity.  On the other hand your proposal ensures backward
compatibility in the case of an optional data structure that needs to
stay in sync with the image.  I'm just not 100% convinced it's worth it.

> >+* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> >+
> 
> It was suggested to have just a bit saying whether the backing
> format is raw or not.  This way you don't need to store the format.

Will fix in v3.

Stefan
Avi Kivity Oct. 11, 2010, 1:04 p.m. UTC | #3
On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >  >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >  >---
> >  >   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  >   1 files changed, 94 insertions(+), 0 deletions(-)
> >  >
> >  >+Feature bits:
> >  >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >  >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> >
> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> >  we add a new structure (e.g. persistent freelist); it's now
> >  impossible to tell whether the structure is updated or not:
> >
> >  1 new qemu opens image
> >  2 writes persistent freelist
> >  3 clears need_check
> >  4 shuts down
> >  5 old qemu opens image
> >  6 doesn't update persistent freelist
> >  7 clears need_check
> >  8 shuts down
> >
> >  The image is now inconsistent, but has need_check cleared.
> >
> >  We can address this by having a third feature bitmask that is
> >  autocleared by guests that don't recognize various bits; so the
> >  sequence becomes:
> >
> >  1 new qemu opens image
> >  2 writes persistent freelist
> >  3 clears need_check
> >  4 sets persistent_freelist
> >  5 shuts down
> >  6 old qemu opens image
> >  7 clears persistent_freelist (and any other bits it doesn't recognize)
> >  8 doesn't update persistent freelist
> >  9 clears need_check
> >  10 shuts down
> >
> >  The image is now consistent, since the persistent freelist has disappeared.
>
> It is more complicated than just the feature bit.  The freelist requires
> space in the image file.  Clearing the persistent_freelist bit leaks the
> freelist.
>
> We can solve this by using a compat feature bit and an autoclear feature
> bit.  The autoclear bit specifies whether or not the freelist is valid
> and the compat feature bit specifices whether or not the freelist
> exists.
>
> When the new qemu opens the image again it notices that the autoclear
> bit is unset but the compat bit is set.  This means the freelist is
> out-of-date and its space can be reclaimed.
>
> I don't like the idea of doing these feature bit acrobatics because they
> add complexity.  On the other hand your proposal ensures backward
> compatibility in the case of an optional data structure that needs to
> stay in sync with the image.  I'm just not 100% convinced it's worth it.

My scenario ends up with data corruption if we move to an old qemu and 
then back again, without any aborts.

A leak is acceptable (it won't grow; it's just an unused, incorrect 
freelist), but data corruption is not.
Stefan Hajnoczi Oct. 11, 2010, 1:42 p.m. UTC | #4
On Mon, Oct 11, 2010 at 03:04:03PM +0200, Avi Kivity wrote:
>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
> >>   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >>  >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >>  >---
> >>  >   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  >   1 files changed, 94 insertions(+), 0 deletions(-)
> >>  >
> >>  >+Feature bits:
> >>  >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >>  >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> >>
> >>  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> >>  we add a new structure (e.g. persistent freelist); it's now
> >>  impossible to tell whether the structure is updated or not:
> >>
> >>  1 new qemu opens image
> >>  2 writes persistent freelist
> >>  3 clears need_check
> >>  4 shuts down
> >>  5 old qemu opens image
> >>  6 doesn't update persistent freelist
> >>  7 clears need_check
> >>  8 shuts down
> >>
> >>  The image is now inconsistent, but has need_check cleared.
> >>
> >>  We can address this by having a third feature bitmask that is
> >>  autocleared by guests that don't recognize various bits; so the
> >>  sequence becomes:
> >>
> >>  1 new qemu opens image
> >>  2 writes persistent freelist
> >>  3 clears need_check
> >>  4 sets persistent_freelist
> >>  5 shuts down
> >>  6 old qemu opens image
> >>  7 clears persistent_freelist (and any other bits it doesn't recognize)
> >>  8 doesn't update persistent freelist
> >>  9 clears need_check
> >>  10 shuts down
> >>
> >>  The image is now consistent, since the persistent freelist has disappeared.
> >
> >It is more complicated than just the feature bit.  The freelist requires
> >space in the image file.  Clearing the persistent_freelist bit leaks the
> >freelist.
> >
> >We can solve this by using a compat feature bit and an autoclear feature
> >bit.  The autoclear bit specifies whether or not the freelist is valid
> >and the compat feature bit specifices whether or not the freelist
> >exists.
> >
> >When the new qemu opens the image again it notices that the autoclear
> >bit is unset but the compat bit is set.  This means the freelist is
> >out-of-date and its space can be reclaimed.
> >
> >I don't like the idea of doing these feature bit acrobatics because they
> >add complexity.  On the other hand your proposal ensures backward
> >compatibility in the case of an optional data structure that needs to
> >stay in sync with the image.  I'm just not 100% convinced it's worth it.
> 
> My scenario ends up with data corruption if we move to an old qemu
> and then back again, without any aborts.
> 
> A leak is acceptable (it won't grow; it's just an unused, incorrect
> freelist), but data corruption is not.

The alternative is for the freelist to be a non-compat feature bit.
That means older QEMU binaries cannot use a QED image that has enabled
the freelist.

Stefan
Avi Kivity Oct. 11, 2010, 1:44 p.m. UTC | #5
On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >
> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
> >  freelist), but data corruption is not.
>
> The alternative is for the freelist to be a non-compat feature bit.
> That means older QEMU binaries cannot use a QED image that has enabled
> the freelist.

For this one feature.  What about others?
Kevin Wolf Oct. 11, 2010, 1:58 p.m. UTC | #6
Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 94 insertions(+), 0 deletions(-)
>  create mode 100644 docs/specs/qed_spec.txt
> 
> diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> new file mode 100644
> index 0000000..c942b8e
> --- /dev/null
> +++ b/docs/specs/qed_spec.txt
> @@ -0,0 +1,94 @@
> +=Specification=
> +
> +The file format looks like this:
> +
> + +----------+----------+----------+-----+
> + | cluster0 | cluster1 | cluster2 | ... |
> + +----------+----------+----------+-----+
> +
> +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
> +
> +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
> +
> +All fields are little-endian.
> +
> +==Header==
> + Header {
> +     uint32_t magic;               /* QED\0 */
> + 
> +     uint32_t cluster_size;        /* in bytes */
> +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
> +     uint32_t header_size;         /* in clusters */
> + 
> +     uint64_t features;            /* format feature bits */
> +     uint64_t compat_features;     /* compat feature bits */
> +     uint64_t l1_table_offset;     /* in bytes */
> +     uint64_t image_size;          /* total logical image size, in bytes */
> + 
> +     /* if (features & QED_F_BACKING_FILE) */
> +     uint32_t backing_filename_offset; /* in bytes from start of header */
> +     uint32_t backing_filename_size;   /* in bytes */
> + 
> +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
> +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
> +     uint32_t backing_fmt_size;    /* in bytes */

It was discussed before, but I don't think we came to a conclusion. Are
there any circumstances under which you don't want to set the
QED_CF_BACKING_FORMAT flag?

Also it's unclear what this "if" actually means: If the flag isn't set,
are the fields zero, are they undefined or are they even completely
missing and the offsets of the following fields must be adjusted?

> + }
> +
> +Field descriptions:
> +* cluster_size must be a power of 2 in range [2^12, 2^26].
> +* table_size must be a power of 2 in range [1, 16].

Is there a reason why this must be a power of two?

> +* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
> +* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
> +* l1_table_offset must be a multiple of cluster_size.

And it is the offset of the first byte of the L1 table in the image file.

> +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.

So there are image sizes that can't be accurately represented in QED? I
think that's a bad idea. Even more so because I can't see how it greatly
simplifies implementation (you save the operation for rounding up on
open/create, that's it) - it looks like a completely arbitrary restriction.

> +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.

A description of the meaning of these strings is missing.

> +
> +Feature bits:
> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.

I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
twice is confusing at first sight.

> +
> +==Tables==
> +
> +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
> +
> + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> +  
> + Table {
> +     uint64_t offsets[TABLE_NOFFSETS];
> + }
> +
> +The tables are organized as follows:
> +
> +                    +----------+
> +                    | L1 table |
> +                    +----------+
> +               ,------'  |  '------.
> +          +----------+   |    +----------+
> +          | L2 table |  ...   | L2 table |
> +          +----------+        +----------+
> +      ,------'  |  '------.
> + +----------+   |    +----------+
> + |   Data   |  ...   |   Data   |
> + +----------+        +----------+
> +
> +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
> +
> +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
> + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> +
> +Logical offsets are translated into cluster offsets as follows:
> +
> +  table_bits table_bits    cluster_bits
> +  <--------> <--------> <--------------->
> + +----------+----------+-----------------+
> + | L1 index | L2 index |     byte offset |
> + +----------+----------+-----------------+
> + 
> +       Structure of a logical offset
> +
> + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> +   l2_offset = l1_table[l1_index]
> +   l2_table = load_table(l2_offset)
> +   cluster_offset = l2_table[l2_index]
> +   return cluster_offset + byte_offset

Should we reserve some bits in the table entries in case we need some
flags later? Also, I suppose all table entries must be cluster aligned?

What happened to the other sections that older versions of the spec
contained? For example, this version doesn't specify any more what the
semantics of unallocated clusters and backing files is.

Kevin
Stefan Hajnoczi Oct. 11, 2010, 2:06 p.m. UTC | #7
On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote:
>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >>
> >>  A leak is acceptable (it won't grow; it's just an unused, incorrect
> >>  freelist), but data corruption is not.
> >
> >The alternative is for the freelist to be a non-compat feature bit.
> >That means older QEMU binaries cannot use a QED image that has enabled
> >the freelist.
> 
> For this one feature.  What about others?

Compat features that need to be in sync with the image state will either
require specific checks (e.g. checksum or shadow of the state) or they
need to be non-compat features and are not backwards compatible.

I'm not opposing autoclear feature bits themselves, they are a neat
idea.  However, they will initially have no users so is this something
we really want to carry?

Stefan
Avi Kivity Oct. 11, 2010, 2:12 p.m. UTC | #8
On 10/11/2010 04:06 PM, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote:
> >   On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >  >>
> >  >>   A leak is acceptable (it won't grow; it's just an unused, incorrect
> >  >>   freelist), but data corruption is not.
> >  >
> >  >The alternative is for the freelist to be a non-compat feature bit.
> >  >That means older QEMU binaries cannot use a QED image that has enabled
> >  >the freelist.
> >
> >  For this one feature.  What about others?
>
> Compat features that need to be in sync with the image state will either
> require specific checks (e.g. checksum or shadow of the state) or they
> need to be non-compat features and are not backwards compatible.
>
> I'm not opposing autoclear feature bits themselves, they are a neat
> idea.  However, they will initially have no users so is this something
> we really want to carry?

Hard to tell in advance.  It seems like a simple feature with potential 
to make our lives easier in the future.

Anything we develop in the next few months can be rolled back into the 
baseline, assuming we declare the format unstable while 0.14 is in 
development, so this is really about post 0.14 features.
Anthony Liguori Oct. 11, 2010, 2:54 p.m. UTC | #9
On 10/11/2010 08:04 AM, Avi Kivity wrote:
>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
>> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>> > >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>> > >---
>> > >   docs/specs/qed_spec.txt |   94 
>> +++++++++++++++++++++++++++++++++++++++++++++++
>> > >   1 files changed, 94 insertions(+), 0 deletions(-)
>> > >
>> > >+Feature bits:
>> > >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>> > >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check 
>> before use.
>> >
>> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
>> >  we add a new structure (e.g. persistent freelist); it's now
>> >  impossible to tell whether the structure is updated or not:
>> >
>> >  1 new qemu opens image
>> >  2 writes persistent freelist
>> >  3 clears need_check
>> >  4 shuts down
>> >  5 old qemu opens image
>> >  6 doesn't update persistent freelist
>> >  7 clears need_check
>> >  8 shuts down
>> >
>> >  The image is now inconsistent, but has need_check cleared.
>> >
>> >  We can address this by having a third feature bitmask that is
>> >  autocleared by guests that don't recognize various bits; so the
>> >  sequence becomes:
>> >
>> >  1 new qemu opens image
>> >  2 writes persistent freelist
>> >  3 clears need_check
>> >  4 sets persistent_freelist
>> >  5 shuts down
>> >  6 old qemu opens image
>> >  7 clears persistent_freelist (and any other bits it doesn't 
>> recognize)
>> >  8 doesn't update persistent freelist
>> >  9 clears need_check
>> >  10 shuts down
>> >
>> >  The image is now consistent, since the persistent freelist has 
>> disappeared.
>>
>> It is more complicated than just the feature bit.  The freelist requires
>> space in the image file.  Clearing the persistent_freelist bit leaks the
>> freelist.
>>
>> We can solve this by using a compat feature bit and an autoclear feature
>> bit.  The autoclear bit specifies whether or not the freelist is valid
>> and the compat feature bit specifices whether or not the freelist
>> exists.
>>
>> When the new qemu opens the image again it notices that the autoclear
>> bit is unset but the compat bit is set.  This means the freelist is
>> out-of-date and its space can be reclaimed.
>>
>> I don't like the idea of doing these feature bit acrobatics because they
>> add complexity.  On the other hand your proposal ensures backward
>> compatibility in the case of an optional data structure that needs to
>> stay in sync with the image.  I'm just not 100% convinced it's worth it.
>
> My scenario ends up with data corruption if we move to an old qemu and 
> then back again, without any aborts.
>
> A leak is acceptable (it won't grow; it's just an unused, incorrect 
> freelist), but data corruption is not.

A leak is unacceptable.  It means an image can grow to an unbounded 
size.  If you are a server provider offering multitenancy, then a 
malicious guest can potentially grow the image beyond it's allotted size 
causing a Denial of Service attack against another tenant.

A freelist has to be a non-optional feature.  When the freelist bit is 
set, an older QEMU cannot read the image.  If the freelist is completed 
used, the freelist bit can be cleared and the image is then usable by 
older QEMUs.

Regards,

Anthony Liguori
Avi Kivity Oct. 11, 2010, 2:58 p.m. UTC | #10
On 10/11/2010 04:54 PM, Anthony Liguori wrote:
> On 10/11/2010 08:04 AM, Avi Kivity wrote:
>>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
>>> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>>> > >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>> > >---
>>> > >   docs/specs/qed_spec.txt |   94 
>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>> > >   1 files changed, 94 insertions(+), 0 deletions(-)
>>> > >
>>> > >+Feature bits:
>>> > >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>>> > >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check 
>>> before use.
>>> >
>>> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
>>> >  we add a new structure (e.g. persistent freelist); it's now
>>> >  impossible to tell whether the structure is updated or not:
>>> >
>>> >  1 new qemu opens image
>>> >  2 writes persistent freelist
>>> >  3 clears need_check
>>> >  4 shuts down
>>> >  5 old qemu opens image
>>> >  6 doesn't update persistent freelist
>>> >  7 clears need_check
>>> >  8 shuts down
>>> >
>>> >  The image is now inconsistent, but has need_check cleared.
>>> >
>>> >  We can address this by having a third feature bitmask that is
>>> >  autocleared by guests that don't recognize various bits; so the
>>> >  sequence becomes:
>>> >
>>> >  1 new qemu opens image
>>> >  2 writes persistent freelist
>>> >  3 clears need_check
>>> >  4 sets persistent_freelist
>>> >  5 shuts down
>>> >  6 old qemu opens image
>>> >  7 clears persistent_freelist (and any other bits it doesn't 
>>> recognize)
>>> >  8 doesn't update persistent freelist
>>> >  9 clears need_check
>>> >  10 shuts down
>>> >
>>> >  The image is now consistent, since the persistent freelist has 
>>> disappeared.
>>>
>>> It is more complicated than just the feature bit.  The freelist 
>>> requires
>>> space in the image file.  Clearing the persistent_freelist bit leaks 
>>> the
>>> freelist.
>>>
>>> We can solve this by using a compat feature bit and an autoclear 
>>> feature
>>> bit.  The autoclear bit specifies whether or not the freelist is valid
>>> and the compat feature bit specifices whether or not the freelist
>>> exists.
>>>
>>> When the new qemu opens the image again it notices that the autoclear
>>> bit is unset but the compat bit is set.  This means the freelist is
>>> out-of-date and its space can be reclaimed.
>>>
>>> I don't like the idea of doing these feature bit acrobatics because 
>>> they
>>> add complexity.  On the other hand your proposal ensures backward
>>> compatibility in the case of an optional data structure that needs to
>>> stay in sync with the image.  I'm just not 100% convinced it's worth 
>>> it.
>>
>> My scenario ends up with data corruption if we move to an old qemu 
>> and then back again, without any aborts.
>>
>> A leak is acceptable (it won't grow; it's just an unused, incorrect 
>> freelist), but data corruption is not.
>
> A leak is unacceptable.  It means an image can grow to an unbounded 
> size.  If you are a server provider offering multitenancy, then a 
> malicious guest can potentially grow the image beyond it's allotted 
> size causing a Denial of Service attack against another tenant.

This particular leak cannot grow, and is not controlled by the guest.

> A freelist has to be a non-optional feature.  When the freelist bit is 
> set, an older QEMU cannot read the image.  If the freelist is 
> completed used, the freelist bit can be cleared and the image is then 
> usable by older QEMUs.

Once we support TRIM (or detect zeros) we'll never have a clean freelist.
Anthony Liguori Oct. 11, 2010, 3:02 p.m. UTC | #11
On 10/11/2010 08:44 AM, Avi Kivity wrote:
>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>> >
>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>> >  freelist), but data corruption is not.
>>
>> The alternative is for the freelist to be a non-compat feature bit.
>> That means older QEMU binaries cannot use a QED image that has enabled
>> the freelist.
>
> For this one feature.  What about others?

A compat feature is one where the feature can be completely ignored 
(meaning that the QEMU does not have to understand the data format).

An example of a compat feature is copy-on-read.  It's merely a 
suggestion and there is no additional metadata.  If a QEMU doesn't 
understand it, it doesn't affect it's ability to read the image.

An example of a non-compat feature would be zero cluster entries.  Zero 
cluster entries are a special L2 table entry that indicates that a 
cluster's on-disk data is all zeros.  As long as there is at least 1 ZCE 
in the L2 tables, this feature bit must be set.  As soon as all of the 
ZCE bits are cleared, the feature bit can be unset.

An older QEMU will gracefully fail when presented with an image using 
ZCE bits.  An image with no ZCEs will work on older QEMUs.

Regards,

Anthony Liguori
Avi Kivity Oct. 11, 2010, 3:24 p.m. UTC | #12
On 10/11/2010 05:02 PM, Anthony Liguori wrote:
> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>> >
>>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>>> >  freelist), but data corruption is not.
>>>
>>> The alternative is for the freelist to be a non-compat feature bit.
>>> That means older QEMU binaries cannot use a QED image that has enabled
>>> the freelist.
>>
>> For this one feature.  What about others?
>
> A compat feature is one where the feature can be completely ignored 
> (meaning that the QEMU does not have to understand the data format).
>
> An example of a compat feature is copy-on-read.  It's merely a 
> suggestion and there is no additional metadata.  If a QEMU doesn't 
> understand it, it doesn't affect it's ability to read the image.
>
> An example of a non-compat feature would be zero cluster entries.  
> Zero cluster entries are a special L2 table entry that indicates that 
> a cluster's on-disk data is all zeros.  As long as there is at least 1 
> ZCE in the L2 tables, this feature bit must be set.  As soon as all of 
> the ZCE bits are cleared, the feature bit can be unset.
>
> An older QEMU will gracefully fail when presented with an image using 
> ZCE bits.  An image with no ZCEs will work on older QEMUs.
>

What's the motivation behind ZCE?

There is yet a third type of feature, one which is not strictly needed 
in order to use the image, but if used, must be kept synchronized.  An 
example is the freelist.  Another example is a directory index for a 
filesystem.  I can't think of another example which would be relevant to 
QED -- metadata checksums perhaps? -- we can always declare it a 
non-compatible feature, but of course, it reduces compatibility.
Stefan Hajnoczi Oct. 11, 2010, 3:30 p.m. UTC | #13
On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> > Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> > ---
> >  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 94 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/specs/qed_spec.txt
> > 
> > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> > new file mode 100644
> > index 0000000..c942b8e
> > --- /dev/null
> > +++ b/docs/specs/qed_spec.txt
> > @@ -0,0 +1,94 @@
> > +=Specification=
> > +
> > +The file format looks like this:
> > +
> > + +----------+----------+----------+-----+
> > + | cluster0 | cluster1 | cluster2 | ... |
> > + +----------+----------+----------+-----+
> > +
> > +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
> > +
> > +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
> > +
> > +All fields are little-endian.
> > +
> > +==Header==
> > + Header {
> > +     uint32_t magic;               /* QED\0 */
> > + 
> > +     uint32_t cluster_size;        /* in bytes */
> > +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
> > +     uint32_t header_size;         /* in clusters */
> > + 
> > +     uint64_t features;            /* format feature bits */
> > +     uint64_t compat_features;     /* compat feature bits */
> > +     uint64_t l1_table_offset;     /* in bytes */
> > +     uint64_t image_size;          /* total logical image size, in bytes */
> > + 
> > +     /* if (features & QED_F_BACKING_FILE) */
> > +     uint32_t backing_filename_offset; /* in bytes from start of header */
> > +     uint32_t backing_filename_size;   /* in bytes */
> > + 
> > +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
> > +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
> > +     uint32_t backing_fmt_size;    /* in bytes */
> 
> It was discussed before, but I don't think we came to a conclusion. Are
> there any circumstances under which you don't want to set the
> QED_CF_BACKING_FORMAT flag?

I suggest the following:

QED_CF_BACKING_FORMAT_RAW = 0x1

When set, the backing file is a raw image and should not be probed for
its file format.  The default (unset) means that the backing image file
format may be probed.

Now the backing_fmt_{offset,size} are no longer necessary.

> 
> Also it's unclear what this "if" actually means: If the flag isn't set,
> are the fields zero, are they undefined or are they even completely
> missing and the offsets of the following fields must be adjusted?

I have updated the wiki:

"Fields predicated on a feature bit are only used when that feature is
set. The fields always take up header space, regardless of whether or
not the feature bit is set."

> 
> > + }
> > +
> > +Field descriptions:
> > +* cluster_size must be a power of 2 in range [2^12, 2^26].
> > +* table_size must be a power of 2 in range [1, 16].
> 
> Is there a reason why this must be a power of two?

The power of two makes logical-to-cluster offset translation easy and
cheap:

l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

> 
> > +* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
> > +* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
> > +* l1_table_offset must be a multiple of cluster_size.
> 
> And it is the offset of the first byte of the L1 table in the image file.

Updated, thanks.

> 
> > +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
> 
> So there are image sizes that can't be accurately represented in QED? I
> think that's a bad idea. Even more so because I can't see how it greatly
> simplifies implementation (you save the operation for rounding up on
> open/create, that's it) - it looks like a completely arbitrary restriction.

Good point.  I will try to lift this restriction in v3.

> 
> > +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
> 
> A description of the meaning of these strings is missing.

Update:
"The backing filename string is given in the
backing_filename_{offset,size} fields and may be an absolute path or
relative to the image file."

> 
> > +
> > +Feature bits:
> > +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> > +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> > +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> 
> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
> twice is confusing at first sight.

Updated, thanks.

> 
> > +
> > +==Tables==
> > +
> > +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
> > +
> > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> > +  
> > + Table {
> > +     uint64_t offsets[TABLE_NOFFSETS];
> > + }
> > +
> > +The tables are organized as follows:
> > +
> > +                    +----------+
> > +                    | L1 table |
> > +                    +----------+
> > +               ,------'  |  '------.
> > +          +----------+   |    +----------+
> > +          | L2 table |  ...   | L2 table |
> > +          +----------+        +----------+
> > +      ,------'  |  '------.
> > + +----------+   |    +----------+
> > + |   Data   |  ...   |   Data   |
> > + +----------+        +----------+
> > +
> > +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
> > +
> > +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
> > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> > +
> > +Logical offsets are translated into cluster offsets as follows:
> > +
> > +  table_bits table_bits    cluster_bits
> > +  <--------> <--------> <--------------->
> > + +----------+----------+-----------------+
> > + | L1 index | L2 index |     byte offset |
> > + +----------+----------+-----------------+
> > + 
> > +       Structure of a logical offset
> > +
> > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> > +   l2_offset = l1_table[l1_index]
> > +   l2_table = load_table(l2_offset)
> > +   cluster_offset = l2_table[l2_index]
> > +   return cluster_offset + byte_offset
> 
> Should we reserve some bits in the table entries in case we need some
> flags later? Also, I suppose all table entries must be cluster aligned?

Yes, let's do that.  At least for sparse zero cluster tracking we need a
bit.  The minimum 4k cluster size gives us 12 bits to play with.

> 
> What happened to the other sections that older versions of the spec
> contained? For example, this version doesn't specify any more what the
> semantics of unallocated clusters and backing files is.

I removed them because they don't describe the on-disk layout and were
more of a way to think through the implementation than a format
specification.  It was more a decision to focus my effort on improving
the on-disk layout specification than anything else.

Do you want the semantics in the specification, or is it okay to leave
that part on the wiki only?

Stefan
Avi Kivity Oct. 11, 2010, 3:39 p.m. UTC | #14
On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >
> >  It was discussed before, but I don't think we came to a conclusion. Are
> >  there any circumstances under which you don't want to set the
> >  QED_CF_BACKING_FORMAT flag?
>
> I suggest the following:
>
> QED_CF_BACKING_FORMAT_RAW = 0x1
>
> When set, the backing file is a raw image and should not be probed for
> its file format.  The default (unset) means that the backing image file
> format may be probed.
>
> Now the backing_fmt_{offset,size} are no longer necessary.

Should it not be an incompatible option?  If the backing disk starts 
with a format magic, it will be probed by an older qemu, incorrectly.
Anthony Liguori Oct. 11, 2010, 3:41 p.m. UTC | #15
On 10/11/2010 10:24 AM, Avi Kivity wrote:
>  On 10/11/2010 05:02 PM, Anthony Liguori wrote:
>> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>>> >
>>>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>>>> >  freelist), but data corruption is not.
>>>>
>>>> The alternative is for the freelist to be a non-compat feature bit.
>>>> That means older QEMU binaries cannot use a QED image that has enabled
>>>> the freelist.
>>>
>>> For this one feature.  What about others?
>>
>> A compat feature is one where the feature can be completely ignored 
>> (meaning that the QEMU does not have to understand the data format).
>>
>> An example of a compat feature is copy-on-read.  It's merely a 
>> suggestion and there is no additional metadata.  If a QEMU doesn't 
>> understand it, it doesn't affect it's ability to read the image.
>>
>> An example of a non-compat feature would be zero cluster entries.  
>> Zero cluster entries are a special L2 table entry that indicates that 
>> a cluster's on-disk data is all zeros.  As long as there is at least 
>> 1 ZCE in the L2 tables, this feature bit must be set.  As soon as all 
>> of the ZCE bits are cleared, the feature bit can be unset.
>>
>> An older QEMU will gracefully fail when presented with an image using 
>> ZCE bits.  An image with no ZCEs will work on older QEMUs.
>>
>
> What's the motivation behind ZCE?

It's very useful for Copy-on-Read.  If the cluster in the backing file 
is unallocated, then when you do a copy-on-read, you don't want to write 
out a zero cluster since you'd expand the image to it's maximum size.

It's also useful for operations like compaction in the absence of TRIM.  
The common implementation on platforms like VMware is to open a file and 
write zeros to it until it fills up the filesystem.  You then delete the 
file.  The result is that any unallocated data on the disk is written as 
zero and combined with zero-detection in the image format, you can 
compact the image size by marking unallocated blocks as ZCE.

> There is yet a third type of feature, one which is not strictly needed 
> in order to use the image, but if used, must be kept synchronized.  An 
> example is the freelist.  Another example is a directory index for a 
> filesystem.  I can't think of another example which would be relevant 
> to QED -- metadata checksums perhaps? -- we can always declare it a 
> non-compatible feature, but of course, it reduces compatibility.

You're suggesting a feature that is not strictly needed, but that needs 
to be kept up to date.  If it can't be kept up to date, something needs 
to happen to remove it.  Let's call this a transient feature.

Most of the transient features can be removed given some bit of code.  
For instance, ZCE can be removed by writing out zero clusters or writing 
an unallocated L2 entry if there is no backing file.

I think we could add a qemu-img demote command or something like that 
that attempted to remove features when possible.  That doesn't give you 
instant compatibility but I'm doubtful that you can come up with a 
generic way to remove a feature from an image without knowing anything 
about the image.

Regards,

Anthony Liguori

>
>
>
Stefan Hajnoczi Oct. 11, 2010, 3:46 p.m. UTC | #16
On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>  On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >>
> >>  It was discussed before, but I don't think we came to a conclusion. Are
> >>  there any circumstances under which you don't want to set the
> >>  QED_CF_BACKING_FORMAT flag?
> >
> >I suggest the following:
> >
> >QED_CF_BACKING_FORMAT_RAW = 0x1
> >
> >When set, the backing file is a raw image and should not be probed for
> >its file format.  The default (unset) means that the backing image file
> >format may be probed.
> >
> >Now the backing_fmt_{offset,size} are no longer necessary.
> 
> Should it not be an incompatible option?  If the backing disk starts
> with a format magic, it will be probed by an older qemu,
> incorrectly.

Agreed, it should be a non-compat feature bit.

Stefan
Avi Kivity Oct. 11, 2010, 3:47 p.m. UTC | #17
On 10/11/2010 05:41 PM, Anthony Liguori wrote:
> On 10/11/2010 10:24 AM, Avi Kivity wrote:
>>  On 10/11/2010 05:02 PM, Anthony Liguori wrote:
>>> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>>>> >
>>>>> >  A leak is acceptable (it won't grow; it's just an unused, 
>>>>> incorrect
>>>>> >  freelist), but data corruption is not.
>>>>>
>>>>> The alternative is for the freelist to be a non-compat feature bit.
>>>>> That means older QEMU binaries cannot use a QED image that has 
>>>>> enabled
>>>>> the freelist.
>>>>
>>>> For this one feature.  What about others?
>>>
>>> A compat feature is one where the feature can be completely ignored 
>>> (meaning that the QEMU does not have to understand the data format).
>>>
>>> An example of a compat feature is copy-on-read.  It's merely a 
>>> suggestion and there is no additional metadata.  If a QEMU doesn't 
>>> understand it, it doesn't affect it's ability to read the image.
>>>
>>> An example of a non-compat feature would be zero cluster entries.  
>>> Zero cluster entries are a special L2 table entry that indicates 
>>> that a cluster's on-disk data is all zeros.  As long as there is at 
>>> least 1 ZCE in the L2 tables, this feature bit must be set.  As soon 
>>> as all of the ZCE bits are cleared, the feature bit can be unset.
>>>
>>> An older QEMU will gracefully fail when presented with an image 
>>> using ZCE bits.  An image with no ZCEs will work on older QEMUs.
>>>
>>
>> What's the motivation behind ZCE?
>
> It's very useful for Copy-on-Read.  If the cluster in the backing file 
> is unallocated, then when you do a copy-on-read, you don't want to 
> write out a zero cluster since you'd expand the image to it's maximum 
> size.
>
> It's also useful for operations like compaction in the absence of 
> TRIM.  The common implementation on platforms like VMware is to open a 
> file and write zeros to it until it fills up the filesystem.  You then 
> delete the file.  The result is that any unallocated data on the disk 
> is written as zero and combined with zero-detection in the image 
> format, you can compact the image size by marking unallocated blocks 
> as ZCE.

Both make sense.  The latter is also useful with TRIM: if you have a 
backing image it's better to implement TRIM with ZCE rather than 
exposing the cluster from the backing file; it saves you a COW when you 
later reallocate the cluster.

>
>> There is yet a third type of feature, one which is not strictly 
>> needed in order to use the image, but if used, must be kept 
>> synchronized.  An example is the freelist.  Another example is a 
>> directory index for a filesystem.  I can't think of another example 
>> which would be relevant to QED -- metadata checksums perhaps? -- we 
>> can always declare it a non-compatible feature, but of course, it 
>> reduces compatibility.
>
> You're suggesting a feature that is not strictly needed, but that 
> needs to be kept up to date.  If it can't be kept up to date, 
> something needs to happen to remove it.  Let's call this a transient 
> feature.
>
> Most of the transient features can be removed given some bit of code.  
> For instance, ZCE can be removed by writing out zero clusters or 
> writing an unallocated L2 entry if there is no backing file.
>
> I think we could add a qemu-img demote command or something like that 
> that attempted to remove features when possible.  That doesn't give 
> you instant compatibility but I'm doubtful that you can come up with a 
> generic way to remove a feature from an image without knowing anything 
> about the image.
>

That should work, and in the worst case there is qemu-img convert (which 
should be taught about format options).
Anthony Liguori Oct. 11, 2010, 3:49 p.m. UTC | #18
On 10/11/2010 09:58 AM, Avi Kivity wrote:
>> A leak is unacceptable.  It means an image can grow to an unbounded 
>> size.  If you are a server provider offering multitenancy, then a 
>> malicious guest can potentially grow the image beyond it's allotted 
>> size causing a Denial of Service attack against another tenant.
>
>
> This particular leak cannot grow, and is not controlled by the guest.

As the image gets moved from hypervisor to hypervisor, it can keep 
growing if given a chance to fill up the disk, then trim it all way.

In a mixed hypervisor environment, it just becomes a numbers game.

>> A freelist has to be a non-optional feature.  When the freelist bit 
>> is set, an older QEMU cannot read the image.  If the freelist is 
>> completed used, the freelist bit can be cleared and the image is then 
>> usable by older QEMUs.
>
> Once we support TRIM (or detect zeros) we'll never have a clean freelist.

Zero detection doesn't add to the free list.

A potential solution here is to treat TRIM a little differently than 
we've been discussing.

When TRIM happens, don't immediately write an unallocated cluster entry 
for the L2.  Leave the L2 entry in-tact.  Don't actually write a UCE to 
the L2 until you actually allocate the block.

This implies a cost because you'll need to do metadata syncs to make 
this work.  However, that eliminates leakage.

Regards,

Anthony Liguori
Kevin Wolf Oct. 11, 2010, 3:50 p.m. UTC | #19
Am 11.10.2010 17:30, schrieb Stefan Hajnoczi:
> On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
>> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>> ---
>>>  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 files changed, 94 insertions(+), 0 deletions(-)
>>>  create mode 100644 docs/specs/qed_spec.txt
>>>
>>> diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
>>> new file mode 100644
>>> index 0000000..c942b8e
>>> --- /dev/null
>>> +++ b/docs/specs/qed_spec.txt
>>> @@ -0,0 +1,94 @@
>>> +=Specification=
>>> +
>>> +The file format looks like this:
>>> +
>>> + +----------+----------+----------+-----+
>>> + | cluster0 | cluster1 | cluster2 | ... |
>>> + +----------+----------+----------+-----+
>>> +
>>> +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
>>> +
>>> +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
>>> +
>>> +All fields are little-endian.
>>> +
>>> +==Header==
>>> + Header {
>>> +     uint32_t magic;               /* QED\0 */
>>> + 
>>> +     uint32_t cluster_size;        /* in bytes */
>>> +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
>>> +     uint32_t header_size;         /* in clusters */
>>> + 
>>> +     uint64_t features;            /* format feature bits */
>>> +     uint64_t compat_features;     /* compat feature bits */
>>> +     uint64_t l1_table_offset;     /* in bytes */
>>> +     uint64_t image_size;          /* total logical image size, in bytes */
>>> + 
>>> +     /* if (features & QED_F_BACKING_FILE) */
>>> +     uint32_t backing_filename_offset; /* in bytes from start of header */
>>> +     uint32_t backing_filename_size;   /* in bytes */
>>> + 
>>> +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
>>> +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
>>> +     uint32_t backing_fmt_size;    /* in bytes */
>>
>> It was discussed before, but I don't think we came to a conclusion. Are
>> there any circumstances under which you don't want to set the
>> QED_CF_BACKING_FORMAT flag?
> 
> I suggest the following:
> 
> QED_CF_BACKING_FORMAT_RAW = 0x1
> 
> When set, the backing file is a raw image and should not be probed for
> its file format.  The default (unset) means that the backing image file
> format may be probed.

And it must be set for raw image formats because we'll want to avoid
considering raw for the probing.

Works for me.

>>> + }
>>> +
>>> +Field descriptions:
>>> +* cluster_size must be a power of 2 in range [2^12, 2^26].
>>> +* table_size must be a power of 2 in range [1, 16].
>>
>> Is there a reason why this must be a power of two?
> 
> The power of two makes logical-to-cluster offset translation easy and
> cheap:
> 
> l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
> cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

Right, good point.

>>> +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
>>
>> So there are image sizes that can't be accurately represented in QED? I
>> think that's a bad idea. Even more so because I can't see how it greatly
>> simplifies implementation (you save the operation for rounding up on
>> open/create, that's it) - it looks like a completely arbitrary restriction.
> 
> Good point.  I will try to lift this restriction in v3.

Thanks.

>>> +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
>>
>> A description of the meaning of these strings is missing.
> 
> Update:
> "The backing filename string is given in the
> backing_filename_{offset,size} fields and may be an absolute path or
> relative to the image file."

Looks good.

>>
>>> +
>>> +Feature bits:
>>> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>>> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
>>> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
>>
>> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
>> twice is confusing at first sight.
> 
> Updated, thanks.
> 
>>
>>> +
>>> +==Tables==
>>> +
>>> +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
>>> +
>>> + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
>>> +  
>>> + Table {
>>> +     uint64_t offsets[TABLE_NOFFSETS];
>>> + }
>>> +
>>> +The tables are organized as follows:
>>> +
>>> +                    +----------+
>>> +                    | L1 table |
>>> +                    +----------+
>>> +               ,------'  |  '------.
>>> +          +----------+   |    +----------+
>>> +          | L2 table |  ...   | L2 table |
>>> +          +----------+        +----------+
>>> +      ,------'  |  '------.
>>> + +----------+   |    +----------+
>>> + |   Data   |  ...   |   Data   |
>>> + +----------+        +----------+
>>> +
>>> +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
>>> +
>>> +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
>>> + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
>>> +
>>> +Logical offsets are translated into cluster offsets as follows:
>>> +
>>> +  table_bits table_bits    cluster_bits
>>> +  <--------> <--------> <--------------->
>>> + +----------+----------+-----------------+
>>> + | L1 index | L2 index |     byte offset |
>>> + +----------+----------+-----------------+
>>> + 
>>> +       Structure of a logical offset
>>> +
>>> + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
>>> +   l2_offset = l1_table[l1_index]
>>> +   l2_table = load_table(l2_offset)
>>> +   cluster_offset = l2_table[l2_index]
>>> +   return cluster_offset + byte_offset
>>
>> Should we reserve some bits in the table entries in case we need some
>> flags later? Also, I suppose all table entries must be cluster aligned?
> 
> Yes, let's do that.  At least for sparse zero cluster tracking we need a
> bit.  The minimum 4k cluster size gives us 12 bits to play with.

Yes, 12 bits should be enough for a while.

>> What happened to the other sections that older versions of the spec
>> contained? For example, this version doesn't specify any more what the
>> semantics of unallocated clusters and backing files is.
> 
> I removed them because they don't describe the on-disk layout and were
> more of a way to think through the implementation than a format
> specification.  It was more a decision to focus my effort on improving
> the on-disk layout specification than anything else.
> 
> Do you want the semantics in the specification, or is it okay to leave
> that part on the wiki only?

I think, at least we need to describe what an unallocated table entry
looks like (it has a zero cluster offset - but may it have flags?) and
what this means for the virtual disk content (read from the backing
file, zero if there is no backing file). This definitely belongs in the
specification.

Other parts that describe qemu's implementation or best practices can be
left on the wiki only.

Kevin
Avi Kivity Oct. 11, 2010, 4:02 p.m. UTC | #20
On 10/11/2010 05:49 PM, Anthony Liguori wrote:
> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>> A leak is unacceptable.  It means an image can grow to an unbounded 
>>> size.  If you are a server provider offering multitenancy, then a 
>>> malicious guest can potentially grow the image beyond it's allotted 
>>> size causing a Denial of Service attack against another tenant.
>>
>>
>> This particular leak cannot grow, and is not controlled by the guest.
>
> As the image gets moved from hypervisor to hypervisor, it can keep 
> growing if given a chance to fill up the disk, then trim it all way.
>
> In a mixed hypervisor environment, it just becomes a numbers game.

I don't see how it can grow.  Both the freelist and the clusters it 
points to consume space, which becomes a leak once you move it to a 
hypervisor that doesn't understand the freelist.  The older hypervisor 
then allocates new blocks.  As soon as it performs a metadata scan (if 
ever), the freelist is reclaimed.

You could only get a growing leak if you moved it to a hypervisor that 
doesn't perform metadata scans, but then that is independent of the 
freelist.

>
>>> A freelist has to be a non-optional feature.  When the freelist bit 
>>> is set, an older QEMU cannot read the image.  If the freelist is 
>>> completed used, the freelist bit can be cleared and the image is 
>>> then usable by older QEMUs.
>>
>> Once we support TRIM (or detect zeros) we'll never have a clean 
>> freelist.
>
> Zero detection doesn't add to the free list.

Why not?  If a cluster is zero filled, you may drop it (assuming no 
backing image).

>
> A potential solution here is to treat TRIM a little differently than 
> we've been discussing.
>
> When TRIM happens, don't immediately write an unallocated cluster 
> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write a 
> UCE to the L2 until you actually allocate the block.
>
> This implies a cost because you'll need to do metadata syncs to make 
> this work.  However, that eliminates leakage.

The information is lost on shutdown; and you can have a large number of 
unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user 
expecting a visit from RIAA).

A slight twist on your proposal is to have an allocated-but-may-drop bit 
in a L2 entry.  TRIM or zero detection sets the bit (leaving the cluster 
number intact).  A following write to the cluster needs to clear the 
bit; if we reallocate the cluster we need to replace it with a ZCE.

This makes the freelist all L2 entries with the bit set; it may be less 
efficient than a custom data structure though.
Anthony Liguori Oct. 11, 2010, 4:10 p.m. UTC | #21
On 10/11/2010 11:02 AM, Avi Kivity wrote:
>  On 10/11/2010 05:49 PM, Anthony Liguori wrote:
>> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>>> A leak is unacceptable.  It means an image can grow to an unbounded 
>>>> size.  If you are a server provider offering multitenancy, then a 
>>>> malicious guest can potentially grow the image beyond it's allotted 
>>>> size causing a Denial of Service attack against another tenant.
>>>
>>>
>>> This particular leak cannot grow, and is not controlled by the guest.
>>
>> As the image gets moved from hypervisor to hypervisor, it can keep 
>> growing if given a chance to fill up the disk, then trim it all way.
>>
>> In a mixed hypervisor environment, it just becomes a numbers game.
>
> I don't see how it can grow.  Both the freelist and the clusters it 
> points to consume space, which becomes a leak once you move it to a 
> hypervisor that doesn't understand the freelist.  The older hypervisor 
> then allocates new blocks.  As soon as it performs a metadata scan (if 
> ever), the freelist is reclaimed.

Assume you don't ever do a metadata scan (which is really our design point).

If you move to a hypervisor that doesn't support it, then move to a 
hypervisor that does, you create a brand new freelist and start leaking 
more space.  This isn't a contrived scenario if you have a cloud 
environment with a mix of hosts.

You might not be able to get a ping-pong every time you provision, but 
with enough effort, you could create serious problems.

It's really an issue of correctness.  Making correctness trade-offs for 
the purpose of compatibility is a policy decision and not something we 
should bake into an image format.  If a tool feels strongly that it's a 
reasonable trade off to make, it can always fudge the feature bits itself.

>>
>>>> A freelist has to be a non-optional feature.  When the freelist bit 
>>>> is set, an older QEMU cannot read the image.  If the freelist is 
>>>> completed used, the freelist bit can be cleared and the image is 
>>>> then usable by older QEMUs.
>>>
>>> Once we support TRIM (or detect zeros) we'll never have a clean 
>>> freelist.
>>
>> Zero detection doesn't add to the free list.
>
> Why not?  If a cluster is zero filled, you may drop it (assuming no 
> backing image).

Sorry, I was thinking about the case of copy-on-read.  When you 
transition from UCE -> ZCE, nothing gets added to the free list.  But if 
you go from allocated -> ZCE, then you would add to the free list.

>>
>> A potential solution here is to treat TRIM a little differently than 
>> we've been discussing.
>>
>> When TRIM happens, don't immediately write an unallocated cluster 
>> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write 
>> a UCE to the L2 until you actually allocate the block.
>>
>> This implies a cost because you'll need to do metadata syncs to make 
>> this work.  However, that eliminates leakage.
>
> The information is lost on shutdown; and you can have a large number 
> of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a 
> user expecting a visit from RIAA).
>
> A slight twist on your proposal is to have an allocated-but-may-drop 
> bit in a L2 entry.  TRIM or zero detection sets the bit (leaving the 
> cluster number intact).  A following write to the cluster needs to 
> clear the bit; if we reallocate the cluster we need to replace it with 
> a ZCE.

Yeah, this is sort of what I was thinking.  You would still want a free 
list but it becomes totally optional because if it's lost, no data is 
leaked (assuming that the older version understands the bit).

I was suggesting that we store that bit in the free list though because 
that let's us support having older QEMUs with absolutely no knowledge 
still work.

> This makes the freelist all L2 entries with the bit set; it may be 
> less efficient than a custom data structure though.

We still want the freelist to avoid recreating it.  We also want to 
store the allocated-but-may-drop bit in the free list.

Regards,

Anthony Liguori
Anthony Liguori Oct. 11, 2010, 4:18 p.m. UTC | #22
On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>    
>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>      
>>>>   It was discussed before, but I don't think we came to a conclusion. Are
>>>>   there any circumstances under which you don't want to set the
>>>>   QED_CF_BACKING_FORMAT flag?
>>>>          
>>> I suggest the following:
>>>
>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>
>>> When set, the backing file is a raw image and should not be probed for
>>> its file format.  The default (unset) means that the backing image file
>>> format may be probed.
>>>
>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>>        
>> Should it not be an incompatible option?  If the backing disk starts
>> with a format magic, it will be probed by an older qemu,
>> incorrectly.
>>      
> Agreed, it should be a non-compat feature bit.
>    

If it's just raw or not raw, then I agree it should be non-compat.

I think we just need a feature bit then that indicates that the backing 
file is non-probeable which certainly simplifies the implementation.

QED_F_BACKING_FORMAT_NOPROBE maybe?

Regards,

Anthony Liguori

> Stefan
>
>
Anthony Liguori Oct. 11, 2010, 5:14 p.m. UTC | #23
On 10/11/2010 11:18 AM, Anthony Liguori wrote:
> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>>>>   It was discussed before, but I don't think we came to a 
>>>>> conclusion. Are
>>>>>   there any circumstances under which you don't want to set the
>>>>>   QED_CF_BACKING_FORMAT flag?
>>>> I suggest the following:
>>>>
>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>>
>>>> When set, the backing file is a raw image and should not be probed for
>>>> its file format.  The default (unset) means that the backing image 
>>>> file
>>>> format may be probed.
>>>>
>>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>> Should it not be an incompatible option?  If the backing disk starts
>>> with a format magic, it will be probed by an older qemu,
>>> incorrectly.
>> Agreed, it should be a non-compat feature bit.
>
> If it's just raw or not raw, then I agree it should be non-compat.
>
> I think we just need a feature bit then that indicates that the 
> backing file is non-probeable which certainly simplifies the 
> implementation.
>
> QED_F_BACKING_FORMAT_NOPROBE maybe?

Er, thinking more, this is still a good idea but we still need 
QED_CF_BACKING_FORMAT because we specifically need to know when a 
protocol is specified.  Otherwise, we have no way of doing nbd as a 
backing file.

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>
>> Stefan
>>
>
Kevin Wolf Oct. 12, 2010, 8:07 a.m. UTC | #24
Am 11.10.2010 19:14, schrieb Anthony Liguori:
> On 10/11/2010 11:18 AM, Anthony Liguori wrote:
>> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
>>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>>>>>   It was discussed before, but I don't think we came to a 
>>>>>> conclusion. Are
>>>>>>   there any circumstances under which you don't want to set the
>>>>>>   QED_CF_BACKING_FORMAT flag?
>>>>> I suggest the following:
>>>>>
>>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>>>
>>>>> When set, the backing file is a raw image and should not be probed for
>>>>> its file format.  The default (unset) means that the backing image 
>>>>> file
>>>>> format may be probed.
>>>>>
>>>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>>> Should it not be an incompatible option?  If the backing disk starts
>>>> with a format magic, it will be probed by an older qemu,
>>>> incorrectly.
>>> Agreed, it should be a non-compat feature bit.
>>
>> If it's just raw or not raw, then I agree it should be non-compat.
>>
>> I think we just need a feature bit then that indicates that the 
>> backing file is non-probeable which certainly simplifies the 
>> implementation.
>>
>> QED_F_BACKING_FORMAT_NOPROBE maybe?
> 
> Er, thinking more, this is still a good idea but we still need 
> QED_CF_BACKING_FORMAT because we specifically need to know when a 
> protocol is specified.  Otherwise, we have no way of doing nbd as a 
> backing file.

Well, the protocol is currently encoded in the file name, separated by a
colon. Of course, we want to get rid of that, but we still don't know
what we want instead. It's completely unrelated to the backing file
format, though, it's about the format of the backing file name.

Kevin
Avi Kivity Oct. 12, 2010, 10:25 a.m. UTC | #25
On 10/11/2010 06:10 PM, Anthony Liguori wrote:
> On 10/11/2010 11:02 AM, Avi Kivity wrote:
>>  On 10/11/2010 05:49 PM, Anthony Liguori wrote:
>>> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>>>> A leak is unacceptable.  It means an image can grow to an 
>>>>> unbounded size.  If you are a server provider offering 
>>>>> multitenancy, then a malicious guest can potentially grow the 
>>>>> image beyond it's allotted size causing a Denial of Service attack 
>>>>> against another tenant.
>>>>
>>>>
>>>> This particular leak cannot grow, and is not controlled by the guest.
>>>
>>> As the image gets moved from hypervisor to hypervisor, it can keep 
>>> growing if given a chance to fill up the disk, then trim it all way.
>>>
>>> In a mixed hypervisor environment, it just becomes a numbers game.
>>
>> I don't see how it can grow.  Both the freelist and the clusters it 
>> points to consume space, which becomes a leak once you move it to a 
>> hypervisor that doesn't understand the freelist.  The older 
>> hypervisor then allocates new blocks.  As soon as it performs a 
>> metadata scan (if ever), the freelist is reclaimed.
>
> Assume you don't ever do a metadata scan (which is really our design 
> point).

What about crashes?

>
> If you move to a hypervisor that doesn't support it, then move to a 
> hypervisor that does, you create a brand new freelist and start 
> leaking more space.  This isn't a contrived scenario if you have a 
> cloud environment with a mix of hosts.

It's only a leak if you don't do a metadata scan.

>
> You might not be able to get a ping-pong every time you provision, but 
> with enough effort, you could create serious problems.
>
> It's really an issue of correctness.  Making correctness trade-offs 
> for the purpose of compatibility is a policy decision and not 
> something we should bake into an image format.  If a tool feels 
> strongly that it's a reasonable trade off to make, it can always fudge 
> the feature bits itself.

I think the effort here is reasonable, clearing a bit on startup is not 
that complicated.

>>>
>>> A potential solution here is to treat TRIM a little differently than 
>>> we've been discussing.
>>>
>>> When TRIM happens, don't immediately write an unallocated cluster 
>>> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write 
>>> a UCE to the L2 until you actually allocate the block.
>>>
>>> This implies a cost because you'll need to do metadata syncs to make 
>>> this work.  However, that eliminates leakage.
>>
>> The information is lost on shutdown; and you can have a large number 
>> of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a 
>> user expecting a visit from RIAA).
>>
>> A slight twist on your proposal is to have an allocated-but-may-drop 
>> bit in a L2 entry.  TRIM or zero detection sets the bit (leaving the 
>> cluster number intact).  A following write to the cluster needs to 
>> clear the bit; if we reallocate the cluster we need to replace it 
>> with a ZCE.
>
> Yeah, this is sort of what I was thinking.  You would still want a 
> free list but it becomes totally optional because if it's lost, no 
> data is leaked (assuming that the older version understands the bit).
>
> I was suggesting that we store that bit in the free list though 
> because that let's us support having older QEMUs with absolutely no 
> knowledge still work.

It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu 
would think it's still free.

The autoclear bit solves it nicely - the old qemu automatically drops 
the allocated-but-may-drop bits, undoing any TRIMs (which is 
unfortunate) but preserving consistency.
Stefan Hajnoczi Oct. 12, 2010, 1:16 p.m. UTC | #26
On Tue, Oct 12, 2010 at 10:07:53AM +0200, Kevin Wolf wrote:
> Am 11.10.2010 19:14, schrieb Anthony Liguori:
> > On 10/11/2010 11:18 AM, Anthony Liguori wrote:
> >> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
> >>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
> >>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >>>>>>   It was discussed before, but I don't think we came to a 
> >>>>>> conclusion. Are
> >>>>>>   there any circumstances under which you don't want to set the
> >>>>>>   QED_CF_BACKING_FORMAT flag?
> >>>>> I suggest the following:
> >>>>>
> >>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
> >>>>>
> >>>>> When set, the backing file is a raw image and should not be probed for
> >>>>> its file format.  The default (unset) means that the backing image 
> >>>>> file
> >>>>> format may be probed.
> >>>>>
> >>>>> Now the backing_fmt_{offset,size} are no longer necessary.
> >>>> Should it not be an incompatible option?  If the backing disk starts
> >>>> with a format magic, it will be probed by an older qemu,
> >>>> incorrectly.
> >>> Agreed, it should be a non-compat feature bit.
> >>
> >> If it's just raw or not raw, then I agree it should be non-compat.
> >>
> >> I think we just need a feature bit then that indicates that the 
> >> backing file is non-probeable which certainly simplifies the 
> >> implementation.
> >>
> >> QED_F_BACKING_FORMAT_NOPROBE maybe?
> > 
> > Er, thinking more, this is still a good idea but we still need 
> > QED_CF_BACKING_FORMAT because we specifically need to know when a 
> > protocol is specified.  Otherwise, we have no way of doing nbd as a 
> > backing file.
> 
> Well, the protocol is currently encoded in the file name, separated by a
> colon. Of course, we want to get rid of that, but we still don't know
> what we want instead. It's completely unrelated to the backing file
> format, though, it's about the format of the backing file name.

I agree with Kevin.  There's no need to have the ill-defined backing
format AFAICT.

Stefan
Anthony Liguori Oct. 12, 2010, 1:32 p.m. UTC | #27
On 10/12/2010 08:16 AM, Stefan Hajnoczi wrote:
>
>> Well, the protocol is currently encoded in the file name, separated by a
>> colon. Of course, we want to get rid of that, but we still don't know
>> what we want instead. It's completely unrelated to the backing file
>> format, though, it's about the format of the backing file name.
>>      
> I agree with Kevin.  There's no need to have the ill-defined backing
> format AFAICT.
>    

Yeah, I've now convinced myself we don't need backing format name too.

Regards,

Anthony Liguori

> Stefan
>
diff mbox

Patch

diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
new file mode 100644
index 0000000..c942b8e
--- /dev/null
+++ b/docs/specs/qed_spec.txt
@@ -0,0 +1,94 @@ 
+=Specification=
+
+The file format looks like this:
+
+ +----------+----------+----------+-----+
+ | cluster0 | cluster1 | cluster2 | ... |
+ +----------+----------+----------+-----+
+
+The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
+
+Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
+
+All fields are little-endian.
+
+==Header==
+ Header {
+     uint32_t magic;               /* QED\0 */
+ 
+     uint32_t cluster_size;        /* in bytes */
+     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
+     uint32_t header_size;         /* in clusters */
+ 
+     uint64_t features;            /* format feature bits */
+     uint64_t compat_features;     /* compat feature bits */
+     uint64_t l1_table_offset;     /* in bytes */
+     uint64_t image_size;          /* total logical image size, in bytes */
+ 
+     /* if (features & QED_F_BACKING_FILE) */
+     uint32_t backing_filename_offset; /* in bytes from start of header */
+     uint32_t backing_filename_size;   /* in bytes */
+ 
+     /* if (compat_features & QED_CF_BACKING_FORMAT) */
+     uint32_t backing_fmt_offset;  /* in bytes from start of header */
+     uint32_t backing_fmt_size;    /* in bytes */
+ }
+
+Field descriptions:
+* cluster_size must be a power of 2 in range [2^12, 2^26].
+* table_size must be a power of 2 in range [1, 16].
+* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
+* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
+* l1_table_offset must be a multiple of cluster_size.
+* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
+* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
+
+Feature bits:
+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
+* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
+
+==Tables==
+
+Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
+
+ #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
+  
+ Table {
+     uint64_t offsets[TABLE_NOFFSETS];
+ }
+
+The tables are organized as follows:
+
+                    +----------+
+                    | L1 table |
+                    +----------+
+               ,------'  |  '------.
+          +----------+   |    +----------+
+          | L2 table |  ...   | L2 table |
+          +----------+        +----------+
+      ,------'  |  '------.
+ +----------+   |    +----------+
+ |   Data   |  ...   |   Data   |
+ +----------+        +----------+
+
+A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
+
+The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
+ header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
+
+Logical offsets are translated into cluster offsets as follows:
+
+  table_bits table_bits    cluster_bits
+  <--------> <--------> <--------------->
+ +----------+----------+-----------------+
+ | L1 index | L2 index |     byte offset |
+ +----------+----------+-----------------+
+ 
+       Structure of a logical offset
+
+ def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
+   l2_offset = l1_table[l1_index]
+   l2_table = load_table(l2_offset)
+   cluster_offset = l2_table[l2_index]
+   return cluster_offset + byte_offset