Patchwork [RFC,V4,01/30] qcow2: Add deduplication to the qcow2 specification.

login
register
mail settings
Submitter Benoît Canet
Date Jan. 2, 2013, 4:16 p.m.
Message ID <1357143393-29832-2-git-send-email-benoit@irqsave.net>
Download mbox | patch
Permalink /patch/209073/
State New
Headers show

Comments

Benoît Canet - Jan. 2, 2013, 4:16 p.m.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 docs/specs/qcow2.txt |  100 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 99 insertions(+), 1 deletion(-)
Eric Blake - Jan. 3, 2013, 6:18 p.m.
On 01/02/2013 09:16 AM, Benoît Canet wrote:
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>  docs/specs/qcow2.txt |  100 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> index 36a559d..c9c0d47 100644
> --- a/docs/specs/qcow2.txt
> +++ b/docs/specs/qcow2.txt
> @@ -80,7 +80,12 @@ in the description of a field.
>                                  tables to repair refcounts before accessing the
>                                  image.
>  
> -                    Bits 1-63:  Reserved (set to 0)
> +                    Bit 1:      Deduplication bit.  If this bit is set then
> +                                deduplication is used on this image.

This part seems fine; and I agree with making this an incompatible
feature (as an older qemu that does not understand dedup would not keep
the dedup table up-to-date).

> +                                L2 tables size 64KB is different from
> +                                cluster size 4KB.

Umm, doesn't the cluster_bits (bytes 20-23 of the header) determine the
size of a cluster, rather than assuming a cluster is always 4KB?  And
later on, the spec says that "L2 tables are exactly one cluster in
size.", so I'm not sure what this comment is doing here.  Or are you
stating that deduplication _also_ has an L2 table, which is fixed in
size (unlike the normal L2 table for actual data)?

> +== Deduplication ==
> +
> +The deduplication extension contains the informations concerning the

s/informations concerning the/information concerning/

> +deduplication.
> +
> +    Byte   0 - 7:   Offset of the RAM deduplication table
> +
> +          8 - 11:   Size of the RAM deduplication table = number of L1 64-bit
> +                    pointers
> +
> +              12:   Hash algo enum field
> +                        0: SHA-256
> +                        1: SHA3
> +                        2: SKEIN-256
> +
> +              13:   Dedup stategies bitmap

s/stategies/strategies/

> +                        0: RAM based hash lookup
> +                        1: Disk based hash lookup
> +
> +Disk based lookup structure will be described in a future QCOW2 specification.

Does that mean that strategy must be 0 for now?

> +
> +== Deduplication table (RAM method) ==
> +
> +The deduplication table maps a physical offset to a data hash and
> +logical offset. It is used to store permanently the informations required to

s/store permanently the informations/permanently store the information/

> +do the deduplication. It is loaded at startup into a RAM based representation
> +used to do the lookups.
> +
> +The deduplication table contains 64-bit offsets to the level 2 deduplication
> +table blocks.
> +Each entry of these blocks contains a 32-byte SHA256 hash followed by the
> +64-bit logical offset of the first encountered cluster having this hash.
> +
> +== Deduplication table schematic (RAM method) ==
> +
> +0       l1_dedup_index                                              Size
> +              |
> +|--------------------------------------------------------------------|
> +|             |                                                      |
> +|             |        L1 Deduplication table                        |
> +|             |                                                      |
> +|--------------------------------------------------------------------|
> +              |
> +              |
> +              |
> +0             |           l2_dedup_block_entries
> +              |
> +|---------------------------------|
> +|                                 |
> +|    L2 deduplication block       |
> +|                                 |
> +|                 l2_dedup_index  |
> +|---------------------------------|
> +                         |
> +         0               |              40
> +                         |
> +         |-------------------------------|
> +         |                               |
> +         |    Deduplication table entry  |
> +         |                               |
> +         |-------------------------------|
> +
> +
> +== Deduplication table entry description (RAM method) ==
> +
> +Each L2 deduplication table entry has the following structure:
> +
> +    Byte  0 - 31:   hash of data cluster
> +
> +         32 - 39:   Logical offset of first encountered block having
> +                    this hash
> +
> +== Deduplication table arithmetics (RAM method) ==
> +
> +Entries in the deduplication table are ordered by physical cluster index.
> +
> +The number of entries in an l2 deduplication table block is :
> +l2_dedup_block_entries = dedup_block_size / (32 + 8)

I'd write this as CEIL(dedup_block_size / (32 + 8)) to make it clear
that it rounds up...

> +
> +The index in the level 1 deduplication table is :
> +l1_dedup_index = physical_cluster_index / l2_block_cluster_entries
> +
> +The index in the level 2 deduplication table is:
> +l2_dedup_index = physical_cluster_index % l2_block_cluster_entries
> +
> +cluster_size = 4096
> +dedup_block_size = 65536
> +l2_size = 65536
> +
> +The 16 remaining bytes in each l2 deduplication blocks are set to zero and
> +reserved for a future usage.

...and move this paragraph closer to the point where you mention the
rounding up in size.

> +
>  == Host cluster management ==
>  
>  qcow2 manages the allocation of host clusters by maintaining a reference count
>
Benoît Canet - Jan. 4, 2013, 2:49 p.m.
> > +                                L2 tables size 64KB is different from
> > +                                cluster size 4KB.
> 
> Umm, doesn't the cluster_bits (bytes 20-23 of the header) determine the
> size of a cluster, rather than assuming a cluster is always 4KB?  And
> later on, the spec says that "L2 tables are exactly one cluster in
> size.", so I'm not sure what this comment is doing here.  Or are you
> stating that deduplication _also_ has an L2 table, which is fixed in
> size (unlike the normal L2 table for actual data)?

As most filesystems (ntfs, ext2/3/4, xfs) use 4KB blocs the deduplication
works very well with 4KB clusters.

The problem with 4KB cluster is that L2 table allocations are done very often
and require a flush to disk which kill performance.

So my patchset break compatibility with regular qcow2 image by using 64KB L2
table which are of a different size than the 4KB cluster.

I'll let the choice for the user to choose the cluster size but will default
to 4KB when creating a deduplicated image.
Benoît Canet - Jan. 16, 2013, 2:50 p.m.
> I'd write this as CEIL(dedup_block_size / (32 + 8)) to make it clear
> that it rounds up...

Isn't it FLOOR instead of CEIL ? (off by one error) ?

Benoît
Eric Blake - Jan. 16, 2013, 3:58 p.m.
On 01/16/2013 07:50 AM, Benoît Canet wrote:
>> I'd write this as CEIL(dedup_block_size / (32 + 8)) to make it clear
>> that it rounds up...
> 
> Isn't it FLOOR instead of CEIL ? (off by one error) ?

Indeed, my reply was a bit too hasty, and I mixed terminology.  The
number of entries that fits in a page is determined by rounding down
(floor); the amount of memory consumed by that many entries is
determined by rounding up to page size (ceil).  But at least you got
what I meant :)

Patch

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 36a559d..c9c0d47 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -80,7 +80,12 @@  in the description of a field.
                                 tables to repair refcounts before accessing the
                                 image.
 
-                    Bits 1-63:  Reserved (set to 0)
+                    Bit 1:      Deduplication bit.  If this bit is set then
+                                deduplication is used on this image.
+                                L2 tables size 64KB is different from
+                                cluster size 4KB.
+
+                    Bits 2-63:  Reserved (set to 0)
 
          80 -  87:  compatible_features
                     Bitmask of compatible features. An implementation can
@@ -116,6 +121,7 @@  be stored. Each extension has a structure like the following:
                         0x00000000 - End of the header extension area
                         0xE2792ACA - Backing file format name
                         0x6803f857 - Feature name table
+                        0xCD8E819B - Deduplication
                         other      - Unknown header extension, can be safely
                                      ignored
 
@@ -159,6 +165,98 @@  the header extension data. Each entry look like this:
                     terminated if it has full length)
 
 
+== Deduplication ==
+
+The deduplication extension contains the informations concerning the
+deduplication.
+
+    Byte   0 - 7:   Offset of the RAM deduplication table
+
+          8 - 11:   Size of the RAM deduplication table = number of L1 64-bit
+                    pointers
+
+              12:   Hash algo enum field
+                        0: SHA-256
+                        1: SHA3
+                        2: SKEIN-256
+
+              13:   Dedup stategies bitmap
+                        0: RAM based hash lookup
+                        1: Disk based hash lookup
+
+Disk based lookup structure will be described in a future QCOW2 specification.
+
+== Deduplication table (RAM method) ==
+
+The deduplication table maps a physical offset to a data hash and
+logical offset. It is used to store permanently the informations required to
+do the deduplication. It is loaded at startup into a RAM based representation
+used to do the lookups.
+
+The deduplication table contains 64-bit offsets to the level 2 deduplication
+table blocks.
+Each entry of these blocks contains a 32-byte SHA256 hash followed by the
+64-bit logical offset of the first encountered cluster having this hash.
+
+== Deduplication table schematic (RAM method) ==
+
+0       l1_dedup_index                                              Size
+              |
+|--------------------------------------------------------------------|
+|             |                                                      |
+|             |        L1 Deduplication table                        |
+|             |                                                      |
+|--------------------------------------------------------------------|
+              |
+              |
+              |
+0             |           l2_dedup_block_entries
+              |
+|---------------------------------|
+|                                 |
+|    L2 deduplication block       |
+|                                 |
+|                 l2_dedup_index  |
+|---------------------------------|
+                         |
+         0               |              40
+                         |
+         |-------------------------------|
+         |                               |
+         |    Deduplication table entry  |
+         |                               |
+         |-------------------------------|
+
+
+== Deduplication table entry description (RAM method) ==
+
+Each L2 deduplication table entry has the following structure:
+
+    Byte  0 - 31:   hash of data cluster
+
+         32 - 39:   Logical offset of first encountered block having
+                    this hash
+
+== Deduplication table arithmetics (RAM method) ==
+
+Entries in the deduplication table are ordered by physical cluster index.
+
+The number of entries in an l2 deduplication table block is :
+l2_dedup_block_entries = dedup_block_size / (32 + 8)
+
+The index in the level 1 deduplication table is :
+l1_dedup_index = physical_cluster_index / l2_block_cluster_entries
+
+The index in the level 2 deduplication table is:
+l2_dedup_index = physical_cluster_index % l2_block_cluster_entries
+
+cluster_size = 4096
+dedup_block_size = 65536
+l2_size = 65536
+
+The 16 remaining bytes in each l2 deduplication blocks are set to zero and
+reserved for a future usage.
+
 == Host cluster management ==
 
 qcow2 manages the allocation of host clusters by maintaining a reference count