diff mbox

[RFC,V8,01/24] qcow2: Add journal specification.

Message ID 1371738392-9594-2-git-send-email-benoit@irqsave.net
State New
Headers show

Commit Message

Benoît Canet June 20, 2013, 2:26 p.m. UTC
---
 docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Comments

Stefan Hajnoczi July 2, 2013, 2:42 p.m. UTC | #1
On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote:
> ---
>  docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> index 36a559d..a4ffc85 100644
> --- a/docs/specs/qcow2.txt
> +++ b/docs/specs/qcow2.txt
> @@ -350,3 +350,45 @@ Snapshot table entry:
>          variable:   Unique ID string for the snapshot (not null terminated)
>  
>          variable:   Name of the snapshot (not null terminated)
> +
> +== Journal ==
> +
> +QCOW2 can use one or more instance of a metadata journal.

s/instance/instances/

Is there a reason to use multiple journals rather than a single journal
for all entry types?  The single journal area avoids seeks.

> +
> +A journal is a sequential log of journal entries appended on a previously
> +allocated and reseted area.

I think you say "previously reset area" instead of "reseted".  Another
option is "initialized area".

> +A journal is designed like a linked list with each entry pointing to the next
> +so it's easy to iterate over entries.
> +
> +A journal uses the following constants to denote the type of each entry
> +
> +TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
> +TYPE_END  = 1         the entry ends a journal cluster and point to the next
> +                      cluster
> +TYPE_HASH = 2         the entry contains a deduplication hash
> +
> +QCOW2 journal entry:
> +
> +    Byte 0         :    Size of the entry: size = 2 + n with size <= 254

This is not clear.  I'm wondering if the +2 is included in the byte
value or not.  I'm also wondering what a byte value of zero means and
what a byte value of 255 means.

Please include an example to illustrate how this field works.

> +
> +         1         :    Type of the entry
> +
> +         2 - size  :    The optional n bytes structure carried by entry
> +
> +A journal is divided into clusters and no journal entry can be spilled on two
> +clusters. This avoid having to read more than one cluster to get a single entry.
> +
> +For this purpose an entry with the end type is added at the end of a journal
> +cluster before starting to write in the next cluster.
> +The size of such an entry is set so the entry points to the next cluster.
> +
> +As any journal cluster must be ended with an end entry the size of regular
> +journal entries is limited to 254 bytes in order to always left room for an end
> +entry which mimimal size is two bytes.
> +
> +The only cases where size > 254 are none entries where size = 255.
> +
> +The replay of a journal stop when the first end none entry is reached.

s/stop/stops/

> +The journal cluster size is 4096 bytes.

Questions about this layout:

1. Journal entries have no integrity mechanism, which is especially
   important if they span physical sectors where cheap disks may perform
   a partial write.  This would leave a corrupt journal.  If the last
   bytes are a checksum then you can get some confidence that the entry
   was fully written and is valid.

   Did I miss something?

2. Byte-granularity means that read-modify-write is necessary to append
   entries to the journal.  Therefore a failure could destroy previously
   committed entries.

   Any ideas how existing journals handle this?
Kevin Wolf July 2, 2013, 2:54 p.m. UTC | #2
Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben:
> On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote:
> > ---
> >  docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> > index 36a559d..a4ffc85 100644
> > --- a/docs/specs/qcow2.txt
> > +++ b/docs/specs/qcow2.txt
> > @@ -350,3 +350,45 @@ Snapshot table entry:
> >          variable:   Unique ID string for the snapshot (not null terminated)
> >  
> >          variable:   Name of the snapshot (not null terminated)
> > +
> > +== Journal ==
> > +
> > +QCOW2 can use one or more instance of a metadata journal.
> 
> s/instance/instances/
> 
> Is there a reason to use multiple journals rather than a single journal
> for all entry types?  The single journal area avoids seeks.
> 
> > +
> > +A journal is a sequential log of journal entries appended on a previously
> > +allocated and reseted area.
> 
> I think you say "previously reset area" instead of "reseted".  Another
> option is "initialized area".
> 
> > +A journal is designed like a linked list with each entry pointing to the next
> > +so it's easy to iterate over entries.
> > +
> > +A journal uses the following constants to denote the type of each entry
> > +
> > +TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
> > +TYPE_END  = 1         the entry ends a journal cluster and point to the next
> > +                      cluster
> > +TYPE_HASH = 2         the entry contains a deduplication hash
> > +
> > +QCOW2 journal entry:
> > +
> > +    Byte 0         :    Size of the entry: size = 2 + n with size <= 254
> 
> This is not clear.  I'm wondering if the +2 is included in the byte
> value or not.  I'm also wondering what a byte value of zero means and
> what a byte value of 255 means.
> 
> Please include an example to illustrate how this field works.
> 
> > +
> > +         1         :    Type of the entry
> > +
> > +         2 - size  :    The optional n bytes structure carried by entry
> > +
> > +A journal is divided into clusters and no journal entry can be spilled on two
> > +clusters. This avoid having to read more than one cluster to get a single entry.
> > +
> > +For this purpose an entry with the end type is added at the end of a journal
> > +cluster before starting to write in the next cluster.
> > +The size of such an entry is set so the entry points to the next cluster.
> > +
> > +As any journal cluster must be ended with an end entry the size of regular
> > +journal entries is limited to 254 bytes in order to always left room for an end
> > +entry which mimimal size is two bytes.
> > +
> > +The only cases where size > 254 are none entries where size = 255.
> > +
> > +The replay of a journal stop when the first end none entry is reached.
> 
> s/stop/stops/
> 
> > +The journal cluster size is 4096 bytes.
> 
> Questions about this layout:
> 
> 1. Journal entries have no integrity mechanism, which is especially
>    important if they span physical sectors where cheap disks may perform
>    a partial write.  This would leave a corrupt journal.  If the last
>    bytes are a checksum then you can get some confidence that the entry
>    was fully written and is valid.
> 
>    Did I miss something?

Adding a checksum sounds like a good idea.

> 2. Byte-granularity means that read-modify-write is necessary to append
>    entries to the journal.  Therefore a failure could destroy previously
>    committed entries.
> 
>    Any ideas how existing journals handle this?

You commit only whole blocks. So in this case we can consider a block
only committed as soon as a TYPE_END entry has been written (and after
that we won't touch it any more until the journalled changes have been
flushed to disk).

There's one "interesting" case: cache=writethrough. I'm not entirely
sure yet what to do with it, but it's slow anyway, so using one block
per entry and therefore flushing the journal very often might actually
be not totally unreasonable.

Another thing I'm not sure about is whether a fixed 4k block is good or
if we should leave it configurable. I don't think making it an option
would hurt (not necessarily modifyable with qemu-img, but as a field
in the file format).

Kevin
Benoît Canet July 2, 2013, 9:23 p.m. UTC | #3
> > +QCOW2 can use one or more instance of a metadata journal.
> 
> s/instance/instances/
> 
> Is there a reason to use multiple journals rather than a single journal
> for all entry types?  The single journal area avoids seeks.

Here are the main reason for this:

For the deduplication some patterns like cycles of insertion/deletion could
leave the hash table almost empty while filling the journal.

If the journal is full and the hash table is empty a packing operation is
started.

Basically a new journal is created and only the entry presents in the hash table
are reinserted.

This is why I want to keep the deduplication journal appart from regular qcow2
journal: to avoid interferences between a pack operation and regular qcow2
journal entries.

The other thing is that freezing the log store would need a replay of regular
qcow2 entries as it trigger a reset of the journal.

Also since deduplication will not work on spinning disk I discarded the seek
time factor.

Maybe commiting the dedupe journal by erase block sized chunk would be a good
idea to reduce random writes to the SSD.

The additional reason for having multiple journals is that the SILT paper
propose a mode where prefix of the hash is used to dispatch insertions in
multiples store and it easier to do with multiple journals.

> 
> > +
> > +A journal is a sequential log of journal entries appended on a previously
> > +allocated and reseted area.
> 
> I think you say "previously reset area" instead of "reseted".  Another
> option is "initialized area".
> 
> > +A journal is designed like a linked list with each entry pointing to the next
> > +so it's easy to iterate over entries.
> > +
> > +A journal uses the following constants to denote the type of each entry
> > +
> > +TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
> > +TYPE_END  = 1         the entry ends a journal cluster and point to the next
> > +                      cluster
> > +TYPE_HASH = 2         the entry contains a deduplication hash
> > +
> > +QCOW2 journal entry:
> > +
> > +    Byte 0         :    Size of the entry: size = 2 + n with size <= 254
> 
> This is not clear.  I'm wondering if the +2 is included in the byte
> value or not.  I'm also wondering what a byte value of zero means and
> what a byte value of 255 means.

I am counting the journal entry header in the size. So yes the +2 is in the byte
value.
A byte value of zero, 1 or 255  is an error.

Maybe this design is bogus and I should only count the payload size in the size
field. It would make less tricky cases.

> 
> Please include an example to illustrate how this field works.
> 
> > +
> > +         1         :    Type of the entry
> > +
> > +         2 - size  :    The optional n bytes structure carried by entry
> > +
> > +A journal is divided into clusters and no journal entry can be spilled on two
> > +clusters. This avoid having to read more than one cluster to get a single entry.
> > +
> > +For this purpose an entry with the end type is added at the end of a journal
> > +cluster before starting to write in the next cluster.
> > +The size of such an entry is set so the entry points to the next cluster.
> > +
> > +As any journal cluster must be ended with an end entry the size of regular
> > +journal entries is limited to 254 bytes in order to always left room for an end
> > +entry which mimimal size is two bytes.
> > +
> > +The only cases where size > 254 are none entries where size = 255.
> > +
> > +The replay of a journal stop when the first end none entry is reached.
> 
> s/stop/stops/
> 
> > +The journal cluster size is 4096 bytes.
> 
> Questions about this layout:
> 
> 1. Journal entries have no integrity mechanism, which is especially
>    important if they span physical sectors where cheap disks may perform
>    a partial write.  This would leave a corrupt journal.  If the last
>    bytes are a checksum then you can get some confidence that the entry
>    was fully written and is valid.

I will add a checksum mecanism.

Do you have any preferences regarding the checksum function ?

> 
>    Did I miss something?
> 
> 2. Byte-granularity means that read-modify-write is necessary to append
>    entries to the journal.  Therefore a failure could destroy previously
>    committed entries.

It's designed to be committed by 4KB blocks.

> 
>    Any ideas how existing journals handle this?
>
Benoît Canet July 2, 2013, 9:26 p.m. UTC | #4
> > 2. Byte-granularity means that read-modify-write is necessary to append
> >    entries to the journal.  Therefore a failure could destroy previously
> >    committed entries.
> > 
> >    Any ideas how existing journals handle this?
> 
> You commit only whole blocks. So in this case we can consider a block
> only committed as soon as a TYPE_END entry has been written (and after
> that we won't touch it any more until the journalled changes have been
> flushed to disk).
> 
> There's one "interesting" case: cache=writethrough. I'm not entirely
> sure yet what to do with it, but it's slow anyway, so using one block
> per entry and therefore flushing the journal very often might actually
> be not totally unreasonable.

This sure would finish to kill the performance because this would be an io
per metadata written to disk.

> 
> Another thing I'm not sure about is whether a fixed 4k block is good or
> if we should leave it configurable. I don't think making it an option
> would hurt (not necessarily modifyable with qemu-img, but as a field
> in the file format).

I agree.
I also think about make the number of block to be flushed at once configurable.

Benoît
Stefan Hajnoczi July 3, 2013, 7:51 a.m. UTC | #5
On Tue, Jul 02, 2013 at 04:54:46PM +0200, Kevin Wolf wrote:
> Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben:
> > On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote:
> > > ---
> > >  docs/specs/qcow2.txt |   42 ++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 42 insertions(+)
> > > 
> > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> > > index 36a559d..a4ffc85 100644
> > > --- a/docs/specs/qcow2.txt
> > > +++ b/docs/specs/qcow2.txt
> > > @@ -350,3 +350,45 @@ Snapshot table entry:
> > >          variable:   Unique ID string for the snapshot (not null terminated)
> > >  
> > >          variable:   Name of the snapshot (not null terminated)
> > > +
> > > +== Journal ==
> > > +
> > > +QCOW2 can use one or more instance of a metadata journal.
> > 
> > s/instance/instances/
> > 
> > Is there a reason to use multiple journals rather than a single journal
> > for all entry types?  The single journal area avoids seeks.
> > 
> > > +
> > > +A journal is a sequential log of journal entries appended on a previously
> > > +allocated and reseted area.
> > 
> > I think you say "previously reset area" instead of "reseted".  Another
> > option is "initialized area".
> > 
> > > +A journal is designed like a linked list with each entry pointing to the next
> > > +so it's easy to iterate over entries.
> > > +
> > > +A journal uses the following constants to denote the type of each entry
> > > +
> > > +TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
> > > +TYPE_END  = 1         the entry ends a journal cluster and point to the next
> > > +                      cluster
> > > +TYPE_HASH = 2         the entry contains a deduplication hash
> > > +
> > > +QCOW2 journal entry:
> > > +
> > > +    Byte 0         :    Size of the entry: size = 2 + n with size <= 254
> > 
> > This is not clear.  I'm wondering if the +2 is included in the byte
> > value or not.  I'm also wondering what a byte value of zero means and
> > what a byte value of 255 means.
> > 
> > Please include an example to illustrate how this field works.
> > 
> > > +
> > > +         1         :    Type of the entry
> > > +
> > > +         2 - size  :    The optional n bytes structure carried by entry
> > > +
> > > +A journal is divided into clusters and no journal entry can be spilled on two
> > > +clusters. This avoid having to read more than one cluster to get a single entry.
> > > +
> > > +For this purpose an entry with the end type is added at the end of a journal
> > > +cluster before starting to write in the next cluster.
> > > +The size of such an entry is set so the entry points to the next cluster.
> > > +
> > > +As any journal cluster must be ended with an end entry the size of regular
> > > +journal entries is limited to 254 bytes in order to always left room for an end
> > > +entry which mimimal size is two bytes.
> > > +
> > > +The only cases where size > 254 are none entries where size = 255.
> > > +
> > > +The replay of a journal stop when the first end none entry is reached.
> > 
> > s/stop/stops/
> > 
> > > +The journal cluster size is 4096 bytes.
> > 
> > Questions about this layout:
> > 
> > 1. Journal entries have no integrity mechanism, which is especially
> >    important if they span physical sectors where cheap disks may perform
> >    a partial write.  This would leave a corrupt journal.  If the last
> >    bytes are a checksum then you can get some confidence that the entry
> >    was fully written and is valid.
> > 
> >    Did I miss something?
> 
> Adding a checksum sounds like a good idea.
> 
> > 2. Byte-granularity means that read-modify-write is necessary to append
> >    entries to the journal.  Therefore a failure could destroy previously
> >    committed entries.
> > 
> >    Any ideas how existing journals handle this?
> 
> You commit only whole blocks. So in this case we can consider a block
> only committed as soon as a TYPE_END entry has been written (and after
> that we won't touch it any more until the journalled changes have been
> flushed to disk).
> 
> There's one "interesting" case: cache=writethrough. I'm not entirely
> sure yet what to do with it, but it's slow anyway, so using one block
> per entry and therefore flushing the journal very often might actually
> be not totally unreasonable.
> 
> Another thing I'm not sure about is whether a fixed 4k block is good or
> if we should leave it configurable. I don't think making it an option
> would hurt (not necessarily modifyable with qemu-img, but as a field
> in the file format).

Making block size configurable seems like a good idea so we can adapt to
disk performance and data integrity characteristics.

Stefan
Stefan Hajnoczi July 3, 2013, 8:01 a.m. UTC | #6
On Tue, Jul 02, 2013 at 11:23:56PM +0200, Benoît Canet wrote:
> > > +QCOW2 can use one or more instance of a metadata journal.
> > 
> > s/instance/instances/
> > 
> > Is there a reason to use multiple journals rather than a single journal
> > for all entry types?  The single journal area avoids seeks.
> 
> Here are the main reason for this:
> 
> For the deduplication some patterns like cycles of insertion/deletion could
> leave the hash table almost empty while filling the journal.
> 
> If the journal is full and the hash table is empty a packing operation is
> started.
> 
> Basically a new journal is created and only the entry presents in the hash table
> are reinserted.
> 
> This is why I want to keep the deduplication journal appart from regular qcow2
> journal: to avoid interferences between a pack operation and regular qcow2
> journal entries.
> 
> The other thing is that freezing the log store would need a replay of regular
> qcow2 entries as it trigger a reset of the journal.
> 
> Also since deduplication will not work on spinning disk I discarded the seek
> time factor.
> 
> Maybe commiting the dedupe journal by erase block sized chunk would be a good
> idea to reduce random writes to the SSD.
> 
> The additional reason for having multiple journals is that the SILT paper
> propose a mode where prefix of the hash is used to dispatch insertions in
> multiples store and it easier to do with multiple journals.

It sounds like the journal is more than just a data integrity mechanism.
It's an integral part of your dedup algorithm and you plan to carefully
manage it while rebuilding some of the other dedup data structures.

Does this mean the journal forms the first-stage data structure for
deduplication?  Dedup records will accumulate in the journal until it
becomes time to convert them in bulk into a more compact representation?

When I read this specification I was thinking of a journal purely for
logging operations.  You could use a commit record to mark previous
records applied.  Upon startup, qcow2 would inspect uncommitted records
and deal with them.

We just need to figure out how to define a good interface so that the
journal can be used in a general way but also for dedup's specific
needs.

Stefan
Kevin Wolf July 3, 2013, 8:04 a.m. UTC | #7
Am 02.07.2013 um 23:23 hat Benoît Canet geschrieben:
> Also since deduplication will not work on spinning disk I discarded the seek
> time factor.

Care to explain that in more detail? Why shouldn't it work on spinning
disks?

Kevin
Kevin Wolf July 3, 2013, 8:08 a.m. UTC | #8
Am 02.07.2013 um 23:26 hat Benoît Canet geschrieben:
> > > 2. Byte-granularity means that read-modify-write is necessary to append
> > >    entries to the journal.  Therefore a failure could destroy previously
> > >    committed entries.
> > > 
> > >    Any ideas how existing journals handle this?
> > 
> > You commit only whole blocks. So in this case we can consider a block
> > only committed as soon as a TYPE_END entry has been written (and after
> > that we won't touch it any more until the journalled changes have been
> > flushed to disk).
> > 
> > There's one "interesting" case: cache=writethrough. I'm not entirely
> > sure yet what to do with it, but it's slow anyway, so using one block
> > per entry and therefore flushing the journal very often might actually
> > be not totally unreasonable.
> 
> This sure would finish to kill the performance because this would be an io
> per metadata written to disk.

cache=writethrough already pretty much kills performance because it's
not only an I/O per metadata write, but also a flush.

The question is, do we have any option to avoid it?

> > Another thing I'm not sure about is whether a fixed 4k block is good or
> > if we should leave it configurable. I don't think making it an option
> > would hurt (not necessarily modifyable with qemu-img, but as a field
> > in the file format).
> 
> I agree.
> I also think about make the number of block to be flushed at once configurable.

This is more of a runtime option. We can store a default in the image,
though.

Kevin
Stefan Hajnoczi July 3, 2013, 8:12 a.m. UTC | #9
On Tue, Jul 02, 2013 at 11:23:56PM +0200, Benoît Canet wrote:
> >    Any ideas how existing journals handle this?

By the way, I don't know much about journalling techniques.  So I'm
asking you these questions so that either you can answer them straight
away or because they might warrant a look at existing journal
implementations like:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2
http://www.sqlite.org/cgi/src/dir?name=src
http://blitiri.com.ar/p/libjio/

Stefan
Benoît Canet July 3, 2013, 12:30 p.m. UTC | #10
> Care to explain that in more detail? Why shouldn't it work on spinning
> disks?

Hash are random they introduce random read access.

With a QCOW2 cluster size of 4KB the deduplication code when writting duplicated
data will do one random read per 4KB block to deduplicate.

A server grade hardisk is rated for 250 iops. This traduce in 1MB/s of deduplicated
data. Not very usable.

On the contrary a samsung 840 pro SSD is rated for 80k iops of random read.
That should traduce in 320MB/s of potentially deduplicated data.

Havind dedup metadata on SSD and actual data on disk would solve the problem but
it would need block backend.

Benoît
Benoît Canet July 3, 2013, 12:35 p.m. UTC | #11
> Does this mean the journal forms the first-stage data structure for
> deduplication?  Dedup records will accumulate in the journal until it
> becomes time to convert them in bulk into a more compact representation?

The journal is mainly used to persist the last inserted dedup metadata across
QEMU stop and restart. I replay it at startup to rebuild the hash table.
So yes it's the first stage even it's never used for regular queries.

> 
> When I read this specification I was thinking of a journal purely for
> logging operations.  You could use a commit record to mark previous
> records applied.  Upon startup, qcow2 would inspect uncommitted records
> and deal with them.

Maybe that could help regular QCOW2 usage. I don't know.
Benoît Canet July 3, 2013, 12:53 p.m. UTC | #12
> By the way, I don't know much about journalling techniques.  So I'm
> asking you these questions so that either you can answer them straight
> away or because they might warrant a look at existing journal
> implementations like:

I tried to so something simple and performing for the deduplication usage.

That explain that there is no concept of transaction and that the journal's
block are flushed asynchronously in order to have an high insertion rate.

I agree with your previous comment is more a log than a journal.

> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2
> http://www.sqlite.org/cgi/src/dir?name=src
> http://blitiri.com.ar/p/libjio/

I will try to find a paper on journal design.

Benoît
Stefan Hajnoczi July 4, 2013, 7:13 a.m. UTC | #13
On Wed, Jul 03, 2013 at 02:53:27PM +0200, Benoît Canet wrote:
> > By the way, I don't know much about journalling techniques.  So I'm
> > asking you these questions so that either you can answer them straight
> > away or because they might warrant a look at existing journal
> > implementations like:
> 
> I tried to so something simple and performing for the deduplication usage.
> 
> That explain that there is no concept of transaction and that the journal's
> block are flushed asynchronously in order to have an high insertion rate.
> 
> I agree with your previous comment is more a log than a journal.

Simple is good.  Even for deduplication alone, I think data integrity is
critical - otherwise we risk stale dedup metadata pointing to clusters
that are unallocated or do not contain the right data.  So the journal
will probably need to follow techniques for commits/checksums.

Stefan
Benoît Canet July 4, 2013, 10:01 a.m. UTC | #14
> Simple is good.  Even for deduplication alone, I think data integrity is
> critical - otherwise we risk stale dedup metadata pointing to clusters
> that are unallocated or do not contain the right data.  So the journal
> will probably need to follow techniques for commits/checksums.

I agree that checksums are missing for the dedup.
Maybe we could even use some kind of error correcting code instead of a checksum.

Concerning data integrity the events that the deduplication code cannot loose
are hash deletions because they mark a previously inserted hash as obsolete.

The problem with a commit/flush mechanism on hash deletion is that it will slow
down the store insertion speed and also create some extra SSD wear out.

To solve this I considered the fact that the dedup metadata as a whole is
disposable.

So I implemented a "dedup dirty" bit.

When QEMU stop the journal is flushed and the dirty bit is cleared.
When QEMU start and the dirty bit is set a crash is detected and _all_ the
deduplication metadata is dropped.
The QCOW2 data integrity won't suffer only the dedup ratio will be lower.

As you said once on irc crashes don't happen often.

Benoît
Benoît Canet July 16, 2013, 10:45 p.m. UTC | #15
> > Simple is good.  Even for deduplication alone, I think data integrity is
> > critical - otherwise we risk stale dedup metadata pointing to clusters
> > that are unallocated or do not contain the right data.  So the journal
> > will probably need to follow techniques for commits/checksums.
>

I'll add checksums to the journal and clean the journal entry size mess soon.

For the transactional/commits aspect of the journal I think that we need Kevin's
point of view on the subject.

Best regards

Benoît
Kevin Wolf July 17, 2013, 8:20 a.m. UTC | #16
Am 17.07.2013 um 00:45 hat Benoît Canet geschrieben:
> > > Simple is good.  Even for deduplication alone, I think data integrity is
> > > critical - otherwise we risk stale dedup metadata pointing to clusters
> > > that are unallocated or do not contain the right data.  So the journal
> > > will probably need to follow techniques for commits/checksums.
> >
> 
> I'll add checksums to the journal and clean the journal entry size mess soon.
> 
> For the transactional/commits aspect of the journal I think that we need Kevin's
> point of view on the subject.

Sorry, I was going to prepare a patch that does journalling for the
existing metadata, but once again other things stole my time. I'm still
planning to do that, though, and then we can compare whether your
requirements are fulfilled with it as well.

Kevin
diff mbox

Patch

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 36a559d..a4ffc85 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -350,3 +350,45 @@  Snapshot table entry:
         variable:   Unique ID string for the snapshot (not null terminated)
 
         variable:   Name of the snapshot (not null terminated)
+
+== Journal ==
+
+QCOW2 can use one or more instance of a metadata journal.
+
+A journal is a sequential log of journal entries appended on a previously
+allocated and reseted area.
+A journal is designed like a linked list with each entry pointing to the next
+so it's easy to iterate over entries.
+
+A journal uses the following constants to denote the type of each entry
+
+TYPE_NONE = 0xFF      default value of any bytes in a reseted journal
+TYPE_END  = 1         the entry ends a journal cluster and point to the next
+                      cluster
+TYPE_HASH = 2         the entry contains a deduplication hash
+
+QCOW2 journal entry:
+
+    Byte 0         :    Size of the entry: size = 2 + n with size <= 254
+
+         1         :    Type of the entry
+
+         2 - size  :    The optional n bytes structure carried by entry
+
+A journal is divided into clusters and no journal entry can be spilled on two
+clusters. This avoid having to read more than one cluster to get a single entry.
+
+For this purpose an entry with the end type is added at the end of a journal
+cluster before starting to write in the next cluster.
+The size of such an entry is set so the entry points to the next cluster.
+
+As any journal cluster must be ended with an end entry the size of regular
+journal entries is limited to 254 bytes in order to always left room for an end
+entry which mimimal size is two bytes.
+
+The only cases where size > 254 are none entries where size = 255.
+
+The replay of a journal stop when the first end none entry is reached.
+
+The journal cluster size is 4096 bytes.
+