Message ID | 1371738392-9594-2-git-send-email-benoit@irqsave.net |
---|---|
State | New |
Headers | show |
On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote: > --- > docs/specs/qcow2.txt | 42 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 42 insertions(+) > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > index 36a559d..a4ffc85 100644 > --- a/docs/specs/qcow2.txt > +++ b/docs/specs/qcow2.txt > @@ -350,3 +350,45 @@ Snapshot table entry: > variable: Unique ID string for the snapshot (not null terminated) > > variable: Name of the snapshot (not null terminated) > + > +== Journal == > + > +QCOW2 can use one or more instance of a metadata journal. s/instance/instances/ Is there a reason to use multiple journals rather than a single journal for all entry types? The single journal area avoids seeks. > + > +A journal is a sequential log of journal entries appended on a previously > +allocated and reseted area. I think you say "previously reset area" instead of "reseted". Another option is "initialized area". > +A journal is designed like a linked list with each entry pointing to the next > +so it's easy to iterate over entries. > + > +A journal uses the following constants to denote the type of each entry > + > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > +TYPE_END = 1 the entry ends a journal cluster and point to the next > + cluster > +TYPE_HASH = 2 the entry contains a deduplication hash > + > +QCOW2 journal entry: > + > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 This is not clear. I'm wondering if the +2 is included in the byte value or not. I'm also wondering what a byte value of zero means and what a byte value of 255 means. Please include an example to illustrate how this field works. > + > + 1 : Type of the entry > + > + 2 - size : The optional n bytes structure carried by entry > + > +A journal is divided into clusters and no journal entry can be spilled on two > +clusters. This avoid having to read more than one cluster to get a single entry. > + > +For this purpose an entry with the end type is added at the end of a journal > +cluster before starting to write in the next cluster. > +The size of such an entry is set so the entry points to the next cluster. > + > +As any journal cluster must be ended with an end entry the size of regular > +journal entries is limited to 254 bytes in order to always left room for an end > +entry which mimimal size is two bytes. > + > +The only cases where size > 254 are none entries where size = 255. > + > +The replay of a journal stop when the first end none entry is reached. s/stop/stops/ > +The journal cluster size is 4096 bytes. Questions about this layout: 1. Journal entries have no integrity mechanism, which is especially important if they span physical sectors where cheap disks may perform a partial write. This would leave a corrupt journal. If the last bytes are a checksum then you can get some confidence that the entry was fully written and is valid. Did I miss something? 2. Byte-granularity means that read-modify-write is necessary to append entries to the journal. Therefore a failure could destroy previously committed entries. Any ideas how existing journals handle this?
Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben: > On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote: > > --- > > docs/specs/qcow2.txt | 42 ++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 42 insertions(+) > > > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > > index 36a559d..a4ffc85 100644 > > --- a/docs/specs/qcow2.txt > > +++ b/docs/specs/qcow2.txt > > @@ -350,3 +350,45 @@ Snapshot table entry: > > variable: Unique ID string for the snapshot (not null terminated) > > > > variable: Name of the snapshot (not null terminated) > > + > > +== Journal == > > + > > +QCOW2 can use one or more instance of a metadata journal. > > s/instance/instances/ > > Is there a reason to use multiple journals rather than a single journal > for all entry types? The single journal area avoids seeks. > > > + > > +A journal is a sequential log of journal entries appended on a previously > > +allocated and reseted area. > > I think you say "previously reset area" instead of "reseted". Another > option is "initialized area". > > > +A journal is designed like a linked list with each entry pointing to the next > > +so it's easy to iterate over entries. > > + > > +A journal uses the following constants to denote the type of each entry > > + > > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > > +TYPE_END = 1 the entry ends a journal cluster and point to the next > > + cluster > > +TYPE_HASH = 2 the entry contains a deduplication hash > > + > > +QCOW2 journal entry: > > + > > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 > > This is not clear. I'm wondering if the +2 is included in the byte > value or not. I'm also wondering what a byte value of zero means and > what a byte value of 255 means. > > Please include an example to illustrate how this field works. > > > + > > + 1 : Type of the entry > > + > > + 2 - size : The optional n bytes structure carried by entry > > + > > +A journal is divided into clusters and no journal entry can be spilled on two > > +clusters. This avoid having to read more than one cluster to get a single entry. > > + > > +For this purpose an entry with the end type is added at the end of a journal > > +cluster before starting to write in the next cluster. > > +The size of such an entry is set so the entry points to the next cluster. > > + > > +As any journal cluster must be ended with an end entry the size of regular > > +journal entries is limited to 254 bytes in order to always left room for an end > > +entry which mimimal size is two bytes. > > + > > +The only cases where size > 254 are none entries where size = 255. > > + > > +The replay of a journal stop when the first end none entry is reached. > > s/stop/stops/ > > > +The journal cluster size is 4096 bytes. > > Questions about this layout: > > 1. Journal entries have no integrity mechanism, which is especially > important if they span physical sectors where cheap disks may perform > a partial write. This would leave a corrupt journal. If the last > bytes are a checksum then you can get some confidence that the entry > was fully written and is valid. > > Did I miss something? Adding a checksum sounds like a good idea. > 2. Byte-granularity means that read-modify-write is necessary to append > entries to the journal. Therefore a failure could destroy previously > committed entries. > > Any ideas how existing journals handle this? You commit only whole blocks. So in this case we can consider a block only committed as soon as a TYPE_END entry has been written (and after that we won't touch it any more until the journalled changes have been flushed to disk). There's one "interesting" case: cache=writethrough. I'm not entirely sure yet what to do with it, but it's slow anyway, so using one block per entry and therefore flushing the journal very often might actually be not totally unreasonable. Another thing I'm not sure about is whether a fixed 4k block is good or if we should leave it configurable. I don't think making it an option would hurt (not necessarily modifyable with qemu-img, but as a field in the file format). Kevin
> > +QCOW2 can use one or more instance of a metadata journal. > > s/instance/instances/ > > Is there a reason to use multiple journals rather than a single journal > for all entry types? The single journal area avoids seeks. Here are the main reason for this: For the deduplication some patterns like cycles of insertion/deletion could leave the hash table almost empty while filling the journal. If the journal is full and the hash table is empty a packing operation is started. Basically a new journal is created and only the entry presents in the hash table are reinserted. This is why I want to keep the deduplication journal appart from regular qcow2 journal: to avoid interferences between a pack operation and regular qcow2 journal entries. The other thing is that freezing the log store would need a replay of regular qcow2 entries as it trigger a reset of the journal. Also since deduplication will not work on spinning disk I discarded the seek time factor. Maybe commiting the dedupe journal by erase block sized chunk would be a good idea to reduce random writes to the SSD. The additional reason for having multiple journals is that the SILT paper propose a mode where prefix of the hash is used to dispatch insertions in multiples store and it easier to do with multiple journals. > > > + > > +A journal is a sequential log of journal entries appended on a previously > > +allocated and reseted area. > > I think you say "previously reset area" instead of "reseted". Another > option is "initialized area". > > > +A journal is designed like a linked list with each entry pointing to the next > > +so it's easy to iterate over entries. > > + > > +A journal uses the following constants to denote the type of each entry > > + > > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > > +TYPE_END = 1 the entry ends a journal cluster and point to the next > > + cluster > > +TYPE_HASH = 2 the entry contains a deduplication hash > > + > > +QCOW2 journal entry: > > + > > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 > > This is not clear. I'm wondering if the +2 is included in the byte > value or not. I'm also wondering what a byte value of zero means and > what a byte value of 255 means. I am counting the journal entry header in the size. So yes the +2 is in the byte value. A byte value of zero, 1 or 255 is an error. Maybe this design is bogus and I should only count the payload size in the size field. It would make less tricky cases. > > Please include an example to illustrate how this field works. > > > + > > + 1 : Type of the entry > > + > > + 2 - size : The optional n bytes structure carried by entry > > + > > +A journal is divided into clusters and no journal entry can be spilled on two > > +clusters. This avoid having to read more than one cluster to get a single entry. > > + > > +For this purpose an entry with the end type is added at the end of a journal > > +cluster before starting to write in the next cluster. > > +The size of such an entry is set so the entry points to the next cluster. > > + > > +As any journal cluster must be ended with an end entry the size of regular > > +journal entries is limited to 254 bytes in order to always left room for an end > > +entry which mimimal size is two bytes. > > + > > +The only cases where size > 254 are none entries where size = 255. > > + > > +The replay of a journal stop when the first end none entry is reached. > > s/stop/stops/ > > > +The journal cluster size is 4096 bytes. > > Questions about this layout: > > 1. Journal entries have no integrity mechanism, which is especially > important if they span physical sectors where cheap disks may perform > a partial write. This would leave a corrupt journal. If the last > bytes are a checksum then you can get some confidence that the entry > was fully written and is valid. I will add a checksum mecanism. Do you have any preferences regarding the checksum function ? > > Did I miss something? > > 2. Byte-granularity means that read-modify-write is necessary to append > entries to the journal. Therefore a failure could destroy previously > committed entries. It's designed to be committed by 4KB blocks. > > Any ideas how existing journals handle this? >
> > 2. Byte-granularity means that read-modify-write is necessary to append > > entries to the journal. Therefore a failure could destroy previously > > committed entries. > > > > Any ideas how existing journals handle this? > > You commit only whole blocks. So in this case we can consider a block > only committed as soon as a TYPE_END entry has been written (and after > that we won't touch it any more until the journalled changes have been > flushed to disk). > > There's one "interesting" case: cache=writethrough. I'm not entirely > sure yet what to do with it, but it's slow anyway, so using one block > per entry and therefore flushing the journal very often might actually > be not totally unreasonable. This sure would finish to kill the performance because this would be an io per metadata written to disk. > > Another thing I'm not sure about is whether a fixed 4k block is good or > if we should leave it configurable. I don't think making it an option > would hurt (not necessarily modifyable with qemu-img, but as a field > in the file format). I agree. I also think about make the number of block to be flushed at once configurable. Benoît
On Tue, Jul 02, 2013 at 04:54:46PM +0200, Kevin Wolf wrote: > Am 02.07.2013 um 16:42 hat Stefan Hajnoczi geschrieben: > > On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote: > > > --- > > > docs/specs/qcow2.txt | 42 ++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 42 insertions(+) > > > > > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > > > index 36a559d..a4ffc85 100644 > > > --- a/docs/specs/qcow2.txt > > > +++ b/docs/specs/qcow2.txt > > > @@ -350,3 +350,45 @@ Snapshot table entry: > > > variable: Unique ID string for the snapshot (not null terminated) > > > > > > variable: Name of the snapshot (not null terminated) > > > + > > > +== Journal == > > > + > > > +QCOW2 can use one or more instance of a metadata journal. > > > > s/instance/instances/ > > > > Is there a reason to use multiple journals rather than a single journal > > for all entry types? The single journal area avoids seeks. > > > > > + > > > +A journal is a sequential log of journal entries appended on a previously > > > +allocated and reseted area. > > > > I think you say "previously reset area" instead of "reseted". Another > > option is "initialized area". > > > > > +A journal is designed like a linked list with each entry pointing to the next > > > +so it's easy to iterate over entries. > > > + > > > +A journal uses the following constants to denote the type of each entry > > > + > > > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > > > +TYPE_END = 1 the entry ends a journal cluster and point to the next > > > + cluster > > > +TYPE_HASH = 2 the entry contains a deduplication hash > > > + > > > +QCOW2 journal entry: > > > + > > > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 > > > > This is not clear. I'm wondering if the +2 is included in the byte > > value or not. I'm also wondering what a byte value of zero means and > > what a byte value of 255 means. > > > > Please include an example to illustrate how this field works. > > > > > + > > > + 1 : Type of the entry > > > + > > > + 2 - size : The optional n bytes structure carried by entry > > > + > > > +A journal is divided into clusters and no journal entry can be spilled on two > > > +clusters. This avoid having to read more than one cluster to get a single entry. > > > + > > > +For this purpose an entry with the end type is added at the end of a journal > > > +cluster before starting to write in the next cluster. > > > +The size of such an entry is set so the entry points to the next cluster. > > > + > > > +As any journal cluster must be ended with an end entry the size of regular > > > +journal entries is limited to 254 bytes in order to always left room for an end > > > +entry which mimimal size is two bytes. > > > + > > > +The only cases where size > 254 are none entries where size = 255. > > > + > > > +The replay of a journal stop when the first end none entry is reached. > > > > s/stop/stops/ > > > > > +The journal cluster size is 4096 bytes. > > > > Questions about this layout: > > > > 1. Journal entries have no integrity mechanism, which is especially > > important if they span physical sectors where cheap disks may perform > > a partial write. This would leave a corrupt journal. If the last > > bytes are a checksum then you can get some confidence that the entry > > was fully written and is valid. > > > > Did I miss something? > > Adding a checksum sounds like a good idea. > > > 2. Byte-granularity means that read-modify-write is necessary to append > > entries to the journal. Therefore a failure could destroy previously > > committed entries. > > > > Any ideas how existing journals handle this? > > You commit only whole blocks. So in this case we can consider a block > only committed as soon as a TYPE_END entry has been written (and after > that we won't touch it any more until the journalled changes have been > flushed to disk). > > There's one "interesting" case: cache=writethrough. I'm not entirely > sure yet what to do with it, but it's slow anyway, so using one block > per entry and therefore flushing the journal very often might actually > be not totally unreasonable. > > Another thing I'm not sure about is whether a fixed 4k block is good or > if we should leave it configurable. I don't think making it an option > would hurt (not necessarily modifyable with qemu-img, but as a field > in the file format). Making block size configurable seems like a good idea so we can adapt to disk performance and data integrity characteristics. Stefan
On Tue, Jul 02, 2013 at 11:23:56PM +0200, Benoît Canet wrote: > > > +QCOW2 can use one or more instance of a metadata journal. > > > > s/instance/instances/ > > > > Is there a reason to use multiple journals rather than a single journal > > for all entry types? The single journal area avoids seeks. > > Here are the main reason for this: > > For the deduplication some patterns like cycles of insertion/deletion could > leave the hash table almost empty while filling the journal. > > If the journal is full and the hash table is empty a packing operation is > started. > > Basically a new journal is created and only the entry presents in the hash table > are reinserted. > > This is why I want to keep the deduplication journal appart from regular qcow2 > journal: to avoid interferences between a pack operation and regular qcow2 > journal entries. > > The other thing is that freezing the log store would need a replay of regular > qcow2 entries as it trigger a reset of the journal. > > Also since deduplication will not work on spinning disk I discarded the seek > time factor. > > Maybe commiting the dedupe journal by erase block sized chunk would be a good > idea to reduce random writes to the SSD. > > The additional reason for having multiple journals is that the SILT paper > propose a mode where prefix of the hash is used to dispatch insertions in > multiples store and it easier to do with multiple journals. It sounds like the journal is more than just a data integrity mechanism. It's an integral part of your dedup algorithm and you plan to carefully manage it while rebuilding some of the other dedup data structures. Does this mean the journal forms the first-stage data structure for deduplication? Dedup records will accumulate in the journal until it becomes time to convert them in bulk into a more compact representation? When I read this specification I was thinking of a journal purely for logging operations. You could use a commit record to mark previous records applied. Upon startup, qcow2 would inspect uncommitted records and deal with them. We just need to figure out how to define a good interface so that the journal can be used in a general way but also for dedup's specific needs. Stefan
Am 02.07.2013 um 23:23 hat Benoît Canet geschrieben: > Also since deduplication will not work on spinning disk I discarded the seek > time factor. Care to explain that in more detail? Why shouldn't it work on spinning disks? Kevin
Am 02.07.2013 um 23:26 hat Benoît Canet geschrieben: > > > 2. Byte-granularity means that read-modify-write is necessary to append > > > entries to the journal. Therefore a failure could destroy previously > > > committed entries. > > > > > > Any ideas how existing journals handle this? > > > > You commit only whole blocks. So in this case we can consider a block > > only committed as soon as a TYPE_END entry has been written (and after > > that we won't touch it any more until the journalled changes have been > > flushed to disk). > > > > There's one "interesting" case: cache=writethrough. I'm not entirely > > sure yet what to do with it, but it's slow anyway, so using one block > > per entry and therefore flushing the journal very often might actually > > be not totally unreasonable. > > This sure would finish to kill the performance because this would be an io > per metadata written to disk. cache=writethrough already pretty much kills performance because it's not only an I/O per metadata write, but also a flush. The question is, do we have any option to avoid it? > > Another thing I'm not sure about is whether a fixed 4k block is good or > > if we should leave it configurable. I don't think making it an option > > would hurt (not necessarily modifyable with qemu-img, but as a field > > in the file format). > > I agree. > I also think about make the number of block to be flushed at once configurable. This is more of a runtime option. We can store a default in the image, though. Kevin
On Tue, Jul 02, 2013 at 11:23:56PM +0200, Benoît Canet wrote:
> > Any ideas how existing journals handle this?
By the way, I don't know much about journalling techniques. So I'm
asking you these questions so that either you can answer them straight
away or because they might warrant a look at existing journal
implementations like:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2
http://www.sqlite.org/cgi/src/dir?name=src
http://blitiri.com.ar/p/libjio/
Stefan
> Care to explain that in more detail? Why shouldn't it work on spinning > disks? Hash are random they introduce random read access. With a QCOW2 cluster size of 4KB the deduplication code when writting duplicated data will do one random read per 4KB block to deduplicate. A server grade hardisk is rated for 250 iops. This traduce in 1MB/s of deduplicated data. Not very usable. On the contrary a samsung 840 pro SSD is rated for 80k iops of random read. That should traduce in 320MB/s of potentially deduplicated data. Havind dedup metadata on SSD and actual data on disk would solve the problem but it would need block backend. Benoît
> Does this mean the journal forms the first-stage data structure for > deduplication? Dedup records will accumulate in the journal until it > becomes time to convert them in bulk into a more compact representation? The journal is mainly used to persist the last inserted dedup metadata across QEMU stop and restart. I replay it at startup to rebuild the hash table. So yes it's the first stage even it's never used for regular queries. > > When I read this specification I was thinking of a journal purely for > logging operations. You could use a commit record to mark previous > records applied. Upon startup, qcow2 would inspect uncommitted records > and deal with them. Maybe that could help regular QCOW2 usage. I don't know.
> By the way, I don't know much about journalling techniques. So I'm > asking you these questions so that either you can answer them straight > away or because they might warrant a look at existing journal > implementations like: I tried to so something simple and performing for the deduplication usage. That explain that there is no concept of transaction and that the journal's block are flushed asynchronously in order to have an high insertion rate. I agree with your previous comment is more a log than a journal. > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2 > http://www.sqlite.org/cgi/src/dir?name=src > http://blitiri.com.ar/p/libjio/ I will try to find a paper on journal design. Benoît
On Wed, Jul 03, 2013 at 02:53:27PM +0200, Benoît Canet wrote: > > By the way, I don't know much about journalling techniques. So I'm > > asking you these questions so that either you can answer them straight > > away or because they might warrant a look at existing journal > > implementations like: > > I tried to so something simple and performing for the deduplication usage. > > That explain that there is no concept of transaction and that the journal's > block are flushed asynchronously in order to have an high insertion rate. > > I agree with your previous comment is more a log than a journal. Simple is good. Even for deduplication alone, I think data integrity is critical - otherwise we risk stale dedup metadata pointing to clusters that are unallocated or do not contain the right data. So the journal will probably need to follow techniques for commits/checksums. Stefan
> Simple is good. Even for deduplication alone, I think data integrity is > critical - otherwise we risk stale dedup metadata pointing to clusters > that are unallocated or do not contain the right data. So the journal > will probably need to follow techniques for commits/checksums. I agree that checksums are missing for the dedup. Maybe we could even use some kind of error correcting code instead of a checksum. Concerning data integrity the events that the deduplication code cannot loose are hash deletions because they mark a previously inserted hash as obsolete. The problem with a commit/flush mechanism on hash deletion is that it will slow down the store insertion speed and also create some extra SSD wear out. To solve this I considered the fact that the dedup metadata as a whole is disposable. So I implemented a "dedup dirty" bit. When QEMU stop the journal is flushed and the dirty bit is cleared. When QEMU start and the dirty bit is set a crash is detected and _all_ the deduplication metadata is dropped. The QCOW2 data integrity won't suffer only the dedup ratio will be lower. As you said once on irc crashes don't happen often. Benoît
> > Simple is good. Even for deduplication alone, I think data integrity is > > critical - otherwise we risk stale dedup metadata pointing to clusters > > that are unallocated or do not contain the right data. So the journal > > will probably need to follow techniques for commits/checksums. > I'll add checksums to the journal and clean the journal entry size mess soon. For the transactional/commits aspect of the journal I think that we need Kevin's point of view on the subject. Best regards Benoît
Am 17.07.2013 um 00:45 hat Benoît Canet geschrieben: > > > Simple is good. Even for deduplication alone, I think data integrity is > > > critical - otherwise we risk stale dedup metadata pointing to clusters > > > that are unallocated or do not contain the right data. So the journal > > > will probably need to follow techniques for commits/checksums. > > > > I'll add checksums to the journal and clean the journal entry size mess soon. > > For the transactional/commits aspect of the journal I think that we need Kevin's > point of view on the subject. Sorry, I was going to prepare a patch that does journalling for the existing metadata, but once again other things stole my time. I'm still planning to do that, though, and then we can compare whether your requirements are fulfilled with it as well. Kevin
diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt index 36a559d..a4ffc85 100644 --- a/docs/specs/qcow2.txt +++ b/docs/specs/qcow2.txt @@ -350,3 +350,45 @@ Snapshot table entry: variable: Unique ID string for the snapshot (not null terminated) variable: Name of the snapshot (not null terminated) + +== Journal == + +QCOW2 can use one or more instance of a metadata journal. + +A journal is a sequential log of journal entries appended on a previously +allocated and reseted area. +A journal is designed like a linked list with each entry pointing to the next +so it's easy to iterate over entries. + +A journal uses the following constants to denote the type of each entry + +TYPE_NONE = 0xFF default value of any bytes in a reseted journal +TYPE_END = 1 the entry ends a journal cluster and point to the next + cluster +TYPE_HASH = 2 the entry contains a deduplication hash + +QCOW2 journal entry: + + Byte 0 : Size of the entry: size = 2 + n with size <= 254 + + 1 : Type of the entry + + 2 - size : The optional n bytes structure carried by entry + +A journal is divided into clusters and no journal entry can be spilled on two +clusters. This avoid having to read more than one cluster to get a single entry. + +For this purpose an entry with the end type is added at the end of a journal +cluster before starting to write in the next cluster. +The size of such an entry is set so the entry points to the next cluster. + +As any journal cluster must be ended with an end entry the size of regular +journal entries is limited to 254 bytes in order to always left room for an end +entry which mimimal size is two bytes. + +The only cases where size > 254 are none entries where size = 255. + +The replay of a journal stop when the first end none entry is reached. + +The journal cluster size is 4096 bytes. +