Message ID | 1400751770-14594-1-git-send-email-stefanha@redhat.com |
---|---|
State | New |
Headers | show |
On 05/22/2014 03:42 AM, Stefan Hajnoczi wrote: > Normally one would expect that qcow2 image file lengths are multiples of > the cluster size. This is not true in all cases and the spec should > document this so implementers remember to accept such files. > > $ qemu-img create -f qcow2 foo.qcow2 2G > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off > $ ls -l foo.qcow2 > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > $ bc -q > 3 * (64 * 1024) + 512 > 197120 > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > cluster size needs. The rest of the L1 table is omitted from the file > but allocation will continue at the next cluster boundary. > > Reported-by: Maria Kustova <maxa@catit.be> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > docs/specs/qcow2.txt | 4 ++++ > 1 file changed, 4 insertions(+) Reviewed-by: Eric Blake <eblake@redhat.com>
Am 22.05.2014 um 11:42 hat Stefan Hajnoczi geschrieben: > Normally one would expect that qcow2 image file lengths are multiples of > the cluster size. This is not true in all cases and the spec should > document this so implementers remember to accept such files. > > $ qemu-img create -f qcow2 foo.qcow2 2G > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off > $ ls -l foo.qcow2 > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > $ bc -q > 3 * (64 * 1024) + 512 > 197120 > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > cluster size needs. The rest of the L1 table is omitted from the file > but allocation will continue at the next cluster boundary. > > Reported-by: Maria Kustova <maxa@catit.be> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > docs/specs/qcow2.txt | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > index f19536a..a46ee57 100644 > --- a/docs/specs/qcow2.txt > +++ b/docs/specs/qcow2.txt > @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called > (host) clusters. A cluster is the unit in which all allocations are done, > both for actual guest data and for image metadata. > > +If the end of the image file is not on a cluster boundary, the missing data is > +treated as zeroes. This layout can occur when an L1 table or refcount table is > +the last thing in the file and not all entries in the table are used. > + > Likewise, the virtual disk as seen by the guest is divided into (guest) > clusters of the same size. Why don't we specify this for _any_ data after EOF, as we discussed on IRC, instead of just for a partial last cluster? If we restrict it to the last cluster, specifying the data as zero doesn't make a whole lot of sense because then the data there wouldn't be supposed to be interpreted anyway. Kevin
On Fri, May 23, 2014 at 01:18:50PM +0200, Kevin Wolf wrote: > Am 22.05.2014 um 11:42 hat Stefan Hajnoczi geschrieben: > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > > index f19536a..a46ee57 100644 > > --- a/docs/specs/qcow2.txt > > +++ b/docs/specs/qcow2.txt > > @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called > > (host) clusters. A cluster is the unit in which all allocations are done, > > both for actual guest data and for image metadata. > > > > +If the end of the image file is not on a cluster boundary, the missing data is > > +treated as zeroes. This layout can occur when an L1 table or refcount table is > > +the last thing in the file and not all entries in the table are used. > > + > > Likewise, the virtual disk as seen by the guest is divided into (guest) > > clusters of the same size. > > Why don't we specify this for _any_ data after EOF, as we discussed on > IRC, instead of just for a partial last cluster? > > If we restrict it to the last cluster, specifying the data as zero > doesn't make a whole lot of sense because then the data there wouldn't > be supposed to be interpreted anyway. That's too general. Specifying it that way means almost all offsets become valid because you're just supposed to read zeroes! That will definitely cause trouble, let's be stricter here. Stefan
The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote : > Normally one would expect that qcow2 image file lengths are multiples of > the cluster size. This is not true in all cases and the spec should > document this so implementers remember to accept such files. > > $ qemu-img create -f qcow2 foo.qcow2 2G > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off > $ ls -l foo.qcow2 > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > $ bc -q > 3 * (64 * 1024) + 512 > 197120 > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > cluster size needs. The rest of the L1 table is omitted from the file > but allocation will continue at the next cluster boundary. I think we should fix this to allocate a whole extra cluster instead of 512B. These days most SSD pages (smalled write unit) are 4KB or 16KB. Having most of the file shifted by 512B mean that the SSD controller will have to to Read/Modify/Write cycles for most write hence impacting performance and SSD life. Best regards Benoît > > Reported-by: Maria Kustova <maxa@catit.be> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > docs/specs/qcow2.txt | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > index f19536a..a46ee57 100644 > --- a/docs/specs/qcow2.txt > +++ b/docs/specs/qcow2.txt > @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called > (host) clusters. A cluster is the unit in which all allocations are done, > both for actual guest data and for image metadata. > > +If the end of the image file is not on a cluster boundary, the missing data is > +treated as zeroes. This layout can occur when an L1 table or refcount table is > +the last thing in the file and not all entries in the table are used. > + > Likewise, the virtual disk as seen by the guest is divided into (guest) > clusters of the same size. > > -- > 1.9.0 >
On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote: > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote : > > Normally one would expect that qcow2 image file lengths are multiples of > > the cluster size. This is not true in all cases and the spec should > > document this so implementers remember to accept such files. > > > > $ qemu-img create -f qcow2 foo.qcow2 2G > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off > > $ ls -l foo.qcow2 > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > > $ bc -q > > 3 * (64 * 1024) + 512 > > 197120 > > > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > > cluster size needs. The rest of the L1 table is omitted from the file > > but allocation will continue at the next cluster boundary. > > I think we should fix this to allocate a whole extra cluster instead > of 512B. > These days most SSD pages (smalled write unit) are 4KB or 16KB. > Having most of the file shifted by 512B mean that the SSD controller > will have to to Read/Modify/Write cycles for most write hence impacting > performance and SSD life. It's not shifted. I thought the last sentence explains this: "The rest of the L1 table is omitted from the file but allocation will continue at the next cluster boundary." Are you worried that the host file system will lay out data poorly because the file looks like this? | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | B = bytes C = clusters My guess is the next cluster will be aligned to a reasonable boundary on the physical disk. Stefan
Il 27/05/2014 18:00, Benoît Canet ha scritto: > > Are you worried that the host file system will lay out data poorly > > because the file looks like this? > > > > | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | > > > > B = bytes > > C = clusters > > > > My guess is the next cluster will be aligned to a reasonable boundary on > > the physical disk. > > I have some kind of doubt. Does anyone knows a filesystem guru ? Not a guru, but indeed there is a risk that the layout will be worse than necessary. I think holes are ignored unless they are big enough, but 64K-512 is probably enough to create one. Indeed here I get this: $ qemu-img create -f qcow2 foo.qcow2 10G $ qemu-io -c 'write 0 512' foo.qcow2 $ qemu-img map -f raw foo.qcow2 10G Offset Length Mapped to File 0 0x31000 0 foo.qcow2 0x40000 0x20000 0x40000 foo.qcow2 I don't know if this has any practical impact, but if the fix is easy... Paolo
Benoît Canet <benoit.canet@irqsave.net> writes: > The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote : >> On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote: >> > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote : >> > > Normally one would expect that qcow2 image file lengths are multiples of >> > > the cluster size. This is not true in all cases and the spec should >> > > document this so implementers remember to accept such files. >> > > >> > > $ qemu-img create -f qcow2 foo.qcow2 2G >> > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off >> > > cluster_size=65536 lazy_refcounts=off >> > > $ ls -l foo.qcow2 >> > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 >> > > $ bc -q >> > > 3 * (64 * 1024) + 512 >> > > 197120 >> > > >> > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB >> > > cluster size needs. The rest of the L1 table is omitted from the file >> > > but allocation will continue at the next cluster boundary. >> > >> > I think we should fix this to allocate a whole extra cluster instead >> > of 512B. >> > These days most SSD pages (smalled write unit) are 4KB or 16KB. >> > Having most of the file shifted by 512B mean that the SSD controller >> > will have to to Read/Modify/Write cycles for most write hence impacting >> > performance and SSD life. >> >> It's not shifted. I thought the last sentence explains this: >> "The rest of the L1 table is omitted from the file but allocation will >> continue at the next cluster boundary." >> >> Are you worried that the host file system will lay out data poorly >> because the file looks like this? >> >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | >> >> B = bytes >> C = clusters >> >> My guess is the next cluster will be aligned to a reasonable boundary on >> the physical disk. > > I have some kind of doubt. Does anyone knows a filesystem guru ? I'm not one, but here goes anyway. Aligning to a multiple of the SSD's erase block size can only help. A common erase block size today is 128KiB. The going recommendation for *partition* alignment (which should also be aligned to erase block size) is 1MiB. What this means for QCOW2 I'll leave to the good folks working on it. Here's some (dated) advice on aligning for SSDs from a real filesystem guru: http://tytso.livejournal.com/2009/02/20/
The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote : > On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote: > > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote : > > > Normally one would expect that qcow2 image file lengths are multiples of > > > the cluster size. This is not true in all cases and the spec should > > > document this so implementers remember to accept such files. > > > > > > $ qemu-img create -f qcow2 foo.qcow2 2G > > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off > > > $ ls -l foo.qcow2 > > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > > > $ bc -q > > > 3 * (64 * 1024) + 512 > > > 197120 > > > > > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > > > cluster size needs. The rest of the L1 table is omitted from the file > > > but allocation will continue at the next cluster boundary. > > > > I think we should fix this to allocate a whole extra cluster instead > > of 512B. > > These days most SSD pages (smalled write unit) are 4KB or 16KB. > > Having most of the file shifted by 512B mean that the SSD controller > > will have to to Read/Modify/Write cycles for most write hence impacting > > performance and SSD life. > > It's not shifted. I thought the last sentence explains this: > "The rest of the L1 table is omitted from the file but allocation will > continue at the next cluster boundary." > > Are you worried that the host file system will lay out data poorly > because the file looks like this? > > | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | > > B = bytes > C = clusters > > My guess is the next cluster will be aligned to a reasonable boundary on > the physical disk. I have some kind of doubt. Does anyone knows a filesystem guru ? > > Stefan
The Tuesday 27 May 2014 à 17:29:16 (+0200), Paolo Bonzini wrote : > Il 27/05/2014 18:00, Benoît Canet ha scritto: > >> Are you worried that the host file system will lay out data poorly > >> because the file looks like this? > >> > >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | > >> > >> B = bytes > >> C = clusters > >> > >> My guess is the next cluster will be aligned to a reasonable boundary on > >> the physical disk. > > > >I have some kind of doubt. Does anyone knows a filesystem guru ? > > Not a guru, but indeed there is a risk that the layout will be worse > than necessary. > > I think holes are ignored unless they are big enough, but 64K-512 is > probably enough to create one. Indeed here I get this: > > $ qemu-img create -f qcow2 foo.qcow2 10G > $ qemu-io -c 'write 0 512' foo.qcow2 > $ qemu-img map -f raw foo.qcow2 10G > Offset Length Mapped to File > 0 0x31000 0 foo.qcow2 > 0x40000 0x20000 0x40000 foo.qcow2 That seems to be 4k aligned. It's better than I though. Best regards Benoît > > I don't know if this has any practical impact, but if the fix is easy... > > Paolo >
On Tue, May 27, 2014 at 05:56:35PM +0200, Markus Armbruster wrote: > Benoît Canet <benoit.canet@irqsave.net> writes: > > > The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote : > >> On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote: > >> > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote : > >> > > Normally one would expect that qcow2 image file lengths are multiples of > >> > > the cluster size. This is not true in all cases and the spec should > >> > > document this so implementers remember to accept such files. > >> > > > >> > > $ qemu-img create -f qcow2 foo.qcow2 2G > >> > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off > >> > > cluster_size=65536 lazy_refcounts=off > >> > > $ ls -l foo.qcow2 > >> > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 > >> > > $ bc -q > >> > > 3 * (64 * 1024) + 512 > >> > > 197120 > >> > > > >> > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB > >> > > cluster size needs. The rest of the L1 table is omitted from the file > >> > > but allocation will continue at the next cluster boundary. > >> > > >> > I think we should fix this to allocate a whole extra cluster instead > >> > of 512B. > >> > These days most SSD pages (smalled write unit) are 4KB or 16KB. > >> > Having most of the file shifted by 512B mean that the SSD controller > >> > will have to to Read/Modify/Write cycles for most write hence impacting > >> > performance and SSD life. > >> > >> It's not shifted. I thought the last sentence explains this: > >> "The rest of the L1 table is omitted from the file but allocation will > >> continue at the next cluster boundary." > >> > >> Are you worried that the host file system will lay out data poorly > >> because the file looks like this? > >> > >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | > >> > >> B = bytes > >> C = clusters > >> > >> My guess is the next cluster will be aligned to a reasonable boundary on > >> the physical disk. > > > > I have some kind of doubt. Does anyone knows a filesystem guru ? > > I'm not one, but here goes anyway. > > Aligning to a multiple of the SSD's erase block size can only help. A > common erase block size today is 128KiB. The going recommendation for > *partition* alignment (which should also be aligned to erase block size) > is 1MiB. What this means for QCOW2 I'll leave to the good folks working > on it. > > Here's some (dated) advice on aligning for SSDs from a real filesystem > guru: http://tytso.livejournal.com/2009/02/20/ We need to do this in two steps: 1. Update the qcow2 specification to clarify that existing files may not be multiples of cluster size. 2. Update QEMU implementation to write full clusters to the file. That way we get the performance benefits but warn implementors that files might not be multiples of cluster size. This patch addresses #1, so unless anyone objects to the spec wording, I think it should be merged.
Am 27.05.2014 um 19:45 hat Benoît Canet geschrieben: > The Tuesday 27 May 2014 à 17:29:16 (+0200), Paolo Bonzini wrote : > > Il 27/05/2014 18:00, Benoît Canet ha scritto: > > >> Are you worried that the host file system will lay out data poorly > > >> because the file looks like this? > > >> > > >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster | > > >> > > >> B = bytes > > >> C = clusters > > >> > > >> My guess is the next cluster will be aligned to a reasonable boundary on > > >> the physical disk. > > > > > >I have some kind of doubt. Does anyone knows a filesystem guru ? > > > > Not a guru, but indeed there is a risk that the layout will be worse > > than necessary. > > > > I think holes are ignored unless they are big enough, but 64K-512 is > > probably enough to create one. Indeed here I get this: > > > > $ qemu-img create -f qcow2 foo.qcow2 10G > > $ qemu-io -c 'write 0 512' foo.qcow2 > > $ qemu-img map -f raw foo.qcow2 10G > > Offset Length Mapped to File > > 0 0x31000 0 foo.qcow2 > > 0x40000 0x20000 0x40000 foo.qcow2 > > That seems to be 4k aligned. It's better than I though. These 4k are probably the host file system block size. Kevin
diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt index f19536a..a46ee57 100644 --- a/docs/specs/qcow2.txt +++ b/docs/specs/qcow2.txt @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called (host) clusters. A cluster is the unit in which all allocations are done, both for actual guest data and for image metadata. +If the end of the image file is not on a cluster boundary, the missing data is +treated as zeroes. This layout can occur when an L1 table or refcount table is +the last thing in the file and not all entries in the table are used. + Likewise, the virtual disk as seen by the guest is divided into (guest) clusters of the same size.
Normally one would expect that qcow2 image file lengths are multiples of the cluster size. This is not true in all cases and the spec should document this so implementers remember to accept such files. $ qemu-img create -f qcow2 foo.qcow2 2G Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off $ ls -l foo.qcow2 -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2 $ bc -q 3 * (64 * 1024) + 512 197120 The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB cluster size needs. The rest of the L1 table is omitted from the file but allocation will continue at the next cluster boundary. Reported-by: Maria Kustova <maxa@catit.be> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- docs/specs/qcow2.txt | 4 ++++ 1 file changed, 4 insertions(+)