diff mbox

docs: clarify that qcow2 file size is not always a cluster multiple

Message ID 1400751770-14594-1-git-send-email-stefanha@redhat.com
State New
Headers show

Commit Message

Stefan Hajnoczi May 22, 2014, 9:42 a.m. UTC
Normally one would expect that qcow2 image file lengths are multiples of
the cluster size.  This is not true in all cases and the spec should
document this so implementers remember to accept such files.

$ qemu-img create -f qcow2 foo.qcow2 2G
Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
$ ls -l foo.qcow2
-rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
$ bc -q
3 * (64 * 1024) + 512
197120

The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
cluster size needs.  The rest of the L1 table is omitted from the file
but allocation will continue at the next cluster boundary.

Reported-by: Maria Kustova <maxa@catit.be>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 docs/specs/qcow2.txt | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Eric Blake May 22, 2014, 4:52 p.m. UTC | #1
On 05/22/2014 03:42 AM, Stefan Hajnoczi wrote:
> Normally one would expect that qcow2 image file lengths are multiples of
> the cluster size.  This is not true in all cases and the spec should
> document this so implementers remember to accept such files.
> 
> $ qemu-img create -f qcow2 foo.qcow2 2G
> Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
> $ ls -l foo.qcow2
> -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> $ bc -q
> 3 * (64 * 1024) + 512
> 197120
> 
> The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> cluster size needs.  The rest of the L1 table is omitted from the file
> but allocation will continue at the next cluster boundary.
> 
> Reported-by: Maria Kustova <maxa@catit.be>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  docs/specs/qcow2.txt | 4 ++++
>  1 file changed, 4 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>
Kevin Wolf May 23, 2014, 11:18 a.m. UTC | #2
Am 22.05.2014 um 11:42 hat Stefan Hajnoczi geschrieben:
> Normally one would expect that qcow2 image file lengths are multiples of
> the cluster size.  This is not true in all cases and the spec should
> document this so implementers remember to accept such files.
> 
> $ qemu-img create -f qcow2 foo.qcow2 2G
> Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
> $ ls -l foo.qcow2
> -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> $ bc -q
> 3 * (64 * 1024) + 512
> 197120
> 
> The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> cluster size needs.  The rest of the L1 table is omitted from the file
> but allocation will continue at the next cluster boundary.
> 
> Reported-by: Maria Kustova <maxa@catit.be>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  docs/specs/qcow2.txt | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> index f19536a..a46ee57 100644
> --- a/docs/specs/qcow2.txt
> +++ b/docs/specs/qcow2.txt
> @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called
>  (host) clusters. A cluster is the unit in which all allocations are done,
>  both for actual guest data and for image metadata.
>  
> +If the end of the image file is not on a cluster boundary, the missing data is
> +treated as zeroes.  This layout can occur when an L1 table or refcount table is
> +the last thing in the file and not all entries in the table are used.
> +
>  Likewise, the virtual disk as seen by the guest is divided into (guest)
>  clusters of the same size.

Why don't we specify this for _any_ data after EOF, as we discussed on
IRC, instead of just for a partial last cluster?

If we restrict it to the last cluster, specifying the data as zero
doesn't make a whole lot of sense because then the data there wouldn't
be supposed to be interpreted anyway.

Kevin
Stefan Hajnoczi May 26, 2014, 11:57 a.m. UTC | #3
On Fri, May 23, 2014 at 01:18:50PM +0200, Kevin Wolf wrote:
> Am 22.05.2014 um 11:42 hat Stefan Hajnoczi geschrieben:
> > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> > index f19536a..a46ee57 100644
> > --- a/docs/specs/qcow2.txt
> > +++ b/docs/specs/qcow2.txt
> > @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called
> >  (host) clusters. A cluster is the unit in which all allocations are done,
> >  both for actual guest data and for image metadata.
> >  
> > +If the end of the image file is not on a cluster boundary, the missing data is
> > +treated as zeroes.  This layout can occur when an L1 table or refcount table is
> > +the last thing in the file and not all entries in the table are used.
> > +
> >  Likewise, the virtual disk as seen by the guest is divided into (guest)
> >  clusters of the same size.
> 
> Why don't we specify this for _any_ data after EOF, as we discussed on
> IRC, instead of just for a partial last cluster?
> 
> If we restrict it to the last cluster, specifying the data as zero
> doesn't make a whole lot of sense because then the data there wouldn't
> be supposed to be interpreted anyway.

That's too general.  Specifying it that way means almost all offsets
become valid because you're just supposed to read zeroes!  That will
definitely cause trouble, let's be stricter here.

Stefan
Benoît Canet May 26, 2014, 2:36 p.m. UTC | #4
The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote :
> Normally one would expect that qcow2 image file lengths are multiples of
> the cluster size.  This is not true in all cases and the spec should
> document this so implementers remember to accept such files.
> 
> $ qemu-img create -f qcow2 foo.qcow2 2G
> Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
> $ ls -l foo.qcow2
> -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> $ bc -q
> 3 * (64 * 1024) + 512
> 197120
> 
> The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> cluster size needs.  The rest of the L1 table is omitted from the file
> but allocation will continue at the next cluster boundary.

I think we should fix this to allocate a whole extra cluster instead
of 512B.
These days most SSD pages (smalled write unit) are 4KB or 16KB.
Having most of the file shifted by 512B mean that the SSD controller
will have to to Read/Modify/Write cycles for most write hence impacting
performance and SSD life.

Best regards

Benoît

> 
> Reported-by: Maria Kustova <maxa@catit.be>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  docs/specs/qcow2.txt | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
> index f19536a..a46ee57 100644
> --- a/docs/specs/qcow2.txt
> +++ b/docs/specs/qcow2.txt
> @@ -4,6 +4,10 @@ A qcow2 image file is organized in units of constant size, which are called
>  (host) clusters. A cluster is the unit in which all allocations are done,
>  both for actual guest data and for image metadata.
>  
> +If the end of the image file is not on a cluster boundary, the missing data is
> +treated as zeroes.  This layout can occur when an L1 table or refcount table is
> +the last thing in the file and not all entries in the table are used.
> +
>  Likewise, the virtual disk as seen by the guest is divided into (guest)
>  clusters of the same size.
>  
> -- 
> 1.9.0
>
Stefan Hajnoczi May 27, 2014, 1:24 p.m. UTC | #5
On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote:
> The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote :
> > Normally one would expect that qcow2 image file lengths are multiples of
> > the cluster size.  This is not true in all cases and the spec should
> > document this so implementers remember to accept such files.
> > 
> > $ qemu-img create -f qcow2 foo.qcow2 2G
> > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
> > $ ls -l foo.qcow2
> > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> > $ bc -q
> > 3 * (64 * 1024) + 512
> > 197120
> > 
> > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> > cluster size needs.  The rest of the L1 table is omitted from the file
> > but allocation will continue at the next cluster boundary.
> 
> I think we should fix this to allocate a whole extra cluster instead
> of 512B.
> These days most SSD pages (smalled write unit) are 4KB or 16KB.
> Having most of the file shifted by 512B mean that the SSD controller
> will have to to Read/Modify/Write cycles for most write hence impacting
> performance and SSD life.

It's not shifted.  I thought the last sentence explains this:
"The rest of the L1 table is omitted from the file but allocation will
continue at the next cluster boundary."

Are you worried that the host file system will lay out data poorly
because the file looks like this?

| header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |

B = bytes
C = clusters

My guess is the next cluster will be aligned to a reasonable boundary on
the physical disk.

Stefan
Paolo Bonzini May 27, 2014, 3:29 p.m. UTC | #6
Il 27/05/2014 18:00, Benoît Canet ha scritto:
> > Are you worried that the host file system will lay out data poorly
> > because the file looks like this?
> >
> > | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
> >
> > B = bytes
> > C = clusters
> >
> > My guess is the next cluster will be aligned to a reasonable boundary on
> > the physical disk.
>
> I have some kind of doubt. Does anyone knows a filesystem guru ?

Not a guru, but indeed there is a risk that the layout will be worse 
than necessary.

I think holes are ignored unless they are big enough, but 64K-512 is 
probably enough to create one.  Indeed here I get this:

     $ qemu-img create -f qcow2 foo.qcow2 10G
     $ qemu-io -c 'write 0 512' foo.qcow2
     $ qemu-img map -f raw foo.qcow2 10G
     Offset          Length          Mapped to       File
     0               0x31000         0               foo.qcow2
     0x40000         0x20000         0x40000         foo.qcow2

I don't know if this has any practical impact, but if the fix is easy...

Paolo
Markus Armbruster May 27, 2014, 3:56 p.m. UTC | #7
Benoît Canet <benoit.canet@irqsave.net> writes:

> The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote :
>> On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote:
>> > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote :
>> > > Normally one would expect that qcow2 image file lengths are multiples of
>> > > the cluster size.  This is not true in all cases and the spec should
>> > > document this so implementers remember to accept such files.
>> > > 
>> > > $ qemu-img create -f qcow2 foo.qcow2 2G
>> > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off
>> > > cluster_size=65536 lazy_refcounts=off
>> > > $ ls -l foo.qcow2
>> > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
>> > > $ bc -q
>> > > 3 * (64 * 1024) + 512
>> > > 197120
>> > > 
>> > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
>> > > cluster size needs.  The rest of the L1 table is omitted from the file
>> > > but allocation will continue at the next cluster boundary.
>> > 
>> > I think we should fix this to allocate a whole extra cluster instead
>> > of 512B.
>> > These days most SSD pages (smalled write unit) are 4KB or 16KB.
>> > Having most of the file shifted by 512B mean that the SSD controller
>> > will have to to Read/Modify/Write cycles for most write hence impacting
>> > performance and SSD life.
>> 
>> It's not shifted.  I thought the last sentence explains this:
>> "The rest of the L1 table is omitted from the file but allocation will
>> continue at the next cluster boundary."
>> 
>> Are you worried that the host file system will lay out data poorly
>> because the file looks like this?
>> 
>> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
>> 
>> B = bytes
>> C = clusters
>> 
>> My guess is the next cluster will be aligned to a reasonable boundary on
>> the physical disk.
>
> I have some kind of doubt. Does anyone knows a filesystem guru ?

I'm not one, but here goes anyway.

Aligning to a multiple of the SSD's erase block size can only help.  A
common erase block size today is 128KiB.  The going recommendation for
*partition* alignment (which should also be aligned to erase block size)
is 1MiB.  What this means for QCOW2 I'll leave to the good folks working
on it.

Here's some (dated) advice on aligning for SSDs from a real filesystem
guru: http://tytso.livejournal.com/2009/02/20/
Benoît Canet May 27, 2014, 4 p.m. UTC | #8
The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote :
> On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote:
> > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote :
> > > Normally one would expect that qcow2 image file lengths are multiples of
> > > the cluster size.  This is not true in all cases and the spec should
> > > document this so implementers remember to accept such files.
> > > 
> > > $ qemu-img create -f qcow2 foo.qcow2 2G
> > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off cluster_size=65536 lazy_refcounts=off
> > > $ ls -l foo.qcow2
> > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> > > $ bc -q
> > > 3 * (64 * 1024) + 512
> > > 197120
> > > 
> > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> > > cluster size needs.  The rest of the L1 table is omitted from the file
> > > but allocation will continue at the next cluster boundary.
> > 
> > I think we should fix this to allocate a whole extra cluster instead
> > of 512B.
> > These days most SSD pages (smalled write unit) are 4KB or 16KB.
> > Having most of the file shifted by 512B mean that the SSD controller
> > will have to to Read/Modify/Write cycles for most write hence impacting
> > performance and SSD life.
> 
> It's not shifted.  I thought the last sentence explains this:
> "The rest of the L1 table is omitted from the file but allocation will
> continue at the next cluster boundary."
> 
> Are you worried that the host file system will lay out data poorly
> because the file looks like this?
> 
> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
> 
> B = bytes
> C = clusters
> 
> My guess is the next cluster will be aligned to a reasonable boundary on
> the physical disk.

I have some kind of doubt. Does anyone knows a filesystem guru ?

> 
> Stefan
Benoît Canet May 27, 2014, 5:45 p.m. UTC | #9
The Tuesday 27 May 2014 à 17:29:16 (+0200), Paolo Bonzini wrote :
> Il 27/05/2014 18:00, Benoît Canet ha scritto:
> >> Are you worried that the host file system will lay out data poorly
> >> because the file looks like this?
> >>
> >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
> >>
> >> B = bytes
> >> C = clusters
> >>
> >> My guess is the next cluster will be aligned to a reasonable boundary on
> >> the physical disk.
> >
> >I have some kind of doubt. Does anyone knows a filesystem guru ?
> 
> Not a guru, but indeed there is a risk that the layout will be worse
> than necessary.
> 
> I think holes are ignored unless they are big enough, but 64K-512 is
> probably enough to create one.  Indeed here I get this:
> 
>     $ qemu-img create -f qcow2 foo.qcow2 10G
>     $ qemu-io -c 'write 0 512' foo.qcow2
>     $ qemu-img map -f raw foo.qcow2 10G
>     Offset          Length          Mapped to       File
>     0               0x31000         0               foo.qcow2
>     0x40000         0x20000         0x40000         foo.qcow2

That seems to be 4k aligned. It's better than I though.

Best regards

Benoît

> 
> I don't know if this has any practical impact, but if the fix is easy...
> 
> Paolo
>
Stefan Hajnoczi May 28, 2014, 9:18 a.m. UTC | #10
On Tue, May 27, 2014 at 05:56:35PM +0200, Markus Armbruster wrote:
> Benoît Canet <benoit.canet@irqsave.net> writes:
> 
> > The Tuesday 27 May 2014 à 15:24:00 (+0200), Stefan Hajnoczi wrote :
> >> On Mon, May 26, 2014 at 04:36:15PM +0200, Benoît Canet wrote:
> >> > The Thursday 22 May 2014 à 11:42:50 (+0200), Stefan Hajnoczi wrote :
> >> > > Normally one would expect that qcow2 image file lengths are multiples of
> >> > > the cluster size.  This is not true in all cases and the spec should
> >> > > document this so implementers remember to accept such files.
> >> > > 
> >> > > $ qemu-img create -f qcow2 foo.qcow2 2G
> >> > > Formatting 'foo.qcow2', fmt=qcow2 size=2147483648 encryption=off
> >> > > cluster_size=65536 lazy_refcounts=off
> >> > > $ ls -l foo.qcow2
> >> > > -rw-r--r-- 1 stefanha stefanha 197120 May 22 11:40 foo.qcow2
> >> > > $ bc -q
> >> > > 3 * (64 * 1024) + 512
> >> > > 197120
> >> > > 
> >> > > The extra sector are the 4 L1 table entries that a 2 GB disk with 64 KB
> >> > > cluster size needs.  The rest of the L1 table is omitted from the file
> >> > > but allocation will continue at the next cluster boundary.
> >> > 
> >> > I think we should fix this to allocate a whole extra cluster instead
> >> > of 512B.
> >> > These days most SSD pages (smalled write unit) are 4KB or 16KB.
> >> > Having most of the file shifted by 512B mean that the SSD controller
> >> > will have to to Read/Modify/Write cycles for most write hence impacting
> >> > performance and SSD life.
> >> 
> >> It's not shifted.  I thought the last sentence explains this:
> >> "The rest of the L1 table is omitted from the file but allocation will
> >> continue at the next cluster boundary."
> >> 
> >> Are you worried that the host file system will lay out data poorly
> >> because the file looks like this?
> >> 
> >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
> >> 
> >> B = bytes
> >> C = clusters
> >> 
> >> My guess is the next cluster will be aligned to a reasonable boundary on
> >> the physical disk.
> >
> > I have some kind of doubt. Does anyone knows a filesystem guru ?
> 
> I'm not one, but here goes anyway.
> 
> Aligning to a multiple of the SSD's erase block size can only help.  A
> common erase block size today is 128KiB.  The going recommendation for
> *partition* alignment (which should also be aligned to erase block size)
> is 1MiB.  What this means for QCOW2 I'll leave to the good folks working
> on it.
> 
> Here's some (dated) advice on aligning for SSDs from a real filesystem
> guru: http://tytso.livejournal.com/2009/02/20/

We need to do this in two steps:

1. Update the qcow2 specification to clarify that existing files may not
   be multiples of cluster size.

2. Update QEMU implementation to write full clusters to the file.

That way we get the performance benefits but warn implementors that
files might not be multiples of cluster size.

This patch addresses #1, so unless anyone objects to the spec wording, I
think it should be merged.
Kevin Wolf May 28, 2014, 11:29 a.m. UTC | #11
Am 27.05.2014 um 19:45 hat Benoît Canet geschrieben:
> The Tuesday 27 May 2014 à 17:29:16 (+0200), Paolo Bonzini wrote :
> > Il 27/05/2014 18:00, Benoît Canet ha scritto:
> > >> Are you worried that the host file system will lay out data poorly
> > >> because the file looks like this?
> > >>
> > >> | header (1C) | refcounts (2C) | L1 (512B) | hole | Next cluster |
> > >>
> > >> B = bytes
> > >> C = clusters
> > >>
> > >> My guess is the next cluster will be aligned to a reasonable boundary on
> > >> the physical disk.
> > >
> > >I have some kind of doubt. Does anyone knows a filesystem guru ?
> > 
> > Not a guru, but indeed there is a risk that the layout will be worse
> > than necessary.
> > 
> > I think holes are ignored unless they are big enough, but 64K-512 is
> > probably enough to create one.  Indeed here I get this:
> > 
> >     $ qemu-img create -f qcow2 foo.qcow2 10G
> >     $ qemu-io -c 'write 0 512' foo.qcow2
> >     $ qemu-img map -f raw foo.qcow2 10G
> >     Offset          Length          Mapped to       File
> >     0               0x31000         0               foo.qcow2
> >     0x40000         0x20000         0x40000         foo.qcow2
> 
> That seems to be 4k aligned. It's better than I though.

These 4k are probably the host file system block size.

Kevin
diff mbox

Patch

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index f19536a..a46ee57 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -4,6 +4,10 @@  A qcow2 image file is organized in units of constant size, which are called
 (host) clusters. A cluster is the unit in which all allocations are done,
 both for actual guest data and for image metadata.
 
+If the end of the image file is not on a cluster boundary, the missing data is
+treated as zeroes.  This layout can occur when an L1 table or refcount table is
+the last thing in the file and not all entries in the table are used.
+
 Likewise, the virtual disk as seen by the guest is divided into (guest)
 clusters of the same size.