Patchwork qcow2: Metadata preallocation

login
register
mail settings
Submitter Kevin Wolf
Date Aug. 14, 2009, 3 p.m.
Message ID <1250262015-996-1-git-send-email-kwolf@redhat.com>
Download mbox | patch
Permalink /patch/31406/
State Superseded
Headers show

Comments

Kevin Wolf - Aug. 14, 2009, 3 p.m.
This introduces a qemu-img create option for qcow2 which allows the metadata to
be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
tables, but no data is written to them. Metadata is quite small, so this
happens in almost no time.

Especially with qcow2 on virtio this helps to gain a bit of performance during
the initial writes. However, as soon as create a snapshot, we're back to the
normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
while we're still having trouble with qcow2 on virtio.

Note that the option is disabled by default and needs to be specified
explicitly using qemu-img create -f qcow2 -o preallocation=metadata.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c |   83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 block_int.h   |    1 +
 2 files changed, 82 insertions(+), 2 deletions(-)
Avi Kivity - Aug. 16, 2009, 11:58 a.m.
On 08/14/2009 06:00 PM, Kevin Wolf wrote:
> This introduces a qemu-img create option for qcow2 which allows the metadata to
> be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
> tables, but no data is written to them. Metadata is quite small, so this
> happens in almost no time.
>
> Especially with qcow2 on virtio this helps to gain a bit of performance during
> the initial writes. However, as soon as create a snapshot, we're back to the
> normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
> while we're still having trouble with qcow2 on virtio.
>
> Note that the option is disabled by default and needs to be specified
> explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
>
>    

Can't say I'm thrilled with this.  I'd prefer coalescing metadata 
updates on parallel writes.  I don't object to this though.

> +    /*
> +     * It is expected that the image file is large enough to actually contain
> +     * all of the allocated clusters (otherwise we get failing reads after
> +     * EOF). So just write some zeros to the last sector.
> +     */
> +    if (cluster_offset != 0) {
> +        uint8_t buf[512];
> +        memset(buf, 0, 512);
> +        bdrv_write(s->hd, (cluster_offset>>  9) + num - 1, buf, 1);
> +    }
> +
>    

Older versions of Windows don't support sparse files, and newer ones 
need a flag.  It's a good idea to set this flag when opening on Windows.
Filip Navara - Aug. 16, 2009, 12:12 p.m.
On Sun, Aug 16, 2009 at 1:58 PM, Avi Kivity<avi@redhat.com> wrote:
> Older versions of Windows don't support sparse files, and newer ones need a
> flag.

It's supported since Windows 2000, btw.

>  It's a good idea to set this flag when opening on Windows.

FILE_ATTRIBUTE_SPARSE_FILE? You can't actually set it when
opening/creating the file, a separate call to
DeviceIoControl/FSCTL_SET_SPARSE is needed.

Best regards,
Filip Navara
Jamie Lokier - Aug. 16, 2009, 4:48 p.m.
Filip Navara wrote:
> FILE_ATTRIBUTE_SPARSE_FILE? You can't actually set it when
> opening/creating the file, a separate call to
> DeviceIoControl/FSCTL_SET_SPARSE is needed.

I see that you increase the file size by writing zeros to the end.

Can't you use the Windows equivalent of unix ftruncate() to extend the
file instead, after FSCTL_SET_SPARSE?

-- Jamie
Jamie Lokier - Aug. 16, 2009, 8:28 p.m.
Jamie Lokier wrote:
> Filip Navara wrote:
> > FILE_ATTRIBUTE_SPARSE_FILE? You can't actually set it when
> > opening/creating the file, a separate call to
> > DeviceIoControl/FSCTL_SET_SPARSE is needed.
> 
> I see that you increase the file size by writing zeros to the end.
> 
> Can't you use the Windows equivalent of unix ftruncate() to extend the
> file instead, after FSCTL_SET_SPARSE?

Specifically: 

    SetEndOfFile
	Use this after SetFilePointer to change the length of a file
	or stream.  If used on a sparse file or stream, increasing
	the length creates a sparse region.

-- Jamie
Kevin Wolf - Aug. 17, 2009, 7:11 a.m.
Avi Kivity schrieb:
> On 08/14/2009 06:00 PM, Kevin Wolf wrote:
>> This introduces a qemu-img create option for qcow2 which allows the metadata to
>> be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
>> tables, but no data is written to them. Metadata is quite small, so this
>> happens in almost no time.
>>
>> Especially with qcow2 on virtio this helps to gain a bit of performance during
>> the initial writes. However, as soon as create a snapshot, we're back to the
>> normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
>> while we're still having trouble with qcow2 on virtio.
>>
>> Note that the option is disabled by default and needs to be specified
>> explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
>>
>>    
> 
> Can't say I'm thrilled with this.  I'd prefer coalescing metadata 
> updates on parallel writes.  I don't object to this though.

Even with improved concurrent cluster allocation, you might profit from
metadata preallocation by having less fragmented qcow2 images which
avoids splitting up requests. Not sure if this is relevant in practice
though.

>> +    /*
>> +     * It is expected that the image file is large enough to actually contain
>> +     * all of the allocated clusters (otherwise we get failing reads after
>> +     * EOF). So just write some zeros to the last sector.
>> +     */
>> +    if (cluster_offset != 0) {
>> +        uint8_t buf[512];
>> +        memset(buf, 0, 512);
>> +        bdrv_write(s->hd, (cluster_offset>>  9) + num - 1, buf, 1);
>> +    }
>> +
>>    
> 
> Older versions of Windows don't support sparse files, and newer ones 
> need a flag.  It's a good idea to set this flag when opening on Windows.

I'm certainly hoping that raw-win32 is doing whatever needs to be done?
The mentioned FSCTL_SET_SPARSE seems to be there at least.

Kevin
Kevin Wolf - Aug. 17, 2009, 7:16 a.m.
Jamie Lokier schrieb:
> Filip Navara wrote:
>> FILE_ATTRIBUTE_SPARSE_FILE? You can't actually set it when
>> opening/creating the file, a separate call to
>> DeviceIoControl/FSCTL_SET_SPARSE is needed.
> 
> I see that you increase the file size by writing zeros to the end.
> 
> Can't you use the Windows equivalent of unix ftruncate() to extend the
> file instead, after FSCTL_SET_SPARSE?

There actually exists a bdrv_truncate(). I wasn't aware of that. If you
prefer, I can resend the patch with bdrv_truncate instead of a zero write.

Kevin
Avi Kivity - Aug. 17, 2009, 7:45 a.m.
On 08/17/2009 10:11 AM, Kevin Wolf wrote:
> Avi Kivity schrieb:
>    
>> On 08/14/2009 06:00 PM, Kevin Wolf wrote:
>>      
>>> This introduces a qemu-img create option for qcow2 which allows the metadata to
>>> be preallocated, i.e. clusters are reserved in the refcount table and L1/L2
>>> tables, but no data is written to them. Metadata is quite small, so this
>>> happens in almost no time.
>>>
>>> Especially with qcow2 on virtio this helps to gain a bit of performance during
>>> the initial writes. However, as soon as create a snapshot, we're back to the
>>> normal slow speed, obviously. So this isn't the real fix, but kind of a cheat
>>> while we're still having trouble with qcow2 on virtio.
>>>
>>> Note that the option is disabled by default and needs to be specified
>>> explicitly using qemu-img create -f qcow2 -o preallocation=metadata.
>>>
>>>
>>>        
>> Can't say I'm thrilled with this.  I'd prefer coalescing metadata
>> updates on parallel writes.  I don't object to this though.
>>      
> Even with improved concurrent cluster allocation, you might profit from
> metadata preallocation by having less fragmented qcow2 images which
> avoids splitting up requests. Not sure if this is relevant in practice
> though.
>    

What I meant was that I prefer changes that improve performance 
throughout the lifetime of the image rather than the initial writes, 
especially as there's a space tradeoff.  It is not a strong objection, 
just a mild preference.

> I'm certainly hoping that raw-win32 is doing whatever needs to be done?
> The mentioned FSCTL_SET_SPARSE seems to be there at least.
>    

Ah, I looked at the open code and missed it.
Kevin Wolf - Aug. 17, 2009, 7:58 a.m.
Avi Kivity schrieb:
>> Even with improved concurrent cluster allocation, you might profit from
>> metadata preallocation by having less fragmented qcow2 images which
>> avoids splitting up requests. Not sure if this is relevant in practice
>> though.
> 
> What I meant was that I prefer changes that improve performance 
> throughout the lifetime of the image rather than the initial writes, 
> especially as there's a space tradeoff. 

Avoiding fragmentation could improve performance during normal operation
(that is, as long as you don't use snapshots). And I wouldn't worry
about the space tradeoff: Metadata for a 10 GB image is under 2 MB, and
a good part of it would be needed anyway.

But I completely agree that it is not the solution to all of our problems.

Kevin

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index a5bf205..88e0c71 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -638,9 +638,56 @@  static int get_bits_from_size(size_t size)
     return res;
 }
 
+
+static int preallocate(BlockDriverState *bs)
+{
+    BDRVQcowState *s = bs->opaque;
+    uint64_t cluster_offset;
+    uint64_t nb_sectors;
+    uint64_t offset;
+    int num;
+    QCowL2Meta meta;
+
+    nb_sectors = bdrv_getlength(bs) >> 9;
+    offset = 0;
+
+    while (nb_sectors) {
+        num = MIN(nb_sectors, INT_MAX >> 9);
+        cluster_offset = qcow2_alloc_cluster_offset(bs, offset, 0, num, &num,
+            &meta);
+
+        if (cluster_offset == 0) {
+            return -1;
+        }
+
+        if (qcow2_alloc_cluster_link_l2(bs, cluster_offset, &meta) < 0) {
+            qcow2_free_any_clusters(bs, cluster_offset, meta.nb_clusters);
+            return -1;
+        }
+
+        /* TODO Preallocate data if requested */
+
+        nb_sectors -= num;
+        offset += num << 9;
+    }
+
+    /*
+     * It is expected that the image file is large enough to actually contain
+     * all of the allocated clusters (otherwise we get failing reads after
+     * EOF). So just write some zeros to the last sector.
+     */
+    if (cluster_offset != 0) {
+        uint8_t buf[512];
+        memset(buf, 0, 512);
+        bdrv_write(s->hd, (cluster_offset >> 9) + num - 1, buf, 1);
+    }
+
+    return 0;
+}
+
 static int qcow_create2(const char *filename, int64_t total_size,
                         const char *backing_file, const char *backing_format,
-                        int flags, size_t cluster_size)
+                        int flags, size_t cluster_size, int prealloc)
 {
 
     int fd, header_size, backing_filename_len, l1_size, i, shift, l2_bits;
@@ -762,6 +809,16 @@  static int qcow_create2(const char *filename, int64_t total_size,
     qemu_free(s->refcount_table);
     qemu_free(s->refcount_block);
     close(fd);
+
+    /* Preallocate metadata */
+    if (prealloc) {
+        BlockDriverState *bs;
+        bs = bdrv_new("");
+        bdrv_open(bs, filename, BDRV_O_CACHE_WB);
+        preallocate(bs);
+        bdrv_close(bs);
+    }
+
     return 0;
 }
 
@@ -772,6 +829,7 @@  static int qcow_create(const char *filename, QEMUOptionParameter *options)
     uint64_t sectors = 0;
     int flags = 0;
     size_t cluster_size = 65536;
+    int prealloc = 0;
 
     /* Read out options */
     while (options && options->name) {
@@ -787,12 +845,28 @@  static int qcow_create(const char *filename, QEMUOptionParameter *options)
             if (options->value.n) {
                 cluster_size = options->value.n;
             }
+        } else if (!strcmp(options->name, BLOCK_OPT_PREALLOC)) {
+            if (!options->value.s || !strcmp(options->value.s, "off")) {
+                prealloc = 0;
+            } else if (!strcmp(options->value.s, "metadata")) {
+                prealloc = 1;
+            } else {
+                fprintf(stderr, "Invalid preallocation mode: '%s'\n",
+                    options->value.s);
+                return -EINVAL;
+            }
         }
         options++;
     }
 
+    if (backing_file && prealloc) {
+        fprintf(stderr, "Backing file and preallocation cannot be used at "
+            "the same time\n");
+        return -EINVAL;
+    }
+
     return qcow_create2(filename, sectors, backing_file, backing_fmt, flags,
-        cluster_size);
+        cluster_size, prealloc);
 }
 
 static int qcow_make_empty(BlockDriverState *bs)
@@ -982,6 +1056,11 @@  static QEMUOptionParameter qcow_create_options[] = {
         .type = OPT_SIZE,
         .help = "qcow2 cluster size"
     },
+    {
+        .name = BLOCK_OPT_PREALLOC,
+        .type = OPT_STRING,
+        .help = "Preallocation mode (allowed values: off, metadata)"
+    },
     { NULL }
 };
 
diff --git a/block_int.h b/block_int.h
index 8898d91..0902fd4 100644
--- a/block_int.h
+++ b/block_int.h
@@ -37,6 +37,7 @@ 
 #define BLOCK_OPT_BACKING_FILE  "backing_file"
 #define BLOCK_OPT_BACKING_FMT   "backing_fmt"
 #define BLOCK_OPT_CLUSTER_SIZE  "cluster_size"
+#define BLOCK_OPT_PREALLOC      "preallocation"
 
 typedef struct AIOPool {
     void (*cancel)(BlockDriverAIOCB *acb);