Patchwork [RFC,2/2] qcow2: Use bdrv_(p)write_sync for metadata writes

login
register
mail settings
Submitter Kevin Wolf
Date June 17, 2010, 12:03 p.m.
Message ID <1276776211-11174-3-git-send-email-kwolf@redhat.com>
Download mbox | patch
Permalink /patch/56025/
State New
Headers show

Comments

Kevin Wolf - June 17, 2010, 12:03 p.m.
Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c  |   16 ++++++++--------
 block/qcow2-refcount.c |   18 +++++++++---------
 block/qcow2-snapshot.c |   14 +++++++-------
 block/qcow2.c          |   10 +++++-----
 4 files changed, 29 insertions(+), 29 deletions(-)
Stefan Hajnoczi - June 17, 2010, 2:19 p.m.
On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.

Any performance numbers?  This change is necessary for correctness but
I wonder what the performance impact is for users.

Stefan
Kevin Wolf - June 17, 2010, 2:39 p.m.
Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
> 
> Any performance numbers?  This change is necessary for correctness but
> I wonder what the performance impact is for users.

No numbers yet, but as you say we need to do it anyway. It should
definitely be better than any other option that I can think of
(cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
performance when metadata is actually changed. As long as we only write
guest data, there is no difference.

Making it a barrier instead of a flush would probably be better, have
you already had a look at this since we talked about it?

Kevin
Stefan Hajnoczi - June 17, 2010, 7:47 p.m.
On Thu, Jun 17, 2010 at 3:39 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
>> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
>>
>> Any performance numbers?  This change is necessary for correctness but
>> I wonder what the performance impact is for users.
>
> No numbers yet, but as you say we need to do it anyway. It should
> definitely be better than any other option that I can think of
> (cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
> performance when metadata is actually changed. As long as we only write
> guest data, there is no difference.
>
> Making it a barrier instead of a flush would probably be better, have
> you already had a look at this since we talked about it?

No I haven't seen a userspace barrier solution.  I'm not actively
working on that at the moment but am keeping an eye out for solutions.
 I'm not sure what to make of sync_file_range(2), the man page
suggests it cannot be relied on since only already allocated blocks
will be flushed appropriately.  Filesystem metadata changes are not
flushed, which is especially problematic for copy-on-write filesystems
(according to the man page).

Thinking about the performance more, I predict that guest OS
installations will slow down but day-to-day usage (especially
modifying already allocated blocks) will be alright for the reasons
you mentioned.

Stefan
Kevin Wolf - June 18, 2010, 7:54 a.m.
Am 17.06.2010 21:47, schrieb Stefan Hajnoczi:
> On Thu, Jun 17, 2010 at 3:39 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
>>> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
>>>
>>> Any performance numbers?  This change is necessary for correctness but
>>> I wonder what the performance impact is for users.
>>
>> No numbers yet, but as you say we need to do it anyway. It should
>> definitely be better than any other option that I can think of
>> (cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
>> performance when metadata is actually changed. As long as we only write
>> guest data, there is no difference.
>>
>> Making it a barrier instead of a flush would probably be better, have
>> you already had a look at this since we talked about it?
> 
> No I haven't seen a userspace barrier solution.  I'm not actively
> working on that at the moment but am keeping an eye out for solutions.
>  I'm not sure what to make of sync_file_range(2), the man page
> suggests it cannot be relied on since only already allocated blocks
> will be flushed appropriately.  Filesystem metadata changes are not
> flushed, which is especially problematic for copy-on-write filesystems
> (according to the man page).

qcow2 files are expected to grow, so this probably doesn't help us. At
best we could try to detect if the file grows and decide if to use
sync_file_range or fdatasync.

Of course, it would be better to have a flag or something to include the
necessary metadata, but I have no idea how it's implemented and if this
is easily possible.

> Thinking about the performance more, I predict that guest OS
> installations will slow down but day-to-day usage (especially
> modifying already allocated blocks) will be alright for the reasons
> you mentioned.

Right, it's only growth that will get an impact. The pooint is that it's
not only installation (in most cases you could even use cache=unsafe for
that) but that you start to grow again after taking a snapshot - be it
internal (savevm) or external (backing file). This is where it hurts.

Kevin
Christoph Hellwig - June 18, 2010, 7:58 a.m.
On Fri, Jun 18, 2010 at 09:54:24AM +0200, Kevin Wolf wrote:
> qcow2 files are expected to grow, so this probably doesn't help us. At
> best we could try to detect if the file grows and decide if to use
> sync_file_range or fdatasync.

No.  sync_file_range is always unsafe for use with a real filesystem.
It never even calls into the filesystem and never flushes the disk write
cache.  For our use case it's exactly a no-op.

> Of course, it would be better to have a flag or something to include the
> necessary metadata, but I have no idea how it's implemented and if this
> is easily possible.

We could add a flag to it.  But rather than adding more crap to this
bastard that should never have been added I'd rather add a real
fsync_range system call.

Note that for our use case it doesn't matter anyway.  For an image file
accessed with cache=none there will be no data in the pagecache, and
the writeback of the inode and associated metadata (btree blocks or
indirect blocks) will be same wether we give it a range or not, and same
for the disk cache flush.

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 5760ad6..05cf6c2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -64,7 +64,7 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
     for(i = 0; i < s->l1_size; i++)
         new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
-    ret = bdrv_pwrite(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
+    ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
     if (ret != new_l1_size2)
         goto fail;
     for(i = 0; i < s->l1_size; i++)
@@ -74,7 +74,7 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
     cpu_to_be32w((uint32_t*)data, new_l1_size);
     cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset);
-    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
     if (ret != sizeof(data)) {
         goto fail;
     }
@@ -207,7 +207,7 @@  static int write_l1_entry(BlockDriverState *bs, int l1_index)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
-    ret = bdrv_pwrite(bs->file, s->l1_table_offset + 8 * l1_start_index,
+    ret = bdrv_pwrite_sync(bs->file, s->l1_table_offset + 8 * l1_start_index,
         buf, sizeof(buf));
     if (ret < 0) {
         return ret;
@@ -263,7 +263,7 @@  static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
     }
     /* write the l2 table to the file */
     BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
-    ret = bdrv_pwrite(bs->file, l2_offset, l2_table,
+    ret = bdrv_pwrite_sync(bs->file, l2_offset, l2_table,
         s->l2_size * sizeof(uint64_t));
     if (ret < 0) {
         goto fail;
@@ -413,8 +413,8 @@  static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
                         &s->aes_encrypt_key);
     }
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
-    ret = bdrv_write(bs->file, (cluster_offset >> 9) + n_start,
-                     s->cluster_data, n);
+    ret = bdrv_write_sync(bs->file, (cluster_offset >> 9) + n_start,
+        s->cluster_data, n);
     if (ret < 0)
         return ret;
     return 0;
@@ -631,7 +631,7 @@  uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 
     BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
     l2_table[l2_index] = cpu_to_be64(cluster_offset);
-    if (bdrv_pwrite(bs->file,
+    if (bdrv_pwrite_sync(bs->file,
                     l2_offset + l2_index * sizeof(uint64_t),
                     l2_table + l2_index,
                     sizeof(uint64_t)) != sizeof(uint64_t))
@@ -655,7 +655,7 @@  static int write_l2_entries(BlockDriverState *bs, uint64_t *l2_table,
     int ret;
 
     BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE);
-    ret = bdrv_pwrite(bs->file, l2_offset + start_offset,
+    ret = bdrv_pwrite_sync(bs->file, l2_offset + start_offset,
         &l2_table[l2_start_index], len);
     if (ret < 0) {
         return ret;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 41e1da9..540bf49 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -44,7 +44,7 @@  static int write_refcount_block(BlockDriverState *bs)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE);
-    if (bdrv_pwrite(bs->file, s->refcount_block_cache_offset,
+    if (bdrv_pwrite_sync(bs->file, s->refcount_block_cache_offset,
             s->refcount_block_cache, size) != size)
     {
         return -EIO;
@@ -269,7 +269,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
 
     /* Now the new refcount block needs to be written to disk */
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE);
-    ret = bdrv_pwrite(bs->file, new_block, s->refcount_block_cache,
+    ret = bdrv_pwrite_sync(bs->file, new_block, s->refcount_block_cache,
         s->cluster_size);
     if (ret < 0) {
         goto fail_block;
@@ -279,7 +279,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     if (refcount_table_index < s->refcount_table_size) {
         uint64_t data64 = cpu_to_be64(new_block);
         BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_HOOKUP);
-        ret = bdrv_pwrite(bs->file,
+        ret = bdrv_pwrite_sync(bs->file,
             s->refcount_table_offset + refcount_table_index * sizeof(uint64_t),
             &data64, sizeof(data64));
         if (ret < 0) {
@@ -359,7 +359,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
 
     /* Write refcount blocks to disk */
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_BLOCKS);
-    ret = bdrv_pwrite(bs->file, meta_offset, new_blocks,
+    ret = bdrv_pwrite_sync(bs->file, meta_offset, new_blocks,
         blocks_clusters * s->cluster_size);
     qemu_free(new_blocks);
     if (ret < 0) {
@@ -372,7 +372,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_TABLE);
-    ret = bdrv_pwrite(bs->file, table_offset, new_table,
+    ret = bdrv_pwrite_sync(bs->file, table_offset, new_table,
         table_size * sizeof(uint64_t));
     if (ret < 0) {
         goto fail_table;
@@ -387,7 +387,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     cpu_to_be64w((uint64_t*)data, table_offset);
     cpu_to_be32w((uint32_t*)(data + 8), table_clusters);
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_SWITCH_TABLE);
-    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, refcount_table_offset),
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, refcount_table_offset),
         data, sizeof(data));
     if (ret < 0) {
         goto fail_table;
@@ -444,7 +444,7 @@  static int write_refcount_block_entries(BlockDriverState *bs,
     size = (last_index - first_index) << REFCOUNT_SHIFT;
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
-    ret = bdrv_pwrite(bs->file,
+    ret = bdrv_pwrite_sync(bs->file,
         refcount_block_offset + (first_index << REFCOUNT_SHIFT),
         &s->refcount_block_cache[first_index], size);
     if (ret < 0) {
@@ -826,7 +826,7 @@  int qcow2_update_snapshot_refcount(BlockDriverState *bs,
                 }
             }
             if (l2_modified) {
-                if (bdrv_pwrite(bs->file,
+                if (bdrv_pwrite_sync(bs->file,
                                 l2_offset, l2_table, l2_size) != l2_size)
                     goto fail;
             }
@@ -850,7 +850,7 @@  int qcow2_update_snapshot_refcount(BlockDriverState *bs,
     if (l1_modified) {
         for(i = 0; i < l1_size; i++)
             cpu_to_be64s(&l1_table[i]);
-        if (bdrv_pwrite(bs->file, l1_table_offset, l1_table,
+        if (bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table,
                         l1_size2) != l1_size2)
             goto fail;
         for(i = 0; i < l1_size; i++)
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 2a21c17..4c33996 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -158,24 +158,24 @@  static int qcow_write_snapshots(BlockDriverState *bs)
         h.id_str_size = cpu_to_be16(id_str_size);
         h.name_size = cpu_to_be16(name_size);
         offset = align_offset(offset, 8);
-        if (bdrv_pwrite(bs->file, offset, &h, sizeof(h)) != sizeof(h))
+        if (bdrv_pwrite_sync(bs->file, offset, &h, sizeof(h)) != sizeof(h))
             goto fail;
         offset += sizeof(h);
-        if (bdrv_pwrite(bs->file, offset, sn->id_str, id_str_size) != id_str_size)
+        if (bdrv_pwrite_sync(bs->file, offset, sn->id_str, id_str_size) != id_str_size)
             goto fail;
         offset += id_str_size;
-        if (bdrv_pwrite(bs->file, offset, sn->name, name_size) != name_size)
+        if (bdrv_pwrite_sync(bs->file, offset, sn->name, name_size) != name_size)
             goto fail;
         offset += name_size;
     }
 
     /* update the various header fields */
     data64 = cpu_to_be64(snapshots_offset);
-    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, snapshots_offset),
+    if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, snapshots_offset),
                     &data64, sizeof(data64)) != sizeof(data64))
         goto fail;
     data32 = cpu_to_be32(s->nb_snapshots);
-    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, nb_snapshots),
+    if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, nb_snapshots),
                     &data32, sizeof(data32)) != sizeof(data32))
         goto fail;
 
@@ -284,7 +284,7 @@  int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     for(i = 0; i < s->l1_size; i++) {
         l1_table[i] = cpu_to_be64(s->l1_table[i]);
     }
-    if (bdrv_pwrite(bs->file, sn->l1_table_offset,
+    if (bdrv_pwrite_sync(bs->file, sn->l1_table_offset,
                     l1_table, s->l1_size * sizeof(uint64_t)) !=
         (s->l1_size * sizeof(uint64_t)))
         goto fail;
@@ -335,7 +335,7 @@  int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
     if (bdrv_pread(bs->file, sn->l1_table_offset,
                    s->l1_table, l1_size2) != l1_size2)
         goto fail;
-    if (bdrv_pwrite(bs->file, s->l1_table_offset,
+    if (bdrv_pwrite_sync(bs->file, s->l1_table_offset,
                     s->l1_table, l1_size2) != l1_size2)
         goto fail;
     for(i = 0;i < s->l1_size; i++) {
diff --git a/block/qcow2.c b/block/qcow2.c
index 1562557..70c5bb3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -736,7 +736,7 @@  static int qcow2_update_ext_header(BlockDriverState *bs,
         backing_file_offset = sizeof(QCowHeader) + offset;
     }
 
-    ret = bdrv_pwrite(bs->file, sizeof(QCowHeader), buf, ext_size);
+    ret = bdrv_pwrite_sync(bs->file, sizeof(QCowHeader), buf, ext_size);
     if (ret < 0) {
         goto fail;
     }
@@ -745,13 +745,13 @@  static int qcow2_update_ext_header(BlockDriverState *bs,
     uint64_t be_backing_file_offset = cpu_to_be64(backing_file_offset);
     uint32_t be_backing_file_size = cpu_to_be32(backing_file_len);
 
-    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_offset),
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_offset),
         &be_backing_file_offset, sizeof(uint64_t));
     if (ret < 0) {
         goto fail;
     }
 
-    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_size),
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_size),
         &be_backing_file_size, sizeof(uint32_t));
     if (ret < 0) {
         goto fail;
@@ -1035,8 +1035,8 @@  static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
 
     /* write updated header.size */
     offset = cpu_to_be64(offset);
-    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, size),
-                      &offset, sizeof(uint64_t));
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, size),
+                           &offset, sizeof(uint64_t));
     if (ret < 0) {
         return ret;
     }