diff mbox

[v3] block/vdi: Use bdrv_flush after metadata updates

Message ID 1430971496-32659-1-git-send-email-phoeagon@gmail.com
State New
Headers show

Commit Message

phoeagon May 7, 2015, 4:04 a.m. UTC
From: phoeagon <phoeagon@gmail.com>

In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

Only when write is successful that bdrv_flush is called.

Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
---
 block/vdi.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Stefan Hajnoczi May 7, 2015, 10:09 a.m. UTC | #1
On Thu, May 07, 2015 at 12:04:56PM +0800, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> 
> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>  block/vdi.c | 3 +++
>  1 file changed, 3 insertions(+)

CCing Stefan Weil and Kevin Wolf (see output from
scripts/get_maintainer.pl -f block/vdi.c).

> 
> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>          logout("will write %u block map sectors starting from entry %u\n",
>                 n_sectors, bmap_first);
>          ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>      }
>  
>      return ret;
> -- 
> 2.4.0
> 
>
Eric Blake May 7, 2015, 3:05 p.m. UTC | #2
On 05/06/2015 10:04 PM, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> 

Please wrap commit comments to be under 80 columns (in fact, under 72 is
good, because 'git log' adds spaces when displaying commit bodies).

Your notation of commit~commit is unusual; ranges in git are usually
spelled commit..commit.  Also, it's okay to abbreviate commit SHAs to 8
bytes or so.  So to shrink your long line, I'd write b0ad5a45..078a458e.

> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>

Your 'From:' and 'Signed-off-by:' lines have different spellings, which
makes it more confusing when searching for patches by you.  You can add
an entry to .mailmap (as a separate patch) to retro-actively consolidate
entries, but it is better to catch things like this up front before we
even have the problem of separate names.

I suspect you want "Zhe Qiu" as the spelling of your name in both lines
(git config can be taught to set up the preferred name to attribute both
for signing patches and for sending email).  It is also acceptable to
use UTF-8 to spell your name in native characters, or even a combination
of "native name (ascii counterpart)"
Kevin Wolf May 8, 2015, 10:43 a.m. UTC | #3
Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

The justification for these patches (in 2010!) was more or less that we
didn't know whether the implementations were safe without the flushes.
Many of the flushes added back then have been removed again until today
because they have turned out to be unnecessary.

> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>

Please describe why VDI needs this flush to be safe. This is best done
by describing a case where not doing the flush would lead to image
corruption in case of a crash.

To my knowledge, the VDI driver is completely safe without it.

Kevin
phoeagon May 8, 2015, 11:50 a.m. UTC | #4
In case of correctness, lacking a sync here does not introduce data
corruption I can think of. But this reduces the volatile window during
which the metadata changes are NOT guaranteed on disk. Without a barrier,
in case of power loss you may end up with the bitmap changes on disk and
not the header block, or vice versa. Neither introduces data corruption
directly, but since VDI doesn't have proper fix mechanism for qemu-img,
once the leak is introduced you have to "convert" to fix it, consuming a
long time if the disk is large.

This patch does not fix the issue entirely, and it does not substitute for
proper check-and-fix implementation. But this should bring about minor
performance degradation (only 1 extra sync per allocation) but greatly
reduces the metadata inconsistency window.

On Fri, May 8, 2015 at 6:43 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben:
> > From: phoeagon <phoeagon@gmail.com>
> >
> > In reference to
> b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2,
> metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to
> succeeding writes.
>
> The justification for these patches (in 2010!) was more or less that we
> didn't know whether the implementations were safe without the flushes.
> Many of the flushes added back then have been removed again until today
> because they have turned out to be unnecessary.
>
> > Only when write is successful that bdrv_flush is called.
> >
> > Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>
> Please describe why VDI needs this flush to be safe. This is best done
> by describing a case where not doing the flush would lead to image
> corruption in case of a crash.
>
> To my knowledge, the VDI driver is completely safe without it.
>
> Kevin
>
Kevin Wolf May 8, 2015, 12:02 p.m. UTC | #5
Am 08.05.2015 um 13:50 hat phoeagon geschrieben:
> In case of correctness, lacking a sync here does not introduce data corruption
> I can think of. But this reduces the volatile window during which the metadata
> changes are NOT guaranteed on disk. Without a barrier, in case of power loss
> you may end up with the bitmap changes on disk and not the header block, or
> vice versa. Neither introduces data corruption directly, but since VDI doesn't
> have proper fix mechanism for qemu-img, once the leak is introduced you have to
> "convert" to fix it, consuming a long time if the disk is large.

This is true. I'm not sure how big a problem this is in practice,
though.

> This patch does not fix the issue entirely, and it does not substitute for
> proper check-and-fix implementation. But this should bring about minor
> performance degradation (only 1 extra sync per allocation) but greatly reduces
> the metadata inconsistency window.

Did you benchmark this? From the past experience with flushes in qemu
block drivers, one sync per allocation certainly doesn't sound "minor".

What could possibly save us from the worst is that VDI has a relatively
large block size (or rather, that we don't support images with different
block sizes).

Kevin
phoeagon May 8, 2015, 12:56 p.m. UTC | #6
I tested it on host-btrfs, guest-ext4, (connected to guest via virtualized
IDE) with 1G VDI image, testing with "dbench 10":

synced:
Writeback: 39.4852M/s 326.812ms
Unsafe: 432.72M/s 1029.313ms

normal:
Writeback: 39.1884M/s 311.383ms
Unsafe: 413.592M/s 280.951ms

Although I hear that dbench is not a good I/O benchmark (and iozone is just
a little too much hassle) but I'm pretty sure the difference, if any, is
within fluctuation in this case. Under my setting even a raw "dd
of=/dev/sdb" (from within a VM) doesn't have higher than 20M/s even without
this extra write barrier in the proposed patch.

So although what described above is not comprehensive (you can probably
extract the most overhead by deliberately O_DIRECT writing at 1M stride,
512K each, no application level sync in "writeback" cache mode of VDI, what
originally mostly resides in host page cache now must be flushed to hard
disk, probably in an inconvenient order if host FS highly fragmented), I
doubt consecutive raw writes cover several Megabytes or guest-FS based
workload would see much performance overhead after introducing this extra
sync.

On Fri, May 8, 2015 at 8:02 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 08.05.2015 um 13:50 hat phoeagon geschrieben:
> > In case of correctness, lacking a sync here does not introduce data
> corruption
> > I can think of. But this reduces the volatile window during which the
> metadata
> > changes are NOT guaranteed on disk. Without a barrier, in case of power
> loss
> > you may end up with the bitmap changes on disk and not the header block,
> or
> > vice versa. Neither introduces data corruption directly, but since VDI
> doesn't
> > have proper fix mechanism for qemu-img, once the leak is introduced you
> have to
> > "convert" to fix it, consuming a long time if the disk is large.
>
> This is true. I'm not sure how big a problem this is in practice,
> though.
>
> > This patch does not fix the issue entirely, and it does not substitute
> for
> > proper check-and-fix implementation. But this should bring about minor
> > performance degradation (only 1 extra sync per allocation) but greatly
> reduces
> > the metadata inconsistency window.
>
> Did you benchmark this? From the past experience with flushes in qemu
> block drivers, one sync per allocation certainly doesn't sound "minor".
>
> What could possibly save us from the worst is that VDI has a relatively
> large block size (or rather, that we don't support images with different
> block sizes).
>
> Kevin
>
Max Reitz May 8, 2015, 1:10 p.m. UTC | #7
On 07.05.2015 06:04, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
>
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>
> Only when write is successful that bdrv_flush is called.
>
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>   block/vdi.c | 3 +++
>   1 file changed, 3 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

Thanks!

> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>           logout("will write %u block map sectors starting from entry %u\n",
>                  n_sectors, bmap_first);
>           ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>       }
>   
>       return ret;
diff mbox

Patch

diff --git a/block/vdi.c b/block/vdi.c
index 7642ef3..dfe8ade 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -713,6 +713,9 @@  static int vdi_co_write(BlockDriverState *bs,
         logout("will write %u block map sectors starting from entry %u\n",
                n_sectors, bmap_first);
         ret = bdrv_write(bs->file, offset, base, n_sectors);
+        if (ret >= 0) {
+            ret = bdrv_flush(bs->file);
+        }
     }
 
     return ret;