Message ID | 1430971496-32659-1-git-send-email-phoeagon@gmail.com |
---|---|
State | New |
Headers | show |
On Thu, May 07, 2015 at 12:04:56PM +0800, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) CCing Stefan Weil and Kevin Wolf (see output from scripts/get_maintainer.pl -f block/vdi.c). > > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret; > -- > 2.4.0 > >
On 05/06/2015 10:04 PM, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > Please wrap commit comments to be under 80 columns (in fact, under 72 is good, because 'git log' adds spaces when displaying commit bodies). Your notation of commit~commit is unusual; ranges in git are usually spelled commit..commit. Also, it's okay to abbreviate commit SHAs to 8 bytes or so. So to shrink your long line, I'd write b0ad5a45..078a458e. > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> Your 'From:' and 'Signed-off-by:' lines have different spellings, which makes it more confusing when searching for patches by you. You can add an entry to .mailmap (as a separate patch) to retro-actively consolidate entries, but it is better to catch things like this up front before we even have the problem of separate names. I suspect you want "Zhe Qiu" as the spelling of your name in both lines (git config can be taught to set up the preferred name to attribute both for signing patches and for sending email). It is also acceptable to use UTF-8 to spell your name in native characters, or even a combination of "native name (ascii counterpart)"
Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. The justification for these patches (in 2010!) was more or less that we didn't know whether the implementations were safe without the flushes. Many of the flushes added back then have been removed again until today because they have turned out to be unnecessary. > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> Please describe why VDI needs this flush to be safe. This is best done by describing a case where not doing the flush would lead to image corruption in case of a crash. To my knowledge, the VDI driver is completely safe without it. Kevin
In case of correctness, lacking a sync here does not introduce data corruption I can think of. But this reduces the volatile window during which the metadata changes are NOT guaranteed on disk. Without a barrier, in case of power loss you may end up with the bitmap changes on disk and not the header block, or vice versa. Neither introduces data corruption directly, but since VDI doesn't have proper fix mechanism for qemu-img, once the leak is introduced you have to "convert" to fix it, consuming a long time if the disk is large. This patch does not fix the issue entirely, and it does not substitute for proper check-and-fix implementation. But this should bring about minor performance degradation (only 1 extra sync per allocation) but greatly reduces the metadata inconsistency window. On Fri, May 8, 2015 at 6:43 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben: > > From: phoeagon <phoeagon@gmail.com> > > > > In reference to > b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, > metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to > succeeding writes. > > The justification for these patches (in 2010!) was more or less that we > didn't know whether the implementations were safe without the flushes. > Many of the flushes added back then have been removed again until today > because they have turned out to be unnecessary. > > > Only when write is successful that bdrv_flush is called. > > > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > > Please describe why VDI needs this flush to be safe. This is best done > by describing a case where not doing the flush would lead to image > corruption in case of a crash. > > To my knowledge, the VDI driver is completely safe without it. > > Kevin >
Am 08.05.2015 um 13:50 hat phoeagon geschrieben: > In case of correctness, lacking a sync here does not introduce data corruption > I can think of. But this reduces the volatile window during which the metadata > changes are NOT guaranteed on disk. Without a barrier, in case of power loss > you may end up with the bitmap changes on disk and not the header block, or > vice versa. Neither introduces data corruption directly, but since VDI doesn't > have proper fix mechanism for qemu-img, once the leak is introduced you have to > "convert" to fix it, consuming a long time if the disk is large. This is true. I'm not sure how big a problem this is in practice, though. > This patch does not fix the issue entirely, and it does not substitute for > proper check-and-fix implementation. But this should bring about minor > performance degradation (only 1 extra sync per allocation) but greatly reduces > the metadata inconsistency window. Did you benchmark this? From the past experience with flushes in qemu block drivers, one sync per allocation certainly doesn't sound "minor". What could possibly save us from the worst is that VDI has a relatively large block size (or rather, that we don't support images with different block sizes). Kevin
I tested it on host-btrfs, guest-ext4, (connected to guest via virtualized IDE) with 1G VDI image, testing with "dbench 10": synced: Writeback: 39.4852M/s 326.812ms Unsafe: 432.72M/s 1029.313ms normal: Writeback: 39.1884M/s 311.383ms Unsafe: 413.592M/s 280.951ms Although I hear that dbench is not a good I/O benchmark (and iozone is just a little too much hassle) but I'm pretty sure the difference, if any, is within fluctuation in this case. Under my setting even a raw "dd of=/dev/sdb" (from within a VM) doesn't have higher than 20M/s even without this extra write barrier in the proposed patch. So although what described above is not comprehensive (you can probably extract the most overhead by deliberately O_DIRECT writing at 1M stride, 512K each, no application level sync in "writeback" cache mode of VDI, what originally mostly resides in host page cache now must be flushed to hard disk, probably in an inconvenient order if host FS highly fragmented), I doubt consecutive raw writes cover several Megabytes or guest-FS based workload would see much performance overhead after introducing this extra sync. On Fri, May 8, 2015 at 8:02 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 08.05.2015 um 13:50 hat phoeagon geschrieben: > > In case of correctness, lacking a sync here does not introduce data > corruption > > I can think of. But this reduces the volatile window during which the > metadata > > changes are NOT guaranteed on disk. Without a barrier, in case of power > loss > > you may end up with the bitmap changes on disk and not the header block, > or > > vice versa. Neither introduces data corruption directly, but since VDI > doesn't > > have proper fix mechanism for qemu-img, once the leak is introduced you > have to > > "convert" to fix it, consuming a long time if the disk is large. > > This is true. I'm not sure how big a problem this is in practice, > though. > > > This patch does not fix the issue entirely, and it does not substitute > for > > proper check-and-fix implementation. But this should bring about minor > > performance degradation (only 1 extra sync per allocation) but greatly > reduces > > the metadata inconsistency window. > > Did you benchmark this? From the past experience with flushes in qemu > block drivers, one sync per allocation certainly doesn't sound "minor". > > What could possibly save us from the worst is that VDI has a relatively > large block size (or rather, that we don't support images with different > block sizes). > > Kevin >
On 07.05.2015 06:04, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) Reviewed-by: Max Reitz <mreitz@redhat.com> Thanks! > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret;
diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout("will write %u block map sectors starting from entry %u\n", n_sectors, bmap_first); ret = bdrv_write(bs->file, offset, base, n_sectors); + if (ret >= 0) { + ret = bdrv_flush(bs->file); + } } return ret;