[v4] block/vdi: Use bdrv_flush after metadata updates

Message ID	1431011818-15822-1-git-send-email-phoeagon@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Zhe Qiu <phoeagon@gmail.com> To: qemu-devel@nongnu.org Date: Thu, 7 May 2015 23:16:58 +0800 Message-Id: <1431011818-15822-1-git-send-email-phoeagon@gmail.com> In-Reply-To: <1430971496-32659-1-git-send-email-phoeagon@gmail.com> References: <1430971496-32659-1-git-send-email-phoeagon@gmail.com> Error: Malformed IPv6 address (bad octet value). Cc: kwolf@redhat.com, Zhe Qiu <phoeagon@gmail.com>, qemu-block@nongnu.org, sw@weilnetz.de, mreitz@redhat.com Subject: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

phoeagon May 7, 2015, 3:16 p.m. UTC

In reference to b0ad5a45...078a458e, metadata writes to 
qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

Only when write is successful that bdrv_flush is called.

Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
---
 block/vdi.c | 3 +++
 1 file changed, 3 insertions(+)

Max Reitz May 8, 2015, 1:14 p.m. UTC | #1

On 07.05.2015 17:16, Zhe Qiu wrote:
> In reference to b0ad5a45...078a458e, metadata writes to
> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>
> Only when write is successful that bdrv_flush is called.
>
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>   block/vdi.c | 3 +++
>   1 file changed, 3 insertions(+)

I missed Kevin's arguments before, but I think that adding this is more 
correct than not having it; and when thinking about speed, this is vdi, 
a format supported for compatibility. Writing isn't our most important 
concern anyway (especially considering that it's even to disable write 
support for vdi at compile time).

So if we wanted to optimize it, we'd probably have to cache multiple 
allocations, do them at once and then flush afterwards (like the 
metadata cache we have in qcow2?), but that is complicated (like the 
metadata cache in qcow2), certainly too complicated for a format 
supported for compatibility (unless someone really wants to implement it).

Reviewed-by: Max Reitz <mreitz@redhat.com>

>
> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>           logout("will write %u block map sectors starting from entry %u\n",
>                  n_sectors, bmap_first);
>           ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>       }
>   
>       return ret;

Kevin Wolf May 8, 2015, 1:55 p.m. UTC | #2

Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> On 07.05.2015 17:16, Zhe Qiu wrote:
> >In reference to b0ad5a45...078a458e, metadata writes to
> >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> >
> >Only when write is successful that bdrv_flush is called.
> >
> >Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> >---
> >  block/vdi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> 
> I missed Kevin's arguments before, but I think that adding this is
> more correct than not having it; and when thinking about speed, this
> is vdi, a format supported for compatibility.

If you use it only as a convert target, you probably care more about
speed than about leaks in case of a host crash.

> So if we wanted to optimize it, we'd probably have to cache multiple
> allocations, do them at once and then flush afterwards (like the
> metadata cache we have in qcow2?)

That would defeat the purpose of this patch which aims at having
metadata and data written out almost at the same time. On the other
hand, fully avoiding the problem instead of just making the window
smaller would require a journal, which VDI just doesn't have.

I'm not convinced of this patch, but I'll defer to Stefan Weil as the
VDI maintainer.

Kevin

phoeagon May 8, 2015, 2:43 p.m. UTC | #3

Then I would guess the same reason should apply to VMDK/VPC as well...
Their metadata update protocol is not atomic either, and a sync after
metadata update doesn't fix the whole thing theoretically either. Yet the
metadata sync patches as old as since 2010 are still there. It should also
be a performance boost if we remove those write barriers as well, if
conversion performance is our major concern.

I think when converting images, one can always opt for "cache=unsafe" to
avoid potential performance degradation from conservative cache (it should
really be default for qemu-img convert, but I don't know if it's the case),
so conversion performance shouldn't be a reason to sacrifice VM-runtime
consistency.

On Fri, May 8, 2015 at 9:55 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> > On 07.05.2015 17:16, Zhe Qiu wrote:
> > >In reference to b0ad5a45...078a458e, metadata writes to
> > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> > >
> > >Only when write is successful that bdrv_flush is called.
> > >
> > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> > >---
> > >  block/vdi.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> >
> > I missed Kevin's arguments before, but I think that adding this is
> > more correct than not having it; and when thinking about speed, this
> > is vdi, a format supported for compatibility.
>
> If you use it only as a convert target, you probably care more about
> speed than about leaks in case of a host crash.
>
> > So if we wanted to optimize it, we'd probably have to cache multiple
> > allocations, do them at once and then flush afterwards (like the
> > metadata cache we have in qcow2?)
>
> That would defeat the purpose of this patch which aims at having
> metadata and data written out almost at the same time. On the other
> hand, fully avoiding the problem instead of just making the window
> smaller would require a journal, which VDI just doesn't have.
>
> I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> VDI maintainer.
>
> Kevin
>

Stefan Weil May 8, 2015, 9:26 p.m. UTC | #4

Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
> Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
>> On 07.05.2015 17:16, Zhe Qiu wrote:
>>> In reference to b0ad5a45...078a458e, metadata writes to
>>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>>>
>>> Only when write is successful that bdrv_flush is called.
>>>
>>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>>> ---
>>>   block/vdi.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>> I missed Kevin's arguments before, but I think that adding this is
>> more correct than not having it; and when thinking about speed, this
>> is vdi, a format supported for compatibility.
> If you use it only as a convert target, you probably care more about
> speed than about leaks in case of a host crash.
>
>> So if we wanted to optimize it, we'd probably have to cache multiple
>> allocations, do them at once and then flush afterwards (like the
>> metadata cache we have in qcow2?)
> That would defeat the purpose of this patch which aims at having
> metadata and data written out almost at the same time. On the other
> hand, fully avoiding the problem instead of just making the window
> smaller would require a journal, which VDI just doesn't have.
>
> I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> VDI maintainer.
>
> Kevin

Thanks for asking. I share your concerns regarding reduced performance
caused by bdrv_flush. Conversions to VDI will take longer (how much?),
and also installation of an OS on a new VDI disk image will be slower
because that are the typical scenarios where the disk usage grows.

@phoeagon: Did the benchmark which you used allocate additional disk
storage? If not or if it only allocated once and then spent some time
on already allocated blocks, that benchmark was not valid for this case.

On the other hand I don't see a need for the flushing because the kind
of failures (power failure) and their consequences seem to be acceptable
for typical VDI usage, namely either image conversion or tests with
existing images.

That's why I'd prefer not to use bdrv_flush here. Could we make
bdrv_flush optional (either generally or for cases like this one) so
both people who prefer speed and people who would want
bdrv_flush to decrease the likelihood of inconsistencies can be
satisfied?

Stefan

phoeagon May 9, 2015, 3:54 a.m. UTC | #5

Thanks. Dbench does not logically allocate new disk space all the time,
because it's a FS level benchmark that creates file and deletes them.
Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
about 1.8x space of that from EXT4, due to its COW nature. It does cause
the FS to allocate some space during about 1/3 of the test duration I
think. But this does not mitigate it too much because a FS often writes in
a stride rather than consecutively, which causes write amplification at
allocation times.

So I tested it with qemu-img convert from a 400M raw file:
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
-O vdi /run/shm/rand 1.vdi

real 0m0.402s
user 0m0.206s
sys 0m0.202s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
writeback -O vdi /run/shm/rand 1.vdi

real 0m8.678s
user 0m0.169s
sys 0m0.500s
zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
/run/shm/rand 1.vdi

real 0m4.320s
user 0m0.148s
sys 0m0.471s
zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand
1.vdi
real 0m0.489s
user 0m0.173s
sys 0m0.325s

zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi

real 0m0.515s
user 0m0.168s
sys 0m0.357s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi
/run/shm/rand 1.vdi

real 0m0.431s
user 0m0.192s
sys 0m0.248s

Although 400M is not a giant file, it does show the trend.
As you can see when there's drastic allocation needs, and when there no
extra buffering from a virtualized host, the throughput drops about 50%.
But still it has no effect on "unsafe" mode, as predicted. Also I believe
that expecting to use a half-converted image is seldom the use case, while
host crash and power loss are not so unimaginable.
Looks like qemu-img convert is using "unsafe" as default as well, so even
novice "qemu-img convert" users are not likely to find performance
degradation.

I have not yet tried guest OS installation on top, but I guess a new flag
for one-time faster OS installation is not likely useful, and
"cache=unsafe" already does the trick.


On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> >> On 07.05.2015 17:16, Zhe Qiu wrote:
> >>> In reference to b0ad5a45...078a458e, metadata writes to
> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> >>>
> >>> Only when write is successful that bdrv_flush is called.
> >>>
> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> >>> ---
> >>>   block/vdi.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >> I missed Kevin's arguments before, but I think that adding this is
> >> more correct than not having it; and when thinking about speed, this
> >> is vdi, a format supported for compatibility.
> > If you use it only as a convert target, you probably care more about
> > speed than about leaks in case of a host crash.
> >
> >> So if we wanted to optimize it, we'd probably have to cache multiple
> >> allocations, do them at once and then flush afterwards (like the
> >> metadata cache we have in qcow2?)
> > That would defeat the purpose of this patch which aims at having
> > metadata and data written out almost at the same time. On the other
> > hand, fully avoiding the problem instead of just making the window
> > smaller would require a journal, which VDI just doesn't have.
> >
> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> > VDI maintainer.
> >
> > Kevin
>
> Thanks for asking. I share your concerns regarding reduced performance
> caused by bdrv_flush. Conversions to VDI will take longer (how much?),
> and also installation of an OS on a new VDI disk image will be slower
> because that are the typical scenarios where the disk usage grows.
>
> @phoeagon: Did the benchmark which you used allocate additional disk
> storage? If not or if it only allocated once and then spent some time
> on already allocated blocks, that benchmark was not valid for this case.
>
> On the other hand I don't see a need for the flushing because the kind
> of failures (power failure) and their consequences seem to be acceptable
> for typical VDI usage, namely either image conversion or tests with
> existing images.
>
> That's why I'd prefer not to use bdrv_flush here. Could we make
> bdrv_flush optional (either generally or for cases like this one) so
> both people who prefer speed and people who would want
> bdrv_flush to decrease the likelihood of inconsistencies can be
> satisfied?
>
> Stefan
>
>

phoeagon May 9, 2015, 3:59 a.m. UTC | #6

BTW, how do you usually measure the time to install a Linux distro within?
Most distros ISOs do NOT have unattended installation ISOs in place. (True
I can bake my own ISOs for this...) But do you have any ISOs made ready for
this purpose?

On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote:

> Thanks. Dbench does not logically allocate new disk space all the time,
> because it's a FS level benchmark that creates file and deletes them.
> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
> about 1.8x space of that from EXT4, due to its COW nature. It does cause
> the FS to allocate some space during about 1/3 of the test duration I
> think. But this does not mitigate it too much because a FS often writes in
> a stride rather than consecutively, which causes write amplification at
> allocation times.
>
> So I tested it with qemu-img convert from a 400M raw file:
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
> -O vdi /run/shm/rand 1.vdi
>
> real 0m0.402s
> user 0m0.206s
> sys 0m0.202s
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
> writeback -O vdi /run/shm/rand 1.vdi
>
> real 0m8.678s
> user 0m0.169s
> sys 0m0.500s
> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
> /run/shm/rand 1.vdi
>
> real 0m4.320s
> user 0m0.148s
> sys 0m0.471s
> zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand
> 1.vdi
> real 0m0.489s
> user 0m0.173s
> sys 0m0.325s
>
> zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi
>
> real 0m0.515s
> user 0m0.168s
> sys 0m0.357s
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi
> /run/shm/rand 1.vdi
>
> real 0m0.431s
> user 0m0.192s
> sys 0m0.248s
>
> Although 400M is not a giant file, it does show the trend.
> As you can see when there's drastic allocation needs, and when there no
> extra buffering from a virtualized host, the throughput drops about 50%.
> But still it has no effect on "unsafe" mode, as predicted. Also I believe
> that expecting to use a half-converted image is seldom the use case, while
> host crash and power loss are not so unimaginable.
> Looks like qemu-img convert is using "unsafe" as default as well, so even
> novice "qemu-img convert" users are not likely to find performance
> degradation.
>
> I have not yet tried guest OS installation on top, but I guess a new flag
> for one-time faster OS installation is not likely useful, and
> "cache=unsafe" already does the trick.
>
>
> On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote:
>
>> Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
>> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
>> >> On 07.05.2015 17:16, Zhe Qiu wrote:
>> >>> In reference to b0ad5a45...078a458e, metadata writes to
>> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>> >>>
>> >>> Only when write is successful that bdrv_flush is called.
>> >>>
>> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>> >>> ---
>> >>>   block/vdi.c | 3 +++
>> >>>   1 file changed, 3 insertions(+)
>> >> I missed Kevin's arguments before, but I think that adding this is
>> >> more correct than not having it; and when thinking about speed, this
>> >> is vdi, a format supported for compatibility.
>> > If you use it only as a convert target, you probably care more about
>> > speed than about leaks in case of a host crash.
>> >
>> >> So if we wanted to optimize it, we'd probably have to cache multiple
>> >> allocations, do them at once and then flush afterwards (like the
>> >> metadata cache we have in qcow2?)
>> > That would defeat the purpose of this patch which aims at having
>> > metadata and data written out almost at the same time. On the other
>> > hand, fully avoiding the problem instead of just making the window
>> > smaller would require a journal, which VDI just doesn't have.
>> >
>> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the
>> > VDI maintainer.
>> >
>> > Kevin
>>
>> Thanks for asking. I share your concerns regarding reduced performance
>> caused by bdrv_flush. Conversions to VDI will take longer (how much?),
>> and also installation of an OS on a new VDI disk image will be slower
>> because that are the typical scenarios where the disk usage grows.
>>
>> @phoeagon: Did the benchmark which you used allocate additional disk
>> storage? If not or if it only allocated once and then spent some time
>> on already allocated blocks, that benchmark was not valid for this case.
>>
>> On the other hand I don't see a need for the flushing because the kind
>> of failures (power failure) and their consequences seem to be acceptable
>> for typical VDI usage, namely either image conversion or tests with
>> existing images.
>>
>> That's why I'd prefer not to use bdrv_flush here. Could we make
>> bdrv_flush optional (either generally or for cases like this one) so
>> both people who prefer speed and people who would want
>> bdrv_flush to decrease the likelihood of inconsistencies can be
>> satisfied?
>>
>> Stefan
>>
>>

Stefan Weil May 9, 2015, 6:39 a.m. UTC | #7

Am 09.05.2015 um 05:59 schrieb phoeagon:
> BTW, how do you usually measure the time to install a Linux distro 
> within? Most distros ISOs do NOT have unattended installation ISOs in 
> place. (True I can bake my own ISOs for this...) But do you have any 
> ISOs made ready for this purpose?
>
> On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com 
> <mailto:phoeagon@gmail.com>> wrote:
>
>     Thanks. Dbench does not logically allocate new disk space all the
>     time, because it's a FS level benchmark that creates file and
>     deletes them. Therefore it also depends on the guest FS, say, a
>     btrfs guest FS allocates about 1.8x space of that from EXT4, due
>     to its COW nature. It does cause the FS to allocate some space
>     during about 1/3 of the test duration I think. But this does not
>     mitigate it too much because a FS often writes in a stride rather
>     than consecutively, which causes write amplification at allocation
>     times.
>
>     So I tested it with qemu-img convert from a 400M raw file:
>     zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>     unsafe -O vdi /run/shm/rand 1.vdi
>
>     real0m0.402s
>     user0m0.206s
>     sys0m0.202s
>     zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>     writeback -O vdi /run/shm/rand 1.vdi
>

I assume that the target file /run/shm/rand 1.vdi is not on a physical disk.
Then flushing data will be fast. For real hard disks (not SSDs) the 
situation is
different: the r/w heads of the hard disk have to move between data location
and the beginning of the written file where the metadata is written, so
I expect a larger effect there.

For measuring installation time of an OS, I'd take a reproducible 
installation
source (hard disk or DVD, no network connection) and take the time for
those parts of the installation where many packets are installed without
any user interaction. For Linux you won't need a stop watch, because the
packet directories in /usr/share/doc have nice timestamps.

Stefan

phoeagon May 9, 2015, 7:41 a.m. UTC | #8

Full Linux Mint (17.1) Installation with writeback:

With VDI extra sync 4min35s
Vanilla: 3min17s

which is consistent with 'qemu-img convert' (slightly less overhead due to
some phases in installation is actually CPU bound).
Still much faster than other "sync-after-metadata" formats like VPC
(vanilla VPC 7min43s)
The thing is he who needs to set up a new Linux system every day probably
have pre-installed images to start with, and others just don't install an
OS every day.



On Sat, May 9, 2015 at 2:39 PM Stefan Weil <sw@weilnetz.de> wrote:

> Am 09.05.2015 um 05:59 schrieb phoeagon:
>
> BTW, how do you usually measure the time to install a Linux distro within?
> Most distros ISOs do NOT have unattended installation ISOs in place. (True
> I can bake my own ISOs for this...) But do you have any ISOs made ready for
> this purpose?
>
> On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote:
>
>> Thanks. Dbench does not logically allocate new disk space all the time,
>> because it's a FS level benchmark that creates file and deletes them.
>> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
>> about 1.8x space of that from EXT4, due to its COW nature. It does cause
>> the FS to allocate some space during about 1/3 of the test duration I
>> think. But this does not mitigate it too much because a FS often writes in
>> a stride rather than consecutively, which causes write amplification at
>> allocation times.
>>
>> So I tested it with qemu-img convert from a 400M raw file:
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
>> -O vdi /run/shm/rand 1.vdi
>>
>>  real 0m0.402s
>> user 0m0.206s
>> sys 0m0.202s
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>> writeback -O vdi /run/shm/rand 1.vdi
>>
>
>
> I assume that the target file /run/shm/rand 1.vdi is not on a physical
> disk.
> Then flushing data will be fast. For real hard disks (not SSDs) the
> situation is
> different: the r/w heads of the hard disk have to move between data
> location
> and the beginning of the written file where the metadata is written, so
> I expect a larger effect there.
>
> For measuring installation time of an OS, I'd take a reproducible
> installation
> source (hard disk or DVD, no network connection) and take the time for
> those parts of the installation where many packets are installed without
> any user interaction. For Linux you won't need a stop watch, because the
> packet directories in /usr/share/doc have nice timestamps.
>
>
> Stefan
>
>

Paolo Bonzini May 10, 2015, 3:01 p.m. UTC | #9

On 09/05/2015 05:54, phoeagon wrote:
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
> 
> real0m8.678s
> user0m0.169s
> sys0m0.500s
> 
> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
> real0m4.320s
> user0m0.148s
> sys0m0.471s

This means that 3.83 seconds are spent when bdrv_close() calls
bdrv_flush().  That's the only difference between writeback
and unsafe in qemu-img convert.

The remaining part of the time (4.85 seconds instead of 0.49
seconds) means that, at least on your hardware, sequential writes
to unallocated space become 10 times slower with your patch.

Since the default qemu-img convert case isn't slowed down, I
would think that correctness trumps performance.  Nevertheless,
it's a huge difference.

Paolo

> zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi
> real	0m0.489s
> user	0m0.173s
> sys	0m0.325s

Stefan Weil May 10, 2015, 4:02 p.m. UTC | #10

Am 10.05.2015 um 17:01 schrieb Paolo Bonzini:
>
> On 09/05/2015 05:54, phoeagon wrote:
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
>>
>> real0m8.678s
>> user0m0.169s
>> sys0m0.500s
>>
>> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
>> real0m4.320s
>> user0m0.148s
>> sys0m0.471s
> This means that 3.83 seconds are spent when bdrv_close() calls
> bdrv_flush().  That's the only difference between writeback
> and unsafe in qemu-img convert.
>
> The remaining part of the time (4.85 seconds instead of 0.49
> seconds) means that, at least on your hardware, sequential writes
> to unallocated space become 10 times slower with your patch.
>
> Since the default qemu-img convert case isn't slowed down, I
> would think that correctness trumps performance.  Nevertheless,
> it's a huge difference.
>
> Paolo

I doubt that the convert case isn't slowed down.

Writing to a tmpfs as it was obviously done for the test is not a 
typical use case.
With real hard disks I expect a significant slowdown.

Stefan

phoeagon May 10, 2015, 4:05 p.m. UTC | #11

Just for clarity, I was not writing to tmpfs. I was READING from tmpfs. I
was writing to a path named 'sdb' (as you see in the prompt) which is a
btrfs on an SSD Drive. I don't have an HDD to test on though.

On Mon, May 11, 2015 at 12:02 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 10.05.2015 um 17:01 schrieb Paolo Bonzini:
> >
> > On 09/05/2015 05:54, phoeagon wrote:
> >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
> writeback -O vdi /run/shm/rand 1.vdi
> >>
> >> real0m8.678s
> >> user0m0.169s
> >> sys0m0.500s
> >>
> >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
> /run/shm/rand 1.vdi
> >> real0m4.320s
> >> user0m0.148s
> >> sys0m0.471s
> > This means that 3.83 seconds are spent when bdrv_close() calls
> > bdrv_flush().  That's the only difference between writeback
> > and unsafe in qemu-img convert.
> >
> > The remaining part of the time (4.85 seconds instead of 0.49
> > seconds) means that, at least on your hardware, sequential writes
> > to unallocated space become 10 times slower with your patch.
> >
> > Since the default qemu-img convert case isn't slowed down, I
> > would think that correctness trumps performance.  Nevertheless,
> > it's a huge difference.
> >
> > Paolo
>
> I doubt that the convert case isn't slowed down.
>
> Writing to a tmpfs as it was obviously done for the test is not a
> typical use case.
> With real hard disks I expect a significant slowdown.
>
> Stefan
>
>

Paolo Bonzini May 10, 2015, 4:10 p.m. UTC | #12

On 10/05/2015 18:02, Stefan Weil wrote:
>> Since the default qemu-img convert case isn't slowed down, I
>> would think that correctness trumps performance.  Nevertheless,
>> it's a huge difference.
> 
> I doubt that the convert case isn't slowed down.

The *default* convert case isn't slowed down because "qemu-img convert"
defaults to the "unsafe" cache mode.

The *non-default* convert case with flushes was slowed down indeed: 2x
in total (if you include the final flush done by bdrv_close), and 10x if
you only consider the sequential write part of convert.

Paolo

> Writing to a tmpfs as it was obviously done for the test is not a
> typical use case.
> With real hard disks I expect a significant slowdown.
> 
> Stefan

Stefan Weil May 10, 2015, 4:26 p.m. UTC | #13

Am 10.05.2015 um 18:10 schrieb Paolo Bonzini:
> On 10/05/2015 18:02, Stefan Weil wrote:
>>> Since the default qemu-img convert case isn't slowed down, I
>>> would think that correctness trumps performance.  Nevertheless,
>>> it's a huge difference.
>> I doubt that the convert case isn't slowed down.
> The *default* convert case isn't slowed down because "qemu-img convert"
> defaults to the "unsafe" cache mode.
>
> The *non-default* convert case with flushes was slowed down indeed: 2x
> in total (if you include the final flush done by bdrv_close), and 10x if
> you only consider the sequential write part of convert.
>
> Paolo

For those who might be interested:

The relevant GPL source code from VirtualBox is available here:

https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage

If I interpret that code correctly, blocks are normally written 
asynchronously,
but changes of metadata (new block allocation) are written synchronously.

See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp 
(vdIOIntWriteMeta).

Stefan

phoeagon May 10, 2015, 5:14 p.m. UTC | #14

I'm not familiar with the VirtualBox code base, but looks like "
vdIOIntWriteMeta" can work both synchronously & asynchronously, and
"vdiBlockAllocUpdate" looks async to me. Frankly, skimming through the code
for 5 min doesn't enlighten me too much on its detailed implementation, but
looks like at least Virtualbox has VDI-repair that fixes block leaks
relatively easily.

I would agree that a more complete implementation on VDI-check-and-repair
might be better in this particular situation. I'm not sure if there are
other cases where flush after metadata update might be better, but doesn't
look like qemu-img auto repair is coming to other writable image formats in
the near future.

Also, I think that memory exhaustion and consequent page cache eviction are
not too uncommon on computers not originally designed to run VMs. Many
laptops are still trapped with 4GB memory and there seem to widespread
instructions on tuning down the swappiness to favor page cache drops than
swapping out memory, all of which adds to the odds of metadata
inconsistency.

On Mon, May 11, 2015 at 12:26 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 10.05.2015 um 18:10 schrieb Paolo Bonzini:
> > On 10/05/2015 18:02, Stefan Weil wrote:
> >>> Since the default qemu-img convert case isn't slowed down, I
> >>> would think that correctness trumps performance.  Nevertheless,
> >>> it's a huge difference.
> >> I doubt that the convert case isn't slowed down.
> > The *default* convert case isn't slowed down because "qemu-img convert"
> > defaults to the "unsafe" cache mode.
> >
> > The *non-default* convert case with flushes was slowed down indeed: 2x
> > in total (if you include the final flush done by bdrv_close), and 10x if
> > you only consider the sequential write part of convert.
> >
> > Paolo
>
>
> For those who might be interested:
>
> The relevant GPL source code from VirtualBox is available here:
>
> https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage
>
> If I interpret that code correctly, blocks are normally written
> asynchronously,
> but changes of metadata (new block allocation) are written synchronously.
>
> See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp
> (vdIOIntWriteMeta).
>
> Stefan
>
>

[v4] block/vdi: Use bdrv_flush after metadata updates

Commit Message

Comments

Patch