Message ID | 1431011818-15822-1-git-send-email-phoeagon@gmail.com |
---|---|
State | New |
Headers | show |
On 07.05.2015 17:16, Zhe Qiu wrote: > In reference to b0ad5a45...078a458e, metadata writes to > qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) I missed Kevin's arguments before, but I think that adding this is more correct than not having it; and when thinking about speed, this is vdi, a format supported for compatibility. Writing isn't our most important concern anyway (especially considering that it's even to disable write support for vdi at compile time). So if we wanted to optimize it, we'd probably have to cache multiple allocations, do them at once and then flush afterwards (like the metadata cache we have in qcow2?), but that is complicated (like the metadata cache in qcow2), certainly too complicated for a format supported for compatibility (unless someone really wants to implement it). Reviewed-by: Max Reitz <mreitz@redhat.com> > > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret;
Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > On 07.05.2015 17:16, Zhe Qiu wrote: > >In reference to b0ad5a45...078a458e, metadata writes to > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > > >Only when write is successful that bdrv_flush is called. > > > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > >--- > > block/vdi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > I missed Kevin's arguments before, but I think that adding this is > more correct than not having it; and when thinking about speed, this > is vdi, a format supported for compatibility. If you use it only as a convert target, you probably care more about speed than about leaks in case of a host crash. > So if we wanted to optimize it, we'd probably have to cache multiple > allocations, do them at once and then flush afterwards (like the > metadata cache we have in qcow2?) That would defeat the purpose of this patch which aims at having metadata and data written out almost at the same time. On the other hand, fully avoiding the problem instead of just making the window smaller would require a journal, which VDI just doesn't have. I'm not convinced of this patch, but I'll defer to Stefan Weil as the VDI maintainer. Kevin
Then I would guess the same reason should apply to VMDK/VPC as well... Their metadata update protocol is not atomic either, and a sync after metadata update doesn't fix the whole thing theoretically either. Yet the metadata sync patches as old as since 2010 are still there. It should also be a performance boost if we remove those write barriers as well, if conversion performance is our major concern. I think when converting images, one can always opt for "cache=unsafe" to avoid potential performance degradation from conservative cache (it should really be default for qemu-img convert, but I don't know if it's the case), so conversion performance shouldn't be a reason to sacrifice VM-runtime consistency. On Fri, May 8, 2015 at 9:55 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > > On 07.05.2015 17:16, Zhe Qiu wrote: > > >In reference to b0ad5a45...078a458e, metadata writes to > > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > > > > >Only when write is successful that bdrv_flush is called. > > > > > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > > >--- > > > block/vdi.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > I missed Kevin's arguments before, but I think that adding this is > > more correct than not having it; and when thinking about speed, this > > is vdi, a format supported for compatibility. > > If you use it only as a convert target, you probably care more about > speed than about leaks in case of a host crash. > > > So if we wanted to optimize it, we'd probably have to cache multiple > > allocations, do them at once and then flush afterwards (like the > > metadata cache we have in qcow2?) > > That would defeat the purpose of this patch which aims at having > metadata and data written out almost at the same time. On the other > hand, fully avoiding the problem instead of just making the window > smaller would require a journal, which VDI just doesn't have. > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > VDI maintainer. > > Kevin >
Am 08.05.2015 um 15:55 schrieb Kevin Wolf: > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: >> On 07.05.2015 17:16, Zhe Qiu wrote: >>> In reference to b0ad5a45...078a458e, metadata writes to >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. >>> >>> Only when write is successful that bdrv_flush is called. >>> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> >>> --- >>> block/vdi.c | 3 +++ >>> 1 file changed, 3 insertions(+) >> I missed Kevin's arguments before, but I think that adding this is >> more correct than not having it; and when thinking about speed, this >> is vdi, a format supported for compatibility. > If you use it only as a convert target, you probably care more about > speed than about leaks in case of a host crash. > >> So if we wanted to optimize it, we'd probably have to cache multiple >> allocations, do them at once and then flush afterwards (like the >> metadata cache we have in qcow2?) > That would defeat the purpose of this patch which aims at having > metadata and data written out almost at the same time. On the other > hand, fully avoiding the problem instead of just making the window > smaller would require a journal, which VDI just doesn't have. > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > VDI maintainer. > > Kevin Thanks for asking. I share your concerns regarding reduced performance caused by bdrv_flush. Conversions to VDI will take longer (how much?), and also installation of an OS on a new VDI disk image will be slower because that are the typical scenarios where the disk usage grows. @phoeagon: Did the benchmark which you used allocate additional disk storage? If not or if it only allocated once and then spent some time on already allocated blocks, that benchmark was not valid for this case. On the other hand I don't see a need for the flushing because the kind of failures (power failure) and their consequences seem to be acceptable for typical VDI usage, namely either image conversion or tests with existing images. That's why I'd prefer not to use bdrv_flush here. Could we make bdrv_flush optional (either generally or for cases like this one) so both people who prefer speed and people who would want bdrv_flush to decrease the likelihood of inconsistencies can be satisfied? Stefan
Thanks. Dbench does not logically allocate new disk space all the time, because it's a FS level benchmark that creates file and deletes them. Therefore it also depends on the guest FS, say, a btrfs guest FS allocates about 1.8x space of that from EXT4, due to its COW nature. It does cause the FS to allocate some space during about 1/3 of the test duration I think. But this does not mitigate it too much because a FS often writes in a stride rather than consecutively, which causes write amplification at allocation times. So I tested it with qemu-img convert from a 400M raw file: zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.402s user 0m0.206s sys 0m0.202s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m8.678s user 0m0.169s sys 0m0.500s zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m4.320s user 0m0.148s sys 0m0.471s zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.489s user 0m0.173s sys 0m0.325s zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.515s user 0m0.168s sys 0m0.357s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.431s user 0m0.192s sys 0m0.248s Although 400M is not a giant file, it does show the trend. As you can see when there's drastic allocation needs, and when there no extra buffering from a virtualized host, the throughput drops about 50%. But still it has no effect on "unsafe" mode, as predicted. Also I believe that expecting to use a half-converted image is seldom the use case, while host crash and power loss are not so unimaginable. Looks like qemu-img convert is using "unsafe" as default as well, so even novice "qemu-img convert" users are not likely to find performance degradation. I have not yet tried guest OS installation on top, but I guess a new flag for one-time faster OS installation is not likely useful, and "cache=unsafe" already does the trick. On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 08.05.2015 um 15:55 schrieb Kevin Wolf: > > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > >> On 07.05.2015 17:16, Zhe Qiu wrote: > >>> In reference to b0ad5a45...078a458e, metadata writes to > >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > >>> > >>> Only when write is successful that bdrv_flush is called. > >>> > >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > >>> --- > >>> block/vdi.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >> I missed Kevin's arguments before, but I think that adding this is > >> more correct than not having it; and when thinking about speed, this > >> is vdi, a format supported for compatibility. > > If you use it only as a convert target, you probably care more about > > speed than about leaks in case of a host crash. > > > >> So if we wanted to optimize it, we'd probably have to cache multiple > >> allocations, do them at once and then flush afterwards (like the > >> metadata cache we have in qcow2?) > > That would defeat the purpose of this patch which aims at having > > metadata and data written out almost at the same time. On the other > > hand, fully avoiding the problem instead of just making the window > > smaller would require a journal, which VDI just doesn't have. > > > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > > VDI maintainer. > > > > Kevin > > Thanks for asking. I share your concerns regarding reduced performance > caused by bdrv_flush. Conversions to VDI will take longer (how much?), > and also installation of an OS on a new VDI disk image will be slower > because that are the typical scenarios where the disk usage grows. > > @phoeagon: Did the benchmark which you used allocate additional disk > storage? If not or if it only allocated once and then spent some time > on already allocated blocks, that benchmark was not valid for this case. > > On the other hand I don't see a need for the flushing because the kind > of failures (power failure) and their consequences seem to be acceptable > for typical VDI usage, namely either image conversion or tests with > existing images. > > That's why I'd prefer not to use bdrv_flush here. Could we make > bdrv_flush optional (either generally or for cases like this one) so > both people who prefer speed and people who would want > bdrv_flush to decrease the likelihood of inconsistencies can be > satisfied? > > Stefan > >
BTW, how do you usually measure the time to install a Linux distro within? Most distros ISOs do NOT have unattended installation ISOs in place. (True I can bake my own ISOs for this...) But do you have any ISOs made ready for this purpose? On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote: > Thanks. Dbench does not logically allocate new disk space all the time, > because it's a FS level benchmark that creates file and deletes them. > Therefore it also depends on the guest FS, say, a btrfs guest FS allocates > about 1.8x space of that from EXT4, due to its COW nature. It does cause > the FS to allocate some space during about 1/3 of the test duration I > think. But this does not mitigate it too much because a FS often writes in > a stride rather than consecutively, which causes write amplification at > allocation times. > > So I tested it with qemu-img convert from a 400M raw file: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe > -O vdi /run/shm/rand 1.vdi > > real 0m0.402s > user 0m0.206s > sys 0m0.202s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > > real 0m8.678s > user 0m0.169s > sys 0m0.500s > zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi > /run/shm/rand 1.vdi > > real 0m4.320s > user 0m0.148s > sys 0m0.471s > zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand > 1.vdi > real 0m0.489s > user 0m0.173s > sys 0m0.325s > > zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi > > real 0m0.515s > user 0m0.168s > sys 0m0.357s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi > /run/shm/rand 1.vdi > > real 0m0.431s > user 0m0.192s > sys 0m0.248s > > Although 400M is not a giant file, it does show the trend. > As you can see when there's drastic allocation needs, and when there no > extra buffering from a virtualized host, the throughput drops about 50%. > But still it has no effect on "unsafe" mode, as predicted. Also I believe > that expecting to use a half-converted image is seldom the use case, while > host crash and power loss are not so unimaginable. > Looks like qemu-img convert is using "unsafe" as default as well, so even > novice "qemu-img convert" users are not likely to find performance > degradation. > > I have not yet tried guest OS installation on top, but I guess a new flag > for one-time faster OS installation is not likely useful, and > "cache=unsafe" already does the trick. > > > On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote: > >> Am 08.05.2015 um 15:55 schrieb Kevin Wolf: >> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: >> >> On 07.05.2015 17:16, Zhe Qiu wrote: >> >>> In reference to b0ad5a45...078a458e, metadata writes to >> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. >> >>> >> >>> Only when write is successful that bdrv_flush is called. >> >>> >> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> >> >>> --- >> >>> block/vdi.c | 3 +++ >> >>> 1 file changed, 3 insertions(+) >> >> I missed Kevin's arguments before, but I think that adding this is >> >> more correct than not having it; and when thinking about speed, this >> >> is vdi, a format supported for compatibility. >> > If you use it only as a convert target, you probably care more about >> > speed than about leaks in case of a host crash. >> > >> >> So if we wanted to optimize it, we'd probably have to cache multiple >> >> allocations, do them at once and then flush afterwards (like the >> >> metadata cache we have in qcow2?) >> > That would defeat the purpose of this patch which aims at having >> > metadata and data written out almost at the same time. On the other >> > hand, fully avoiding the problem instead of just making the window >> > smaller would require a journal, which VDI just doesn't have. >> > >> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the >> > VDI maintainer. >> > >> > Kevin >> >> Thanks for asking. I share your concerns regarding reduced performance >> caused by bdrv_flush. Conversions to VDI will take longer (how much?), >> and also installation of an OS on a new VDI disk image will be slower >> because that are the typical scenarios where the disk usage grows. >> >> @phoeagon: Did the benchmark which you used allocate additional disk >> storage? If not or if it only allocated once and then spent some time >> on already allocated blocks, that benchmark was not valid for this case. >> >> On the other hand I don't see a need for the flushing because the kind >> of failures (power failure) and their consequences seem to be acceptable >> for typical VDI usage, namely either image conversion or tests with >> existing images. >> >> That's why I'd prefer not to use bdrv_flush here. Could we make >> bdrv_flush optional (either generally or for cases like this one) so >> both people who prefer speed and people who would want >> bdrv_flush to decrease the likelihood of inconsistencies can be >> satisfied? >> >> Stefan >> >>
Am 09.05.2015 um 05:59 schrieb phoeagon: > BTW, how do you usually measure the time to install a Linux distro > within? Most distros ISOs do NOT have unattended installation ISOs in > place. (True I can bake my own ISOs for this...) But do you have any > ISOs made ready for this purpose? > > On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com > <mailto:phoeagon@gmail.com>> wrote: > > Thanks. Dbench does not logically allocate new disk space all the > time, because it's a FS level benchmark that creates file and > deletes them. Therefore it also depends on the guest FS, say, a > btrfs guest FS allocates about 1.8x space of that from EXT4, due > to its COW nature. It does cause the FS to allocate some space > during about 1/3 of the test duration I think. But this does not > mitigate it too much because a FS often writes in a stride rather > than consecutively, which causes write amplification at allocation > times. > > So I tested it with qemu-img convert from a 400M raw file: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > unsafe -O vdi /run/shm/rand 1.vdi > > real0m0.402s > user0m0.206s > sys0m0.202s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > I assume that the target file /run/shm/rand 1.vdi is not on a physical disk. Then flushing data will be fast. For real hard disks (not SSDs) the situation is different: the r/w heads of the hard disk have to move between data location and the beginning of the written file where the metadata is written, so I expect a larger effect there. For measuring installation time of an OS, I'd take a reproducible installation source (hard disk or DVD, no network connection) and take the time for those parts of the installation where many packets are installed without any user interaction. For Linux you won't need a stop watch, because the packet directories in /usr/share/doc have nice timestamps. Stefan
Full Linux Mint (17.1) Installation with writeback: With VDI extra sync 4min35s Vanilla: 3min17s which is consistent with 'qemu-img convert' (slightly less overhead due to some phases in installation is actually CPU bound). Still much faster than other "sync-after-metadata" formats like VPC (vanilla VPC 7min43s) The thing is he who needs to set up a new Linux system every day probably have pre-installed images to start with, and others just don't install an OS every day. On Sat, May 9, 2015 at 2:39 PM Stefan Weil <sw@weilnetz.de> wrote: > Am 09.05.2015 um 05:59 schrieb phoeagon: > > BTW, how do you usually measure the time to install a Linux distro within? > Most distros ISOs do NOT have unattended installation ISOs in place. (True > I can bake my own ISOs for this...) But do you have any ISOs made ready for > this purpose? > > On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote: > >> Thanks. Dbench does not logically allocate new disk space all the time, >> because it's a FS level benchmark that creates file and deletes them. >> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates >> about 1.8x space of that from EXT4, due to its COW nature. It does cause >> the FS to allocate some space during about 1/3 of the test duration I >> think. But this does not mitigate it too much because a FS often writes in >> a stride rather than consecutively, which causes write amplification at >> allocation times. >> >> So I tested it with qemu-img convert from a 400M raw file: >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe >> -O vdi /run/shm/rand 1.vdi >> >> real 0m0.402s >> user 0m0.206s >> sys 0m0.202s >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t >> writeback -O vdi /run/shm/rand 1.vdi >> > > > I assume that the target file /run/shm/rand 1.vdi is not on a physical > disk. > Then flushing data will be fast. For real hard disks (not SSDs) the > situation is > different: the r/w heads of the hard disk have to move between data > location > and the beginning of the written file where the metadata is written, so > I expect a larger effect there. > > For measuring installation time of an OS, I'd take a reproducible > installation > source (hard disk or DVD, no network connection) and take the time for > those parts of the installation where many packets are installed without > any user interaction. For Linux you won't need a stop watch, because the > packet directories in /usr/share/doc have nice timestamps. > > > Stefan > >
On 09/05/2015 05:54, phoeagon wrote: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi > > real0m8.678s > user0m0.169s > sys0m0.500s > > zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi > real0m4.320s > user0m0.148s > sys0m0.471s This means that 3.83 seconds are spent when bdrv_close() calls bdrv_flush(). That's the only difference between writeback and unsafe in qemu-img convert. The remaining part of the time (4.85 seconds instead of 0.49 seconds) means that, at least on your hardware, sequential writes to unallocated space become 10 times slower with your patch. Since the default qemu-img convert case isn't slowed down, I would think that correctness trumps performance. Nevertheless, it's a huge difference. Paolo > zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi > real 0m0.489s > user 0m0.173s > sys 0m0.325s
Am 10.05.2015 um 17:01 schrieb Paolo Bonzini: > > On 09/05/2015 05:54, phoeagon wrote: >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi >> >> real0m8.678s >> user0m0.169s >> sys0m0.500s >> >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi >> real0m4.320s >> user0m0.148s >> sys0m0.471s > This means that 3.83 seconds are spent when bdrv_close() calls > bdrv_flush(). That's the only difference between writeback > and unsafe in qemu-img convert. > > The remaining part of the time (4.85 seconds instead of 0.49 > seconds) means that, at least on your hardware, sequential writes > to unallocated space become 10 times slower with your patch. > > Since the default qemu-img convert case isn't slowed down, I > would think that correctness trumps performance. Nevertheless, > it's a huge difference. > > Paolo I doubt that the convert case isn't slowed down. Writing to a tmpfs as it was obviously done for the test is not a typical use case. With real hard disks I expect a significant slowdown. Stefan
Just for clarity, I was not writing to tmpfs. I was READING from tmpfs. I was writing to a path named 'sdb' (as you see in the prompt) which is a btrfs on an SSD Drive. I don't have an HDD to test on though. On Mon, May 11, 2015 at 12:02 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 10.05.2015 um 17:01 schrieb Paolo Bonzini: > > > > On 09/05/2015 05:54, phoeagon wrote: > >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > >> > >> real0m8.678s > >> user0m0.169s > >> sys0m0.500s > >> > >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi > /run/shm/rand 1.vdi > >> real0m4.320s > >> user0m0.148s > >> sys0m0.471s > > This means that 3.83 seconds are spent when bdrv_close() calls > > bdrv_flush(). That's the only difference between writeback > > and unsafe in qemu-img convert. > > > > The remaining part of the time (4.85 seconds instead of 0.49 > > seconds) means that, at least on your hardware, sequential writes > > to unallocated space become 10 times slower with your patch. > > > > Since the default qemu-img convert case isn't slowed down, I > > would think that correctness trumps performance. Nevertheless, > > it's a huge difference. > > > > Paolo > > I doubt that the convert case isn't slowed down. > > Writing to a tmpfs as it was obviously done for the test is not a > typical use case. > With real hard disks I expect a significant slowdown. > > Stefan > >
On 10/05/2015 18:02, Stefan Weil wrote: >> Since the default qemu-img convert case isn't slowed down, I >> would think that correctness trumps performance. Nevertheless, >> it's a huge difference. > > I doubt that the convert case isn't slowed down. The *default* convert case isn't slowed down because "qemu-img convert" defaults to the "unsafe" cache mode. The *non-default* convert case with flushes was slowed down indeed: 2x in total (if you include the final flush done by bdrv_close), and 10x if you only consider the sequential write part of convert. Paolo > Writing to a tmpfs as it was obviously done for the test is not a > typical use case. > With real hard disks I expect a significant slowdown. > > Stefan
Am 10.05.2015 um 18:10 schrieb Paolo Bonzini: > On 10/05/2015 18:02, Stefan Weil wrote: >>> Since the default qemu-img convert case isn't slowed down, I >>> would think that correctness trumps performance. Nevertheless, >>> it's a huge difference. >> I doubt that the convert case isn't slowed down. > The *default* convert case isn't slowed down because "qemu-img convert" > defaults to the "unsafe" cache mode. > > The *non-default* convert case with flushes was slowed down indeed: 2x > in total (if you include the final flush done by bdrv_close), and 10x if > you only consider the sequential write part of convert. > > Paolo For those who might be interested: The relevant GPL source code from VirtualBox is available here: https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage If I interpret that code correctly, blocks are normally written asynchronously, but changes of metadata (new block allocation) are written synchronously. See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp (vdIOIntWriteMeta). Stefan
I'm not familiar with the VirtualBox code base, but looks like " vdIOIntWriteMeta" can work both synchronously & asynchronously, and "vdiBlockAllocUpdate" looks async to me. Frankly, skimming through the code for 5 min doesn't enlighten me too much on its detailed implementation, but looks like at least Virtualbox has VDI-repair that fixes block leaks relatively easily. I would agree that a more complete implementation on VDI-check-and-repair might be better in this particular situation. I'm not sure if there are other cases where flush after metadata update might be better, but doesn't look like qemu-img auto repair is coming to other writable image formats in the near future. Also, I think that memory exhaustion and consequent page cache eviction are not too uncommon on computers not originally designed to run VMs. Many laptops are still trapped with 4GB memory and there seem to widespread instructions on tuning down the swappiness to favor page cache drops than swapping out memory, all of which adds to the odds of metadata inconsistency. On Mon, May 11, 2015 at 12:26 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 10.05.2015 um 18:10 schrieb Paolo Bonzini: > > On 10/05/2015 18:02, Stefan Weil wrote: > >>> Since the default qemu-img convert case isn't slowed down, I > >>> would think that correctness trumps performance. Nevertheless, > >>> it's a huge difference. > >> I doubt that the convert case isn't slowed down. > > The *default* convert case isn't slowed down because "qemu-img convert" > > defaults to the "unsafe" cache mode. > > > > The *non-default* convert case with flushes was slowed down indeed: 2x > > in total (if you include the final flush done by bdrv_close), and 10x if > > you only consider the sequential write part of convert. > > > > Paolo > > > For those who might be interested: > > The relevant GPL source code from VirtualBox is available here: > > https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage > > If I interpret that code correctly, blocks are normally written > asynchronously, > but changes of metadata (new block allocation) are written synchronously. > > See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp > (vdIOIntWriteMeta). > > Stefan > >
diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout("will write %u block map sectors starting from entry %u\n", n_sectors, bmap_first); ret = bdrv_write(bs->file, offset, base, n_sectors); + if (ret >= 0) { + ret = bdrv_flush(bs->file); + } } return ret;
In reference to b0ad5a45...078a458e, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. Only when write is successful that bdrv_flush is called. Signed-off-by: Zhe Qiu <phoeagon@gmail.com> --- block/vdi.c | 3 +++ 1 file changed, 3 insertions(+)