Message ID | 4CCADCF9.5030508@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
Am 29.10.2010 16:40, schrieb Anthony Liguori: > On 10/29/2010 09:29 AM, Kevin Wolf wrote: >> Am 29.10.2010 16:15, schrieb Anthony Liguori: >>> I don't think it's a bad idea to do that but to the extent that the >>> block API is designed after posix file I/O, close does not usually imply >>> flush. >>> >> I don't think it really resembles POSIX. More or less the only thing >> they have in common is that both provide open, read, write and close, >> which is something that probably any API for file accesses provides. >> >> The operation you're talking about here is bdrv_flush/fsync that is not >> implied by a POSIX close? >> > > Yes. But I think for the purposes of this patch, a bdrv_cancel_all() > would be just as good. The intention is to eliminate pending I/O > requests, the fsync is just a side effect. Well, if I'm not mistaken, bdrv_flush would provide only this side effect and not the semantics that you're really looking for. This is why I suggested adding both bdrv_flush and qemu_aio_flush. We could probably introduce a qemu_aio_flush variant that flushes only one BlockDriverState - this is what you really want. >>>> And why do we have to flush here, but not before other uses of >>>> bdrv_close(), such as eject_device()? >>>> >>>> >>> Good question. Kevin should also confirm, but looking at the code, I >>> think flush() is needed before close. If there's a pending I/O event >>> and you close before the I/O event is completed, you'll get a callback >>> for completion against a bogus BlockDriverState. >>> >>> I can't find anything in either raw-posix or the generic block layer >>> that would mitigate this. >>> >> I'm not aware of anything either. This is what qemu_aio_flush would do. >> >> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in >> bdrv_close. We probably don't really need to call bdrv_flush to operate >> correctly, but it can't hurt and bdrv_close shouldn't happen that often >> anyway. >> > > I agree. Re: qemu_aio_flush, we have to wait for it to complete which > gets a little complicated in bdrv_close(). qemu_aio_flush is the function that waits for requests to complete. > I think it would be better > to make bdrv_flush() call bdrv_aio_flush() if an explicit bdrv_flush > method isn't provided. Something like the attached (still need to test). > > Does that seem reasonable? I'm not sure why you want to introduce this emulation. Are there any drivers that implement bdrv_aio_flush, but not bdrv_flush? They are definitely broken. Today, bdrv_aio_flush is emulated using bdrv_flush if the driver doesn't provide it explicitly. I think this also means that your first patch would kill any drivers implementing neither bdrv_flush nor bdrv_aio_flush because they'd try to emulate the other function in an endless recursion. And apart from that, as said above, bdrv_flush doesn't do the right thing anyway. ;-) Kevin
On 10/29/2010 09:57 AM, Kevin Wolf wrote: > Am 29.10.2010 16:40, schrieb Anthony Liguori: > >> On 10/29/2010 09:29 AM, Kevin Wolf wrote: >> >>> Am 29.10.2010 16:15, schrieb Anthony Liguori: >>> >>>> I don't think it's a bad idea to do that but to the extent that the >>>> block API is designed after posix file I/O, close does not usually imply >>>> flush. >>>> >>>> >>> I don't think it really resembles POSIX. More or less the only thing >>> they have in common is that both provide open, read, write and close, >>> which is something that probably any API for file accesses provides. >>> >>> The operation you're talking about here is bdrv_flush/fsync that is not >>> implied by a POSIX close? >>> >>> >> Yes. But I think for the purposes of this patch, a bdrv_cancel_all() >> would be just as good. The intention is to eliminate pending I/O >> requests, the fsync is just a side effect. >> > Well, if I'm not mistaken, bdrv_flush would provide only this side > effect and not the semantics that you're really looking for. This is why > I suggested adding both bdrv_flush and qemu_aio_flush. We could probably > introduce a qemu_aio_flush variant that flushes only one > BlockDriverState - this is what you really want. > > >>>>> And why do we have to flush here, but not before other uses of >>>>> bdrv_close(), such as eject_device()? >>>>> >>>>> >>>>> >>>> Good question. Kevin should also confirm, but looking at the code, I >>>> think flush() is needed before close. If there's a pending I/O event >>>> and you close before the I/O event is completed, you'll get a callback >>>> for completion against a bogus BlockDriverState. >>>> >>>> I can't find anything in either raw-posix or the generic block layer >>>> that would mitigate this. >>>> >>>> >>> I'm not aware of anything either. This is what qemu_aio_flush would do. >>> >>> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in >>> bdrv_close. We probably don't really need to call bdrv_flush to operate >>> correctly, but it can't hurt and bdrv_close shouldn't happen that often >>> anyway. >>> >>> >> I agree. Re: qemu_aio_flush, we have to wait for it to complete which >> gets a little complicated in bdrv_close(). >> > qemu_aio_flush is the function that waits for requests to complete. > Please excuse me while my head explodes ;-) I think we've got a bit of a problem. We have: 1) bdrv_flush() - sends an fdatasync 2) bdrv_aio_flush() - sends an fdatasync using the thread pool 3) qemu_aio_flush() - waits for all pending aio requests to complete But we use bdrv_aio_flush() to implement a barrier and we don't actually preserve those barrier semantics in the thread pool. That is: If I do: bdrv_aio_write() -> A bdrv_aio_write() -> B bdrv_aio_flush() -> C This will get queued as three requests on the thread pool. (A) is a write, (B) is a write, and (C) is a fdatasync. But if this gets picked up by three separate threads, the ordering isn't guaranteed. It might be C, B, A. So semantically, is bdrv_aio_flush() supposed to flush any *pending* writes or any *completed* writes? If it's the later, we're okay, but if it's the former, we're broken. If it's supposed to flush any pending writes, then my patch series is correct in theory. Regards, Anthony Liguori >> I think it would be better >> to make bdrv_flush() call bdrv_aio_flush() if an explicit bdrv_flush >> method isn't provided. Something like the attached (still need to test). >> >> Does that seem reasonable? >> > I'm not sure why you want to introduce this emulation. Are there any > drivers that implement bdrv_aio_flush, but not bdrv_flush? They are > definitely broken. > > Today, bdrv_aio_flush is emulated using bdrv_flush if the driver doesn't > provide it explicitly. > > I think this also means that your first patch would kill any drivers > implementing neither bdrv_flush nor bdrv_aio_flush because they'd try to > emulate the other function in an endless recursion. > > And apart from that, as said above, bdrv_flush doesn't do the right > thing anyway. ;-) > > Kevin >
Am 29.10.2010 17:28, schrieb Anthony Liguori: > On 10/29/2010 09:57 AM, Kevin Wolf wrote: >> Am 29.10.2010 16:40, schrieb Anthony Liguori: >> >>> On 10/29/2010 09:29 AM, Kevin Wolf wrote: >>> >>>> Am 29.10.2010 16:15, schrieb Anthony Liguori: >>>> >>>>> I don't think it's a bad idea to do that but to the extent that the >>>>> block API is designed after posix file I/O, close does not usually imply >>>>> flush. >>>>> >>>>> >>>> I don't think it really resembles POSIX. More or less the only thing >>>> they have in common is that both provide open, read, write and close, >>>> which is something that probably any API for file accesses provides. >>>> >>>> The operation you're talking about here is bdrv_flush/fsync that is not >>>> implied by a POSIX close? >>>> >>>> >>> Yes. But I think for the purposes of this patch, a bdrv_cancel_all() >>> would be just as good. The intention is to eliminate pending I/O >>> requests, the fsync is just a side effect. >>> >> Well, if I'm not mistaken, bdrv_flush would provide only this side >> effect and not the semantics that you're really looking for. This is why >> I suggested adding both bdrv_flush and qemu_aio_flush. We could probably >> introduce a qemu_aio_flush variant that flushes only one >> BlockDriverState - this is what you really want. >> >> >>>>>> And why do we have to flush here, but not before other uses of >>>>>> bdrv_close(), such as eject_device()? >>>>>> >>>>>> >>>>>> >>>>> Good question. Kevin should also confirm, but looking at the code, I >>>>> think flush() is needed before close. If there's a pending I/O event >>>>> and you close before the I/O event is completed, you'll get a callback >>>>> for completion against a bogus BlockDriverState. >>>>> >>>>> I can't find anything in either raw-posix or the generic block layer >>>>> that would mitigate this. >>>>> >>>>> >>>> I'm not aware of anything either. This is what qemu_aio_flush would do. >>>> >>>> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in >>>> bdrv_close. We probably don't really need to call bdrv_flush to operate >>>> correctly, but it can't hurt and bdrv_close shouldn't happen that often >>>> anyway. >>>> >>>> >>> I agree. Re: qemu_aio_flush, we have to wait for it to complete which >>> gets a little complicated in bdrv_close(). >>> >> qemu_aio_flush is the function that waits for requests to complete. >> > > Please excuse me while my head explodes ;-) > > I think we've got a bit of a problem. > > We have: > > 1) bdrv_flush() - sends an fdatasync > > 2) bdrv_aio_flush() - sends an fdatasync using the thread pool > > 3) qemu_aio_flush() - waits for all pending aio requests to complete > > But we use bdrv_aio_flush() to implement a barrier and we don't actually > preserve those barrier semantics in the thread pool. Not really. We use it to implement flush commands, which I think don't necessarily constitute a barrier by themselves. > That is: > > If I do: > > bdrv_aio_write() -> A > bdrv_aio_write() -> B > bdrv_aio_flush() -> C > > This will get queued as three requests on the thread pool. (A) is a > write, (B) is a write, and (C) is a fdatasync. > > But if this gets picked up by three separate threads, the ordering isn't > guaranteed. It might be C, B, A. So semantically, is bdrv_aio_flush() > supposed to flush any *pending* writes or any *completed* writes? If > it's the later, we're okay, but if it's the former, we're broken. Right, so don't do that. ;-) bdrv_aio_flush, as I understand it, is meant to flush only completed writes. We've had this discussion before and if I understood right, this is also how real hardware works generally. So to get barrier semantics you as an OS need to flush your queue, i.e. you wait for A and B to complete before you issue C. Christoph should be able to detail on this. Kevin
On Fri, Oct 29, 2010 at 06:08:03PM +0200, Kevin Wolf wrote: > > I think we've got a bit of a problem. > > > > We have: > > > > 1) bdrv_flush() - sends an fdatasync > > > > 2) bdrv_aio_flush() - sends an fdatasync using the thread pool > > > > 3) qemu_aio_flush() - waits for all pending aio requests to complete > > > > But we use bdrv_aio_flush() to implement a barrier and we don't actually > > preserve those barrier semantics in the thread pool. > > Not really. We use it to implement flush commands, which I think don't > necessarily constitute a barrier by themselves. Yes. Just as with normal disks qemu has absolutely no concept of I/O barriers. I/O barriers is an abstraction inside the Linux kernel that we fortunately finally got rid of. Qemu just gets a cache flush command from the guest and executes it. Usuaully asynchronously as synchronous block I/O with a single outstanding request is not very performant. The filesystem in the guest handles the ordering around it. > bdrv_aio_flush, as I understand it, is meant to flush only completed > writes. Exactly. The guest OS tracks writes and only issues a cache flush if all I/Os it wants to see flushes have been ACKed by the storage hardware / qemu.
From 094049974796ddf78ee2f1541bffa40fe1176a1a Mon Sep 17 00:00:00 2001 From: Anthony Liguori <aliguori@us.ibm.com> Date: Fri, 29 Oct 2010 09:37:25 -0500 Subject: [PATCH 2/2] block: add bdrv_flush to bdrv_close To ensure that there are no pending completions before destroying a block device. Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> diff --git a/block.c b/block.c index fc8defd..d2aed1b 100644 --- a/block.c +++ b/block.c @@ -644,6 +644,8 @@ unlink_and_fail: void bdrv_close(BlockDriverState *bs) { if (bs->drv) { + bdrv_flush(bs); + if (bs == bs_snapshots) { bs_snapshots = NULL; } -- 1.7.0.4