diff mbox

[2/3] v2 Fix Block Hotplug race with drive_unplug()

Message ID 4CCADCF9.5030508@linux.vnet.ibm.com
State New
Headers show

Commit Message

Anthony Liguori Oct. 29, 2010, 2:40 p.m. UTC
On 10/29/2010 09:29 AM, Kevin Wolf wrote:
> Am 29.10.2010 16:15, schrieb Anthony Liguori:
>    
>> On 10/29/2010 09:01 AM, Markus Armbruster wrote:
>>      
>>> Ryan Harper<ryanh@us.ibm.com>   writes:
>>>        
>>>> diff --git a/block.c b/block.c
>>>> index a19374d..be47655 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
>>>>        }
>>>>    }
>>>>
>>>> +void bdrv_unplug(BlockDriverState *bs)
>>>> +{
>>>> +    qemu_aio_flush();
>>>> +    bdrv_flush(bs);
>>>> +    bdrv_close(bs);
>>>> +}
>>>>
>>>>          
>>> Stupid question: why doesn't bdrv_close() flush automatically?
>>>
>>>        
>> I don't think it's a bad idea to do that but to the extent that the
>> block API is designed after posix file I/O, close does not usually imply
>> flush.
>>      
> I don't think it really resembles POSIX. More or less the only thing
> they have in common is that both provide open, read, write and close,
> which is something that probably any API for file accesses provides.
>
> The operation you're talking about here is bdrv_flush/fsync that is not
> implied by a POSIX close?
>    

Yes.  But I think for the purposes of this patch, a bdrv_cancel_all() 
would be just as good.  The intention is to eliminate pending I/O 
requests, the fsync is just a side effect.

>>> And why do we have to flush here, but not before other uses of
>>> bdrv_close(), such as eject_device()?
>>>
>>>        
>> Good question.  Kevin should also confirm, but looking at the code, I
>> think flush() is needed before close.  If there's a pending I/O event
>> and you close before the I/O event is completed, you'll get a callback
>> for completion against a bogus BlockDriverState.
>>
>> I can't find anything in either raw-posix or the generic block layer
>> that would mitigate this.
>>      
> I'm not aware of anything either. This is what qemu_aio_flush would do.
>
> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in
> bdrv_close. We probably don't really need to call bdrv_flush to operate
> correctly, but it can't hurt and bdrv_close shouldn't happen that often
> anyway.
>    

I agree.  Re: qemu_aio_flush, we have to wait for it to complete which 
gets a little complicated in bdrv_close().  I think it would be better 
to make bdrv_flush() call bdrv_aio_flush() if an explicit bdrv_flush 
method isn't provided.  Something like the attached (still need to test).

Does that seem reasonable?

Regards,

Anthony Liguori

> Kevin
>

Comments

Kevin Wolf Oct. 29, 2010, 2:57 p.m. UTC | #1
Am 29.10.2010 16:40, schrieb Anthony Liguori:
> On 10/29/2010 09:29 AM, Kevin Wolf wrote:
>> Am 29.10.2010 16:15, schrieb Anthony Liguori:
>>> I don't think it's a bad idea to do that but to the extent that the
>>> block API is designed after posix file I/O, close does not usually imply
>>> flush.
>>>      
>> I don't think it really resembles POSIX. More or less the only thing
>> they have in common is that both provide open, read, write and close,
>> which is something that probably any API for file accesses provides.
>>
>> The operation you're talking about here is bdrv_flush/fsync that is not
>> implied by a POSIX close?
>>    
> 
> Yes.  But I think for the purposes of this patch, a bdrv_cancel_all() 
> would be just as good.  The intention is to eliminate pending I/O 
> requests, the fsync is just a side effect.

Well, if I'm not mistaken, bdrv_flush would provide only this side
effect and not the semantics that you're really looking for. This is why
I suggested adding both bdrv_flush and qemu_aio_flush. We could probably
introduce a qemu_aio_flush variant that flushes only one
BlockDriverState - this is what you really want.

>>>> And why do we have to flush here, but not before other uses of
>>>> bdrv_close(), such as eject_device()?
>>>>
>>>>        
>>> Good question.  Kevin should also confirm, but looking at the code, I
>>> think flush() is needed before close.  If there's a pending I/O event
>>> and you close before the I/O event is completed, you'll get a callback
>>> for completion against a bogus BlockDriverState.
>>>
>>> I can't find anything in either raw-posix or the generic block layer
>>> that would mitigate this.
>>>      
>> I'm not aware of anything either. This is what qemu_aio_flush would do.
>>
>> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in
>> bdrv_close. We probably don't really need to call bdrv_flush to operate
>> correctly, but it can't hurt and bdrv_close shouldn't happen that often
>> anyway.
>>    
> 
> I agree.  Re: qemu_aio_flush, we have to wait for it to complete which 
> gets a little complicated in bdrv_close().  

qemu_aio_flush is the function that waits for requests to complete.

> I think it would be better 
> to make bdrv_flush() call bdrv_aio_flush() if an explicit bdrv_flush 
> method isn't provided.  Something like the attached (still need to test).
> 
> Does that seem reasonable?

I'm not sure why you want to introduce this emulation. Are there any
drivers that implement bdrv_aio_flush, but not bdrv_flush? They are
definitely broken.

Today, bdrv_aio_flush is emulated using bdrv_flush if the driver doesn't
provide it explicitly.

I think this also means that your first patch would kill any drivers
implementing neither bdrv_flush nor bdrv_aio_flush because they'd try to
emulate the other function in an endless recursion.

And apart from that, as said above, bdrv_flush doesn't do the right
thing anyway. ;-)

Kevin
Anthony Liguori Oct. 29, 2010, 3:28 p.m. UTC | #2
On 10/29/2010 09:57 AM, Kevin Wolf wrote:
> Am 29.10.2010 16:40, schrieb Anthony Liguori:
>    
>> On 10/29/2010 09:29 AM, Kevin Wolf wrote:
>>      
>>> Am 29.10.2010 16:15, schrieb Anthony Liguori:
>>>        
>>>> I don't think it's a bad idea to do that but to the extent that the
>>>> block API is designed after posix file I/O, close does not usually imply
>>>> flush.
>>>>
>>>>          
>>> I don't think it really resembles POSIX. More or less the only thing
>>> they have in common is that both provide open, read, write and close,
>>> which is something that probably any API for file accesses provides.
>>>
>>> The operation you're talking about here is bdrv_flush/fsync that is not
>>> implied by a POSIX close?
>>>
>>>        
>> Yes.  But I think for the purposes of this patch, a bdrv_cancel_all()
>> would be just as good.  The intention is to eliminate pending I/O
>> requests, the fsync is just a side effect.
>>      
> Well, if I'm not mistaken, bdrv_flush would provide only this side
> effect and not the semantics that you're really looking for. This is why
> I suggested adding both bdrv_flush and qemu_aio_flush. We could probably
> introduce a qemu_aio_flush variant that flushes only one
> BlockDriverState - this is what you really want.
>
>    
>>>>> And why do we have to flush here, but not before other uses of
>>>>> bdrv_close(), such as eject_device()?
>>>>>
>>>>>
>>>>>            
>>>> Good question.  Kevin should also confirm, but looking at the code, I
>>>> think flush() is needed before close.  If there's a pending I/O event
>>>> and you close before the I/O event is completed, you'll get a callback
>>>> for completion against a bogus BlockDriverState.
>>>>
>>>> I can't find anything in either raw-posix or the generic block layer
>>>> that would mitigate this.
>>>>
>>>>          
>>> I'm not aware of anything either. This is what qemu_aio_flush would do.
>>>
>>> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in
>>> bdrv_close. We probably don't really need to call bdrv_flush to operate
>>> correctly, but it can't hurt and bdrv_close shouldn't happen that often
>>> anyway.
>>>
>>>        
>> I agree.  Re: qemu_aio_flush, we have to wait for it to complete which
>> gets a little complicated in bdrv_close().
>>      
> qemu_aio_flush is the function that waits for requests to complete.
>    

Please excuse me while my head explodes ;-)

I think we've got a bit of a problem.

We have:

1) bdrv_flush() - sends an fdatasync

2) bdrv_aio_flush() - sends an fdatasync using the thread pool

3) qemu_aio_flush() - waits for all pending aio requests to complete

But we use bdrv_aio_flush() to implement a barrier and we don't actually 
preserve those barrier semantics in the thread pool.

That is:

If I do:

bdrv_aio_write() -> A
bdrv_aio_write() -> B
bdrv_aio_flush() -> C

This will get queued as three requests on the thread pool.  (A) is a 
write, (B) is a write, and (C) is a fdatasync.

But if this gets picked up by three separate threads, the ordering isn't 
guaranteed.  It might be C, B, A.  So semantically, is bdrv_aio_flush() 
supposed to flush any *pending* writes or any *completed* writes?  If 
it's the later, we're okay, but if it's the former, we're broken.

If it's supposed to flush any pending writes, then my patch series is 
correct in theory.

Regards,

Anthony Liguori

>> I think it would be better
>> to make bdrv_flush() call bdrv_aio_flush() if an explicit bdrv_flush
>> method isn't provided.  Something like the attached (still need to test).
>>
>> Does that seem reasonable?
>>      
> I'm not sure why you want to introduce this emulation. Are there any
> drivers that implement bdrv_aio_flush, but not bdrv_flush? They are
> definitely broken.
>
> Today, bdrv_aio_flush is emulated using bdrv_flush if the driver doesn't
> provide it explicitly.
>
> I think this also means that your first patch would kill any drivers
> implementing neither bdrv_flush nor bdrv_aio_flush because they'd try to
> emulate the other function in an endless recursion.
>
> And apart from that, as said above, bdrv_flush doesn't do the right
> thing anyway. ;-)
>
> Kevin
>
Kevin Wolf Oct. 29, 2010, 4:08 p.m. UTC | #3
Am 29.10.2010 17:28, schrieb Anthony Liguori:
> On 10/29/2010 09:57 AM, Kevin Wolf wrote:
>> Am 29.10.2010 16:40, schrieb Anthony Liguori:
>>    
>>> On 10/29/2010 09:29 AM, Kevin Wolf wrote:
>>>      
>>>> Am 29.10.2010 16:15, schrieb Anthony Liguori:
>>>>        
>>>>> I don't think it's a bad idea to do that but to the extent that the
>>>>> block API is designed after posix file I/O, close does not usually imply
>>>>> flush.
>>>>>
>>>>>          
>>>> I don't think it really resembles POSIX. More or less the only thing
>>>> they have in common is that both provide open, read, write and close,
>>>> which is something that probably any API for file accesses provides.
>>>>
>>>> The operation you're talking about here is bdrv_flush/fsync that is not
>>>> implied by a POSIX close?
>>>>
>>>>        
>>> Yes.  But I think for the purposes of this patch, a bdrv_cancel_all()
>>> would be just as good.  The intention is to eliminate pending I/O
>>> requests, the fsync is just a side effect.
>>>      
>> Well, if I'm not mistaken, bdrv_flush would provide only this side
>> effect and not the semantics that you're really looking for. This is why
>> I suggested adding both bdrv_flush and qemu_aio_flush. We could probably
>> introduce a qemu_aio_flush variant that flushes only one
>> BlockDriverState - this is what you really want.
>>
>>    
>>>>>> And why do we have to flush here, but not before other uses of
>>>>>> bdrv_close(), such as eject_device()?
>>>>>>
>>>>>>
>>>>>>            
>>>>> Good question.  Kevin should also confirm, but looking at the code, I
>>>>> think flush() is needed before close.  If there's a pending I/O event
>>>>> and you close before the I/O event is completed, you'll get a callback
>>>>> for completion against a bogus BlockDriverState.
>>>>>
>>>>> I can't find anything in either raw-posix or the generic block layer
>>>>> that would mitigate this.
>>>>>
>>>>>          
>>>> I'm not aware of anything either. This is what qemu_aio_flush would do.
>>>>
>>>> It seems reasonable to me to call both qemu_aio_flush and bdrv_flush in
>>>> bdrv_close. We probably don't really need to call bdrv_flush to operate
>>>> correctly, but it can't hurt and bdrv_close shouldn't happen that often
>>>> anyway.
>>>>
>>>>        
>>> I agree.  Re: qemu_aio_flush, we have to wait for it to complete which
>>> gets a little complicated in bdrv_close().
>>>      
>> qemu_aio_flush is the function that waits for requests to complete.
>>    
> 
> Please excuse me while my head explodes ;-)
> 
> I think we've got a bit of a problem.
> 
> We have:
> 
> 1) bdrv_flush() - sends an fdatasync
> 
> 2) bdrv_aio_flush() - sends an fdatasync using the thread pool
> 
> 3) qemu_aio_flush() - waits for all pending aio requests to complete
> 
> But we use bdrv_aio_flush() to implement a barrier and we don't actually 
> preserve those barrier semantics in the thread pool.

Not really. We use it to implement flush commands, which I think don't
necessarily constitute a barrier by themselves.

> That is:
> 
> If I do:
> 
> bdrv_aio_write() -> A
> bdrv_aio_write() -> B
> bdrv_aio_flush() -> C
> 
> This will get queued as three requests on the thread pool.  (A) is a 
> write, (B) is a write, and (C) is a fdatasync.
> 
> But if this gets picked up by three separate threads, the ordering isn't 
> guaranteed.  It might be C, B, A.  So semantically, is bdrv_aio_flush() 
> supposed to flush any *pending* writes or any *completed* writes?  If 
> it's the later, we're okay, but if it's the former, we're broken.

Right, so don't do that. ;-)

bdrv_aio_flush, as I understand it, is meant to flush only completed
writes. We've had this discussion before and if I understood right, this
is also how real hardware works generally. So to get barrier semantics
you as an OS need to flush your queue, i.e. you wait for A and B to
complete before you issue C.

Christoph should be able to detail on this.

Kevin
Christoph Hellwig Oct. 30, 2010, 1:25 p.m. UTC | #4
On Fri, Oct 29, 2010 at 06:08:03PM +0200, Kevin Wolf wrote:
> > I think we've got a bit of a problem.
> > 
> > We have:
> > 
> > 1) bdrv_flush() - sends an fdatasync
> > 
> > 2) bdrv_aio_flush() - sends an fdatasync using the thread pool
> > 
> > 3) qemu_aio_flush() - waits for all pending aio requests to complete
> > 
> > But we use bdrv_aio_flush() to implement a barrier and we don't actually 
> > preserve those barrier semantics in the thread pool.
> 
> Not really. We use it to implement flush commands, which I think don't
> necessarily constitute a barrier by themselves.

Yes.  Just as with normal disks qemu has absolutely no concept of I/O
barriers.  I/O barriers is an abstraction inside the Linux kernel that
we fortunately finally got rid of.

Qemu just gets a cache flush command from the guest and executes it.
Usuaully asynchronously as synchronous block I/O with a single
outstanding request is not very performant.  The filesystem in the guest
handles the ordering around it.

> bdrv_aio_flush, as I understand it, is meant to flush only completed
> writes.

Exactly.  The guest OS tracks writes and only issues a cache flush if
all I/Os it wants to see flushes have been ACKed by the storage hardware
/ qemu.
diff mbox

Patch

From 094049974796ddf78ee2f1541bffa40fe1176a1a Mon Sep 17 00:00:00 2001
From: Anthony Liguori <aliguori@us.ibm.com>
Date: Fri, 29 Oct 2010 09:37:25 -0500
Subject: [PATCH 2/2] block: add bdrv_flush to bdrv_close

To ensure that there are no pending completions before destroying a block
device.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

diff --git a/block.c b/block.c
index fc8defd..d2aed1b 100644
--- a/block.c
+++ b/block.c
@@ -644,6 +644,8 @@  unlink_and_fail:
 void bdrv_close(BlockDriverState *bs)
 {
     if (bs->drv) {
+        bdrv_flush(bs);
+
         if (bs == bs_snapshots) {
             bs_snapshots = NULL;
         }
-- 
1.7.0.4