Patchwork Block I/O outside the QEMU global mutex was "Re: [RFC PATCH 00/17] Support for multiple "AIO contexts""

login
register
mail settings
Submitter Paolo Bonzini
Date Oct. 9, 2012, 4:26 p.m.
Message ID <5074502F.5030706@redhat.com>
Download mbox | patch
Permalink /patch/190360/
State New
Headers show

Comments

Paolo Bonzini - Oct. 9, 2012, 4:26 p.m.
Il 09/10/2012 17:37, Anthony Liguori ha scritto:
>>> >> In the very short term, I can imagine an aio fastpath that was only
>>> >> implemented in terms of the device API.  We could have a slow path that
>>> >> acquired the BQL.
>> >
>> > Not sure I follow.
> 
> As long as the ioeventfd thread can acquire qemu_mutex in order to call
> bdrv_* functions.  The new device-only API could do this under the
> covers for everything but the linux-aio fast path initially.

Ok, so it's about the locking.  I'm not even sure we need locking if we
have cooperative multitasking.  For example if bdrv_aio_readv/writev
is called from a VCPU thread, it can just schedule a bottom half for
itself in the appropriate AioContext.  Similarly for block jobs.

The only part where I'm not sure how it would work is bdrv_read/write,
because of the strange "qemu_aio_wait() calls select with a lock taken".
Maybe we can just forbid synchronous I/O if you set a non-default
AioContext.

This would be entirely hidden in the block layer.  For example the
following does it for bdrv_aio_readv/writev:



Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which
would run outside the bottom half and provide the desired fast path.

Paolo

> That means that we can convert block devices to use the device-only API
> across the board (provided we make BQL recursive).
> 
> It also means we get at least some of the benefits of data-plane in the
> short term.
Anthony Liguori - Oct. 9, 2012, 6:26 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 09/10/2012 17:37, Anthony Liguori ha scritto:
>>>> >> In the very short term, I can imagine an aio fastpath that was only
>>>> >> implemented in terms of the device API.  We could have a slow path that
>>>> >> acquired the BQL.
>>> >
>>> > Not sure I follow.
>> 
>> As long as the ioeventfd thread can acquire qemu_mutex in order to call
>> bdrv_* functions.  The new device-only API could do this under the
>> covers for everything but the linux-aio fast path initially.
>
> Ok, so it's about the locking.  I'm not even sure we need locking if we
> have cooperative multitasking.  For example if bdrv_aio_readv/writev
> is called from a VCPU thread, it can just schedule a bottom half for
> itself in the appropriate AioContext.  Similarly for block jobs.


Okay, let's separate out the two issues here though.  One is whether we
need a device specific block API.  The second is whether we should short
cut to a fast path in the short term and go after a fully unlocked bdrv_
layer in the long(shortish?) term.

So let's talk about your proposal...

> The only part where I'm not sure how it would work is bdrv_read/write,
> because of the strange "qemu_aio_wait() calls select with a lock taken".
> Maybe we can just forbid synchronous I/O if you set a non-default
> AioContext.

Not sure how practical that is.  The is an awful lot of sync I/O still left.

> This would be entirely hidden in the block layer.  For example the
> following does it for bdrv_aio_readv/writev:
>
> diff --git a/block.c b/block.c
> index e95f613..7165e82 100644
> --- a/block.c
> +++ b/block.c
> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = {
>      .cancel             = bdrv_aio_co_cancel_em,
>  };
>  
> -static void bdrv_co_em_bh(void *opaque)
> -{
> -    BlockDriverAIOCBCoroutine *acb = opaque;
> -
> -    acb->common.cb(acb->common.opaque, acb->req.error);
> -    qemu_bh_delete(acb->bh);
> -    qemu_aio_release(acb);
> -}
> -
>  /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */
>  static void coroutine_fn bdrv_co_do_rw(void *opaque)
>  {
> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
>              acb->req.nb_sectors, acb->req.qiov, 0);
>      }
>  
> -    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
> -    qemu_bh_schedule(acb->bh);
> +    acb->common.cb(acb->common.opaque, acb->req.error);
> +    qemu_aio_release(acb);
> +}
> +
> +static void bdrv_co_em_bh(void *opaque)
> +{
> +    BlockDriverAIOCBCoroutine *acb = opaque;
> +
> +    qemu_bh_delete(acb->bh);
> +    co = qemu_coroutine_create(bdrv_co_do_rw);
> +    qemu_coroutine_enter(co, acb);
>  }
>  
>  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>      acb->req.qiov = qiov;
>      acb->is_write = is_write;
>  
> -    co = qemu_coroutine_create(bdrv_co_do_rw);
> -    qemu_coroutine_enter(co, acb);
> +    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
> +    qemu_bh_schedule(acb->bh);
>  
>      return &acb->common;
>  }
>
>
> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which
> would run outside the bottom half and provide the desired fast path.

This works for some of the block layer I think.  How does this interact
with thread pools for AIO?

But this wouldn't work well with things like NBD or curl, right?  What's
the plan there?

Regards,

Anthony Liguori

>
> Paolo
>
>> That means that we can convert block devices to use the device-only API
>> across the board (provided we make BQL recursive).
>> 
>> It also means we get at least some of the benefits of data-plane in the
>> short term.
Paolo Bonzini - Oct. 10, 2012, 7:11 a.m.
Il 09/10/2012 20:26, Anthony Liguori ha scritto:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> Il 09/10/2012 17:37, Anthony Liguori ha scritto:
>>>>>>> In the very short term, I can imagine an aio fastpath that was only
>>>>>>> implemented in terms of the device API.  We could have a slow path that
>>>>>>> acquired the BQL.
>>>>>
>>>>> Not sure I follow.
>>>
>>> As long as the ioeventfd thread can acquire qemu_mutex in order to call
>>> bdrv_* functions.  The new device-only API could do this under the
>>> covers for everything but the linux-aio fast path initially.
>>
>> Ok, so it's about the locking.  I'm not even sure we need locking if we
>> have cooperative multitasking.  For example if bdrv_aio_readv/writev
>> is called from a VCPU thread, it can just schedule a bottom half for
>> itself in the appropriate AioContext.  Similarly for block jobs.
> 
> Okay, let's separate out the two issues here though.  One is whether we
> need a device specific block API.  The second is whether we should short
> cut to a fast path in the short term and go after a fully unlocked bdrv_
> layer in the long(shortish?) term.
> 
> So let's talk about your proposal...
> 
>> The only part where I'm not sure how it would work is bdrv_read/write,
>> because of the strange "qemu_aio_wait() calls select with a lock taken".
>> Maybe we can just forbid synchronous I/O if you set a non-default
>> AioContext.
> 
> Not sure how practical that is.  The is an awful lot of sync I/O still left.

Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock.
 The lock would be taken by:

- sync I/O ops

- monitor commands that currently call bdrv_drain_all

- aio_poll when calling bottom halves or handlers

The rest of the proposal however would stand (especially with reference
to block jobs).

I think we can proceed incrementally.  The first obvious step is to
s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the
CoQueue stuff), which would also help fixing the qemu-char bug that Jan
reported.

>> This would be entirely hidden in the block layer.  For example the
>> following does it for bdrv_aio_readv/writev:
>>
>> diff --git a/block.c b/block.c
>> index e95f613..7165e82 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = {
>>      .cancel             = bdrv_aio_co_cancel_em,
>>  };
>>  
>> -static void bdrv_co_em_bh(void *opaque)
>> -{
>> -    BlockDriverAIOCBCoroutine *acb = opaque;
>> -
>> -    acb->common.cb(acb->common.opaque, acb->req.error);
>> -    qemu_bh_delete(acb->bh);
>> -    qemu_aio_release(acb);
>> -}
>> -
>>  /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */
>>  static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>  {
>> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>              acb->req.nb_sectors, acb->req.qiov, 0);
>>      }
>>  
>> -    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>> -    qemu_bh_schedule(acb->bh);
>> +    acb->common.cb(acb->common.opaque, acb->req.error);
>> +    qemu_aio_release(acb);
>> +}
>> +
>> +static void bdrv_co_em_bh(void *opaque)
>> +{
>> +    BlockDriverAIOCBCoroutine *acb = opaque;
>> +
>> +    qemu_bh_delete(acb->bh);
>> +    co = qemu_coroutine_create(bdrv_co_do_rw);
>> +    qemu_coroutine_enter(co, acb);
>>  }
>>  
>>  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>>      acb->req.qiov = qiov;
>>      acb->is_write = is_write;
>>  
>> -    co = qemu_coroutine_create(bdrv_co_do_rw);
>> -    qemu_coroutine_enter(co, acb);
>> +    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>> +    qemu_bh_schedule(acb->bh);
>>  
>>      return &acb->common;
>>  }
>>
>>
>> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which
>> would run outside the bottom half and provide the desired fast path.
> 
> This works for some of the block layer I think.  How does this interact
> with thread pools for AIO?
> 
> But this wouldn't work well with things like NBD or curl, right?  What's
> the plan there?

NBD uses coroutines; curl can use the non-unlocked
bdrv_aio_readv/writev.  In both cases they would execute in the
dataplane thread.  qcow2-over-raw would also execute its read/write code
entirely from the dataplane thread, for example.

Paolo
Anthony Liguori - Oct. 10, 2012, 12:25 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 09/10/2012 20:26, Anthony Liguori ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>> 
>>> Il 09/10/2012 17:37, Anthony Liguori ha scritto:
>>>>>>>> In the very short term, I can imagine an aio fastpath that was only
>>>>>>>> implemented in terms of the device API.  We could have a slow path that
>>>>>>>> acquired the BQL.
>>>>>>
>>>>>> Not sure I follow.
>>>>
>>>> As long as the ioeventfd thread can acquire qemu_mutex in order to call
>>>> bdrv_* functions.  The new device-only API could do this under the
>>>> covers for everything but the linux-aio fast path initially.
>>>
>>> Ok, so it's about the locking.  I'm not even sure we need locking if we
>>> have cooperative multitasking.  For example if bdrv_aio_readv/writev
>>> is called from a VCPU thread, it can just schedule a bottom half for
>>> itself in the appropriate AioContext.  Similarly for block jobs.
>> 
>> Okay, let's separate out the two issues here though.  One is whether we
>> need a device specific block API.  The second is whether we should short
>> cut to a fast path in the short term and go after a fully unlocked bdrv_
>> layer in the long(shortish?) term.
>> 
>> So let's talk about your proposal...
>> 
>>> The only part where I'm not sure how it would work is bdrv_read/write,
>>> because of the strange "qemu_aio_wait() calls select with a lock taken".
>>> Maybe we can just forbid synchronous I/O if you set a non-default
>>> AioContext.
>> 
>> Not sure how practical that is.  The is an awful lot of sync I/O still left.
>
> Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock.
>  The lock would be taken by:
>
> - sync I/O ops
>
> - monitor commands that currently call bdrv_drain_all
>
> - aio_poll when calling bottom halves or handlers
>
> The rest of the proposal however would stand (especially with reference
> to block jobs).
>
> I think we can proceed incrementally.  The first obvious step is to
> s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the
> CoQueue stuff), which would also help fixing the qemu-char bug that Jan
> reported.
>
>>> This would be entirely hidden in the block layer.  For example the
>>> following does it for bdrv_aio_readv/writev:
>>>
>>> diff --git a/block.c b/block.c
>>> index e95f613..7165e82 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = {
>>>      .cancel             = bdrv_aio_co_cancel_em,
>>>  };
>>>  
>>> -static void bdrv_co_em_bh(void *opaque)
>>> -{
>>> -    BlockDriverAIOCBCoroutine *acb = opaque;
>>> -
>>> -    acb->common.cb(acb->common.opaque, acb->req.error);
>>> -    qemu_bh_delete(acb->bh);
>>> -    qemu_aio_release(acb);
>>> -}
>>> -
>>>  /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */
>>>  static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>>  {
>>> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>>              acb->req.nb_sectors, acb->req.qiov, 0);
>>>      }
>>>  
>>> -    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>>> -    qemu_bh_schedule(acb->bh);
>>> +    acb->common.cb(acb->common.opaque, acb->req.error);
>>> +    qemu_aio_release(acb);
>>> +}
>>> +
>>> +static void bdrv_co_em_bh(void *opaque)
>>> +{
>>> +    BlockDriverAIOCBCoroutine *acb = opaque;
>>> +
>>> +    qemu_bh_delete(acb->bh);
>>> +    co = qemu_coroutine_create(bdrv_co_do_rw);
>>> +    qemu_coroutine_enter(co, acb);
>>>  }
>>>  
>>>  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>>> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>>>      acb->req.qiov = qiov;
>>>      acb->is_write = is_write;
>>>  
>>> -    co = qemu_coroutine_create(bdrv_co_do_rw);
>>> -    qemu_coroutine_enter(co, acb);
>>> +    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>>> +    qemu_bh_schedule(acb->bh);
>>>  
>>>      return &acb->common;
>>>  }
>>>
>>>
>>> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which
>>> would run outside the bottom half and provide the desired fast path.
>> 
>> This works for some of the block layer I think.  How does this interact
>> with thread pools for AIO?
>> 
>> But this wouldn't work well with things like NBD or curl, right?  What's
>> the plan there?
>
> NBD uses coroutines; curl can use the non-unlocked
> bdrv_aio_readv/writev.  In both cases they would execute in the
> dataplane thread.  qcow2-over-raw would also execute its read/write code
> entirely from the dataplane thread, for example.

Does that mean that we'd stop processing the queue if we're waiting for
an I/O completion to handle meta data operations?

If that's the case, that probably will hurt performance overall.

Regards,

Anthony Liguori

>
> Paolo
Paolo Bonzini - Oct. 10, 2012, 1:31 p.m.
Il 10/10/2012 14:25, Anthony Liguori ha scritto:
>> > NBD uses coroutines; curl can use the non-unlocked
>> > bdrv_aio_readv/writev.  In both cases they would execute in the
>> > dataplane thread.  qcow2-over-raw would also execute its read/write code
>> > entirely from the dataplane thread, for example.
> Does that mean that we'd stop processing the queue if we're waiting for
> an I/O completion to handle meta data operations?
> 
> If that's the case, that probably will hurt performance overall.

From discussion on IRC it looked like this was ENOCAFFEINE. :)

Paolo
Anthony Liguori - Oct. 10, 2012, 2:44 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 10/10/2012 14:25, Anthony Liguori ha scritto:
>>> > NBD uses coroutines; curl can use the non-unlocked
>>> > bdrv_aio_readv/writev.  In both cases they would execute in the
>>> > dataplane thread.  qcow2-over-raw would also execute its read/write code
>>> > entirely from the dataplane thread, for example.
>> Does that mean that we'd stop processing the queue if we're waiting for
>> an I/O completion to handle meta data operations?
>> 
>> If that's the case, that probably will hurt performance overall.
>
>>From discussion on IRC it looked like this was ENOCAFFEINE. :)
>
> Paolo

Correct :-)

Regards,

Anthony Liguori

Patch

diff --git a/block.c b/block.c
index e95f613..7165e82 100644
--- a/block.c
+++ b/block.c
@@ -3712,15 +3712,6 @@  static AIOPool bdrv_em_co_aio_pool = {
     .cancel             = bdrv_aio_co_cancel_em,
 };
 
-static void bdrv_co_em_bh(void *opaque)
-{
-    BlockDriverAIOCBCoroutine *acb = opaque;
-
-    acb->common.cb(acb->common.opaque, acb->req.error);
-    qemu_bh_delete(acb->bh);
-    qemu_aio_release(acb);
-}
-
 /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */
 static void coroutine_fn bdrv_co_do_rw(void *opaque)
 {
@@ -3735,8 +3726,17 @@  static void coroutine_fn bdrv_co_do_rw(void *opaque)
             acb->req.nb_sectors, acb->req.qiov, 0);
     }
 
-    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
-    qemu_bh_schedule(acb->bh);
+    acb->common.cb(acb->common.opaque, acb->req.error);
+    qemu_aio_release(acb);
+}
+
+static void bdrv_co_em_bh(void *opaque)
+{
+    BlockDriverAIOCBCoroutine *acb = opaque;
+
+    qemu_bh_delete(acb->bh);
+    co = qemu_coroutine_create(bdrv_co_do_rw);
+    qemu_coroutine_enter(co, acb);
 }
 
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
@@ -3756,8 +3756,8 @@  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     acb->req.qiov = qiov;
     acb->is_write = is_write;
 
-    co = qemu_coroutine_create(bdrv_co_do_rw);
-    qemu_coroutine_enter(co, acb);
+    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
+    qemu_bh_schedule(acb->bh);
 
     return &acb->common;
 }