Message ID | 5074502F.5030706@redhat.com |
---|---|
State | New |
Headers | show |
Paolo Bonzini <pbonzini@redhat.com> writes: > Il 09/10/2012 17:37, Anthony Liguori ha scritto: >>>> >> In the very short term, I can imagine an aio fastpath that was only >>>> >> implemented in terms of the device API. We could have a slow path that >>>> >> acquired the BQL. >>> > >>> > Not sure I follow. >> >> As long as the ioeventfd thread can acquire qemu_mutex in order to call >> bdrv_* functions. The new device-only API could do this under the >> covers for everything but the linux-aio fast path initially. > > Ok, so it's about the locking. I'm not even sure we need locking if we > have cooperative multitasking. For example if bdrv_aio_readv/writev > is called from a VCPU thread, it can just schedule a bottom half for > itself in the appropriate AioContext. Similarly for block jobs. Okay, let's separate out the two issues here though. One is whether we need a device specific block API. The second is whether we should short cut to a fast path in the short term and go after a fully unlocked bdrv_ layer in the long(shortish?) term. So let's talk about your proposal... > The only part where I'm not sure how it would work is bdrv_read/write, > because of the strange "qemu_aio_wait() calls select with a lock taken". > Maybe we can just forbid synchronous I/O if you set a non-default > AioContext. Not sure how practical that is. The is an awful lot of sync I/O still left. > This would be entirely hidden in the block layer. For example the > following does it for bdrv_aio_readv/writev: > > diff --git a/block.c b/block.c > index e95f613..7165e82 100644 > --- a/block.c > +++ b/block.c > @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = { > .cancel = bdrv_aio_co_cancel_em, > }; > > -static void bdrv_co_em_bh(void *opaque) > -{ > - BlockDriverAIOCBCoroutine *acb = opaque; > - > - acb->common.cb(acb->common.opaque, acb->req.error); > - qemu_bh_delete(acb->bh); > - qemu_aio_release(acb); > -} > - > /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */ > static void coroutine_fn bdrv_co_do_rw(void *opaque) > { > @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque) > acb->req.nb_sectors, acb->req.qiov, 0); > } > > - acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); > - qemu_bh_schedule(acb->bh); > + acb->common.cb(acb->common.opaque, acb->req.error); > + qemu_aio_release(acb); > +} > + > +static void bdrv_co_em_bh(void *opaque) > +{ > + BlockDriverAIOCBCoroutine *acb = opaque; > + > + qemu_bh_delete(acb->bh); > + co = qemu_coroutine_create(bdrv_co_do_rw); > + qemu_coroutine_enter(co, acb); > } > > static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, > @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, > acb->req.qiov = qiov; > acb->is_write = is_write; > > - co = qemu_coroutine_create(bdrv_co_do_rw); > - qemu_coroutine_enter(co, acb); > + acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); > + qemu_bh_schedule(acb->bh); > > return &acb->common; > } > > > Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which > would run outside the bottom half and provide the desired fast path. This works for some of the block layer I think. How does this interact with thread pools for AIO? But this wouldn't work well with things like NBD or curl, right? What's the plan there? Regards, Anthony Liguori > > Paolo > >> That means that we can convert block devices to use the device-only API >> across the board (provided we make BQL recursive). >> >> It also means we get at least some of the benefits of data-plane in the >> short term.
Il 09/10/2012 20:26, Anthony Liguori ha scritto: > Paolo Bonzini <pbonzini@redhat.com> writes: > >> Il 09/10/2012 17:37, Anthony Liguori ha scritto: >>>>>>> In the very short term, I can imagine an aio fastpath that was only >>>>>>> implemented in terms of the device API. We could have a slow path that >>>>>>> acquired the BQL. >>>>> >>>>> Not sure I follow. >>> >>> As long as the ioeventfd thread can acquire qemu_mutex in order to call >>> bdrv_* functions. The new device-only API could do this under the >>> covers for everything but the linux-aio fast path initially. >> >> Ok, so it's about the locking. I'm not even sure we need locking if we >> have cooperative multitasking. For example if bdrv_aio_readv/writev >> is called from a VCPU thread, it can just schedule a bottom half for >> itself in the appropriate AioContext. Similarly for block jobs. > > Okay, let's separate out the two issues here though. One is whether we > need a device specific block API. The second is whether we should short > cut to a fast path in the short term and go after a fully unlocked bdrv_ > layer in the long(shortish?) term. > > So let's talk about your proposal... > >> The only part where I'm not sure how it would work is bdrv_read/write, >> because of the strange "qemu_aio_wait() calls select with a lock taken". >> Maybe we can just forbid synchronous I/O if you set a non-default >> AioContext. > > Not sure how practical that is. The is an awful lot of sync I/O still left. Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock. The lock would be taken by: - sync I/O ops - monitor commands that currently call bdrv_drain_all - aio_poll when calling bottom halves or handlers The rest of the proposal however would stand (especially with reference to block jobs). I think we can proceed incrementally. The first obvious step is to s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the CoQueue stuff), which would also help fixing the qemu-char bug that Jan reported. >> This would be entirely hidden in the block layer. For example the >> following does it for bdrv_aio_readv/writev: >> >> diff --git a/block.c b/block.c >> index e95f613..7165e82 100644 >> --- a/block.c >> +++ b/block.c >> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = { >> .cancel = bdrv_aio_co_cancel_em, >> }; >> >> -static void bdrv_co_em_bh(void *opaque) >> -{ >> - BlockDriverAIOCBCoroutine *acb = opaque; >> - >> - acb->common.cb(acb->common.opaque, acb->req.error); >> - qemu_bh_delete(acb->bh); >> - qemu_aio_release(acb); >> -} >> - >> /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */ >> static void coroutine_fn bdrv_co_do_rw(void *opaque) >> { >> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque) >> acb->req.nb_sectors, acb->req.qiov, 0); >> } >> >> - acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >> - qemu_bh_schedule(acb->bh); >> + acb->common.cb(acb->common.opaque, acb->req.error); >> + qemu_aio_release(acb); >> +} >> + >> +static void bdrv_co_em_bh(void *opaque) >> +{ >> + BlockDriverAIOCBCoroutine *acb = opaque; >> + >> + qemu_bh_delete(acb->bh); >> + co = qemu_coroutine_create(bdrv_co_do_rw); >> + qemu_coroutine_enter(co, acb); >> } >> >> static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >> acb->req.qiov = qiov; >> acb->is_write = is_write; >> >> - co = qemu_coroutine_create(bdrv_co_do_rw); >> - qemu_coroutine_enter(co, acb); >> + acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >> + qemu_bh_schedule(acb->bh); >> >> return &acb->common; >> } >> >> >> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which >> would run outside the bottom half and provide the desired fast path. > > This works for some of the block layer I think. How does this interact > with thread pools for AIO? > > But this wouldn't work well with things like NBD or curl, right? What's > the plan there? NBD uses coroutines; curl can use the non-unlocked bdrv_aio_readv/writev. In both cases they would execute in the dataplane thread. qcow2-over-raw would also execute its read/write code entirely from the dataplane thread, for example. Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > Il 09/10/2012 20:26, Anthony Liguori ha scritto: >> Paolo Bonzini <pbonzini@redhat.com> writes: >> >>> Il 09/10/2012 17:37, Anthony Liguori ha scritto: >>>>>>>> In the very short term, I can imagine an aio fastpath that was only >>>>>>>> implemented in terms of the device API. We could have a slow path that >>>>>>>> acquired the BQL. >>>>>> >>>>>> Not sure I follow. >>>> >>>> As long as the ioeventfd thread can acquire qemu_mutex in order to call >>>> bdrv_* functions. The new device-only API could do this under the >>>> covers for everything but the linux-aio fast path initially. >>> >>> Ok, so it's about the locking. I'm not even sure we need locking if we >>> have cooperative multitasking. For example if bdrv_aio_readv/writev >>> is called from a VCPU thread, it can just schedule a bottom half for >>> itself in the appropriate AioContext. Similarly for block jobs. >> >> Okay, let's separate out the two issues here though. One is whether we >> need a device specific block API. The second is whether we should short >> cut to a fast path in the short term and go after a fully unlocked bdrv_ >> layer in the long(shortish?) term. >> >> So let's talk about your proposal... >> >>> The only part where I'm not sure how it would work is bdrv_read/write, >>> because of the strange "qemu_aio_wait() calls select with a lock taken". >>> Maybe we can just forbid synchronous I/O if you set a non-default >>> AioContext. >> >> Not sure how practical that is. The is an awful lot of sync I/O still left. > > Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock. > The lock would be taken by: > > - sync I/O ops > > - monitor commands that currently call bdrv_drain_all > > - aio_poll when calling bottom halves or handlers > > The rest of the proposal however would stand (especially with reference > to block jobs). > > I think we can proceed incrementally. The first obvious step is to > s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the > CoQueue stuff), which would also help fixing the qemu-char bug that Jan > reported. > >>> This would be entirely hidden in the block layer. For example the >>> following does it for bdrv_aio_readv/writev: >>> >>> diff --git a/block.c b/block.c >>> index e95f613..7165e82 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = { >>> .cancel = bdrv_aio_co_cancel_em, >>> }; >>> >>> -static void bdrv_co_em_bh(void *opaque) >>> -{ >>> - BlockDriverAIOCBCoroutine *acb = opaque; >>> - >>> - acb->common.cb(acb->common.opaque, acb->req.error); >>> - qemu_bh_delete(acb->bh); >>> - qemu_aio_release(acb); >>> -} >>> - >>> /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */ >>> static void coroutine_fn bdrv_co_do_rw(void *opaque) >>> { >>> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque) >>> acb->req.nb_sectors, acb->req.qiov, 0); >>> } >>> >>> - acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >>> - qemu_bh_schedule(acb->bh); >>> + acb->common.cb(acb->common.opaque, acb->req.error); >>> + qemu_aio_release(acb); >>> +} >>> + >>> +static void bdrv_co_em_bh(void *opaque) >>> +{ >>> + BlockDriverAIOCBCoroutine *acb = opaque; >>> + >>> + qemu_bh_delete(acb->bh); >>> + co = qemu_coroutine_create(bdrv_co_do_rw); >>> + qemu_coroutine_enter(co, acb); >>> } >>> >>> static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >>> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >>> acb->req.qiov = qiov; >>> acb->is_write = is_write; >>> >>> - co = qemu_coroutine_create(bdrv_co_do_rw); >>> - qemu_coroutine_enter(co, acb); >>> + acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >>> + qemu_bh_schedule(acb->bh); >>> >>> return &acb->common; >>> } >>> >>> >>> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which >>> would run outside the bottom half and provide the desired fast path. >> >> This works for some of the block layer I think. How does this interact >> with thread pools for AIO? >> >> But this wouldn't work well with things like NBD or curl, right? What's >> the plan there? > > NBD uses coroutines; curl can use the non-unlocked > bdrv_aio_readv/writev. In both cases they would execute in the > dataplane thread. qcow2-over-raw would also execute its read/write code > entirely from the dataplane thread, for example. Does that mean that we'd stop processing the queue if we're waiting for an I/O completion to handle meta data operations? If that's the case, that probably will hurt performance overall. Regards, Anthony Liguori > > Paolo
Il 10/10/2012 14:25, Anthony Liguori ha scritto: >> > NBD uses coroutines; curl can use the non-unlocked >> > bdrv_aio_readv/writev. In both cases they would execute in the >> > dataplane thread. qcow2-over-raw would also execute its read/write code >> > entirely from the dataplane thread, for example. > Does that mean that we'd stop processing the queue if we're waiting for > an I/O completion to handle meta data operations? > > If that's the case, that probably will hurt performance overall. From discussion on IRC it looked like this was ENOCAFFEINE. :) Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > Il 10/10/2012 14:25, Anthony Liguori ha scritto: >>> > NBD uses coroutines; curl can use the non-unlocked >>> > bdrv_aio_readv/writev. In both cases they would execute in the >>> > dataplane thread. qcow2-over-raw would also execute its read/write code >>> > entirely from the dataplane thread, for example. >> Does that mean that we'd stop processing the queue if we're waiting for >> an I/O completion to handle meta data operations? >> >> If that's the case, that probably will hurt performance overall. > >>From discussion on IRC it looked like this was ENOCAFFEINE. :) > > Paolo Correct :-) Regards, Anthony Liguori
diff --git a/block.c b/block.c index e95f613..7165e82 100644 --- a/block.c +++ b/block.c @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = { .cancel = bdrv_aio_co_cancel_em, }; -static void bdrv_co_em_bh(void *opaque) -{ - BlockDriverAIOCBCoroutine *acb = opaque; - - acb->common.cb(acb->common.opaque, acb->req.error); - qemu_bh_delete(acb->bh); - qemu_aio_release(acb); -} - /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */ static void coroutine_fn bdrv_co_do_rw(void *opaque) { @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque) acb->req.nb_sectors, acb->req.qiov, 0); } - acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); - qemu_bh_schedule(acb->bh); + acb->common.cb(acb->common.opaque, acb->req.error); + qemu_aio_release(acb); +} + +static void bdrv_co_em_bh(void *opaque) +{ + BlockDriverAIOCBCoroutine *acb = opaque; + + qemu_bh_delete(acb->bh); + co = qemu_coroutine_create(bdrv_co_do_rw); + qemu_coroutine_enter(co, acb); } static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, acb->req.qiov = qiov; acb->is_write = is_write; - co = qemu_coroutine_create(bdrv_co_do_rw); - qemu_coroutine_enter(co, acb); + acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); + qemu_bh_schedule(acb->bh); return &acb->common; }