Message ID | 1490628915-19459-1-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
@Subject: Do you mean "PATCH for-2.9?"? Because "RFC" to me means "please do not merge". ;-) I wouldn't mind a change like this to go into 2.9. On 27.03.2017 17:35, Denis V. Lunev wrote: > Recently we expirience hang with iothreads enabled with the following > call trace: > Thread 1 (Thread 0x7fa95efebc80 (LWP 177117)): > 0 ppoll () from /lib64/libc.so.6 > 2 qemu_poll_ns () at qemu-timer.c:313 > 3 aio_poll () at aio-posix.c:457 > 4 bdrv_flush () at block/io.c:2641 > 5 bdrv_close () at block.c:2143 > 6 bdrv_delete () at block.c:2352 > 7 bdrv_unref () at block.c:3429 > 8 blk_remove_bs () at block/block-backend.c:427 > 9 blk_delete () at block/block-backend.c:178 > 10 blk_unref () at block/block-backend.c:226 > 11 object_property_del_all () at qom/object.c:399 > 12 object_finalize () at qom/object.c:461 > 13 object_unref () at qom/object.c:898 > 14 object_property_del_child () at qom/object.c:422 > 15 qmp_marshal_device_del () at qmp-marshal.c:1145 > 16 handle_qmp_command () at /usr/src/debug/qemu-2.6.0/monitor.c:3929 > > Technically bdrv_flush() stucks in > while (rwco.ret == NOT_DONE) { > aio_poll(aio_context, true); > } > but rwco.ret is equal to 0 thus we have missed wakeup. Code investigation > reveals that we do not have performed aio_context_acquire() on this call > stack. > > This patch adds missed lock. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Max Reitz <mreitz@redhat.com> > CC: Eric Blake <eblake@redhat.com> > CC: Markus Armbruster <armbru@redhat.com> > --- > block/block-backend.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/block/block-backend.c b/block/block-backend.c > index 5742c09..65d5da9 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -273,7 +273,11 @@ void blk_unref(BlockBackend *blk) > if (blk) { > assert(blk->refcnt > 0); > if (!--blk->refcnt) { > + AioContext *ctx = blk_get_aio_context(blk); > + > + aio_context_acquire(ctx); > blk_delete(blk); > + aio_context_release(ctx); But I don't think this is quite the correct place. The caller of blk_unref() should have acquired the AioContext already. As far as I can tell in this case that would be originally release_drive() in hw/core/qdev-properties-system.c and then blk_detach_dev(). I think the former would be the somehow more correct place but I can imagine the latter to be more useful in reality. I'll leave it to you. (As an alternative, you may of course convince me that this patch is indeed correct and should be taken as-is. :-)) Max > } > } > } >
On 03/27/2017 06:45 PM, Max Reitz wrote: > @Subject: Do you mean "PATCH for-2.9?"? Because "RFC" to me means > "please do not merge". ;-) > > I wouldn't mind a change like this to go into 2.9. I am quite sure that problem is here but unsure about the place. Here I use term 'RFC' as 'Can we start the discussion'? > On 27.03.2017 17:35, Denis V. Lunev wrote: >> Recently we expirience hang with iothreads enabled with the following >> call trace: >> Thread 1 (Thread 0x7fa95efebc80 (LWP 177117)): >> 0 ppoll () from /lib64/libc.so.6 >> 2 qemu_poll_ns () at qemu-timer.c:313 >> 3 aio_poll () at aio-posix.c:457 >> 4 bdrv_flush () at block/io.c:2641 >> 5 bdrv_close () at block.c:2143 >> 6 bdrv_delete () at block.c:2352 >> 7 bdrv_unref () at block.c:3429 >> 8 blk_remove_bs () at block/block-backend.c:427 >> 9 blk_delete () at block/block-backend.c:178 >> 10 blk_unref () at block/block-backend.c:226 >> 11 object_property_del_all () at qom/object.c:399 >> 12 object_finalize () at qom/object.c:461 >> 13 object_unref () at qom/object.c:898 >> 14 object_property_del_child () at qom/object.c:422 >> 15 qmp_marshal_device_del () at qmp-marshal.c:1145 >> 16 handle_qmp_command () at /usr/src/debug/qemu-2.6.0/monitor.c:3929 >> >> Technically bdrv_flush() stucks in >> while (rwco.ret == NOT_DONE) { >> aio_poll(aio_context, true); >> } >> but rwco.ret is equal to 0 thus we have missed wakeup. Code investigation >> reveals that we do not have performed aio_context_acquire() on this call >> stack. >> >> This patch adds missed lock. >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Max Reitz <mreitz@redhat.com> >> CC: Eric Blake <eblake@redhat.com> >> CC: Markus Armbruster <armbru@redhat.com> >> --- >> block/block-backend.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/block/block-backend.c b/block/block-backend.c >> index 5742c09..65d5da9 100644 >> --- a/block/block-backend.c >> +++ b/block/block-backend.c >> @@ -273,7 +273,11 @@ void blk_unref(BlockBackend *blk) >> if (blk) { >> assert(blk->refcnt > 0); >> if (!--blk->refcnt) { >> + AioContext *ctx = blk_get_aio_context(blk); >> + >> + aio_context_acquire(ctx); >> blk_delete(blk); >> + aio_context_release(ctx); > But I don't think this is quite the correct place. The caller of > blk_unref() should have acquired the AioContext already. As far as I can > tell in this case that would be originally release_drive() in > hw/core/qdev-properties-system.c and then blk_detach_dev(). > > I think the former would be the somehow more correct place but I can > imagine the latter to be more useful in reality. I'll leave it to you. > > (As an alternative, you may of course convince me that this patch is > indeed correct and should be taken as-is. :-)) ha-ha ;) no, I not going to try to convince you and that is why the patch was marked as RFC. I like release_drive() place. This should solve the problem for me. Thank you for the quick answer and good proposal. Den
diff --git a/block/block-backend.c b/block/block-backend.c index 5742c09..65d5da9 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -273,7 +273,11 @@ void blk_unref(BlockBackend *blk) if (blk) { assert(blk->refcnt > 0); if (!--blk->refcnt) { + AioContext *ctx = blk_get_aio_context(blk); + + aio_context_acquire(ctx); blk_delete(blk); + aio_context_release(ctx); } } }
Recently we expirience hang with iothreads enabled with the following call trace: Thread 1 (Thread 0x7fa95efebc80 (LWP 177117)): 0 ppoll () from /lib64/libc.so.6 2 qemu_poll_ns () at qemu-timer.c:313 3 aio_poll () at aio-posix.c:457 4 bdrv_flush () at block/io.c:2641 5 bdrv_close () at block.c:2143 6 bdrv_delete () at block.c:2352 7 bdrv_unref () at block.c:3429 8 blk_remove_bs () at block/block-backend.c:427 9 blk_delete () at block/block-backend.c:178 10 blk_unref () at block/block-backend.c:226 11 object_property_del_all () at qom/object.c:399 12 object_finalize () at qom/object.c:461 13 object_unref () at qom/object.c:898 14 object_property_del_child () at qom/object.c:422 15 qmp_marshal_device_del () at qmp-marshal.c:1145 16 handle_qmp_command () at /usr/src/debug/qemu-2.6.0/monitor.c:3929 Technically bdrv_flush() stucks in while (rwco.ret == NOT_DONE) { aio_poll(aio_context, true); } but rwco.ret is equal to 0 thus we have missed wakeup. Code investigation reveals that we do not have performed aio_context_acquire() on this call stack. This patch adds missed lock. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Max Reitz <mreitz@redhat.com> CC: Eric Blake <eblake@redhat.com> CC: Markus Armbruster <armbru@redhat.com> --- block/block-backend.c | 4 ++++ 1 file changed, 4 insertions(+)