Message ID | 1435818839-5376-1-git-send-email-famz@redhat.com |
---|---|
State | New |
Headers | show |
Am 02.07.2015 um 08:33 schrieb Fam Zheng: > bdrv_flush() uses a loop like > > while (rwco.ret == NOT_DONE) { > aio_poll(aio_context, true); > } > > to wait for thread pool, which may not get notified about the scheduled > BH right away, if there is no new event that wakes up a blocking > qemu_poll_ns(). In this case, it may even be a permanent hang. > > Wake the main thread up by writing to the event notifier fd. > > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Christian Borntraeger <borntraeger@de.ibm.com> > Signed-off-by: Fam Zheng <famz@redhat.com> > > --- > > I suspect this may relate to > > [Qemu-devel] "iothread: release iothread around aio_poll" causes random > hangs at startup > > [http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg00623.html] > > reported by Christian Borntraeger. Because in iothread there is rarely > any fd activity, so the blocking aio_poll() may block forever if it > misses the BH schedule. > > Christian, could you test this patch against your reproducer? Still does not work. It really seems to be triggered by the null device (and there must be >= 2). > --- > thread-pool.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/thread-pool.c b/thread-pool.c > index ac909f4..9b9c065 100644 > --- a/thread-pool.c > +++ b/thread-pool.c > @@ -112,6 +112,7 @@ static void *worker_thread(void *opaque) > qemu_mutex_lock(&pool->lock); > > qemu_bh_schedule(pool->completion_bh); > + aio_notify(pool->ctx); > } > > pool->cur_threads--; >
On 02/07/2015 08:33, Fam Zheng wrote: > bdrv_flush() uses a loop like > > while (rwco.ret == NOT_DONE) { > aio_poll(aio_context, true); > } > > to wait for thread pool, which may not get notified about the scheduled > BH right away, if there is no new event that wakes up a blocking > qemu_poll_ns(). That translates to "the dispatching optimization does not work". :) I do not think that is the problem. Paolo
On Thu, 07/02 09:11, Paolo Bonzini wrote: > > > On 02/07/2015 08:33, Fam Zheng wrote: > > bdrv_flush() uses a loop like > > > > while (rwco.ret == NOT_DONE) { > > aio_poll(aio_context, true); > > } > > > > to wait for thread pool, which may not get notified about the scheduled > > BH right away, if there is no new event that wakes up a blocking > > qemu_poll_ns(). > > That translates to "the dispatching optimization does not work". :) I > do not think that is the problem. I must be missing something. I see a hang locally with some AioContext patches I'm testing, and this does fix it. I traced that qemu_bh_schedule does call aio_notify and event_notifier_set, so it's curious. Still looking. Fam
diff --git a/thread-pool.c b/thread-pool.c index ac909f4..9b9c065 100644 --- a/thread-pool.c +++ b/thread-pool.c @@ -112,6 +112,7 @@ static void *worker_thread(void *opaque) qemu_mutex_lock(&pool->lock); qemu_bh_schedule(pool->completion_bh); + aio_notify(pool->ctx); } pool->cur_threads--;
bdrv_flush() uses a loop like while (rwco.ret == NOT_DONE) { aio_poll(aio_context, true); } to wait for thread pool, which may not get notified about the scheduled BH right away, if there is no new event that wakes up a blocking qemu_poll_ns(). In this case, it may even be a permanent hang. Wake the main thread up by writing to the event notifier fd. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Fam Zheng <famz@redhat.com> --- I suspect this may relate to [Qemu-devel] "iothread: release iothread around aio_poll" causes random hangs at startup [http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg00623.html] reported by Christian Borntraeger. Because in iothread there is rarely any fd activity, so the blocking aio_poll() may block forever if it misses the BH schedule. Christian, could you test this patch against your reproducer? --- thread-pool.c | 1 + 1 file changed, 1 insertion(+)