Message ID | 20150610093408.GC11648@ad.nay.redhat.com |
---|---|
State | New |
Headers | show |
Am 10.06.2015 um 11:34 schrieb Fam Zheng: > On Wed, 06/10 11:18, Christian Borntraeger wrote: >> Am 10.06.2015 um 04:12 schrieb Fam Zheng: >>> On Tue, 06/09 11:01, Christian Borntraeger wrote: >>>> Am 09.06.2015 um 04:28 schrieb Fam Zheng: >>>>> On Tue, 06/02 16:36, Christian Borntraeger wrote: >>>>>> Paolo, >>>>>> >>>>>> I bisected >>>>>> commit a0710f7995f914e3044e5899bd8ff6c43c62f916 >>>>>> Author: Paolo Bonzini <pbonzini@redhat.com> >>>>>> AuthorDate: Fri Feb 20 17:26:52 2015 +0100 >>>>>> Commit: Kevin Wolf <kwolf@redhat.com> >>>>>> CommitDate: Tue Apr 28 15:36:08 2015 +0200 >>>>>> >>>>>> iothread: release iothread around aio_poll >>>>>> >>>>>> to cause a problem with hanging guests. >>>>>> >>>>>> Having many guests all with a kernel/ramdisk (via -kernel) and >>>>>> several null block devices will result in hangs. All hanging >>>>>> guests are in partition detection code waiting for an I/O to return >>>>>> so very early maybe even the first I/O. >>>>>> >>>>>> Reverting that commit "fixes" the hangs. >>>>>> Any ideas? >>>>> >>>>> Christian, I can't reproduce this on my x86 box with virtio-blk-pci. Do you >>>>> have a reproducer for x86? Or could you collect backtraces for all the threads >>>>> in QEMU when it hangs? >>>>> >>>>> My long shot is that the main loop is blocked at aio_context_acquire(ctx), >>>>> while the iothread of that ctx is blocked at aio_poll(ctx, blocking). >>>> >>>> Here is a backtrace on s390. I need 2 or more disks, (one is not enough). >>> >>> It shows iothreads and main loop are all waiting for events, and the vcpu >>> threads are running guest code. >>> >>> It could be the requests being leaked. Do you see this problem with a regular >>> file based image or null-co driver? Maybe we're missing something about the >>> AioContext in block/null.c. >> >> It seems to run with normal file based images. As soon as I have two or more null-aio >> devices it hangs pretty soon when doing a reboot loop. >> > > Ahh! If it's a reboot loop, the device reset thing may get fishy. I suspect the > completion BH used by null-aio may be messed up, that's why I wonder whether > null-co:// would work for you. Could you test that? null-co also fails. > > Also, could you try below patch with null-aio://, too? The same. Guests still get stuck. > > Thanks, > Fam > > --- > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c > index cd539aa..c87b444 100644 > --- a/hw/block/virtio-blk.c > +++ b/hw/block/virtio-blk.c > @@ -652,15 +652,11 @@ static void virtio_blk_reset(VirtIODevice *vdev) > { > VirtIOBlock *s = VIRTIO_BLK(vdev); > > - if (s->dataplane) { > - virtio_blk_data_plane_stop(s->dataplane); > - } > - > - /* > - * This should cancel pending requests, but can't do nicely until there > - * are per-device request lists. > - */ > blk_drain_all(); > + if (s->dataplane) { > + virtio_blk_data_plane_stop(s->dataplane); > + } > + > blk_set_enable_write_cache(s->blk, s->original_wce); > } >
Am 10.06.2015 um 11:34 schrieb Fam Zheng: > On Wed, 06/10 11:18, Christian Borntraeger wrote: >> Am 10.06.2015 um 04:12 schrieb Fam Zheng: >>> On Tue, 06/09 11:01, Christian Borntraeger wrote: >>>> Am 09.06.2015 um 04:28 schrieb Fam Zheng: >>>>> On Tue, 06/02 16:36, Christian Borntraeger wrote: >>>>>> Paolo, >>>>>> >>>>>> I bisected >>>>>> commit a0710f7995f914e3044e5899bd8ff6c43c62f916 >>>>>> Author: Paolo Bonzini <pbonzini@redhat.com> >>>>>> AuthorDate: Fri Feb 20 17:26:52 2015 +0100 >>>>>> Commit: Kevin Wolf <kwolf@redhat.com> >>>>>> CommitDate: Tue Apr 28 15:36:08 2015 +0200 >>>>>> >>>>>> iothread: release iothread around aio_poll >>>>>> >>>>>> to cause a problem with hanging guests. >>>>>> >>>>>> Having many guests all with a kernel/ramdisk (via -kernel) and >>>>>> several null block devices will result in hangs. All hanging >>>>>> guests are in partition detection code waiting for an I/O to return >>>>>> so very early maybe even the first I/O. >>>>>> >>>>>> Reverting that commit "fixes" the hangs. >>>>>> Any ideas? For what its worth, I can no longer reproduce the issue on current master + cherry-pick of a0710f7995f (iothread: release iothread around aio_poll) bisect tells me that commit 53ec73e264f481b79b52efcadc9ceb8f8996975c Author: Fam Zheng <famz@redhat.com> AuthorDate: Fri May 29 18:53:14 2015 +0800 Commit: Stefan Hajnoczi <stefanha@redhat.com> CommitDate: Tue Jul 7 14:27:14 2015 +0100 block: Use bdrv_drain to replace uncessary bdrv_drain_all made the problem will blk-null go away. I still dont understand why. Christian
On 16/07/2015 13:03, Christian Borntraeger wrote: > For what its worth, I can no longer reproduce the issue on > current master + cherry-pick of a0710f7995f (iothread: release iothread around aio_poll) > > bisect tells me that > > commit 53ec73e264f481b79b52efcadc9ceb8f8996975c > Author: Fam Zheng <famz@redhat.com> > AuthorDate: Fri May 29 18:53:14 2015 +0800 > Commit: Stefan Hajnoczi <stefanha@redhat.com> > CommitDate: Tue Jul 7 14:27:14 2015 +0100 > > block: Use bdrv_drain to replace uncessary bdrv_drain_all > > made the problem will blk-null go away. I still dont understand why. It could be related to the AioContext problem that I'm fixing these days, too. Good news, we'll requeue the patch for 2.5. Paolo
Am 16.07.2015 um 13:20 schrieb Paolo Bonzini: > > > On 16/07/2015 13:03, Christian Borntraeger wrote: >> For what its worth, I can no longer reproduce the issue on >> current master + cherry-pick of a0710f7995f (iothread: release iothread around aio_poll) >> >> bisect tells me that >> >> commit 53ec73e264f481b79b52efcadc9ceb8f8996975c >> Author: Fam Zheng <famz@redhat.com> >> AuthorDate: Fri May 29 18:53:14 2015 +0800 >> Commit: Stefan Hajnoczi <stefanha@redhat.com> >> CommitDate: Tue Jul 7 14:27:14 2015 +0100 >> >> block: Use bdrv_drain to replace uncessary bdrv_drain_all >> >> made the problem will blk-null go away. I still dont understand why. > > It could be related to the AioContext problem that I'm fixing these > days, too. Good news, we'll requeue the patch for 2.5. That was also something that I had in mind (in fact I retested this to check the ctx patch). master + cherry-pick of a0710f7995f + revert of 53ec73e26 + this fix still fails, so it was (is?) a different issue. The interesting part is that this problem required 2 or more disk (and we replace drain_all with single drains) so it somewhat sounds plausible. Christian
On 16/07/2015 13:24, Christian Borntraeger wrote: > The interesting part is that this problem required 2 or more disk > (and we replace drain_all with single drains) so it somewhat sounds > plausible. Yes, indeed. It is very plausible. I wanted to reproduce it these days, so thanks for saving me a lot of time! I'll test your exact setup (master + AioContext fix + cherry-pick of a0710f7995f + revert of 53ec73e26). Thanks, Paolo
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index cd539aa..c87b444 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -652,15 +652,11 @@ static void virtio_blk_reset(VirtIODevice *vdev) { VirtIOBlock *s = VIRTIO_BLK(vdev); - if (s->dataplane) { - virtio_blk_data_plane_stop(s->dataplane); - } - - /* - * This should cancel pending requests, but can't do nicely until there - * are per-device request lists. - */ blk_drain_all(); + if (s->dataplane) { + virtio_blk_data_plane_stop(s->dataplane); + } + blk_set_enable_write_cache(s->blk, s->original_wce); }