Patchwork coroutine bug?, was Re: [PATCH] sheepdog: use coroutines

login
register
mail settings
Submitter Stefan Hajnoczi
Date Jan. 3, 2012, 8:13 a.m.
Message ID <20120103081351.GA28636@stefanha-thinkpad.localdomain>
Download mbox | patch
Permalink /patch/133989/
State New
Headers show

Comments

Stefan Hajnoczi - Jan. 3, 2012, 8:13 a.m.
On Mon, Jan 02, 2012 at 10:38:11PM +0000, Stefan Hajnoczi wrote:
> On Mon, Jan 2, 2012 at 3:39 PM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote:
> >> If you can reproduce this bug and suspect coroutines are involved then I
> >
> > It's entirely reproducable.
> >
> > I've played around a bit and switched from the ucontext to the gthreads
> > coroutine implementation.  The result seems odd, but starts to make sense.
> >
> > Running the workload I now get the following message from qemu:
> >
> > Co-routine re-entered recursively
> >
> > and the gdb backtrace looks like:
> >
> > (gdb) bt
> > #0  0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>)
> >    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> > #1  0x00007f2fff372680 in *__GI_abort () at abort.c:92
> > #2  0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, opaque=0x0)
> >    at qemu-coroutine.c:53
> > #3  0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>)
> >    at qemu-coroutine-lock.c:43
> > #4  0x00007f30018d5a72 in qemu_bh_poll () at async.c:71
> > #5  0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>)
> >    at main-loop.c:472
> > #6  0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481
> > #7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> >    at /home/hch/work/qemu/vl.c:3479
> >
> > adding some printks suggest this happens when calling add_aio_request from
> > aio_read_response when either delaying creates, or updating metadata,
> > although not everytime one of these cases happens.
> >
> > I've tried to understand how the recursive calling happens, but unfortunately
> > the whole coroutine code lacks any sort of documentation how it should
> > behave or what it asserts about the callers.
> >
> >> I don't have a sheepdog setup here but if there's an easy way to
> >> reproduce please let me know and I'll take a look.
> >
> > With the small patch below applied to the sheppdog source I can reproduce
> > the issue on my laptop using the following setup:
> >
> > for port in 7000 7001 7002; do
> >    mkdir -p /mnt/sheepdog/$port
> >    /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port
> >    sleep 2
> > done
> >
> > collie cluster format
> > collie vdi create test 20G
> >
> > then start a qemu instance that uses the the sheepdog volume using the
> > following device and drive lines:
> >
> >        -drive if=none,file=sheepdog:test,cache=none,id=test \
> >        -device virtio-blk-pci,drive=test,id=testdev \
> >
> > finally, in the guest run:
> >
> >        dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct
> 
> Thanks for these instructions.  I can reproduce the issue here.
> 
> It seems suspicious the way that BDRVSheepdogState->co_recv and
> ->co_send work.  The code adds select(2) read/write callback functions
> on the sheepdog socket file descriptor.  When the socket becomes
> writeable or readable the co_send or co_recv coroutines are entered.
> So far, so good, this is how a coroutine is integrated into the main
> loop of QEMU.
> 
> The problem is that this patch is mixing things.  The co_recv
> coroutine runs aio_read_response(), which invokes send_pending_req().
> send_pending_req() invokes add_aio_request().  That function isn't
> suitable for co_recv's context because it actually sends data and hits
> a few blocking (yield) points.  It takes a coroutine mutex - but the
> select(2) read callback is still in place.  We're now still in the
> aio_read_response() call chain except we're actually not reading at
> all, we're trying to write!  And we'll get spurious wakeups if there
> is any data readable on the socket.
> 
> So the co_recv coroutine has two things in the system that will try to enter it:
> 1. The select(2) read callback on the sheepdog socket.
> 2. The aio_add_request() blocking operations, including a coroutine mutex.
> 
> This is bad, a yielded coroutine should only have one thing that will
> enter it.  It's rare that it makes sense to have multiple things
> entering a coroutine.
> 
> It's late here but I hope this gives Kazutaka some thoughts on what is
> causing the issue with this patch.

Poked around some more this morning.  Here is a patch that isolates the
bug:

---8<---
---8<---

When you run this with ucontext or GThread coroutines you hit this
assertion failure:

qemu-system-x86_64: qemu-coroutine-lock.c:66: qemu_co_queue_wait: Assertion `co != self' failed.

Here is an explanation for what the asserts are checking and how we
violate the constraint:

qemu_co_queue_next() and qemu_co_queue_next_bh() take a waiting
Coroutine off the wait queue and put it onto the unlock bh queue.  When
the BH executes the coroutines from the unlock bh queue are popped off
that queue and entered.  This is how coroutine wait queues work, and
coroutine mutexes are built on coroutine wait queues.

The problem is that due to the spurious wakeups introduced in the
sheepdog patch we are acquiring the mutex earlier than normal with a BH
unlock still pending.  Our coroutine can actually terminate by returning
from sheepdog.c:aio_read_respond() while the BH is still pending.  When
we get around to executing the BH we're going to operate on a freed
coroutine - BOOM!

The reason why we get different failure behavior with ucontext and
GThread is because they have different implementations and reuse
coroutines different.  We've still done something illegal but the
undefined behavior happens to be different - both ucontext and GThread
are working correctly, the problem is in the sheepdog patch.

I suggest implementing coroutine socket I/O functions for both send
*and* receive.  Then sheepdog request processing becomes simple - we'll
have less callback and coroutine trickery because it's possible to write
synchronous coroutine code.

Right now the code is a mix of callbacks and some coroutine code (which
has the flaw I explained above).  Things will be much easier if the code
is converted fully to coroutines.

Stefan

Patch

diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index 26ad76b..0d7a03f 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -59,6 +59,16 @@  void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
     QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
     qemu_coroutine_yield();
     assert(qemu_in_coroutine());
+    {
+        Coroutine *co;
+
+        QTAILQ_FOREACH(co, &queue->entries, co_queue_next) {
+            assert(co != self);
+        }
+        QTAILQ_FOREACH(co, &unlock_bh_queue, co_queue_next) {
+            assert(co != self);
+        }
+    }
 }
 
 void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue)