From patchwork Fri Jan  6 11:16:16 2012
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
X-Patchwork-Id: 134625
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [140.186.70.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id EFE48B6FBA
	for <incoming@patchwork.ozlabs.org>;
	Fri,  6 Jan 2012 22:08:17 +1100 (EST)
Received: from localhost ([::1]:46745 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1Rj7ed-0007cE-0y
	for incoming@patchwork.ozlabs.org; Fri, 06 Jan 2012 06:08:11 -0500
Received: from eggs.gnu.org ([140.186.70.92]:53611)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <morita.kazutaka@lab.ntt.co.jp>) id 1Rj7eW-0007bz-Ux
	for qemu-devel@nongnu.org; Fri, 06 Jan 2012 06:08:06 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <morita.kazutaka@lab.ntt.co.jp>) id 1Rj7eV-0008JM-GO
	for qemu-devel@nongnu.org; Fri, 06 Jan 2012 06:08:04 -0500
Received: from sh.osrg.net ([192.16.179.4]:46941)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <morita.kazutaka@lab.ntt.co.jp>) id 1Rj7eV-0008J1-2R
	for qemu-devel@nongnu.org; Fri, 06 Jan 2012 06:08:03 -0500
Received: from fs.osrg.net (postfix@fs.osrg.net [10.0.0.12])
	by sh.osrg.net (8.14.3/8.14.3/OSRG-NET) with ESMTP id q06B7gms022218;
	Fri, 6 Jan 2012 20:07:42 +0900
Received: from dfs1401.osrg.net.osrg.net (dfs1401.osrg.net [10.68.14.1])
	by fs.osrg.net (Postfix) with ESMTP id 0E7583E0010;
	Fri,  6 Jan 2012 20:07:42 +0900 (JST)
Date: Fri, 06 Jan 2012 20:16:16 +0900
Message-ID: <lpipkp2i9r.wl%morita.kazutaka@lab.ntt.co.jp>
From: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
To: Stefan Hajnoczi <stefanha@gmail.com>
In-Reply-To: <20120103081351.GA28636@stefanha-thinkpad.localdomain>
References: <1313152395-25248-1-git-send-email-morita.kazutaka@lab.ntt.co.jp>
	<20111223133850.GA12770@lst.de> <20111229120626.GA32331@lst.de>
	<20111230103500.GA1740@stefanha-thinkpad.localdomain>
	<20120102153959.GA22823@lst.de>
	<CAJSP0QVWE48B9d6muYt9Jxo9uDgNjE82H7Z7RhLwAmYwMegDBQ@mail.gmail.com>
	<20120103081351.GA28636@stefanha-thinkpad.localdomain>
User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.9
	(=?ISO-8859-4?Q?Goj=F2?=) APEL/10.8 Emacs/22.2 (x86_64-pc-linux-gnu)
	MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.3.7
	(sh.osrg.net [192.16.179.4]); Fri, 06 Jan 2012 20:07:43 +0900 (JST)
X-Virus-Scanned: clamav-milter 0.97.3 at sh
X-Virus-Status: Clean
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 192.16.179.4
Cc: kwolf@redhat.com, sheepdog@lists.wpkg.org, Christoph Hellwig <hch@lst.de>,
	MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] coroutine bug?,
	was Re: [PATCH] sheepdog: use coroutines
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

At Tue, 3 Jan 2012 08:13:51 +0000,
Stefan Hajnoczi wrote:
> 
> On Mon, Jan 02, 2012 at 10:38:11PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Jan 2, 2012 at 3:39 PM, Christoph Hellwig <hch@lst.de> wrote:
> > > On Fri, Dec 30, 2011 at 10:35:01AM +0000, Stefan Hajnoczi wrote:
> > >> If you can reproduce this bug and suspect coroutines are involved then I
> > >
> > > It's entirely reproducable.
> > >
> > > I've played around a bit and switched from the ucontext to the gthreads
> > > coroutine implementation.  The result seems odd, but starts to make sense.
> > >
> > > Running the workload I now get the following message from qemu:
> > >
> > > Co-routine re-entered recursively
> > >
> > > and the gdb backtrace looks like:
> > >
> > > (gdb) bt
> > > #0  0x00007f2fff36f405 in *__GI_raise (sig=<optimized out>)
> > >    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> > > #1  0x00007f2fff372680 in *__GI_abort () at abort.c:92
> > > #2  0x00007f30019a6616 in qemu_coroutine_enter (co=0x7f3004d4d7b0, opaque=0x0)
> > >    at qemu-coroutine.c:53
> > > #3  0x00007f30019a5e82 in qemu_co_queue_next_bh (opaque=<optimized out>)
> > >    at qemu-coroutine-lock.c:43
> > > #4  0x00007f30018d5a72 in qemu_bh_poll () at async.c:71
> > > #5  0x00007f3001982990 in main_loop_wait (nonblocking=<optimized out>)
> > >    at main-loop.c:472
> > > #6  0x00007f30018cf714 in main_loop () at /home/hch/work/qemu/vl.c:1481
> > > #7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> > >    at /home/hch/work/qemu/vl.c:3479
> > >
> > > adding some printks suggest this happens when calling add_aio_request from
> > > aio_read_response when either delaying creates, or updating metadata,
> > > although not everytime one of these cases happens.
> > >
> > > I've tried to understand how the recursive calling happens, but unfortunately
> > > the whole coroutine code lacks any sort of documentation how it should
> > > behave or what it asserts about the callers.
> > >
> > >> I don't have a sheepdog setup here but if there's an easy way to
> > >> reproduce please let me know and I'll take a look.
> > >
> > > With the small patch below applied to the sheppdog source I can reproduce
> > > the issue on my laptop using the following setup:
> > >
> > > for port in 7000 7001 7002; do
> > >    mkdir -p /mnt/sheepdog/$port
> > >    /usr/sbin/sheep -p $port -c local /mnt/sheepdog/$port
> > >    sleep 2
> > > done
> > >
> > > collie cluster format
> > > collie vdi create test 20G
> > >
> > > then start a qemu instance that uses the the sheepdog volume using the
> > > following device and drive lines:
> > >
> > >        -drive if=none,file=sheepdog:test,cache=none,id=test \
> > >        -device virtio-blk-pci,drive=test,id=testdev \
> > >
> > > finally, in the guest run:
> > >
> > >        dd if=/dev/zero of=/dev/vdX bs=67108864 count=128 oflag=direct
> > 
> > Thanks for these instructions.  I can reproduce the issue here.
> > 
> > It seems suspicious the way that BDRVSheepdogState->co_recv and
> > ->co_send work.  The code adds select(2) read/write callback functions
> > on the sheepdog socket file descriptor.  When the socket becomes
> > writeable or readable the co_send or co_recv coroutines are entered.
> > So far, so good, this is how a coroutine is integrated into the main
> > loop of QEMU.
> > 
> > The problem is that this patch is mixing things.  The co_recv
> > coroutine runs aio_read_response(), which invokes send_pending_req().
> > send_pending_req() invokes add_aio_request().  That function isn't
> > suitable for co_recv's context because it actually sends data and hits
> > a few blocking (yield) points.  It takes a coroutine mutex - but the
> > select(2) read callback is still in place.  We're now still in the
> > aio_read_response() call chain except we're actually not reading at
> > all, we're trying to write!  And we'll get spurious wakeups if there
> > is any data readable on the socket.
> > 
> > So the co_recv coroutine has two things in the system that will try to enter it:
> > 1. The select(2) read callback on the sheepdog socket.
> > 2. The aio_add_request() blocking operations, including a coroutine mutex.
> > 
> > This is bad, a yielded coroutine should only have one thing that will
> > enter it.  It's rare that it makes sense to have multiple things
> > entering a coroutine.
> > 
> > It's late here but I hope this gives Kazutaka some thoughts on what is
> > causing the issue with this patch.
> 
> Poked around some more this morning.  Here is a patch that isolates the
> bug:
> 
> ---8<---
> diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
> index 26ad76b..0d7a03f 100644
> --- a/qemu-coroutine-lock.c
> +++ b/qemu-coroutine-lock.c
> @@ -59,6 +59,16 @@ void coroutine_fn qemu_co_queue_wait(CoQueue *queue)
>      QTAILQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
>      qemu_coroutine_yield();
>      assert(qemu_in_coroutine());
> +    {
> +        Coroutine *co;
> +
> +        QTAILQ_FOREACH(co, &queue->entries, co_queue_next) {
> +            assert(co != self);
> +        }
> +        QTAILQ_FOREACH(co, &unlock_bh_queue, co_queue_next) {
> +            assert(co != self);
> +        }
> +    }
>  }
>  
>  void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue)
> ---8<---
> 
> When you run this with ucontext or GThread coroutines you hit this
> assertion failure:
> 
> qemu-system-x86_64: qemu-coroutine-lock.c:66: qemu_co_queue_wait: Assertion `co != self' failed.
> 
> Here is an explanation for what the asserts are checking and how we
> violate the constraint:
> 
> qemu_co_queue_next() and qemu_co_queue_next_bh() take a waiting
> Coroutine off the wait queue and put it onto the unlock bh queue.  When
> the BH executes the coroutines from the unlock bh queue are popped off
> that queue and entered.  This is how coroutine wait queues work, and
> coroutine mutexes are built on coroutine wait queues.
> 
> The problem is that due to the spurious wakeups introduced in the
> sheepdog patch we are acquiring the mutex earlier than normal with a BH
> unlock still pending.  Our coroutine can actually terminate by returning
> from sheepdog.c:aio_read_respond() while the BH is still pending.  When
> we get around to executing the BH we're going to operate on a freed
> coroutine - BOOM!

Thanks for your detailed explanation.  I confirmed the following hack
fixes this problem.

---8<---
---8<---

> 
> The reason why we get different failure behavior with ucontext and
> GThread is because they have different implementations and reuse
> coroutines different.  We've still done something illegal but the
> undefined behavior happens to be different - both ucontext and GThread
> are working correctly, the problem is in the sheepdog patch.
> 
> I suggest implementing coroutine socket I/O functions for both send
> *and* receive.  Then sheepdog request processing becomes simple - we'll
> have less callback and coroutine trickery because it's possible to write
> synchronous coroutine code.
> 
> Right now the code is a mix of callbacks and some coroutine code (which
> has the flaw I explained above).  Things will be much easier if the code
> is converted fully to coroutines.

Yes, the original patch aimed to make Sheepdog I/O asynchronous with a
small change, so the current sheepdog code is not clean.  It looks
like a good time to make the sheepdog driver fully coroutine-based
code.

Thanks,

Kazutaka

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 17a79be..b27c770 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -626,6 +626,9 @@ static void coroutine_fn aio_read_response(void *opaque)
 
     switch (acb->aiocb_type) {
     case AIOCB_WRITE_UDATA:
+        /* this coroutine context is no longer suitable for co_recv
+         * because we may send data to update vdi objects */
+        s->co_recv = NULL;
         if (!is_data_obj(aio_req->oid)) {
             break;
         }