[1/2] nbd: Drop connection if broken server is detected

Message ID d8754399-cbe6-325c-cb4e-07e47fd08cf0@redhat.com
State New
Headers show

Commit Message

Eric Blake Aug. 11, 2017, 10:12 p.m.
On 08/11/2017 03:01 PM, Eric Blake wrote:
> On 08/11/2017 02:41 PM, Eric Blake wrote:
>>> Hmm, was it correct even before your patch? Is it safe to enter a coroutine
>>> (which we've scheduled by nbd_recv_coroutines_enter_all()), which is
>>> actually
>>> yielded inside nbd_rwv (not our yield in nbd_co_receive_reply)?
>>
>> I'm honestly not sure how to answer the question. In my testing, I was
>> unable to catch a coroutine yielding inside of nbd_rwv();
> 
> Single stepping through nbd_rwv(), I see that I/O is performed by
> sendmsg(), which either gets the message sent or, because of nonblocking
> mode, fails with EAGAIN, which gets turned into QIO_CHANNEL_ERR_BLOCK
> and indeed a call to qemu_channel_yield() within nbd_rwv() - but it's
> timing sensitive, so I still haven't been able to provoke this scenario
> using gdb.

With this compiled into the client:


and that got further, only to crash at:

(gdb) c
Continuing.
readv failed: Input/output error
aio_write failed: Input/output error
qemu-io: block/block-backend.c:1211: blk_aio_write_entry: Assertion
`!rwco->qiov || rwco->qiov->size == acb->bytes' failed.

where rwco->qiov->size was garbage.

At this point, while I still think it might be possible to come up with
a less-invasive  solution than your v2 patch, I'm also at the point
where I want the bug fixed rather than me wallowing around trying to
debug coroutine interactions; and thus I'm leaning towards your v2 patch
as being more likely to be robust in the face of concurrency (it's not
killing ioc while other coroutines still exist, so much as just making
sure that every yield point checks if the kill switch has been triggered
for a short-circuit exit).  So I will probably be taking your version
and creating a pull request for -rc3 on Monday.  (Before I fully ack
your version, though, I _will_ be hammering at it under gdb the same way
I hammered at mine)

Patch

diff --git i/nbd/common.c w/nbd/common.c
index a2f28f2eec..f10e991eed 100644
--- i/nbd/common.c
+++ w/nbd/common.c
@@ -36,6 +36,10 @@  ssize_t nbd_rwv(QIOChannel *ioc, struct iovec *iov,
size_t niov, size_t length,

     while (nlocal_iov > 0) {
         ssize_t len;
+        static int hack;
+        if (hack) {
+            len = QIO_CHANNEL_ERR_BLOCK;
+        } else
         if (do_read) {
             len = qio_channel_readv(ioc, local_iov, nlocal_iov, errp);
         } else {

I was able to set a breakpoint in gdb to temporarily manipulate 'hack'
at the right moment, in order to trigger what would happen if a
nbd_rwv() hit EAGAIN.  And sadly, I got a segfault using my patches,
because the last reference to ioc had been cleared in the meantime.


Program received signal SIGSEGV, Segmentation fault.
0x000055555562ee94 in object_class_dynamic_cast_assert
(class=0x555555d9d1b0,
    typename=0x5555556c856d "qio-channel", file=0x5555556c8560
"io/channel.c",
    line=75, func=0x5555556c8670 <__func__.21506> "qio_channel_writev_full")
    at qom/object.c:705
705	    trace_object_class_dynamic_cast_assert(class ? class->type->name
: "(null)",


#0  0x000055555562ee94 in object_class_dynamic_cast_assert (
    class=0x555555d9d1b0, typename=0x5555556c856d "qio-channel",
    file=0x5555556c8560 "io/channel.c", line=75,
    func=0x5555556c8670 <__func__.21506> "qio_channel_writev_full")
    at qom/object.c:705
#1  0x000055555562312d in qio_channel_writev_full (ioc=0x555555d9bde0,
    iov=0x555555d9ec90, niov=1, fds=0x0, nfds=0, errp=0x0) at
io/channel.c:75
#2  0x0000555555623233 in qio_channel_writev (ioc=0x555555d9bde0,
    iov=0x555555d9ec90, niov=1, errp=0x0) at io/channel.c:102
#3  0x0000555555603590 in nbd_rwv (ioc=0x555555d9bde0, iov=0x555555d9ecd0,
    niov=1, length=1048576, do_read=false, errp=0x0) at nbd/common.c:46
#4  0x00005555555ebcca in nbd_co_send_request (bs=0x555555d94260,
    request=0x7fffda2819f0, qiov=0x555555d9ee88) at block/nbd-client.c:152


My next test is whether incrementing the ref-count to s->ioc for the
duration of nbd_co_send_request() is adequate to protect against this
problem:

diff --git i/block/nbd-client.c w/block/nbd-client.c
index 28b10f3fa2..cb0c4ebedf 100644
--- i/block/nbd-client.c
+++ w/block/nbd-client.c
@@ -147,6 +147,11 @@  static int nbd_co_send_request(BlockDriverState *bs,
         return -EPIPE;
     }

+    /*
+     * Make sure ioc stays live, even if another coroutine decides to
+     * kill the connection because of a server error.
+     */
+    object_ref(OBJECT(ioc));
     if (qiov) {
         qio_channel_set_cork(ioc, true);
         rc = nbd_send_request(ioc, request);
@@ -161,6 +166,7 @@  static int nbd_co_send_request(BlockDriverState *bs,
     } else {
         rc = nbd_send_request(ioc, request);
     }
+    object_unref(OBJECT(ioc));
     qemu_co_mutex_unlock(&s->send_mutex);
     return rc;
 }