diff mbox series

[RFC] nbd: decouple reconnect from drain

Message ID 20210310093232.519585-1-rvkagan@yandex-team.ru
State New
Headers show
Series [RFC] nbd: decouple reconnect from drain | expand

Commit Message

Roman Kagan March 10, 2021, 9:32 a.m. UTC
NBD connect coroutine takes an extra in_flight reference as if it's a
request handler.  This prevents drain from completion until the
connection coroutine is releases the reference.

When NBD is configured to reconnect, however, this appears to be fatal
to the reconnection idea: the drain procedure wants the reconnection to
be suspended, but this is only possible if the in-flight requests are
canceled.

Fix this by making the connection coroutine stop messing with the
in-flight counter.  Instead, certain care is taken to properly move the
reconnection stuff from one aio_context to another in
.bdrv_{attach,detach}_aio_context callbacks.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
This patch passes the regular make check but fails some extra iotests,
in particular 277.  It obviously lacks more robust interaction with the
connection thread (which in general is fairly complex and hard to reason
about), and perhaps has some other drawbacks, so I'll work on this
further, but I'd appreciate some feedback on whether the idea is sound.

 block/nbd.c  | 134 ++++++++++++++++-----------------------------------
 nbd/client.c |   2 -
 2 files changed, 41 insertions(+), 95 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy March 12, 2021, 12:35 p.m. UTC | #1
10.03.2021 12:32, Roman Kagan wrote:
> NBD connect coroutine takes an extra in_flight reference as if it's a
> request handler.  This prevents drain from completion until the
> connection coroutine is releases the reference.
> 
> When NBD is configured to reconnect, however, this appears to be fatal
> to the reconnection idea: the drain procedure wants the reconnection to
> be suspended, but this is only possible if the in-flight requests are
> canceled.

As I remember from our conversation, the problem is not that we don't reconnect during drained section, but exactly that we cancel requests on drained begins starting from 8c517de24a8a1dcbeb.

This is not a problem in scenarios when reconnect is rare case and failed request is acceptable. But if we have bad connection and requests should often wait for reconnect (so, it may be considered as a kind of "latency") then really, cancelling the waiting requests on any drain() kills the reconnect feature.

> 
> Fix this by making the connection coroutine stop messing with the
> in-flight counter.  Instead, certain care is taken to properly move the
> reconnection stuff from one aio_context to another in
> .bdrv_{attach,detach}_aio_context callbacks.
> 
> Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
> ---
> This patch passes the regular make check but fails some extra iotests,
> in particular 277.  It obviously lacks more robust interaction with the
> connection thread (which in general is fairly complex and hard to reason
> about), and perhaps has some other drawbacks, so I'll work on this
> further, but I'd appreciate some feedback on whether the idea is sound.
> 

In general I like the idea. The logic around drain in nbd is overcomplicated. And I never liked the fact that nbd_read_eof() does dec-inc inflight section. Some notes:

1. I hope, the patch can be divided into several ones, as there are several things done:

- removing use of in_flight counter introduced by 5ad81b4946
- do reconnect during drained section
- stop cancelling requests on .drained_begin

2. 5ad81b4946 was needed to make nbd_client_attach_aio_context() reenter connection_co only in one (or two) possible places, not on any yield.. And I don't see how it is achieved now.. This should be described in commit msg..

3. About cancelling requests on drained_begin. The behavior was introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the deadlock is fixed another way, let's change the logic (don't cancel requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and the commit that fixes deadlock the other way in the commit message.
Roman Kagan March 15, 2021, 5:36 a.m. UTC | #2
On Fri, Mar 12, 2021 at 03:35:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 10.03.2021 12:32, Roman Kagan wrote:
> > NBD connect coroutine takes an extra in_flight reference as if it's a
> > request handler.  This prevents drain from completion until the
> > connection coroutine is releases the reference.
> > 
> > When NBD is configured to reconnect, however, this appears to be fatal
> > to the reconnection idea: the drain procedure wants the reconnection to
> > be suspended, but this is only possible if the in-flight requests are
> > canceled.
> 
> As I remember from our conversation, the problem is not that we don't
> reconnect during drained section, but exactly that we cancel requests
> on drained begins starting from 8c517de24a8a1dcbeb.

Well, I'd put it the other way around: the problem is that we don't
reconnect during the drained section, so the requests can't be
successfully completed if the connection breaks: they can either stall
forever (before 8c517de24a8a1dcbeb) or be aborted (after
8c517de24a8a1dcbeb).

> This is not a problem in scenarios when reconnect is rare case and
> failed request is acceptable. But if we have bad connection and
> requests should often wait for reconnect (so, it may be considered as
> a kind of "latency") then really, cancelling the waiting requests on
> any drain() kills the reconnect feature.

In our experience the primary purpose of reconnect is not to withstand
poor network links, but about being able to restart the NBD server
without disrupting the guest operation.  So failing the requests during
the server maintenance window is never acceptable, no matter how rare it
is (and in our practice it isn't).

> > Fix this by making the connection coroutine stop messing with the
> > in-flight counter.  Instead, certain care is taken to properly move the
> > reconnection stuff from one aio_context to another in
> > .bdrv_{attach,detach}_aio_context callbacks.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
> > ---
> > This patch passes the regular make check but fails some extra iotests,
> > in particular 277.  It obviously lacks more robust interaction with the
> > connection thread (which in general is fairly complex and hard to reason
> > about), and perhaps has some other drawbacks, so I'll work on this
> > further, but I'd appreciate some feedback on whether the idea is sound.
> > 
> 
> In general I like the idea. The logic around drain in nbd is
> overcomplicated. And I never liked the fact that nbd_read_eof() does
> dec-inc inflight section. Some notes:
> 
> 1. I hope, the patch can be divided into several ones, as there are
> several things done:
> 
> - removing use of in_flight counter introduced by 5ad81b4946
> - do reconnect during drained section
> - stop cancelling requests on .drained_begin

I've split the (somewhat extended) patch into a series of 7, but not
exactly as you suggested.  In particular, I can't just stop aborting the
requests on .drained_begin as this would reintroduce the deadlock, so I
just remove the whole .drained_begin/end in a single patch.

> 2. 5ad81b4946 was needed to make nbd_client_attach_aio_context()
> reenter connection_co only in one (or two) possible places, not on any
> yield.. And I don't see how it is achieved now.. This should be
> described in commit msg..

My problem is that I failed to figure out why it was necessary in the
first place, so I think I don't achieve this with the series.

> 3. About cancelling requests on drained_begin. The behavior was
> introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the
> deadlock is fixed another way, let's change the logic (don't cancel
> requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and
> the commit that fixes deadlock the other way in the commit message.

As I mentioned earlier I did a patch that removed the root cause; a
separate patch removing just the request cancelation didn't look
justified.  I did mention the commits in the log.

Thanks for the review!  I'm now on to submitting a non-RFC version.
Roman.
Vladimir Sementsov-Ogievskiy March 15, 2021, 2:07 p.m. UTC | #3
15.03.2021 08:36, Roman Kagan wrote:
> On Fri, Mar 12, 2021 at 03:35:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 10.03.2021 12:32, Roman Kagan wrote:
>>> NBD connect coroutine takes an extra in_flight reference as if it's a
>>> request handler.  This prevents drain from completion until the
>>> connection coroutine is releases the reference.
>>>
>>> When NBD is configured to reconnect, however, this appears to be fatal
>>> to the reconnection idea: the drain procedure wants the reconnection to
>>> be suspended, but this is only possible if the in-flight requests are
>>> canceled.
>>
>> As I remember from our conversation, the problem is not that we don't
>> reconnect during drained section, but exactly that we cancel requests
>> on drained begins starting from 8c517de24a8a1dcbeb.
> 
> Well, I'd put it the other way around: the problem is that we don't
> reconnect during the drained section, so the requests can't be
> successfully completed if the connection breaks: they can either stall
> forever (before 8c517de24a8a1dcbeb) or be aborted (after
> 8c517de24a8a1dcbeb).

The sense of the drained section is that we don't have any inflight requests during drained section.
So, all requests must be finished on drained_begin()..

We can continue reconnect process during drained section. But requests should be completed on drained_begin() anyway.

> 
>> This is not a problem in scenarios when reconnect is rare case and
>> failed request is acceptable. But if we have bad connection and
>> requests should often wait for reconnect (so, it may be considered as
>> a kind of "latency") then really, cancelling the waiting requests on
>> any drain() kills the reconnect feature.
> 
> In our experience the primary purpose of reconnect is not to withstand
> poor network links, but about being able to restart the NBD server
> without disrupting the guest operation.  So failing the requests during
> the server maintenance window is never acceptable, no matter how rare it
> is (and in our practice it isn't).
> 
>>> Fix this by making the connection coroutine stop messing with the
>>> in-flight counter.  Instead, certain care is taken to properly move the
>>> reconnection stuff from one aio_context to another in
>>> .bdrv_{attach,detach}_aio_context callbacks.
>>>
>>> Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
>>> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
>>> ---
>>> This patch passes the regular make check but fails some extra iotests,
>>> in particular 277.  It obviously lacks more robust interaction with the
>>> connection thread (which in general is fairly complex and hard to reason
>>> about), and perhaps has some other drawbacks, so I'll work on this
>>> further, but I'd appreciate some feedback on whether the idea is sound.
>>>
>>
>> In general I like the idea. The logic around drain in nbd is
>> overcomplicated. And I never liked the fact that nbd_read_eof() does
>> dec-inc inflight section. Some notes:
>>
>> 1. I hope, the patch can be divided into several ones, as there are
>> several things done:
>>
>> - removing use of in_flight counter introduced by 5ad81b4946
>> - do reconnect during drained section
>> - stop cancelling requests on .drained_begin
> 
> I've split the (somewhat extended) patch into a series of 7, but not
> exactly as you suggested.  In particular, I can't just stop aborting the
> requests on .drained_begin as this would reintroduce the deadlock, so I
> just remove the whole .drained_begin/end in a single patch.
> 
>> 2. 5ad81b4946 was needed to make nbd_client_attach_aio_context()
>> reenter connection_co only in one (or two) possible places, not on any
>> yield.. And I don't see how it is achieved now.. This should be
>> described in commit msg..
> 
> My problem is that I failed to figure out why it was necessary in the
> first place, so I think I don't achieve this with the series.

nbd_client_attach_aio_context() reenters connection_co. So we must be sure,
that on any yield() of connection_co it's safe to enter from
nbd_client_attach_aio_context(). Or somehow prevent enetering on unsafe
yields. So does 5ad81b4946: it prevents entering connection_co from any place
except two, where we decrease inflight counter. (amd we exploit the fact that
nbd_client_attach_aio_context() is called inside drained section.

> 
>> 3. About cancelling requests on drained_begin. The behavior was
>> introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the
>> deadlock is fixed another way, let's change the logic (don't cancel
>> requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and
>> the commit that fixes deadlock the other way in the commit message.
> 
> As I mentioned earlier I did a patch that removed the root cause; a
> separate patch removing just the request cancelation didn't look
> justified.  I did mention the commits in the log.
> 
> Thanks for the review!  I'm now on to submitting a non-RFC version.
> Roman.
>
diff mbox series

Patch

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..5319e543ab 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -117,8 +117,6 @@  typedef struct BDRVNBDState {
     Coroutine *connection_co;
     Coroutine *teardown_co;
     QemuCoSleepState *connection_co_sleep_ns_state;
-    bool drained;
-    bool wait_drained_end;
     int in_flight;
     NBDClientState state;
     int connect_status;
@@ -126,6 +124,7 @@  typedef struct BDRVNBDState {
     bool wait_in_flight;
 
     QEMUTimer *reconnect_delay_timer;
+    int64_t reconnect_expire_time_ns;
 
     NBDClientRequest requests[MAX_NBD_REQUESTS];
     NBDReply reply;
@@ -165,6 +164,18 @@  static void nbd_clear_bdrvstate(BDRVNBDState *s)
     s->x_dirty_bitmap = NULL;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+    NBDClientState state = qatomic_load_acquire(&s->state);
+    return state == NBD_CLIENT_CONNECTING_WAIT ||
+        state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
+static bool nbd_client_connecting_wait(BDRVNBDState *s)
+{
+    return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
     if (ret == -EIO) {
@@ -217,10 +228,6 @@  static void reconnect_delay_timer_cb(void *opaque)
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t expire_time_ns)
 {
-    if (qatomic_load_acquire(&s->state) != NBD_CLIENT_CONNECTING_WAIT) {
-        return;
-    }
-
     assert(!s->reconnect_delay_timer);
     s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
                                              QEMU_CLOCK_REALTIME,
@@ -233,8 +240,20 @@  static void nbd_client_detach_aio_context(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-    /* Timer is deleted in nbd_client_co_drain_begin() */
-    assert(!s->reconnect_delay_timer);
+    /*
+     * This runs in the (old, about to be detached) aio context of the @bs so
+     * accessing the stuff on @s is concurrency-free.
+     */
+    assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+    /*
+     * Preserve the expiration time of the reconnect_delay_timer in order to
+     * resume it on the new aio context.
+     */
+    s->reconnect_expire_time_ns = s->reconnect_delay_timer ?
+        timer_expire_time_ns(s->reconnect_delay_timer) : -1;
+    reconnect_delay_timer_del(s);
+
     /*
      * If reconnect is in progress we may have no ->ioc.  It will be
      * re-instantiated in the proper aio context once the connection is
@@ -250,6 +269,16 @@  static void nbd_client_attach_aio_context_bh(void *opaque)
     BlockDriverState *bs = opaque;
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+    /*
+     * This runs in the (new, just attached) aio context of the @bs so
+     * accessing the stuff on @s is concurrency-free.
+     */
+    assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+    if (nbd_client_connecting_wait(s) && s->reconnect_expire_time_ns >= 0) {
+        reconnect_delay_timer_init(s, s->reconnect_expire_time_ns);
+    }
+
     if (s->connection_co) {
         /*
          * The node is still drained, so we know the coroutine has yielded in
@@ -259,7 +288,6 @@  static void nbd_client_attach_aio_context_bh(void *opaque)
          */
         qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
     }
-    bdrv_dec_in_flight(bs);
 }
 
 static void nbd_client_attach_aio_context(BlockDriverState *bs,
@@ -275,7 +303,6 @@  static void nbd_client_attach_aio_context(BlockDriverState *bs,
         qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
     }
 
-    bdrv_inc_in_flight(bs);
 
     /*
      * Need to wait here for the BH to run because the BH must run while the
@@ -284,37 +311,6 @@  static void nbd_client_attach_aio_context(BlockDriverState *bs,
     aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, bs);
 }
 
-static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    s->drained = true;
-    if (s->connection_co_sleep_ns_state) {
-        qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
-    }
-
-    nbd_co_establish_connection_cancel(bs, false);
-
-    reconnect_delay_timer_del(s);
-
-    if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
-        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-        qemu_co_queue_restart_all(&s->free_sema);
-    }
-}
-
-static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
-{
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-    s->drained = false;
-    if (s->wait_drained_end) {
-        s->wait_drained_end = false;
-        aio_co_wake(s->connection_co);
-    }
-}
-
-
 static void nbd_teardown_connection(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
@@ -346,18 +342,6 @@  static void nbd_teardown_connection(BlockDriverState *bs)
     assert(!s->connection_co);
 }
 
-static bool nbd_client_connecting(BDRVNBDState *s)
-{
-    NBDClientState state = qatomic_load_acquire(&s->state);
-    return state == NBD_CLIENT_CONNECTING_WAIT ||
-        state == NBD_CLIENT_CONNECTING_NOWAIT;
-}
-
-static bool nbd_client_connecting_wait(BDRVNBDState *s)
-{
-    return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT;
-}
-
 static void connect_bh(void *opaque)
 {
     BDRVNBDState *state = opaque;
@@ -624,21 +608,9 @@  static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
         goto out;
     }
 
-    bdrv_dec_in_flight(s->bs);
 
     ret = nbd_client_handshake(s->bs, &local_err);
 
-    if (s->drained) {
-        s->wait_drained_end = true;
-        while (s->drained) {
-            /*
-             * We may be entered once from nbd_client_attach_aio_context_bh
-             * and then from nbd_client_co_drain_end. So here is a loop.
-             */
-            qemu_coroutine_yield();
-        }
-    }
-    bdrv_inc_in_flight(s->bs);
 
 out:
     s->connect_status = ret;
@@ -666,26 +638,10 @@  static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState *s)
     nbd_reconnect_attempt(s);
 
     while (nbd_client_connecting(s)) {
-        if (s->drained) {
-            bdrv_dec_in_flight(s->bs);
-            s->wait_drained_end = true;
-            while (s->drained) {
-                /*
-                 * We may be entered once from nbd_client_attach_aio_context_bh
-                 * and then from nbd_client_co_drain_end. So here is a loop.
-                 */
-                qemu_coroutine_yield();
-            }
-            bdrv_inc_in_flight(s->bs);
-        } else {
-            qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
-                                      &s->connection_co_sleep_ns_state);
-            if (s->drained) {
-                continue;
-            }
-            if (timeout < max_timeout) {
-                timeout *= 2;
-            }
+        qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
+                                  &s->connection_co_sleep_ns_state);
+        if (timeout < max_timeout) {
+            timeout *= 2;
         }
 
         nbd_reconnect_attempt(s);
@@ -766,7 +722,6 @@  static coroutine_fn void nbd_connection_entry(void *opaque)
 
     qemu_co_queue_restart_all(&s->free_sema);
     nbd_recv_coroutines_wake_all(s);
-    bdrv_dec_in_flight(s->bs);
 
     s->connection_co = NULL;
     if (s->ioc) {
@@ -2314,7 +2269,6 @@  static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
     nbd_init_connect_thread(s);
 
     s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
-    bdrv_inc_in_flight(bs);
     aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
 
     return 0;
@@ -2490,8 +2444,6 @@  static BlockDriver bdrv_nbd = {
     .bdrv_getlength             = nbd_getlength,
     .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
     .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
@@ -2519,8 +2471,6 @@  static BlockDriver bdrv_nbd_tcp = {
     .bdrv_getlength             = nbd_getlength,
     .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
     .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
@@ -2548,8 +2498,6 @@  static BlockDriver bdrv_nbd_unix = {
     .bdrv_getlength             = nbd_getlength,
     .bdrv_detach_aio_context    = nbd_client_detach_aio_context,
     .bdrv_attach_aio_context    = nbd_client_attach_aio_context,
-    .bdrv_co_drain_begin        = nbd_client_co_drain_begin,
-    .bdrv_co_drain_end          = nbd_client_co_drain_end,
     .bdrv_refresh_filename      = nbd_refresh_filename,
     .bdrv_co_block_status       = nbd_client_co_block_status,
     .bdrv_dirname               = nbd_dirname,
diff --git a/nbd/client.c b/nbd/client.c
index 0c2db4bcba..30d5383cb1 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1434,9 +1434,7 @@  nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void *buffer, size_t size,
 
         len = qio_channel_readv(ioc, &iov, 1, errp);
         if (len == QIO_CHANNEL_ERR_BLOCK) {
-            bdrv_dec_in_flight(bs);
             qio_channel_yield(ioc, G_IO_IN);
-            bdrv_inc_in_flight(bs);
             continue;
         } else if (len < 0) {
             return -EIO;