Message ID | 20210610100802.5888-21-vsementsov@virtuozzo.com |
---|---|
State | New |
Headers | show |
Series | block/nbd: rework client connection | expand |
On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote: > Add an option for a thread to retry connection until succeeds. We'll for a thread to retry connecting until it succeeds. > use nbd/client-connection both for reconnect and for initial connection > in nbd_open(), so we need a possibility to use same NBDClientConnection > instance to connect once in nbd_open() and then use retry semantics for > reconnect. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > include/block/nbd.h | 2 ++ > nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++---------- > 2 files changed, 45 insertions(+), 13 deletions(-) > Reviewed-by: Eric Blake <eblake@redhat.com>
Reviving this thread, as a good as place as any for my question: On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote: > Add an option for a thread to retry connection until succeeds. We'll > use nbd/client-connection both for reconnect and for initial connection > in nbd_open(), so we need a possibility to use same NBDClientConnection > instance to connect once in nbd_open() and then use retry semantics for > reconnect. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > include/block/nbd.h | 2 ++ > nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++---------- > 2 files changed, 45 insertions(+), 13 deletions(-) > NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr, > bool do_negotiation, > const char *export_name, > @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque) > NBDClientConnection *conn = opaque; > int ret; > bool do_free; > + uint64_t timeout = 1; > + uint64_t max_timeout = 16; > > - conn->sioc = qio_channel_socket_new(); > + while (true) { > + conn->sioc = qio_channel_socket_new(); > > - error_free(conn->err); > - conn->err = NULL; > - conn->updated_info = conn->initial_info; > + error_free(conn->err); > + conn->err = NULL; > + conn->updated_info = conn->initial_info; > > - ret = nbd_connect(conn->sioc, conn->saddr, > - conn->do_negotiation ? &conn->updated_info : NULL, > - conn->tlscreds, &conn->ioc, &conn->err); This says that on each retry attempt, we reset whether to ask the server for structured replies back to our original initial_info values. But when dealing with NBD retries in general, I suspect we have a bug. Consider what happens if our first connection requests structured replies and base:allocation block status, and we are successful. But later, the server disconnects, triggering a retry. Suppose that on our retry, we encounter a different server that no longer supports structured replies. We would no longer be justified in sending NBD_CMD_BLOCK_STATUS requests to the reconnected server. But I can't find anywhere in the code base that ensures that on a reconnect, the new server supplies at least as many extensions as the original server, nor anywhere that we would be able to gracefully handle an in-flight block status command that can no longer be successfully continued because the reconnect landed on a downgraded server. In general, you don't expect a server to downgrade its capabilities across restarts, so assuming that a retried connection will hit a server at least as capable as the original server is typical, even if unsafe. But it is easy enough to use nbdkit to write a server that purposefully downgrades its abilities after the first client connection, for testing how qemu falls apart if it continues making assumptions about the current server based solely on what it learned prior to retrying from the first server. Is this something we need to address quickly for inclusion in 6.2? Maybe by having a retry connect fail if the new server does not have the same capabilities as the old? Do we also need to care about a server reporting a different size export than the old server?
22.11.2021 19:30, Eric Blake wrote: > Reviving this thread, as a good as place as any for my question: > > On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote: >> Add an option for a thread to retry connection until succeeds. We'll >> use nbd/client-connection both for reconnect and for initial connection >> in nbd_open(), so we need a possibility to use same NBDClientConnection >> instance to connect once in nbd_open() and then use retry semantics for >> reconnect. >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >> --- >> include/block/nbd.h | 2 ++ >> nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++---------- >> 2 files changed, 45 insertions(+), 13 deletions(-) > >> NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr, >> bool do_negotiation, >> const char *export_name, >> @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque) >> NBDClientConnection *conn = opaque; >> int ret; >> bool do_free; >> + uint64_t timeout = 1; >> + uint64_t max_timeout = 16; >> >> - conn->sioc = qio_channel_socket_new(); >> + while (true) { >> + conn->sioc = qio_channel_socket_new(); >> >> - error_free(conn->err); >> - conn->err = NULL; >> - conn->updated_info = conn->initial_info; >> + error_free(conn->err); >> + conn->err = NULL; >> + conn->updated_info = conn->initial_info; >> >> - ret = nbd_connect(conn->sioc, conn->saddr, >> - conn->do_negotiation ? &conn->updated_info : NULL, >> - conn->tlscreds, &conn->ioc, &conn->err); > > This says that on each retry attempt, we reset whether to ask the > server for structured replies back to our original initial_info > values. > > But when dealing with NBD retries in general, I suspect we have a bug. > Consider what happens if our first connection requests structured > replies and base:allocation block status, and we are successful. But > later, the server disconnects, triggering a retry. Suppose that on > our retry, we encounter a different server that no longer supports > structured replies. We would no longer be justified in sending > NBD_CMD_BLOCK_STATUS requests to the reconnected server. But I can't > find anywhere in the code base that ensures that on a reconnect, the > new server supplies at least as many extensions as the original > server, nor anywhere that we would be able to gracefully handle an > in-flight block status command that can no longer be successfully > continued because the reconnect landed on a downgraded server. > > In general, you don't expect a server to downgrade its capabilities > across restarts, so assuming that a retried connection will hit a > server at least as capable as the original server is typical, even if > unsafe. But it is easy enough to use nbdkit to write a server that > purposefully downgrades its abilities after the first client > connection, for testing how qemu falls apart if it continues making > assumptions about the current server based solely on what it learned > prior to retrying from the first server. > > Is this something we need to address quickly for inclusion in 6.2? > Maybe by having a retry connect fail if the new server does not have > the same capabilities as the old? Do we also need to care about a > server reporting a different size export than the old server? > Yes that's a problem. We previously noted it here https://lists.gnu.org/archive/html/qemu-block/2021-06/msg00458.html Honestly, I didn't start any fix for that :(.. I agree, it would be good to fix it somehow in 6.2. I'll try to make something simple this week. Or did you already started doing some fix?
On Mon, Nov 22, 2021 at 08:17:34PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 22.11.2021 19:30, Eric Blake wrote: > > Reviving this thread, as a good as place as any for my question: > > > > But I can't > > find anywhere in the code base that ensures that on a reconnect, the > > new server supplies at least as many extensions as the original > > server, nor anywhere that we would be able to gracefully handle an > > in-flight block status command that can no longer be successfully > > continued because the reconnect landed on a downgraded server. > > > > Yes that's a problem. We previously noted it here https://lists.gnu.org/archive/html/qemu-block/2021-06/msg00458.html Aha! And oops that we've let it slide this long. But now that we know the problem was also present in 6.1, and not a new regression in 6.2, it is less urgent to get a fix in. If we get one written in time, it's still game for inclusion for -rc2 or maybe -rc3, but as we get further along in the process, it should not hold up the final release (it would not be -rc4 material, for example). > > Honestly, I didn't start any fix for that :(.. I agree, it would be good to fix it somehow in 6.2. I'll try to make something simple this week. Or did you already started doing some fix? I haven't started writing anything on the topic. I've got some patches that I hope to post soon targetting 7.0 that allow an extension for NBD_CMD_BLOCK_STATUS to do a full round-trip 64-bit operation, which is why I noticed it (if the 64-bit extension is not present on reconnect, we have a problem - but it's no worse than the existing problems). But I'm currently so focused on getting the new feature interoperating between qemu, libnbd, and nbdkit, plus the US Thanksgiving break this week, that I'm happy to let you take first shot at client-side validation that a server reconnect does not lose capabilities.
diff --git a/include/block/nbd.h b/include/block/nbd.h index 5d86e6a393..5bb54d831c 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -409,6 +409,8 @@ const char *nbd_err_lookup(int err); /* nbd/client-connection.c */ typedef struct NBDClientConnection NBDClientConnection; +void nbd_client_connection_enable_retry(NBDClientConnection *conn); + NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr, bool do_negotiation, const char *export_name, diff --git a/nbd/client-connection.c b/nbd/client-connection.c index 4301099f1b..76346481ba 100644 --- a/nbd/client-connection.c +++ b/nbd/client-connection.c @@ -35,6 +35,7 @@ struct NBDClientConnection { QCryptoTLSCreds *tlscreds; NBDExportInfo initial_info; bool do_negotiation; + bool do_retry; QemuMutex mutex; @@ -60,6 +61,15 @@ struct NBDClientConnection { Coroutine *wait_co; }; +/* + * The function isn't protected by any mutex, only call it when the client + * connection attempt has not yet started. + */ +void nbd_client_connection_enable_retry(NBDClientConnection *conn) +{ + conn->do_retry = true; +} + NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr, bool do_negotiation, const char *export_name, @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque) NBDClientConnection *conn = opaque; int ret; bool do_free; + uint64_t timeout = 1; + uint64_t max_timeout = 16; - conn->sioc = qio_channel_socket_new(); + while (true) { + conn->sioc = qio_channel_socket_new(); - error_free(conn->err); - conn->err = NULL; - conn->updated_info = conn->initial_info; + error_free(conn->err); + conn->err = NULL; + conn->updated_info = conn->initial_info; - ret = nbd_connect(conn->sioc, conn->saddr, - conn->do_negotiation ? &conn->updated_info : NULL, - conn->tlscreds, &conn->ioc, &conn->err); - if (ret < 0) { - object_unref(OBJECT(conn->sioc)); - conn->sioc = NULL; - } + ret = nbd_connect(conn->sioc, conn->saddr, + conn->do_negotiation ? &conn->updated_info : NULL, + conn->tlscreds, &conn->ioc, &conn->err); + + /* + * conn->updated_info will finally be returned to the user. Clear the + * pointers to our internally allocated strings, which are IN parameters + * of nbd_receive_negotiate() and therefore nbd_connect(). Caller + * shoudn't be interested in these fields. + */ + conn->updated_info.x_dirty_bitmap = NULL; + conn->updated_info.name = NULL; + + if (ret < 0) { + object_unref(OBJECT(conn->sioc)); + conn->sioc = NULL; + if (conn->do_retry) { + sleep(timeout); + if (timeout < max_timeout) { + timeout *= 2; + } + continue; + } + } - conn->updated_info.x_dirty_bitmap = NULL; - conn->updated_info.name = NULL; + break; + } qemu_mutex_lock(&conn->mutex);
Add an option for a thread to retry connection until succeeds. We'll use nbd/client-connection both for reconnect and for initial connection in nbd_open(), so we need a possibility to use same NBDClientConnection instance to connect once in nbd_open() and then use retry semantics for reconnect. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> --- include/block/nbd.h | 2 ++ nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++---------- 2 files changed, 45 insertions(+), 13 deletions(-)