Patchwork NBD block device backend - 'improvements'

login
register
mail settings
Submitter Nicholas Thomas
Date Feb. 14, 2011, 7:40 p.m.
Message ID <1297712422.12551.2.camel@den>
Download mbox | patch
Permalink /patch/83153/
State New
Headers show

Comments

Nicholas Thomas - Feb. 14, 2011, 7:40 p.m.
[Apologies for the cross-post - I originally sent this to the KVM ML -
obviously, it's far more appropriate here]

Hi,

I've been doing some work with /block/nbd.c with the aim of improving
its behaviour when the NBD server is inaccessible or goes away.

Current behaviour is to exit on startup if connecting to the NBD server
fails for any reason. If the connection dies  once KVM is started, the
VM stays up but reads and writes fail. No attempt to re-establish the
connection is made. 

I've written a patch that changes the behaviour - instead of exiting at
startup, we wait for the NBD connection to be established, and we hang
on reads and writes until the connection is re-established.

I'm interested in getting the changes merged upstream, so I thought I'd
get in early and ask if you'd be interested in the patch, in principle;
whether the old behaviour would need to be preserved, making the new
behaviour accessible via a config option ("-drive
file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
about the changes in a sane way (I've attached the current version of
the patch).

Another thing I've noticed is that the nbd library ( /nbd.c ) is not
IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
have a patch for that yet, but I'm going to need to write one :) -
presumably you'd like that merging upstream too (and I should make the
library use the functions provided in qemu_socket.h) ?

Regards,
Stefan Hajnoczi - Feb. 14, 2011, 8:32 p.m.
On Mon, Feb 14, 2011 at 7:40 PM, Nicholas Thomas <nick@lupine.me.uk> wrote:
> I've written a patch that changes the behaviour - instead of exiting at
> startup, we wait for the NBD connection to be established, and we hang
> on reads and writes until the connection is re-established.

Hi Nick,
I think reconnect is a useful feature.  For more info on submitting
patches to QEMU, please see
http://wiki.qemu.org/Contribute/SubmitAPatch.  It contains a few
points like sending patches inline instead of as an attachment (makes
review easier), using Signed-off-by:, and references to QEMU coding
style.

> I'm interested in getting the changes merged upstream, so I thought I'd
> get in early and ask if you'd be interested in the patch, in principle;
> whether the old behaviour would need to be preserved, making the new
> behaviour accessible via a config option ("-drive
> file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
> about the changes in a sane way (I've attached the current version of
> the patch).

block/nbd.c needs to be made asynchronous in order for this change to
work.  Otherwise the only thing you can do is to block QEMU and the
VM, and even that shouldn't be done using sleep(2) because that could
stop the I/O thread which processes the QEMU monitor and other
external interfaces.  See below for more info.

> Another thing I've noticed is that the nbd library ( /nbd.c ) is not
> IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
> have a patch for that yet, but I'm going to need to write one :) -
> presumably you'd like that merging upstream too (and I should make the
> library use the functions provided in qemu_socket.h) ?

IPv6 would be nice and if you can consolidate that in qemu_socket.h,
then that's a win for non-nbd socket users too.

I have inlined your patch below for easy reviewing:

> diff --git a/block/nbd.c b/block/nbd.c
> index c8dc763..87da07e 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -39,9 +39,11 @@ typedef struct BDRVNBDState {
>      int sock;
>      off_t size;
>      size_t blocksize;
> +    char *filename;

block_int.h:BlockDriverState->filename already has this.

>  } BDRVNBDState;
>
> -static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
> +
> +static int nbd_open(BlockDriverState *bs)
>  {
>      BDRVNBDState *s = bs->opaque;
>      uint32_t nbdflags;
> @@ -56,7 +58,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
>      int ret;
>      int err = -EINVAL;
>
> -    file = qemu_strdup(filename);
> +    file = qemu_strdup(s->filename);
>
>      name = strstr(file, EN_OPTSTR);
>      if (name) {
> @@ -121,32 +123,62 @@ out:
>      return err;
>  }
>
> +// Puts the filename into the driver state and calls nbd_open - hanging until
> +// it is successful.

Please use /* */ comments instead, QEMU coding style.

> +static int nbd_setup(BlockDriverState *bs, const char* filename, int flags)
> +{
> +    BDRVNBDState *s = bs->opaque;
> +    int err = 1;
> +
> +    s->filename = qemu_strdup(filename);
> +    while (err != 0)
> +    {
> +        err = nbd_open(bs);
> +        // Avoid tight loops
> +        if (err != 0)
> +            sleep(1);

Please use {} even for single statement blocks.

> +    }
> +
> +    return err;
> +}

Waiting until the connection succeeds is a policy decision.  It might
be equally useful to abort starting QEMU or to start with the nbd
device unavailable (either returning I/O errors or not completing
requests until connection).  There should be a way to configure this,
otherwise we might get a user who depended on the old way having to
file a bug report.

> +
>  static int nbd_read(BlockDriverState *bs, int64_t sector_num,
>                      uint8_t *buf, int nb_sectors)
>  {
>      BDRVNBDState *s = bs->opaque;
>      struct nbd_request request;
>      struct nbd_reply reply;
> +    bool success = false;
>
>      request.type = NBD_CMD_READ;
>      request.handle = (uint64_t)(intptr_t)bs;
> -    request.from = sector_num * 512;;
> +    request.from = sector_num * 512;
>      request.len = nb_sectors * 512;
>
> -    if (nbd_send_request(s->sock, &request) == -1)
> -        return -errno;
> +    while (success == false)
> +    {
> +        if ( (nbd_send_request(s->sock, &request) == -1) ||
> +             (nbd_receive_reply(s->sock, &reply) == -1)     )
> +        {
> +            // We hang here until the TCP session is established
> +            close(s->sock);
> +            while(nbd_open(bs) != 0)
> +                sleep(1);
> +            continue;

This hangs the VM.  It will literally stop executing code and be
frozen.  Doing this without hanging the VM involves changing
nbd_read()/nbd_write() to work asynchronously.

Also, on ENOSPC QEMU can already pause the VM using a different
mechanism.  That may not be appropriate here but it could be
interesting to see how that code works.

> +        }
>
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +        if (reply.error !=0)
> +            return -reply.error;
>
> -    if (reply.error !=0)
> -        return -reply.error;
> +        if (reply.handle != request.handle)
> +            return -EIO;
>
> -    if (reply.handle != request.handle)
> -        return -EIO;
> +        // If reading the actual data fails, we retry the whole request
> +        if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
> +            continue;
>
> -    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
> -        return -EIO;
> +        success = true;
> +    }
>
>      return 0;
>  }
> @@ -157,26 +189,39 @@ static int nbd_write(BlockDriverState *bs, int64_t sector_num,
>      BDRVNBDState *s = bs->opaque;
>      struct nbd_request request;
>      struct nbd_reply reply;
> +    bool success = false;
>
>      request.type = NBD_CMD_WRITE;
>      request.handle = (uint64_t)(intptr_t)bs;
>      request.from = sector_num * 512;;
>      request.len = nb_sectors * 512;
>
> -    if (nbd_send_request(s->sock, &request) == -1)
> -        return -errno;
> +    while (success == false)
> +    {
> +        if ( (nbd_send_request(s->sock, &request) == -1) ||
> +             (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len) )
> +        {
> +            // We hang here until the TCP session is established
> +            close(s->sock);
> +            while(nbd_open(bs) != 0)
> +                sleep(1);
> +            continue;
> +        }
> +
> +        // We didn't get a reply from the write, so try again
> +        if (nbd_receive_reply(s->sock, &reply) == -1)
> +            continue;
>
> -    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len)
> -        return -EIO;
> +        // Problem with the response itself
> +        if (reply.error !=0)
> +            return -reply.error;
>
> -    if (nbd_receive_reply(s->sock, &reply) == -1)
> -        return -errno;
> +        if (reply.handle != request.handle)
> +            return -EIO;
>
> -    if (reply.error !=0)
> -        return -reply.error;
> +        success = true;
> +    }
>
> -    if (reply.handle != request.handle)
> -        return -EIO;
>
>      return 0;
>  }
> @@ -191,6 +236,7 @@ static void nbd_close(BlockDriverState *bs)
>      request.from = 0;
>      request.len = 0;
>      nbd_send_request(s->sock, &request);
> +    qemu_free(s->filename);
>
>      close(s->sock);
>  }
> @@ -205,7 +251,7 @@ static int64_t nbd_getlength(BlockDriverState *bs)
>  static BlockDriver bdrv_nbd = {
>      .format_name	= "nbd",
>      .instance_size	= sizeof(BDRVNBDState),
> -    .bdrv_file_open	= nbd_open,
> +    .bdrv_file_open	= nbd_setup,
>      .bdrv_read		= nbd_read,
>      .bdrv_write		= nbd_write,
>      .bdrv_close		= nbd_close,
Kevin Wolf - Feb. 15, 2011, 11:09 a.m.
Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> On Mon, Feb 14, 2011 at 7:40 PM, Nicholas Thomas <nick@lupine.me.uk> wrote:
>> I've written a patch that changes the behaviour - instead of exiting at
>> startup, we wait for the NBD connection to be established, and we hang
>> on reads and writes until the connection is re-established.
> 
> Hi Nick,
> I think reconnect is a useful feature.  For more info on submitting
> patches to QEMU, please see
> http://wiki.qemu.org/Contribute/SubmitAPatch.  It contains a few
> points like sending patches inline instead of as an attachment (makes
> review easier), using Signed-off-by:, and references to QEMU coding
> style.
> 
>> I'm interested in getting the changes merged upstream, so I thought I'd
>> get in early and ask if you'd be interested in the patch, in principle;
>> whether the old behaviour would need to be preserved, making the new
>> behaviour accessible via a config option ("-drive
>> file=nbd:127.0.0.1:5000:retry=forever,..." ?); and whether I'm going
>> about the changes in a sane way (I've attached the current version of
>> the patch).
> 
> block/nbd.c needs to be made asynchronous in order for this change to
> work.  

And even then it's not free of problem: For example qemu_aio_flush()
will hang. We're having all kinds of fun with NFS servers that go away
and let requests hang indefinitely.

So maybe what we should add is a timeout option which defaults to 0
(fail immediately, like today)

> Otherwise the only thing you can do is to block QEMU and the
> VM, and even that shouldn't be done using sleep(2) because that could
> stop the I/O thread which processes the QEMU monitor and other
> external interfaces.  See below for more info.

Unconditionally stopping the VM from a block driver sounds wrong to me.
If you want to have this behaviour, the block driver should return an
error and you should use werror=stop.

>> Another thing I've noticed is that the nbd library ( /nbd.c ) is not
>> IPv6-compatible ( "-drive file=nbd:\:\:1:5000", for instance ) - I don't
>> have a patch for that yet, but I'm going to need to write one :) -
>> presumably you'd like that merging upstream too (and I should make the
>> library use the functions provided in qemu_socket.h) ?
> 
> IPv6 would be nice and if you can consolidate that in qemu_socket.h,
> then that's a win for non-nbd socket users too.

Agreed.

Kevin
Nicholas Thomas - Feb. 15, 2011, 9:26 p.m.
Hi Kevin, Stefan.

On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
[...]
> > block/nbd.c needs to be made asynchronous in order for this change to
> > work.  
> 
> And even then it's not free of problem: For example qemu_aio_flush()
> will hang. We're having all kinds of fun with NFS servers that go away
> and let requests hang indefinitely.
> 
> So maybe what we should add is a timeout option which defaults to 0
> (fail immediately, like today)

Noted, so long as we can have -1 as "forever". I'm currently spending
time reworking block/nbd.c to be asynchronous, following the model in
block/sheepdog.c

There does seem to be a lot of scope for code duplication (setting up
the TCP connection, taking it down, the mechanics of actually reading /
writing bytes using the aio interface, etc) between the two, and
presumably for rbd as well. 

Reading http://www.mail-archive.com/qemu-devel@nongnu.org/msg36479.html
suggests it should be possible to have a "tcp" (+ "unix") protocol /
transport, which nbd+sheepdog could stack on top of (curl+rbd seem to
depend on their own libraries for managing the TCP part of the
connection).

They would implement talking the actual protocol, while the tcp/unix
transports would have the duplicatable bits.

I've not investigated it in code yet - it's possible I'm just letting my
appetite for abstraction get away with me. Thoughts?

> Unconditionally stopping the VM from a block driver sounds wrong to me.
> If you want to have this behaviour, the block driver should return an
> error and you should use werror=stop.

Unconditional? - if the socket manages to re-establish, the process
continues on its way (I guess we'd see the same behaviour if a send/recv
happened to take an unconscionably long time with the current code).

Making just the I/O hang until the network comes back, keeping guest
execution and qemu monitor working, is obviously better than that
(although not /strictly/ necessary for our particular use case), so I
hope to be able to offer an AIO NBD patch for review "soon". 

> > IPv6 would be nice and if you can consolidate that in qemu_socket.h,
> > then that's a win for non-nbd socket users too.
> 
> Agreed.

We'd get it for free with a unified TCP transport, as described above
(sheepdog already uses getaddrinfo and friends) - but if that's not
feasible, I'll be happy to supply a patch just for this. Much easier
than aio! :)

/Nick
Kevin Wolf - Feb. 16, 2011, noon
Hi Nick,

[ Please use "reply to all" so that the CC list is kept intact,
re-adding Stefan ]

Am 15.02.2011 22:26, schrieb Nicholas Thomas:
> Hi Kevin, Stefan.
> 
> On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
>> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> [...]
>>> block/nbd.c needs to be made asynchronous in order for this change to
>>> work.  
>>
>> And even then it's not free of problem: For example qemu_aio_flush()
>> will hang. We're having all kinds of fun with NFS servers that go away
>> and let requests hang indefinitely.
>>
>> So maybe what we should add is a timeout option which defaults to 0
>> (fail immediately, like today)
> 
> Noted, so long as we can have -1 as "forever". 

As long as it's optional, that's fine with me.

> I'm currently spending
> time reworking block/nbd.c to be asynchronous, following the model in
> block/sheepdog.c
> 
> There does seem to be a lot of scope for code duplication (setting up
> the TCP connection, taking it down, the mechanics of actually reading /
> writing bytes using the aio interface, etc) between the two, and
> presumably for rbd as well. 
> 
> Reading http://www.mail-archive.com/qemu-devel@nongnu.org/msg36479.html
> suggests it should be possible to have a "tcp" (+ "unix") protocol /
> transport, which nbd+sheepdog could stack on top of (curl+rbd seem to
> depend on their own libraries for managing the TCP part of the
> connection).
> 
> They would implement talking the actual protocol, while the tcp/unix
> transports would have the duplicatable bits.
> 
> I've not investigated it in code yet - it's possible I'm just letting my
> appetite for abstraction get away with me. Thoughts?

I'm not sure about how much duplication there actually is, but if you
can take a closer look and think it's worthwhile, we should probably
consider it.

>> Unconditionally stopping the VM from a block driver sounds wrong to me.
>> If you want to have this behaviour, the block driver should return an
>> error and you should use werror=stop.
> 
> Unconditional? - if the socket manages to re-establish, the process
> continues on its way (I guess we'd see the same behaviour if a send/recv
> happened to take an unconscionably long time with the current code).
> 
> Making just the I/O hang until the network comes back, keeping guest
> execution and qemu monitor working, is obviously better than that
> (although not /strictly/ necessary for our particular use case), so I
> hope to be able to offer an AIO NBD patch for review "soon". 

Maybe I wasn't very clear. I was talking about Stefan's suggestion to
completely stop the VM, like we already can do for I/O errors (see the
werror and rerror options for -drive). I think it's not what you're
looking for, you just need the timeout=-1 thing.

>>> IPv6 would be nice and if you can consolidate that in qemu_socket.h,
>>> then that's a win for non-nbd socket users too.
>>
>> Agreed.
> 
> We'd get it for free with a unified TCP transport, as described above
> (sheepdog already uses getaddrinfo and friends) - but if that's not
> feasible, I'll be happy to supply a patch just for this. Much easier
> than aio! :)

Sure, I think it would be a good thing to have. And even if you
implemented this unified TCP transport (I'm not sure yet what it would
look like), I think the basic support could still be in qemu_socket.h,
so that users outside the block layer can benefit from it, too.

Kevin
Nicholas Thomas - Feb. 17, 2011, 4:27 p.m.
Hi again,

On Wed, 2011-02-16 at 13:00 +0100, Kevin Wolf wrote:
> Am 15.02.2011 22:26, schrieb Nicholas Thomas:
> > On Tue, 2011-02-15 at 12:09 +0100, Kevin Wolf wrote:
> >> Am 14.02.2011 21:32, schrieb Stefan Hajnoczi:
> I'm not sure about how much duplication there actually is, but if you
> can take a closer look and think it's worthwhile, we should probably
> consider it.

It's a couple of hundred lines, I'd guess - on reflection, it's probably
not enough to be too bothered about - so I haven't. I'll submit a
patchset in a moment implementing the first part of what we've been
talking about (converting the NBD driver to use the aio interface).

Assuming we can get this merged, I'll submit an IPv6 and a timeout=
patch too. It might also be worth adding aio support to qemu-nbd - we
use the canonical nbd-server binary, so it won't affect us directly, but
it seems a shame for the server to be behind the client's capabilities.

/Nick

Patch

diff --git a/block/nbd.c b/block/nbd.c
index c8dc763..87da07e 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -39,9 +39,11 @@  typedef struct BDRVNBDState {
     int sock;
     off_t size;
     size_t blocksize;
+    char *filename;
 } BDRVNBDState;
 
-static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
+
+static int nbd_open(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
     uint32_t nbdflags;
@@ -56,7 +58,7 @@  static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
     int ret;
     int err = -EINVAL;
 
-    file = qemu_strdup(filename);
+    file = qemu_strdup(s->filename);
 
     name = strstr(file, EN_OPTSTR);
     if (name) {
@@ -121,32 +123,62 @@  out:
     return err;
 }
 
+// Puts the filename into the driver state and calls nbd_open - hanging until
+// it is successful.
+static int nbd_setup(BlockDriverState *bs, const char* filename, int flags)
+{
+    BDRVNBDState *s = bs->opaque;
+    int err = 1;
+
+    s->filename = qemu_strdup(filename);
+    while (err != 0)
+    {
+        err = nbd_open(bs);
+        // Avoid tight loops
+        if (err != 0)
+            sleep(1);
+    }
+
+    return err;
+}
+
 static int nbd_read(BlockDriverState *bs, int64_t sector_num,
                     uint8_t *buf, int nb_sectors)
 {
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
     struct nbd_reply reply;
+    bool success = false;
 
     request.type = NBD_CMD_READ;
     request.handle = (uint64_t)(intptr_t)bs;
-    request.from = sector_num * 512;;
+    request.from = sector_num * 512;
     request.len = nb_sectors * 512;
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    while (success == false)
+    {
+        if ( (nbd_send_request(s->sock, &request) == -1) ||
+             (nbd_receive_reply(s->sock, &reply) == -1)     )
+        {
+            // We hang here until the TCP session is established
+            close(s->sock);
+            while(nbd_open(bs) != 0)
+                sleep(1);
+            continue;
+        }
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+        if (reply.error !=0)
+            return -reply.error;
 
-    if (reply.error !=0)
-        return -reply.error;
+        if (reply.handle != request.handle)
+            return -EIO;
 
-    if (reply.handle != request.handle)
-        return -EIO;
+        // If reading the actual data fails, we retry the whole request
+        if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
+            continue;
 
-    if (nbd_wr_sync(s->sock, buf, request.len, 1) != request.len)
-        return -EIO;
+        success = true;
+    }
 
     return 0;
 }
@@ -157,26 +189,39 @@  static int nbd_write(BlockDriverState *bs, int64_t sector_num,
     BDRVNBDState *s = bs->opaque;
     struct nbd_request request;
     struct nbd_reply reply;
+    bool success = false;
 
     request.type = NBD_CMD_WRITE;
     request.handle = (uint64_t)(intptr_t)bs;
     request.from = sector_num * 512;;
     request.len = nb_sectors * 512;
 
-    if (nbd_send_request(s->sock, &request) == -1)
-        return -errno;
+    while (success == false)
+    {
+        if ( (nbd_send_request(s->sock, &request) == -1) ||
+             (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len) )
+        {
+            // We hang here until the TCP session is established
+            close(s->sock);
+            while(nbd_open(bs) != 0)
+                sleep(1);
+            continue;
+        }
+
+        // We didn't get a reply from the write, so try again
+        if (nbd_receive_reply(s->sock, &reply) == -1)
+            continue;
 
-    if (nbd_wr_sync(s->sock, (uint8_t*)buf, request.len, 0) != request.len)
-        return -EIO;
+        // Problem with the response itself
+        if (reply.error !=0)
+            return -reply.error;
 
-    if (nbd_receive_reply(s->sock, &reply) == -1)
-        return -errno;
+        if (reply.handle != request.handle)
+            return -EIO;
 
-    if (reply.error !=0)
-        return -reply.error;
+        success = true;
+    }
 
-    if (reply.handle != request.handle)
-        return -EIO;
 
     return 0;
 }
@@ -191,6 +236,7 @@  static void nbd_close(BlockDriverState *bs)
     request.from = 0;
     request.len = 0;
     nbd_send_request(s->sock, &request);
+    qemu_free(s->filename);
 
     close(s->sock);
 }
@@ -205,7 +251,7 @@  static int64_t nbd_getlength(BlockDriverState *bs)
 static BlockDriver bdrv_nbd = {
     .format_name	= "nbd",
     .instance_size	= sizeof(BDRVNBDState),
-    .bdrv_file_open	= nbd_open,
+    .bdrv_file_open	= nbd_setup,
     .bdrv_read		= nbd_read,
     .bdrv_write		= nbd_write,
     .bdrv_close		= nbd_close,